Speed up transaction completion faster after many relations are accessed in a transaction

Started by Tsunakawa, Takayukialmost 7 years ago121 messages

tsunakawa.takay@jp.fujitsu.com

almost 7 years ago

1 attachment(s)

Hello,

The attached patch speeds up transaction completion when any prior transaction accessed many relations in the same session.

The transaction records its own acquired lock information in the LOCALLOCK structure (a pair of lock object and lock mode). It stores LOCALLOCKs in a hash table in its local memory. The hash table automatically expands when the transaction acquires many relations. The hash table doesn't shrink. When the transaction commits or aborts, it scans the hash table to find LOCALLOCKs to release locks.

The problem is that once some transaction accesses many relations, even subsequent transactions in the same session that only access a few relations take unreasonably long time to complete, because it has to scan the expanded hash table.

The attached patch links LOCALLOCKS to PGPROC, so that releasing locks should only scan the list instead of the hash table. The hash table is kept because some functions want to find a particular LOCALLOCK quickly based on its hash value.

This problem was uncovered while evaluating partitioning performance. When the application PREPAREs a statement once and then EXECUTE-COMMIT repeatedly, the server creates a generic plan on the 6th EXECUTE. Unfortunately, creation of the generic plan of UPDATE/DELETE currently accesses all partitions of the target table (this itself needs improvement), expanding the LOCALLOCK hash table. As a result, 7th and later EXECUTEs get slower.

Imai-san confirmed performance improvement with this patch:

https://commitfest.postgresql.org/22/1993/

Regards
Takayuki Tsunakawa

Attachments:

faster-locallock-scan.patchapplication/octet-stream; name=faster-locallock-scan.patchDownload

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 3bb5ce3..9475fe1 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -794,6 +794,7 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	 */
 	if (!found)
 	{
+		dlist_push_head(&MyProc->localLocks, &locallock->procLink);
 		locallock->lock = NULL;
 		locallock->proclock = NULL;
 		locallock->hashcode = LockTagHashCode(&(localtag.lock));
@@ -1320,6 +1321,7 @@ RemoveLocalLock(LOCALLOCK *locallock)
 		SpinLockRelease(&FastPathStrongRelationLocks->mutex);
 	}
 
+	dlist_delete(&locallock->procLink);
 	if (!hash_search(LockMethodLocalHash,
 					 (void *) &(locallock->tag),
 					 HASH_REMOVE, NULL))
@@ -2088,7 +2090,7 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 void
 LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
 {
-	HASH_SEQ_STATUS status;
+	dlist_mutable_iter iter;
 	LockMethod	lockMethodTable;
 	int			i,
 				numLockModes;
@@ -2126,10 +2128,10 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
 	 * pointers.  Fast-path locks are cleaned up during the locallock table
 	 * scan, though.
 	 */
-	hash_seq_init(&status, LockMethodLocalHash);
-
-	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+	dlist_foreach_modify(iter, &MyProc->localLocks)
 	{
+		locallock = dlist_container(LOCALLOCK, procLink, iter.cur);
+
 		/*
 		 * If the LOCALLOCK entry is unused, we must've run out of shared
 		 * memory while trying to set up this lock.  Just forget the local
@@ -2362,16 +2364,16 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
 void
 LockReleaseSession(LOCKMETHODID lockmethodid)
 {
-	HASH_SEQ_STATUS status;
+	dlist_mutable_iter iter;
 	LOCALLOCK  *locallock;
 
 	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
 		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
 
-	hash_seq_init(&status, LockMethodLocalHash);
-
-	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+	dlist_foreach_modify(iter, &MyProc->localLocks)
 	{
+		locallock = dlist_container(LOCALLOCK, procLink, iter.cur);
+
 		/* Ignore items that are not of the specified lock method */
 		if (LOCALLOCK_LOCKMETHOD(*locallock) != lockmethodid)
 			continue;
@@ -2394,13 +2396,14 @@ LockReleaseCurrentOwner(LOCALLOCK **locallocks, int nlocks)
 {
 	if (locallocks == NULL)
 	{
-		HASH_SEQ_STATUS status;
+		dlist_mutable_iter iter;
 		LOCALLOCK  *locallock;
 
-		hash_seq_init(&status, LockMethodLocalHash);
-
-		while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+		dlist_foreach_modify(iter, &MyProc->localLocks)
+		{
+			locallock = dlist_container(LOCALLOCK, procLink, iter.cur);
 			ReleaseLockIfHeld(locallock, false);
+		}
 	}
 	else
 	{
@@ -2493,13 +2496,14 @@ LockReassignCurrentOwner(LOCALLOCK **locallocks, int nlocks)
 
 	if (locallocks == NULL)
 	{
-		HASH_SEQ_STATUS status;
+		dlist_mutable_iter iter;
 		LOCALLOCK  *locallock;
 
-		hash_seq_init(&status, LockMethodLocalHash);
-
-		while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+		dlist_foreach_modify(iter, &MyProc->localLocks)
+		{
+			locallock = dlist_container(LOCALLOCK, procLink, iter.cur);
 			LockReassignOwner(locallock, parent);
+		}
 	}
 	else
 	{
@@ -3133,8 +3137,7 @@ LockRefindAndRelease(LockMethod lockMethodTable, PGPROC *proc,
 void
 AtPrepare_Locks(void)
 {
-	HASH_SEQ_STATUS status;
-	LOCALLOCK  *locallock;
+	dlist_mutable_iter iter;
 
 	/*
 	 * For the most part, we don't need to touch shared memory for this ---
@@ -3142,10 +3145,9 @@ AtPrepare_Locks(void)
 	 * Fast-path locks are an exception, however: we move any such locks to
 	 * the main table before allowing PREPARE TRANSACTION to succeed.
 	 */
-	hash_seq_init(&status, LockMethodLocalHash);
-
-	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+	dlist_foreach_modify(iter, &MyProc->localLocks)
 	{
+		LOCALLOCK  *locallock = dlist_container(LOCALLOCK, procLink, iter.cur);
 		TwoPhaseLockRecord record;
 		LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
 		bool		haveSessionLock;
@@ -3244,8 +3246,7 @@ void
 PostPrepare_Locks(TransactionId xid)
 {
 	PGPROC	   *newproc = TwoPhaseGetDummyProc(xid);
-	HASH_SEQ_STATUS status;
-	LOCALLOCK  *locallock;
+	dlist_mutable_iter iter;
 	LOCK	   *lock;
 	PROCLOCK   *proclock;
 	PROCLOCKTAG proclocktag;
@@ -3267,10 +3268,9 @@ PostPrepare_Locks(TransactionId xid)
 	 * pointing to the same proclock, and we daren't end up with any dangling
 	 * pointers.
 	 */
-	hash_seq_init(&status, LockMethodLocalHash);
-
-	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+	dlist_foreach_modify(iter, &MyProc->localLocks)
 	{
+		LOCALLOCK  *locallock = dlist_container(LOCALLOCK, procLink, iter.cur);
 		LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
 		bool		haveSessionLock;
 		bool		haveXactLock;
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 89c80fb..a22d73a 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -389,6 +389,7 @@ InitProcess(void)
 	MyProc->lwWaitMode = 0;
 	MyProc->waitLock = NULL;
 	MyProc->waitProcLock = NULL;
+	dlist_init(&MyProc->localLocks);
 #ifdef USE_ASSERT_CHECKING
 	{
 		int			i;
@@ -568,6 +569,7 @@ InitAuxiliaryProcess(void)
 	MyProc->lwWaitMode = 0;
 	MyProc->waitLock = NULL;
 	MyProc->waitProcLock = NULL;
+	dlist_init(&MyProc->localLocks);
 #ifdef USE_ASSERT_CHECKING
 	{
 		int			i;
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index 16b927c..20887f4 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -18,6 +18,7 @@
 #error "lock.h may not be included from frontend code"
 #endif
 
+#include "lib/ilist.h"
 #include "storage/lockdefs.h"
 #include "storage/backendid.h"
 #include "storage/lwlock.h"
@@ -406,6 +407,7 @@ typedef struct LOCALLOCK
 	/* data */
 	LOCK	   *lock;			/* associated LOCK object, if any */
 	PROCLOCK   *proclock;		/* associated PROCLOCK object, if any */
+	dlist_node  procLink;		/* list link in PGPROC's list of LOCALLOCKs */
 	uint32		hashcode;		/* copy of LOCKTAG's hash value */
 	int64		nLocks;			/* total number of times lock is held */
 	bool		holdsStrongLockCount;	/* bumped FastPathStrongRelationLocks */
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index d203acb..031e004 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -138,6 +138,7 @@ struct PGPROC
 	/* waitLock and waitProcLock are NULL if not currently waiting. */
 	LOCK	   *waitLock;		/* Lock object we're sleeping on ... */
 	PROCLOCK   *waitProcLock;	/* Per-holder info for awaited lock */
+	dlist_head localLocks;		/* List of LOCALLOCKs */
 	LOCKMODE	waitLockMode;	/* type of lock we're waiting for */
 	LOCKMASK	heldLocks;		/* bitmask for lock types already held on this
 								 * lock object by this backend */

Tom Lane

tgl@sss.pgh.pa.us

almost 7 years ago

In reply to: Tsunakawa, Takayuki (#1)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

"Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> writes:

The attached patch speeds up transaction completion when any prior transaction accessed many relations in the same session.

Hm. Putting a list header for a purely-local data structure into shared
memory seems quite ugly. Isn't there a better place to keep that?

Do we really want a dlist here at all? I'm concerned that bloating
LOCALLOCK will cost us when there are many locks involved. This patch
increases the size of LOCALLOCK by 25% if I counted right, which does
not seem like a negligible penalty.

My own thought about how to improve this situation was just to destroy
and recreate LockMethodLocalHash at transaction end (or start)
if its size exceeded $some-value. Leaving it permanently bloated seems
like possibly a bad idea, even if we get rid of all the hash_seq_searches
on it.

regards, tom lane

David Rowley

david.rowley@2ndquadrant.com

almost 7 years ago

In reply to: Tom Lane (#2)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

On Tue, 19 Feb 2019 at 12:42, Tom Lane <tgl@sss.pgh.pa.us> wrote:

My own thought about how to improve this situation was just to destroy
and recreate LockMethodLocalHash at transaction end (or start)
if its size exceeded $some-value. Leaving it permanently bloated seems
like possibly a bad idea, even if we get rid of all the hash_seq_searches
on it.

That seems like a good idea. Although, it would be good to know that
it didn't add too much overhead dropping and recreating the table when
every transaction happened to obtain more locks than $some-value. If
it did, then maybe we could track the average locks per of recent
transactions and just ditch the table after the locks are released if
the locks held by the last transaction exceeded the average *
1.something. No need to go near shared memory to do that.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Andres Freund

andres@anarazel.de

almost 7 years ago

In reply to: David Rowley (#3)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

Hi,

On 2019-02-19 12:52:08 +1300, David Rowley wrote:

On Tue, 19 Feb 2019 at 12:42, Tom Lane <tgl@sss.pgh.pa.us> wrote:

My own thought about how to improve this situation was just to destroy
and recreate LockMethodLocalHash at transaction end (or start)
if its size exceeded $some-value. Leaving it permanently bloated seems
like possibly a bad idea, even if we get rid of all the hash_seq_searches
on it.

That seems like a good idea. Although, it would be good to know that
it didn't add too much overhead dropping and recreating the table when
every transaction happened to obtain more locks than $some-value. If
it did, then maybe we could track the average locks per of recent
transactions and just ditch the table after the locks are released if
the locks held by the last transaction exceeded the average *
1.something. No need to go near shared memory to do that.

Isn't a large portion of benefits in this patch going to be mooted by
the locking improvements discussed in the other threads? I.e. there's
hopefully not going to be a ton of cases with low overhead where we
acquire a lot of locks and release them very soon after. Sure, for DDL
etc we will, but I can't see this mattering from a performance POV?

I'm not against doing something like Tom proposes, but heuristics with
magic constants like this tend to age purely / are hard to tune well
across systems.

Greetings,

Andres Freund

Tom Lane

tgl@sss.pgh.pa.us

almost 7 years ago

In reply to: David Rowley (#3)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

David Rowley <david.rowley@2ndquadrant.com> writes:

On Tue, 19 Feb 2019 at 12:42, Tom Lane <tgl@sss.pgh.pa.us> wrote:

My own thought about how to improve this situation was just to destroy
and recreate LockMethodLocalHash at transaction end (or start)
if its size exceeded $some-value. Leaving it permanently bloated seems
like possibly a bad idea, even if we get rid of all the hash_seq_searches
on it.

That seems like a good idea. Although, it would be good to know that
it didn't add too much overhead dropping and recreating the table when
every transaction happened to obtain more locks than $some-value. If
it did, then maybe we could track the average locks per of recent
transactions and just ditch the table after the locks are released if
the locks held by the last transaction exceeded the average *
1.something. No need to go near shared memory to do that.

Yeah, I'd deliberately avoided saying how we'd choose $some-value ;-).
Making it adaptive might not be a bad plan.

regards, tom lane

Tom Lane

tgl@sss.pgh.pa.us

almost 7 years ago

In reply to: Andres Freund (#4)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

Andres Freund <andres@anarazel.de> writes:

Isn't a large portion of benefits in this patch going to be mooted by
the locking improvements discussed in the other threads? I.e. there's
hopefully not going to be a ton of cases with low overhead where we
acquire a lot of locks and release them very soon after. Sure, for DDL
etc we will, but I can't see this mattering from a performance POV?

Mmm ... AIUI, the patches currently proposed can only help for what
David called "point lookup" queries. There are still going to be
queries that scan a large proportion of a partition tree, so if you've
got tons of partitions, you'll be concerned about this sort of thing.

I'm not against doing something like Tom proposes, but heuristics with
magic constants like this tend to age purely / are hard to tune well
across systems.

I didn't say it had to be a constant ...

regards, tom lane

David Rowley

david.rowley@2ndquadrant.com

almost 7 years ago

In reply to: Andres Freund (#4)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

On Tue, 19 Feb 2019 at 12:56, Andres Freund <andres@anarazel.de> wrote:

Isn't a large portion of benefits in this patch going to be mooted by
the locking improvements discussed in the other threads? I.e. there's
hopefully not going to be a ton of cases with low overhead where we
acquire a lot of locks and release them very soon after. Sure, for DDL
etc we will, but I can't see this mattering from a performance POV?

I think this patch was born from Amit's partition planner improvement
patch. If not that one, which other threads did you have in mind?

A problem exists where, if using a PREPAREd statement to plan a query
to a partitioned table containing many partitions that a generic plan
will never be favoured over a custom plan since the generic plan might
not be able to prune partitions like the custom plan can. The actual
problem is around that we do need to at some point generate a generic
plan in order to know it's more costly and that requires locking
possibly every partition. When plan_cache_mode = auto, this is done
on the 6th execution of the statement. After Amit's partition planner
changes [1]https://commitfest.postgresql.org/22/1778/, the custom plan will only lock partitions that are not
pruned, so the 6th execution of the statement bloats the local lock
table.

[1]: https://commitfest.postgresql.org/22/1778/

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Andres Freund

andres@anarazel.de

almost 7 years ago

In reply to: Tom Lane (#6)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

Hi,

On 2019-02-18 19:01:06 -0500, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

Isn't a large portion of benefits in this patch going to be mooted by
the locking improvements discussed in the other threads? I.e. there's
hopefully not going to be a ton of cases with low overhead where we
acquire a lot of locks and release them very soon after. Sure, for DDL
etc we will, but I can't see this mattering from a performance POV?

Mmm ... AIUI, the patches currently proposed can only help for what
David called "point lookup" queries. There are still going to be
queries that scan a large proportion of a partition tree, so if you've
got tons of partitions, you'll be concerned about this sort of thing.

Agreed - but it seems not unlikely that for those the rest of the
planner / executor overhead will entirely swamp any improvement we could
make here. If I understand correctly the benchmarks here were made with
"point" update and select queries, although the reference in the first
post in this thread is a bit vague.

I'm not against doing something like Tom proposes, but heuristics with
magic constants like this tend to age purely / are hard to tune well
across systems.

I didn't say it had to be a constant ...

Do you have good idea?

Greetings,

Andres Freund

Tom Lane

tgl@sss.pgh.pa.us

almost 7 years ago

In reply to: Andres Freund (#8)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

Andres Freund <andres@anarazel.de> writes:

On 2019-02-18 19:01:06 -0500, Tom Lane wrote:

Mmm ... AIUI, the patches currently proposed can only help for what
David called "point lookup" queries. There are still going to be
queries that scan a large proportion of a partition tree, so if you've
got tons of partitions, you'll be concerned about this sort of thing.

Agreed - but it seems not unlikely that for those the rest of the
planner / executor overhead will entirely swamp any improvement we could
make here. If I understand correctly the benchmarks here were made with
"point" update and select queries, although the reference in the first
post in this thread is a bit vague.

I think what Maumau-san is on about here is that not only does your
$giant-query take a long time, but it has a permanent negative effect
on all subsequent transactions in the session. That seems worth
doing something about.

I didn't say it had to be a constant ...

Do you have good idea?

I think David's on the right track --- keep some kind of moving average of
the LOCALLOCK table size for each transaction, and nuke it if it exceeds
some multiple of the recent average. Not sure offhand about how to get
the data cheaply --- it might not be sufficient to look at transaction
end, if we release LOCALLOCK entries before that (but do we?)

regards, tom lane

#10

Andres Freund

andres@anarazel.de

almost 7 years ago

In reply to: Tom Lane (#2)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

Hi,

On 2019-02-18 18:42:32 -0500, Tom Lane wrote:

"Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> writes:

The attached patch speeds up transaction completion when any prior transaction accessed many relations in the same session.

Hm. Putting a list header for a purely-local data structure into shared
memory seems quite ugly. Isn't there a better place to keep that?

Yea, I think it'd be just as fine to store that in a static
variable (best defined directly besides LockMethodLocalHash).

(Btw, I'd be entirely unsurprised if moving away from a dynahash for
LockMethodLocalHash would be beneficial)

Do we really want a dlist here at all? I'm concerned that bloating
LOCALLOCK will cost us when there are many locks involved. This patch
increases the size of LOCALLOCK by 25% if I counted right, which does
not seem like a negligible penalty.

It's currently

struct LOCALLOCK {
LOCALLOCKTAG tag; /* 0 20 */

/* XXX 4 bytes hole, try to pack */

LOCK * lock; /* 24 8 */
PROCLOCK * proclock; /* 32 8 */
uint32 hashcode; /* 40 4 */

/* XXX 4 bytes hole, try to pack */

int64 nLocks; /* 48 8 */
_Bool holdsStrongLockCount; /* 56 1 */
_Bool lockCleared; /* 57 1 */

/* XXX 2 bytes hole, try to pack */

int numLockOwners; /* 60 4 */
/* --- cacheline 1 boundary (64 bytes) --- */
int maxLockOwners; /* 64 4 */

/* XXX 4 bytes hole, try to pack */

LOCALLOCKOWNER * lockOwners; /* 72 8 */

/* size: 80, cachelines: 2, members: 10 */
/* sum members: 66, holes: 4, sum holes: 14 */
/* last cacheline: 16 bytes */
};

seems we could trivially squeeze most of the bytes for a dlist node out
of padding.

My own thought about how to improve this situation was just to destroy
and recreate LockMethodLocalHash at transaction end (or start)
if its size exceeded $some-value. Leaving it permanently bloated seems
like possibly a bad idea, even if we get rid of all the hash_seq_searches
on it.

OTOH, that'll force constant incremental resizing of the hashtable, for
workloads that regularly need a lot of locks. And I'd assume in most
cases if one transaction needs a lot of locks it's quite likely that
future ones will need a lot of locks, too.

Greetings,

Andres Freund

#11

Andres Freund

andres@anarazel.de

almost 7 years ago

In reply to: Tom Lane (#9)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

Hi,

On 2019-02-18 19:13:31 -0500, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2019-02-18 19:01:06 -0500, Tom Lane wrote:

Mmm ... AIUI, the patches currently proposed can only help for what
David called "point lookup" queries. There are still going to be
queries that scan a large proportion of a partition tree, so if you've
got tons of partitions, you'll be concerned about this sort of thing.

Agreed - but it seems not unlikely that for those the rest of the
planner / executor overhead will entirely swamp any improvement we could
make here. If I understand correctly the benchmarks here were made with
"point" update and select queries, although the reference in the first
post in this thread is a bit vague.

I think what Maumau-san is on about here is that not only does your
$giant-query take a long time, but it has a permanent negative effect
on all subsequent transactions in the session. That seems worth
doing something about.

Ah, yes, that makes sense. I'm inclined to think however that the
original approach presented in this thread is better than the
reset-the-whole-hashtable approach. Because:

I think David's on the right track --- keep some kind of moving average of
the LOCALLOCK table size for each transaction, and nuke it if it exceeds
some multiple of the recent average. Not sure offhand about how to get
the data cheaply --- it might not be sufficient to look at transaction
end, if we release LOCALLOCK entries before that (but do we?)

Seems too complicated for my taste. And it doesn't solve the issue of
having some transactions with few locks (say because the plan can be
nicely pruned) interspersed with transactions where a lot of locks are
acquired.

Greetings,

Andres Freund

#12

Tom Lane

tgl@sss.pgh.pa.us

almost 7 years ago

In reply to: Andres Freund (#10)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

Andres Freund <andres@anarazel.de> writes:

On 2019-02-18 18:42:32 -0500, Tom Lane wrote:

Do we really want a dlist here at all? I'm concerned that bloating
LOCALLOCK will cost us when there are many locks involved. This patch
increases the size of LOCALLOCK by 25% if I counted right, which does
not seem like a negligible penalty.

It's currently [ 80 bytes with several padding holes ]
seems we could trivially squeeze most of the bytes for a dlist node out
of padding.

Yeah, but if we want to rearrange the members into an illogical order
to save some space, we should do that independently of this patch ---
and then the overhead of this patch would be even worse than 25%.

regards, tom lane

#13

Andres Freund

andres@anarazel.de

almost 7 years ago

In reply to: Tom Lane (#12)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

Hi,

On 2019-02-18 19:24:54 -0500, Tom Lane wrote:

Yeah, but if we want to rearrange the members into an illogical order
to save some space, we should do that independently of this patch ---

Sure, we should do that. I don't buy the "illogical" bit, just moving
hashcode up to after tag isn't more or less logical, and saves most of
the padding, and moving the booleans to the end isn't better/worse
either.

You always bring up that argument. While I agree that sometimes the most
optimal ordering can be less natural, I think most of the time it vastly
overstates how intelligent the original ordering was. Often new elements
were either just added iteratively without consideration for padding, or
the attention to padding was paid in 32bit times.

I don't find

struct LOCALLOCK {
LOCALLOCKTAG tag; /* 0 20 */
uint32 hashcode; /* 20 4 */
LOCK * lock; /* 24 8 */
PROCLOCK * proclock; /* 32 8 */
int64 nLocks; /* 40 8 */
int numLockOwners; /* 48 4 */
int maxLockOwners; /* 52 4 */
LOCALLOCKOWNER * lockOwners; /* 56 8 */
/* --- cacheline 1 boundary (64 bytes) --- */
_Bool holdsStrongLockCount; /* 64 1 */
_Bool lockCleared; /* 65 1 */

/* size: 72, cachelines: 2, members: 10 */
/* padding: 6 */
/* last cacheline: 8 bytes */
};

less clear than

struct LOCALLOCK {
LOCALLOCKTAG tag; /* 0 20 */

/* XXX 4 bytes hole, try to pack */

LOCK * lock; /* 24 8 */
PROCLOCK * proclock; /* 32 8 */
uint32 hashcode; /* 40 4 */

/* XXX 4 bytes hole, try to pack */

int64 nLocks; /* 48 8 */
_Bool holdsStrongLockCount; /* 56 1 */
_Bool lockCleared; /* 57 1 */

/* XXX 2 bytes hole, try to pack */

int numLockOwners; /* 60 4 */
/* --- cacheline 1 boundary (64 bytes) --- */
int maxLockOwners; /* 64 4 */

/* XXX 4 bytes hole, try to pack */

LOCALLOCKOWNER * lockOwners; /* 72 8 */

/* size: 80, cachelines: 2, members: 10 */
/* sum members: 66, holes: 4, sum holes: 14 */
/* last cacheline: 16 bytes */
};

but it's smaller (althoug there's plenty trailing space).

and then the overhead of this patch would be even worse than 25%.

IDK, we, including you, very often make largely independent improvements
to make the cost of something else more palpable. Why's that not OK
here? Especially because we're not comparing to an alternative where no
cost is added, keeping track of e.g. a running average of the hashtable
size isn't free either; nor does it help in the intermittent cases.

- Andres

#14

Tom Lane

tgl@sss.pgh.pa.us

almost 7 years ago

In reply to: Andres Freund (#13)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

Andres Freund <andres@anarazel.de> writes:

On 2019-02-18 19:24:54 -0500, Tom Lane wrote:

Yeah, but if we want to rearrange the members into an illogical order
to save some space, we should do that independently of this patch ---

Sure, we should do that. I don't buy the "illogical" bit, just moving
hashcode up to after tag isn't more or less logical, and saves most of
the padding, and moving the booleans to the end isn't better/worse
either.

I hadn't looked at the details closely, but if we can squeeze out the
padding space without any loss of intelligibility, sure let's do so.
I still say that's independent of whether to adopt this patch though.

but it's smaller (althoug there's plenty trailing space).

I think you're supposing that these things are independently palloc'd, but
they aren't --- dynahash lays them out in arrays without palloc padding.

IDK, we, including you, very often make largely independent improvements
to make the cost of something else more palpable. Why's that not OK
here?

When we do that, we aren't normally talking about overheads as high as
25% (even more, if it's measured as I think it ought to be). What I'm
concerned about is that the patch is being advocated for cases where
there are lots of LOCALLOCK entries --- which is exactly where the
space overhead is going to hurt the most.

Especially because we're not comparing to an alternative where no
cost is added, keeping track of e.g. a running average of the hashtable
size isn't free either; nor does it help in the intermittent cases.

What I was hoping for --- though perhaps it's not achievable --- was
statistical overhead amounting to just a few more instructions per
transaction. Adding dlist linking would add more instructions per
hashtable entry/removal, which seems like it'd be a substantially
bigger time penalty. As for the intermittent-usage issue, that largely
depends on the details of the when-to-reset heuristic, which we don't
have a concrete design for yet. But I could certainly imagine it waiting
for a few transactions before deciding to chomp.

Anyway, I'm not trying to veto the patch in this form, just suggesting
that there are alternatives worth thinking about.

regards, tom lane

#15

Andres Freund

andres@anarazel.de

almost 7 years ago

In reply to: Tom Lane (#14)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

Hi,

On 2019-02-18 20:29:29 -0500, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

but it's smaller (althoug there's plenty trailing space).

I think you're supposing that these things are independently palloc'd, but
they aren't --- dynahash lays them out in arrays without palloc padding.

I don't think that matters, given that the trailing six bytes are
included in sizeof() (and have to, to guarantee suitable alignment in
arrays etc).

Greetings,

Andres Freund

#16

Simon Riggs

simon@2ndquadrant.com

almost 7 years ago

In reply to: Andres Freund (#11)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

On Tue, 19 Feb 2019 at 00:20, Andres Freund <andres@anarazel.de> wrote:

On 2019-02-18 19:13:31 -0500, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2019-02-18 19:01:06 -0500, Tom Lane wrote:

Mmm ... AIUI, the patches currently proposed can only help for what
David called "point lookup" queries. There are still going to be
queries that scan a large proportion of a partition tree, so if you've
got tons of partitions, you'll be concerned about this sort of thing.

I think what Maumau-san is on about here is that not only does your
$giant-query take a long time, but it has a permanent negative effect
on all subsequent transactions in the session. That seems worth
doing something about.

Ah, yes, that makes sense. I'm inclined to think however that the
original approach presented in this thread is better than the
reset-the-whole-hashtable approach.

If it was just many-tables then blowing away the hash table would work fine.

The main issue seems to be with partitioning, not with the general case of
many-tables. For that case, it seems like reset hashtable is too much.

Can we use our knowledge of the structure of locks, i.e. partition locks
are all children of the partitioned table, to do a better job?

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#17

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 7 years ago

In reply to: Tsunakawa, Takayuki (#1)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

On 2/12/19 7:33 AM, Tsunakawa, Takayuki wrote:

...

This problem was uncovered while evaluating partitioning performance.
When the application PREPAREs a statement once and then
EXECUTE-COMMIT repeatedly, the server creates a generic plan on the
6th EXECUTE. Unfortunately, creation of the generic plan of
UPDATE/DELETE currently accesses all partitions of the target table
(this itself needs improvement), expanding the LOCALLOCK hash table.
As a result, 7th and later EXECUTEs get slower.

Imai-san confirmed performance improvement with this patch:

https://commitfest.postgresql.org/22/1993/

Can you quantify the effects? That is, how much slower/faster does it get?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#18

Tsunakawa, Takayuki

tsunakawa.takay@jp.fujitsu.com

almost 7 years ago

In reply to: Tomas Vondra (#17)

RE: Speed up transaction completion faster after many relations are accessed in a transaction

From: Tomas Vondra [mailto:tomas.vondra@2ndquadrant.com]

On 2/12/19 7:33 AM, Tsunakawa, Takayuki wrote:

Imai-san confirmed performance improvement with this patch:

https://commitfest.postgresql.org/22/1993/

Can you quantify the effects? That is, how much slower/faster does it get?

Ugh, sorry, I wrote a wrong URL. The correct page is:

/messages/by-id/0F97FA9ABBDBE54F91744A9B37151A512787EC@g01jpexmbkw24

The quoted figures re:

[v20 + faster-locallock-scan.patch]
auto: 9,069 TPS
custom: 9,015 TPS

[v20]
auto: 8,037 TPS
custom: 8,933 TPS

In the original problematic case, plan_cache_mode = auto (default), we can see about 13% improvement.

Regards
Takayuki Tsunakawa

#19

Tsunakawa, Takayuki

tsunakawa.takay@jp.fujitsu.com

almost 7 years ago

In reply to: Simon Riggs (#16)

1 attachment(s)

RE: Speed up transaction completion faster after many relations are accessed in a transaction

From: Tom Lane [mailto:tgl@sss.pgh.pa.us]

Hm. Putting a list header for a purely-local data structure into shared
memory seems quite ugly. Isn't there a better place to keep that?

Agreed. I put it in the global variable.

Do we really want a dlist here at all? I'm concerned that bloating
LOCALLOCK will cost us when there are many locks involved. This patch
increases the size of LOCALLOCK by 25% if I counted right, which does
not seem like a negligible penalty.

To delete the LOCALLOCK in RemoveLocalLock(), we need a dlist. slist requires the list iterator to be passed from callers.

From: Andres Freund [mailto:andres@anarazel.de]

Sure, we should do that. I don't buy the "illogical" bit, just moving
hashcode up to after tag isn't more or less logical, and saves most of
the padding, and moving the booleans to the end isn't better/worse
either.

I don't find

Thanks, I've done it.

From: Simon Riggs [mailto:simon@2ndquadrant.com]

Can we use our knowledge of the structure of locks, i.e. partition locks
are all children of the partitioned table, to do a better job?

I couldn't come up with a idea.

Regards
Takayuki Tsunakawa

Attachments:

faster-locallock-scan_v2.patchapplication/octet-stream; name=faster-locallock-scan_v2.patchDownload

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 3bb5ce3..8c55d50 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -255,6 +255,10 @@ static HTAB *LockMethodProcLockHash;
 static HTAB *LockMethodLocalHash;
 
 
+/* list of LOCALLOCK structures that each backend acquired */
+static dlist_head LocalLocks = DLIST_STATIC_INIT(LocalLocks);
+
+
 /* private state for error cleanup */
 static LOCALLOCK *StrongLockInProgress;
 static LOCALLOCK *awaitedLock;
@@ -794,6 +798,7 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	 */
 	if (!found)
 	{
+		dlist_push_head(&LocalLocks, &locallock->procLink);
 		locallock->lock = NULL;
 		locallock->proclock = NULL;
 		locallock->hashcode = LockTagHashCode(&(localtag.lock));
@@ -1320,6 +1325,7 @@ RemoveLocalLock(LOCALLOCK *locallock)
 		SpinLockRelease(&FastPathStrongRelationLocks->mutex);
 	}
 
+	dlist_delete(&locallock->procLink);
 	if (!hash_search(LockMethodLocalHash,
 					 (void *) &(locallock->tag),
 					 HASH_REMOVE, NULL))
@@ -2088,7 +2094,7 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 void
 LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
 {
-	HASH_SEQ_STATUS status;
+	dlist_mutable_iter iter;
 	LockMethod	lockMethodTable;
 	int			i,
 				numLockModes;
@@ -2126,10 +2132,10 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
 	 * pointers.  Fast-path locks are cleaned up during the locallock table
 	 * scan, though.
 	 */
-	hash_seq_init(&status, LockMethodLocalHash);
-
-	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+	dlist_foreach_modify(iter, &LocalLocks)
 	{
+		locallock = dlist_container(LOCALLOCK, procLink, iter.cur);
+
 		/*
 		 * If the LOCALLOCK entry is unused, we must've run out of shared
 		 * memory while trying to set up this lock.  Just forget the local
@@ -2362,16 +2368,16 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
 void
 LockReleaseSession(LOCKMETHODID lockmethodid)
 {
-	HASH_SEQ_STATUS status;
+	dlist_mutable_iter iter;
 	LOCALLOCK  *locallock;
 
 	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
 		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
 
-	hash_seq_init(&status, LockMethodLocalHash);
-
-	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+	dlist_foreach_modify(iter, &LocalLocks)
 	{
+		locallock = dlist_container(LOCALLOCK, procLink, iter.cur);
+
 		/* Ignore items that are not of the specified lock method */
 		if (LOCALLOCK_LOCKMETHOD(*locallock) != lockmethodid)
 			continue;
@@ -2394,13 +2400,14 @@ LockReleaseCurrentOwner(LOCALLOCK **locallocks, int nlocks)
 {
 	if (locallocks == NULL)
 	{
-		HASH_SEQ_STATUS status;
+		dlist_mutable_iter iter;
 		LOCALLOCK  *locallock;
 
-		hash_seq_init(&status, LockMethodLocalHash);
-
-		while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+		dlist_foreach_modify(iter, &LocalLocks)
+		{
+			locallock = dlist_container(LOCALLOCK, procLink, iter.cur);
 			ReleaseLockIfHeld(locallock, false);
+		}
 	}
 	else
 	{
@@ -2493,13 +2500,14 @@ LockReassignCurrentOwner(LOCALLOCK **locallocks, int nlocks)
 
 	if (locallocks == NULL)
 	{
-		HASH_SEQ_STATUS status;
+		dlist_mutable_iter iter;
 		LOCALLOCK  *locallock;
 
-		hash_seq_init(&status, LockMethodLocalHash);
-
-		while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+		dlist_foreach_modify(iter, &LocalLocks)
+		{
+			locallock = dlist_container(LOCALLOCK, procLink, iter.cur);
 			LockReassignOwner(locallock, parent);
+		}
 	}
 	else
 	{
@@ -3133,8 +3141,7 @@ LockRefindAndRelease(LockMethod lockMethodTable, PGPROC *proc,
 void
 AtPrepare_Locks(void)
 {
-	HASH_SEQ_STATUS status;
-	LOCALLOCK  *locallock;
+	dlist_mutable_iter iter;
 
 	/*
 	 * For the most part, we don't need to touch shared memory for this ---
@@ -3142,10 +3149,9 @@ AtPrepare_Locks(void)
 	 * Fast-path locks are an exception, however: we move any such locks to
 	 * the main table before allowing PREPARE TRANSACTION to succeed.
 	 */
-	hash_seq_init(&status, LockMethodLocalHash);
-
-	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+	dlist_foreach_modify(iter, &LocalLocks)
 	{
+		LOCALLOCK  *locallock = dlist_container(LOCALLOCK, procLink, iter.cur);
 		TwoPhaseLockRecord record;
 		LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
 		bool		haveSessionLock;
@@ -3244,8 +3250,7 @@ void
 PostPrepare_Locks(TransactionId xid)
 {
 	PGPROC	   *newproc = TwoPhaseGetDummyProc(xid);
-	HASH_SEQ_STATUS status;
-	LOCALLOCK  *locallock;
+	dlist_mutable_iter iter;
 	LOCK	   *lock;
 	PROCLOCK   *proclock;
 	PROCLOCKTAG proclocktag;
@@ -3267,10 +3272,9 @@ PostPrepare_Locks(TransactionId xid)
 	 * pointing to the same proclock, and we daren't end up with any dangling
 	 * pointers.
 	 */
-	hash_seq_init(&status, LockMethodLocalHash);
-
-	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+	dlist_foreach_modify(iter, &LocalLocks)
 	{
+		LOCALLOCK  *locallock = dlist_container(LOCALLOCK, procLink, iter.cur);
 		LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
 		bool		haveSessionLock;
 		bool		haveXactLock;
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index 16b927c..5e39e36 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -18,6 +18,7 @@
 #error "lock.h may not be included from frontend code"
 #endif
 
+#include "lib/ilist.h"
 #include "storage/lockdefs.h"
 #include "storage/backendid.h"
 #include "storage/lwlock.h"
@@ -404,15 +405,16 @@ typedef struct LOCALLOCK
 	LOCALLOCKTAG tag;			/* unique identifier of locallock entry */
 
 	/* data */
+	uint32		hashcode;		/* copy of LOCKTAG's hash value */
 	LOCK	   *lock;			/* associated LOCK object, if any */
 	PROCLOCK   *proclock;		/* associated PROCLOCK object, if any */
-	uint32		hashcode;		/* copy of LOCKTAG's hash value */
+	dlist_node  procLink;		/* list link in PGPROC's list of LOCALLOCKs */
 	int64		nLocks;			/* total number of times lock is held */
-	bool		holdsStrongLockCount;	/* bumped FastPathStrongRelationLocks */
-	bool		lockCleared;	/* we read all sinval msgs for lock */
 	int			numLockOwners;	/* # of relevant ResourceOwners */
 	int			maxLockOwners;	/* allocated size of array */
 	LOCALLOCKOWNER *lockOwners; /* dynamically resizable array */
+	bool		holdsStrongLockCount;	/* bumped FastPathStrongRelationLocks */
+	bool		lockCleared;	/* we read all sinval msgs for lock */
 } LOCALLOCK;
 
 #define LOCALLOCK_LOCKMETHOD(llock) ((llock).tag.lock.locktag_lockmethodid)

#20

Peter Eisentraut

peter.eisentraut@2ndquadrant.com

almost 7 years ago

In reply to: Tsunakawa, Takayuki (#19)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

On 2019-02-20 07:20, Tsunakawa, Takayuki wrote:

From: Tom Lane [mailto:tgl@sss.pgh.pa.us]

Hm. Putting a list header for a purely-local data structure into shared
memory seems quite ugly. Isn't there a better place to keep that?

Agreed. I put it in the global variable.

I think there is agreement on the principles of this patch. Perhaps it
could be polished a bit.

Your changes in LOCALLOCK still refer to PGPROC, from your first version
of the patch.

I think the reordering of struct members could be done as a separate
preliminary patch.

Some more documentation in the comment before dlist_head LocalLocks to
explain this whole mechanism would be nice.

You posted a link to some performance numbers, but I didn't see the test
setup explained there. I'd like to get some more information on this
impact of this. Is there an effect with 100 tables, or do you need 100000?

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#21

Tsunakawa, Takayuki

tsunakawa.takay@jp.fujitsu.com

almost 7 years ago

In reply to: Peter Eisentraut (#20)

2 attachment(s)

RE: Speed up transaction completion faster after many relations are accessed in a transaction

Hi Peter, Imai-san,

From: Peter Eisentraut [mailto:peter.eisentraut@2ndquadrant.com]

Your changes in LOCALLOCK still refer to PGPROC, from your first version
of the patch.

I think the reordering of struct members could be done as a separate
preliminary patch.

Some more documentation in the comment before dlist_head LocalLocks to
explain this whole mechanism would be nice.

Fixed.

You posted a link to some performance numbers, but I didn't see the test
setup explained there. I'd like to get some more information on this
impact of this. Is there an effect with 100 tables, or do you need 100000?

Imai-san, can you tell us the test setup?

Regards
Takayuki Tsunakawa

Attachments:

0001-reorder-LOCALLOCK-structure-members-to-compact-the-s.patchapplication/octet-stream; name=0001-reorder-LOCALLOCK-structure-members-to-compact-the-s.patchDownload

From bdfb135bd1ab9aee6be01308a67edc9a3f479f2f Mon Sep 17 00:00:00 2001
From: Takayuki Tsunakawa <tsunakawa.takay@jp.fujitsu.com>
Date: Tue, 19 Mar 2019 16:43:01 +0900
Subject: [PATCH 1/2] reorder LOCALLOCK structure members to compact the size

---
 src/include/storage/lock.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index 16b927c..badf7fd 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -404,15 +404,15 @@ typedef struct LOCALLOCK
 	LOCALLOCKTAG tag;			/* unique identifier of locallock entry */
 
 	/* data */
+	uint32		hashcode;		/* copy of LOCKTAG's hash value */
 	LOCK	   *lock;			/* associated LOCK object, if any */
 	PROCLOCK   *proclock;		/* associated PROCLOCK object, if any */
-	uint32		hashcode;		/* copy of LOCKTAG's hash value */
 	int64		nLocks;			/* total number of times lock is held */
-	bool		holdsStrongLockCount;	/* bumped FastPathStrongRelationLocks */
-	bool		lockCleared;	/* we read all sinval msgs for lock */
 	int			numLockOwners;	/* # of relevant ResourceOwners */
 	int			maxLockOwners;	/* allocated size of array */
 	LOCALLOCKOWNER *lockOwners; /* dynamically resizable array */
+	bool		holdsStrongLockCount;	/* bumped FastPathStrongRelationLocks */
+	bool		lockCleared;	/* we read all sinval msgs for lock */
 } LOCALLOCK;
 
 #define LOCALLOCK_LOCKMETHOD(llock) ((llock).tag.lock.locktag_lockmethodid)
-- 
2.10.1

0002-speed-up-LOCALLOCK-scan.patchapplication/octet-stream; name=0002-speed-up-LOCALLOCK-scan.patchDownload

From 9e0cbbe3d6c4ec2f316f2a42d464076a0452d26e Mon Sep 17 00:00:00 2001
From: Takayuki Tsunakawa <tsunakawa.takay@jp.fujitsu.com>
Date: Tue, 19 Mar 2019 16:46:51 +0900
Subject: [PATCH 2/2] speed up LOCALLOCK scan

---
 src/backend/storage/lmgr/lock.c | 63 ++++++++++++++++++++++++-----------------
 src/include/storage/lock.h      |  2 ++
 2 files changed, 39 insertions(+), 26 deletions(-)

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 3bb5ce3..4b66577 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -255,6 +255,17 @@ static HTAB *LockMethodProcLockHash;
 static HTAB *LockMethodLocalHash;
 
 
+/*
+ * List of LOCALLOCK structures that each backend acquired
+ *
+ * If a transaction acquires many locks, LockMethodLocalHash bloats, making
+ * the hash table scans in subsequent transactions (e.g., in LockReleaseAll)
+ * even though they only acquire a few locks.  To speed up iteration over
+ * acquired locks in a backend, we use a list of LOCALLOCKs instead.
+ */
+static dlist_head LocalLocks = DLIST_STATIC_INIT(LocalLocks);
+
+
 /* private state for error cleanup */
 static LOCALLOCK *StrongLockInProgress;
 static LOCALLOCK *awaitedLock;
@@ -794,6 +805,7 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	 */
 	if (!found)
 	{
+		dlist_push_head(&LocalLocks, &locallock->procLink);
 		locallock->lock = NULL;
 		locallock->proclock = NULL;
 		locallock->hashcode = LockTagHashCode(&(localtag.lock));
@@ -1320,6 +1332,7 @@ RemoveLocalLock(LOCALLOCK *locallock)
 		SpinLockRelease(&FastPathStrongRelationLocks->mutex);
 	}
 
+	dlist_delete(&locallock->procLink);
 	if (!hash_search(LockMethodLocalHash,
 					 (void *) &(locallock->tag),
 					 HASH_REMOVE, NULL))
@@ -2088,7 +2101,7 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 void
 LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
 {
-	HASH_SEQ_STATUS status;
+	dlist_mutable_iter iter;
 	LockMethod	lockMethodTable;
 	int			i,
 				numLockModes;
@@ -2126,10 +2139,10 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
 	 * pointers.  Fast-path locks are cleaned up during the locallock table
 	 * scan, though.
 	 */
-	hash_seq_init(&status, LockMethodLocalHash);
-
-	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+	dlist_foreach_modify(iter, &LocalLocks)
 	{
+		locallock = dlist_container(LOCALLOCK, procLink, iter.cur);
+
 		/*
 		 * If the LOCALLOCK entry is unused, we must've run out of shared
 		 * memory while trying to set up this lock.  Just forget the local
@@ -2362,16 +2375,16 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
 void
 LockReleaseSession(LOCKMETHODID lockmethodid)
 {
-	HASH_SEQ_STATUS status;
+	dlist_mutable_iter iter;
 	LOCALLOCK  *locallock;
 
 	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
 		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
 
-	hash_seq_init(&status, LockMethodLocalHash);
-
-	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+	dlist_foreach_modify(iter, &LocalLocks)
 	{
+		locallock = dlist_container(LOCALLOCK, procLink, iter.cur);
+
 		/* Ignore items that are not of the specified lock method */
 		if (LOCALLOCK_LOCKMETHOD(*locallock) != lockmethodid)
 			continue;
@@ -2394,13 +2407,14 @@ LockReleaseCurrentOwner(LOCALLOCK **locallocks, int nlocks)
 {
 	if (locallocks == NULL)
 	{
-		HASH_SEQ_STATUS status;
+		dlist_mutable_iter iter;
 		LOCALLOCK  *locallock;
 
-		hash_seq_init(&status, LockMethodLocalHash);
-
-		while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+		dlist_foreach_modify(iter, &LocalLocks)
+		{
+			locallock = dlist_container(LOCALLOCK, procLink, iter.cur);
 			ReleaseLockIfHeld(locallock, false);
+		}
 	}
 	else
 	{
@@ -2493,13 +2507,14 @@ LockReassignCurrentOwner(LOCALLOCK **locallocks, int nlocks)
 
 	if (locallocks == NULL)
 	{
-		HASH_SEQ_STATUS status;
+		dlist_mutable_iter iter;
 		LOCALLOCK  *locallock;
 
-		hash_seq_init(&status, LockMethodLocalHash);
-
-		while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+		dlist_foreach_modify(iter, &LocalLocks)
+		{
+			locallock = dlist_container(LOCALLOCK, procLink, iter.cur);
 			LockReassignOwner(locallock, parent);
+		}
 	}
 	else
 	{
@@ -3133,8 +3148,7 @@ LockRefindAndRelease(LockMethod lockMethodTable, PGPROC *proc,
 void
 AtPrepare_Locks(void)
 {
-	HASH_SEQ_STATUS status;
-	LOCALLOCK  *locallock;
+	dlist_mutable_iter iter;
 
 	/*
 	 * For the most part, we don't need to touch shared memory for this ---
@@ -3142,10 +3156,9 @@ AtPrepare_Locks(void)
 	 * Fast-path locks are an exception, however: we move any such locks to
 	 * the main table before allowing PREPARE TRANSACTION to succeed.
 	 */
-	hash_seq_init(&status, LockMethodLocalHash);
-
-	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+	dlist_foreach_modify(iter, &LocalLocks)
 	{
+		LOCALLOCK  *locallock = dlist_container(LOCALLOCK, procLink, iter.cur);
 		TwoPhaseLockRecord record;
 		LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
 		bool		haveSessionLock;
@@ -3244,8 +3257,7 @@ void
 PostPrepare_Locks(TransactionId xid)
 {
 	PGPROC	   *newproc = TwoPhaseGetDummyProc(xid);
-	HASH_SEQ_STATUS status;
-	LOCALLOCK  *locallock;
+	dlist_mutable_iter iter;
 	LOCK	   *lock;
 	PROCLOCK   *proclock;
 	PROCLOCKTAG proclocktag;
@@ -3267,10 +3279,9 @@ PostPrepare_Locks(TransactionId xid)
 	 * pointing to the same proclock, and we daren't end up with any dangling
 	 * pointers.
 	 */
-	hash_seq_init(&status, LockMethodLocalHash);
-
-	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+	dlist_foreach_modify(iter, &LocalLocks)
 	{
+		LOCALLOCK  *locallock = dlist_container(LOCALLOCK, procLink, iter.cur);
 		LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
 		bool		haveSessionLock;
 		bool		haveXactLock;
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index badf7fd..6bb907d 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -18,6 +18,7 @@
 #error "lock.h may not be included from frontend code"
 #endif
 
+#include "lib/ilist.h"
 #include "storage/lockdefs.h"
 #include "storage/backendid.h"
 #include "storage/lwlock.h"
@@ -407,6 +408,7 @@ typedef struct LOCALLOCK
 	uint32		hashcode;		/* copy of LOCKTAG's hash value */
 	LOCK	   *lock;			/* associated LOCK object, if any */
 	PROCLOCK   *proclock;		/* associated PROCLOCK object, if any */
+	dlist_node  procLink;		/* list link in a backend's list of LOCALLOCKs */
 	int64		nLocks;			/* total number of times lock is held */
 	int			numLockOwners;	/* # of relevant ResourceOwners */
 	int			maxLockOwners;	/* allocated size of array */
-- 
2.10.1

#22

Imai, Yoshikazu

imai.yoshikazu@jp.fujitsu.com

almost 7 years ago

In reply to: Tsunakawa, Takayuki (#21)

RE: Speed up transaction completion faster after many relations are accessed in a transaction

Hi Tsunakawa-san, Peter

On Tue, Mar 19, 2019 at 7:53 AM, Tsunakawa, Takayuki wrote:

From: Peter Eisentraut [mailto:peter.eisentraut@2ndquadrant.com]

You posted a link to some performance numbers, but I didn't see the
test setup explained there. I'd like to get some more information on
this impact of this. Is there an effect with 100 tables, or do you

need 100000?

Imai-san, can you tell us the test setup?

Maybe I used this test setup[1]/messages/by-id/0F97FA9ABBDBE54F91744A9B37151A51256276@g01jpexmbkw24.

I tested again with those settings for prepared transactions.
I used Tsunakawa-san's patch for locallock[2]/messages/by-id/0A3221C70F24FB45833433255569204D1FBDFA00@G01JPEXMBYT05 (which couldn't be applied to current master so I fixed it) and Amit's v32 patch for speeding up planner[3]/messages/by-id/9feacaf6-ddb3-96dd-5b98-df5e927b1439@lab.ntt.co.jp.

[settings]
plan_cache_mode = 'auto' or 'force_custom_plan'
max_parallel_workers = 0
max_parallel_workers_per_gather = 0
max_locks_per_transaction = 4096

[partitioning table definitions(with 4096 partitions)]
create table rt (a int, b int, c int) partition by range (a);

\o /dev/null
select 'create table rt' || x::text || ' partition of rt for values from (' ||
(x)::text || ') to (' || (x+1)::text || ');' from generate_series(1, 4096) x;
\gexec
\o

[select4096.sql]
\set a random(1, 4096)
select a from rt where a = :a;

[pgbench(with 4096 partitions)]
pgbench -n -f select4096.sql -T 60 -M prepared

[results]
master locallock v32 v32+locallock
------ --------- --- -------------
auto 21.9 22.9 6,834 7,355
custom 19.7 20.0 7,415 7,252

[1]: /messages/by-id/0F97FA9ABBDBE54F91744A9B37151A51256276@g01jpexmbkw24
[2]: /messages/by-id/0A3221C70F24FB45833433255569204D1FBDFA00@G01JPEXMBYT05
[3]: /messages/by-id/9feacaf6-ddb3-96dd-5b98-df5e927b1439@lab.ntt.co.jp

--
Yoshikazu Imai

#23

Tsunakawa, Takayuki

tsunakawa.takay@jp.fujitsu.com

almost 7 years ago

In reply to: Tsunakawa, Takayuki (#21)

2 attachment(s)

RE: Speed up transaction completion faster after many relations are accessed in a transaction

From: Tsunakawa, Takayuki [mailto:tsunakawa.takay@jp.fujitsu.com]

Fixed.

Rebased on HEAD.

Regards
Takayuki Tsunakawa

Attachments:

0002-speed-up-LOCALLOCK-scan.patchapplication/octet-stream; name=0002-speed-up-LOCALLOCK-scan.patchDownload

From a585ba41faf640d34b319472343caadb38f6b1e8 Mon Sep 17 00:00:00 2001
From: Takayuki Tsunakawa <tsunakawa.takay@jp.fujitsu.com>
Date: Tue, 19 Mar 2019 16:46:51 +0900
Subject: [PATCH 2/2] speed up LOCALLOCK scan

---
 src/backend/storage/lmgr/lock.c | 63 ++++++++++++++++++++++++-----------------
 src/include/storage/lock.h      |  2 ++
 2 files changed, 39 insertions(+), 26 deletions(-)

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 78fdbd6..29a199b 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -255,6 +255,17 @@ static HTAB *LockMethodProcLockHash;
 static HTAB *LockMethodLocalHash;
 
 
+/*
+ * List of LOCALLOCK structures that each backend acquired
+ *
+ * If a transaction acquires many locks, LockMethodLocalHash bloats, making
+ * the hash table scans in subsequent transactions (e.g., in LockReleaseAll)
+ * even though they only acquire a few locks.  To speed up iteration over
+ * acquired locks in a backend, we use a list of LOCALLOCKs instead.
+ */
+static dlist_head LocalLocks = DLIST_STATIC_INIT(LocalLocks);
+
+
 /* private state for error cleanup */
 static LOCALLOCK *StrongLockInProgress;
 static LOCALLOCK *awaitedLock;
@@ -794,6 +805,7 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	 */
 	if (!found)
 	{
+		dlist_push_head(&LocalLocks, &locallock->procLink);
 		locallock->lock = NULL;
 		locallock->proclock = NULL;
 		locallock->hashcode = LockTagHashCode(&(localtag.lock));
@@ -1320,6 +1332,7 @@ RemoveLocalLock(LOCALLOCK *locallock)
 		SpinLockRelease(&FastPathStrongRelationLocks->mutex);
 	}
 
+	dlist_delete(&locallock->procLink);
 	if (!hash_search(LockMethodLocalHash,
 					 (void *) &(locallock->tag),
 					 HASH_REMOVE, NULL))
@@ -2088,7 +2101,7 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 void
 LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
 {
-	HASH_SEQ_STATUS status;
+	dlist_mutable_iter iter;
 	LockMethod	lockMethodTable;
 	int			i,
 				numLockModes;
@@ -2126,10 +2139,10 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
 	 * pointers.  Fast-path locks are cleaned up during the locallock table
 	 * scan, though.
 	 */
-	hash_seq_init(&status, LockMethodLocalHash);
-
-	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+	dlist_foreach_modify(iter, &LocalLocks)
 	{
+		locallock = dlist_container(LOCALLOCK, procLink, iter.cur);
+
 		/*
 		 * If the LOCALLOCK entry is unused, we must've run out of shared
 		 * memory while trying to set up this lock.  Just forget the local
@@ -2362,16 +2375,16 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
 void
 LockReleaseSession(LOCKMETHODID lockmethodid)
 {
-	HASH_SEQ_STATUS status;
+	dlist_mutable_iter iter;
 	LOCALLOCK  *locallock;
 
 	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
 		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
 
-	hash_seq_init(&status, LockMethodLocalHash);
-
-	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+	dlist_foreach_modify(iter, &LocalLocks)
 	{
+		locallock = dlist_container(LOCALLOCK, procLink, iter.cur);
+
 		/* Ignore items that are not of the specified lock method */
 		if (LOCALLOCK_LOCKMETHOD(*locallock) != lockmethodid)
 			continue;
@@ -2394,13 +2407,14 @@ LockReleaseCurrentOwner(LOCALLOCK **locallocks, int nlocks)
 {
 	if (locallocks == NULL)
 	{
-		HASH_SEQ_STATUS status;
+		dlist_mutable_iter iter;
 		LOCALLOCK  *locallock;
 
-		hash_seq_init(&status, LockMethodLocalHash);
-
-		while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+		dlist_foreach_modify(iter, &LocalLocks)
+		{
+			locallock = dlist_container(LOCALLOCK, procLink, iter.cur);
 			ReleaseLockIfHeld(locallock, false);
+		}
 	}
 	else
 	{
@@ -2493,13 +2507,14 @@ LockReassignCurrentOwner(LOCALLOCK **locallocks, int nlocks)
 
 	if (locallocks == NULL)
 	{
-		HASH_SEQ_STATUS status;
+		dlist_mutable_iter iter;
 		LOCALLOCK  *locallock;
 
-		hash_seq_init(&status, LockMethodLocalHash);
-
-		while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+		dlist_foreach_modify(iter, &LocalLocks)
+		{
+			locallock = dlist_container(LOCALLOCK, procLink, iter.cur);
 			LockReassignOwner(locallock, parent);
+		}
 	}
 	else
 	{
@@ -3133,8 +3148,7 @@ LockRefindAndRelease(LockMethod lockMethodTable, PGPROC *proc,
 void
 AtPrepare_Locks(void)
 {
-	HASH_SEQ_STATUS status;
-	LOCALLOCK  *locallock;
+	dlist_mutable_iter iter;
 
 	/*
 	 * For the most part, we don't need to touch shared memory for this ---
@@ -3142,10 +3156,9 @@ AtPrepare_Locks(void)
 	 * Fast-path locks are an exception, however: we move any such locks to
 	 * the main table before allowing PREPARE TRANSACTION to succeed.
 	 */
-	hash_seq_init(&status, LockMethodLocalHash);
-
-	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+	dlist_foreach_modify(iter, &LocalLocks)
 	{
+		LOCALLOCK  *locallock = dlist_container(LOCALLOCK, procLink, iter.cur);
 		TwoPhaseLockRecord record;
 		LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
 		bool		haveSessionLock;
@@ -3244,8 +3257,7 @@ void
 PostPrepare_Locks(TransactionId xid)
 {
 	PGPROC	   *newproc = TwoPhaseGetDummyProc(xid, false);
-	HASH_SEQ_STATUS status;
-	LOCALLOCK  *locallock;
+	dlist_mutable_iter iter;
 	LOCK	   *lock;
 	PROCLOCK   *proclock;
 	PROCLOCKTAG proclocktag;
@@ -3267,10 +3279,9 @@ PostPrepare_Locks(TransactionId xid)
 	 * pointing to the same proclock, and we daren't end up with any dangling
 	 * pointers.
 	 */
-	hash_seq_init(&status, LockMethodLocalHash);
-
-	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+	dlist_foreach_modify(iter, &LocalLocks)
 	{
+		LOCALLOCK  *locallock = dlist_container(LOCALLOCK, procLink, iter.cur);
 		LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
 		bool		haveSessionLock;
 		bool		haveXactLock;
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index badf7fd..6bb907d 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -18,6 +18,7 @@
 #error "lock.h may not be included from frontend code"
 #endif
 
+#include "lib/ilist.h"
 #include "storage/lockdefs.h"
 #include "storage/backendid.h"
 #include "storage/lwlock.h"
@@ -407,6 +408,7 @@ typedef struct LOCALLOCK
 	uint32		hashcode;		/* copy of LOCKTAG's hash value */
 	LOCK	   *lock;			/* associated LOCK object, if any */
 	PROCLOCK   *proclock;		/* associated PROCLOCK object, if any */
+	dlist_node  procLink;		/* list link in a backend's list of LOCALLOCKs */
 	int64		nLocks;			/* total number of times lock is held */
 	int			numLockOwners;	/* # of relevant ResourceOwners */
 	int			maxLockOwners;	/* allocated size of array */
-- 
2.10.1

0001-reorder-LOCALLOCK-structure-members-to-compact-the-s.patchapplication/octet-stream; name=0001-reorder-LOCALLOCK-structure-members-to-compact-the-s.patchDownload

From 62108574e61dc1852c868406af908bc874305227 Mon Sep 17 00:00:00 2001
From: Takayuki Tsunakawa <tsunakawa.takay@jp.fujitsu.com>
Date: Tue, 19 Mar 2019 16:43:01 +0900
Subject: [PATCH 1/2] reorder LOCALLOCK structure members to compact the size

---
 src/include/storage/lock.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index 16b927c..badf7fd 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -404,15 +404,15 @@ typedef struct LOCALLOCK
 	LOCALLOCKTAG tag;			/* unique identifier of locallock entry */
 
 	/* data */
+	uint32		hashcode;		/* copy of LOCKTAG's hash value */
 	LOCK	   *lock;			/* associated LOCK object, if any */
 	PROCLOCK   *proclock;		/* associated PROCLOCK object, if any */
-	uint32		hashcode;		/* copy of LOCKTAG's hash value */
 	int64		nLocks;			/* total number of times lock is held */
-	bool		holdsStrongLockCount;	/* bumped FastPathStrongRelationLocks */
-	bool		lockCleared;	/* we read all sinval msgs for lock */
 	int			numLockOwners;	/* # of relevant ResourceOwners */
 	int			maxLockOwners;	/* allocated size of array */
 	LOCALLOCKOWNER *lockOwners; /* dynamically resizable array */
+	bool		holdsStrongLockCount;	/* bumped FastPathStrongRelationLocks */
+	bool		lockCleared;	/* we read all sinval msgs for lock */
 } LOCALLOCK;
 
 #define LOCALLOCK_LOCKMETHOD(llock) ((llock).tag.lock.locktag_lockmethodid)
-- 
2.10.1

#24

Peter Eisentraut

peter.eisentraut@2ndquadrant.com

almost 7 years ago

In reply to: Tsunakawa, Takayuki (#23)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

On 2019-03-19 10:21, Tsunakawa, Takayuki wrote:

From: Tsunakawa, Takayuki [mailto:tsunakawa.takay@jp.fujitsu.com]

Fixed.

Rebased on HEAD.

I have committed the first patch that reorganizes the struct. I'll have
to spend some time evaluating the performance impact of the second
patch, but it seems OK in principle. Performance tests from others welcome.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#25

Peter Eisentraut

peter.eisentraut@2ndquadrant.com

almost 7 years ago

In reply to: Peter Eisentraut (#24)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

On 2019-03-19 16:38, Peter Eisentraut wrote:

On 2019-03-19 10:21, Tsunakawa, Takayuki wrote:

From: Tsunakawa, Takayuki [mailto:tsunakawa.takay@jp.fujitsu.com]

Fixed.

Rebased on HEAD.

I have committed the first patch that reorganizes the struct. I'll have
to spend some time evaluating the performance impact of the second
patch, but it seems OK in principle. Performance tests from others welcome.

I did a bit of performance testing, both a plain pgbench and the
suggested test case with 4096 partitions. I can't detect any
performance improvements. In fact, within the noise, it tends to be
just a bit on the slower side.

So I'd like to kick it back to the patch submitter now and ask for more
justification and performance analysis.

Perhaps "speeding up planning with partitions" needs to be accepted first?

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#26

David Rowley

david.rowley@2ndquadrant.com

almost 7 years ago

In reply to: Peter Eisentraut (#25)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

On Mon, 25 Mar 2019 at 23:44, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

I did a bit of performance testing, both a plain pgbench and the
suggested test case with 4096 partitions. I can't detect any
performance improvements. In fact, within the noise, it tends to be
just a bit on the slower side.

So I'd like to kick it back to the patch submitter now and ask for more
justification and performance analysis.

Perhaps "speeding up planning with partitions" needs to be accepted first?

Yeah, I think it likely will require that patch to be able to measure
the gains from this patch.

If planning a SELECT to a partitioned table with a large number of
partitions using PREPAREd statements, when we attempt the generic plan
on the 6th execution, it does cause the local lock table to expand to
fit all the locks for each partition. This does cause the
LockReleaseAll() to become slow due to the hash_seq_search having to
skip over many empty buckets. Since generating a custom plan for a
partitioned table with many partitions is still slow in master, then I
very much imagine you'll struggle to see the gains brought by this
patch.

I did a quick benchmark too and couldn't measure anything:

create table hp (a int) partition by hash (a);
select 'create table hp'||x|| ' partition of hp for values with
(modulus 4096, remainder ' || x || ');' from generate_series(0,4095)
x;

bench.sql
\set p_a 13315
select * from hp where a = :p_a;

Master:
$ pgbench -M prepared -n -T 60 -f bench.sql postgres
tps = 31.844468 (excluding connections establishing)
tps = 32.950154 (excluding connections establishing)
tps = 31.534761 (excluding connections establishing)

Patched:
$ pgbench -M prepared -n -T 60 -f bench.sql postgres
tps = 30.099133 (excluding connections establishing)
tps = 32.157328 (excluding connections establishing)
tps = 32.329884 (excluding connections establishing)

The situation won't be any better with plan_cache_mode =
force_generic_plan either. In this case, we'll only plan once but
we'll also have to obtain and release a lock for each partition for
each execution of the prepared statement. LockReleaseAll() is going to
be slow in that case because it actually has to release a lot of
locks.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#27

Tsunakawa, Takayuki

tsunakawa.takay@jp.fujitsu.com

almost 7 years ago

In reply to: David Rowley (#26)

RE: Speed up transaction completion faster after many relations are accessed in a transaction

From: David Rowley [mailto:david.rowley@2ndquadrant.com]

On Mon, 25 Mar 2019 at 23:44, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

Perhaps "speeding up planning with partitions" needs to be accepted first?

Yeah, I think it likely will require that patch to be able to measure
the gains from this patch.

If planning a SELECT to a partitioned table with a large number of
partitions using PREPAREd statements, when we attempt the generic plan
on the 6th execution, it does cause the local lock table to expand to
fit all the locks for each partition. This does cause the
LockReleaseAll() to become slow due to the hash_seq_search having to
skip over many empty buckets. Since generating a custom plan for a
partitioned table with many partitions is still slow in master, then I
very much imagine you'll struggle to see the gains brought by this
patch.

Thank you David for explaining. Although I may not understand the effect of "speeding up planning with partitions" patch, this patch takes effect even without it. That is, perform the following in the same session:

1. SELECT count(*) FROM table; on a table with many partitions. That bloats the LocalLockHash.
2. PREPARE a point query, e.g., SELECT * FROM table WHERE pkey = $1;
3. EXECUTE the PREPAREd query repeatedly, with each EXECUTE in a separate transaction. Without the patch, each transaction's LockReleaseAll() has to scan the bloated large hash table.

Regards
Takayuki Tsunakawa

#28

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

almost 7 years ago

In reply to: Tsunakawa, Takayuki (#27)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

Tsunakawa-san,

On 2019/03/26 17:21, Tsunakawa, Takayuki wrote:

From: David Rowley [mailto:david.rowley@2ndquadrant.com]

On Mon, 25 Mar 2019 at 23:44, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

Perhaps "speeding up planning with partitions" needs to be accepted first?

Yeah, I think it likely will require that patch to be able to measure
the gains from this patch.

If planning a SELECT to a partitioned table with a large number of
partitions using PREPAREd statements, when we attempt the generic plan
on the 6th execution, it does cause the local lock table to expand to
fit all the locks for each partition. This does cause the
LockReleaseAll() to become slow due to the hash_seq_search having to
skip over many empty buckets. Since generating a custom plan for a
partitioned table with many partitions is still slow in master, then I
very much imagine you'll struggle to see the gains brought by this
patch.

Thank you David for explaining. Although I may not understand the effect of "speeding up planning with partitions" patch, this patch takes effect even without it. That is, perform the following in the same session:

1. SELECT count(*) FROM table; on a table with many partitions. That bloats the LocalLockHash.
2. PREPARE a point query, e.g., SELECT * FROM table WHERE pkey = $1;
3. EXECUTE the PREPAREd query repeatedly, with each EXECUTE in a separate transaction. Without the patch, each transaction's LockReleaseAll() has to scan the bloated large hash table.

My understanding of what David wrote is that the slowness of bloated hash
table is hard to notice, because planning itself is pretty slow. With the
"speeding up planning with partitions" patch, planning becomes quite fast,
so the bloated hash table overhead and so your patch's benefit is easier
to notice. This patch is clearly helpful, but it's just hard to notice it
when the other big bottleneck is standing in the way.

Thanks,
Amit

#29

David Rowley

david.rowley@2ndquadrant.com

almost 7 years ago

In reply to: Tsunakawa, Takayuki (#27)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

On Tue, 26 Mar 2019 at 21:23, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:

Thank you David for explaining. Although I may not understand the effect of "speeding up planning with partitions" patch, this patch takes effect even without it. That is, perform the following in the same session:

1. SELECT count(*) FROM table; on a table with many partitions. That bloats the LocalLockHash.
2. PREPARE a point query, e.g., SELECT * FROM table WHERE pkey = $1;
3. EXECUTE the PREPAREd query repeatedly, with each EXECUTE in a separate transaction. Without the patch, each transaction's LockReleaseAll() has to scan the bloated large hash table.

Oh. I think I see what you're saying. Really the table in #2 would
have to be some completely different table that's not partitioned. I
think in that case it should make a difference.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#30

David Rowley

david.rowley@2ndquadrant.com

almost 7 years ago

In reply to: David Rowley (#29)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

On Tue, 26 Mar 2019 at 21:55, David Rowley <david.rowley@2ndquadrant.com> wrote:

On Tue, 26 Mar 2019 at 21:23, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:

Thank you David for explaining. Although I may not understand the effect of "speeding up planning with partitions" patch, this patch takes effect even without it. That is, perform the following in the same session:

1. SELECT count(*) FROM table; on a table with many partitions. That bloats the LocalLockHash.
2. PREPARE a point query, e.g., SELECT * FROM table WHERE pkey = $1;
3. EXECUTE the PREPAREd query repeatedly, with each EXECUTE in a separate transaction. Without the patch, each transaction's LockReleaseAll() has to scan the bloated large hash table.

Oh. I think I see what you're saying. Really the table in #2 would
have to be some completely different table that's not partitioned. I
think in that case it should make a difference.

Here a benchmark doing that using pgbench's script weight feature.

I've set this up so the query that hits the partitioned table runs
once for every 10k times the other script runs. I picked that number
so the lock table was expanded fairly early on in the benchmark.

setup:
create table t1 (a int primary key);
create table hp (a int) partition by hash (a);
select 'create table hp'||x|| ' partition of hp for values with
(modulus 4096, remainder ' || x || ');' from generate_series(0,4095)
x;
\gexec

hp.sql
select count(*) from hp;

t1.sql
\set p 1
select a from t1 where a = :p;

Master = c8c885b7a5

Master:
$ pgbench -T 60 -M prepared -n -f hp.sql@1 -f t1.sql@10000 postgres
SQL script 2: t1.sql
- 1057306 transactions (100.0% of total, tps = 17621.756473)
- 1081905 transactions (100.0% of total, tps = 18021.449914)
- 1122420 transactions (100.0% of total, tps = 18690.804699)

Master + 0002-speed-up-LOCALLOCK-scan.patch

$ pgbench -T 60 -M prepared -n -f hp.sql@1 -f t1.sql@10000 postgres
SQL script 2: t1.sql
- 1277014 transactions (100.0% of total, tps = 21283.551615)
- 1184052 transactions (100.0% of total, tps = 19734.185872)
- 1188523 transactions (100.0% of total, tps = 19785.835662)

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#31

Tsunakawa, Takayuki

tsunakawa.takay@jp.fujitsu.com

almost 7 years ago

In reply to: Amit Langote (#28)

RE: Speed up transaction completion faster after many relations are accessed in a transaction

From: Amit Langote [mailto:Langote_Amit_f8@lab.ntt.co.jp]

My understanding of what David wrote is that the slowness of bloated hash
table is hard to notice, because planning itself is pretty slow. With the
"speeding up planning with partitions" patch, planning becomes quite fast,
so the bloated hash table overhead and so your patch's benefit is easier
to notice. This patch is clearly helpful, but it's just hard to notice
it
when the other big bottleneck is standing in the way.

Ah, I see. I failed to recognize the simple fact that without your patch, EXECUTE on a table with many partitions is slow due to the custom planning time proportional to the number of partitions. Thanks for waking up my sleeping head!

Regards
Takayuki Tsunakawa

#32

Tsunakawa, Takayuki

tsunakawa.takay@jp.fujitsu.com

almost 7 years ago

In reply to: David Rowley (#30)

RE: Speed up transaction completion faster after many relations are accessed in a transaction

From: David Rowley [mailto:david.rowley@2ndquadrant.com]

Here a benchmark doing that using pgbench's script weight feature.

Wow, I didn't know that pgbench has evolved to have such a convenient feature. Thanks for telling me how to utilize it in testing. PostgreSQL is cool!

Regards
Takayuki Tsunakawa

#33

Tsunakawa, Takayuki

tsunakawa.takay@jp.fujitsu.com

almost 7 years ago

In reply to: Peter Eisentraut (#25)

RE: Speed up transaction completion faster after many relations are accessed in a transaction

Hi Peter,

From: Peter Eisentraut [mailto:peter.eisentraut@2ndquadrant.com]

I did a bit of performance testing, both a plain pgbench and the
suggested test case with 4096 partitions. I can't detect any
performance improvements. In fact, within the noise, it tends to be
just a bit on the slower side.

So I'd like to kick it back to the patch submitter now and ask for more
justification and performance analysis.

Perhaps "speeding up planning with partitions" needs to be accepted first?

David kindly showed how to demonstrate the performance improvement on March 26, so I changed the status to needs review. I'd appreciate it if you could continue the final check.

Regards
Takayuki Tsunakawa

#34

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

almost 7 years ago

In reply to: Tsunakawa, Takayuki (#33)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

Hi,

On 2019/04/04 13:37, Tsunakawa, Takayuki wrote:

Hi Peter,

From: Peter Eisentraut [mailto:peter.eisentraut@2ndquadrant.com]

I did a bit of performance testing, both a plain pgbench and the
suggested test case with 4096 partitions. I can't detect any
performance improvements. In fact, within the noise, it tends to be
just a bit on the slower side.

So I'd like to kick it back to the patch submitter now and ask for more
justification and performance analysis.

Perhaps "speeding up planning with partitions" needs to be accepted first?

David kindly showed how to demonstrate the performance improvement on March 26, so I changed the status to needs review. I'd appreciate it if you could continue the final check.

Also, since the "speed up partition planning" patch went in (428b260f8),
it might be possible to see the performance boost even with the
partitioning example you cited upthread.

Thanks,
Amit

#35

Peter Eisentraut

peter.eisentraut@2ndquadrant.com

almost 7 years ago

In reply to: Amit Langote (#34)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

On 2019-04-04 06:58, Amit Langote wrote:

Also, since the "speed up partition planning" patch went in (428b260f8),
it might be possible to see the performance boost even with the
partitioning example you cited upthread.

I can't detect any performance improvement with the patch applied to
current master, using the test case from Yoshikazu Imai (2019-03-19).

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#36

Tsunakawa, Takayuki

tsunakawa.takay@jp.fujitsu.com

almost 7 years ago

In reply to: Peter Eisentraut (#35)

RE: Speed up transaction completion faster after many relations are accessed in a transaction

Hi Peter, Imai-san,

From: Peter Eisentraut [mailto:peter.eisentraut@2ndquadrant.com]

I can't detect any performance improvement with the patch applied to
current master, using the test case from Yoshikazu Imai (2019-03-19).

That's strange... Peter, Imai-san, can you compare your test procedures?

Peter, can you check and see the performance improvement with David's method on March 26 instead?

Regards
Takayuki Tsunakawa

#37

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

almost 7 years ago

In reply to: Peter Eisentraut (#35)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

On 2019/04/05 5:42, Peter Eisentraut wrote:

On 2019-04-04 06:58, Amit Langote wrote:

Also, since the "speed up partition planning" patch went in (428b260f8),
it might be possible to see the performance boost even with the
partitioning example you cited upthread.

I can't detect any performance improvement with the patch applied to
current master, using the test case from Yoshikazu Imai (2019-03-19).

I was able to detect it as follows.

* partitioned table setup:

$ cat ht.sql
drop table ht cascade;
create table ht (a int primary key, b int, c int) partition by hash (a);
select 'create table ht' || x::text || ' partition of ht for values with
(modulus 8192, remainder ' || (x)::text || ');' from generate_series(0,
8191) x;
\gexec

* pgbench script:

$ cat select.sql
\set param random(1, 8192)
select * from ht where a = :param

* pgbench (5 minute run with -M prepared)

pgbench -n -M prepared -T 300 -f select.sql

* tps:

plan_cache_mode = auto

HEAD: 1915 tps
Patched: 2394 tps

plan_cache_mode = custom (non-problematic: generic plan is never created)

HEAD: 2402 tps
Patched: 2393 tps

Thanks,
Amit

#38

Imai, Yoshikazu

imai.yoshikazu@jp.fujitsu.com

almost 7 years ago

In reply to: Amit Langote (#37)

RE: Speed up transaction completion faster after many relations are accessed in a transaction

On Fri, Apr 5, 2019 at 1:31 AM, Amit Langote wrote:

On 2019/04/05 5:42, Peter Eisentraut wrote:

On 2019-04-04 06:58, Amit Langote wrote:

Also, since the "speed up partition planning" patch went in
(428b260f8), it might be possible to see the performance boost even
with the partitioning example you cited upthread.

I can't detect any performance improvement with the patch applied to
current master, using the test case from Yoshikazu Imai (2019-03-19).

I was able to detect it as follows.

* partitioned table setup:

$ cat ht.sql
drop table ht cascade;
create table ht (a int primary key, b int, c int) partition by hash (a);
select 'create table ht' || x::text || ' partition of ht for values with
(modulus 8192, remainder ' || (x)::text || ');' from generate_series(0,
8191) x;
\gexec

* pgbench script:

$ cat select.sql
\set param random(1, 8192)
select * from ht where a = :param

* pgbench (5 minute run with -M prepared)

pgbench -n -M prepared -T 300 -f select.sql

* tps:

plan_cache_mode = auto

HEAD: 1915 tps
Patched: 2394 tps

plan_cache_mode = custom (non-problematic: generic plan is never created)

HEAD: 2402 tps
Patched: 2393 tps

Amit-san, thanks for testing this.

I also re-ran my tests(3/19) with HEAD(413ccaa) and HEAD(413ccaa) + patched, and I can still detect the performance difference with plan_cache_mode = auto.

Thanks
--
Yoshikazu Imai

#39

Tsunakawa, Takayuki

tsunakawa.takay@jp.fujitsu.com

almost 7 years ago

In reply to: Amit Langote (#37)

RE: Speed up transaction completion faster after many relations are accessed in a transaction

Hi Amit-san, Imai-snan,

From: Amit Langote [mailto:Langote_Amit_f8@lab.ntt.co.jp]

I was able to detect it as follows.
plan_cache_mode = auto

HEAD: 1915 tps
Patched: 2394 tps

plan_cache_mode = custom (non-problematic: generic plan is never created)

HEAD: 2402 tps
Patched: 2393 tps

Thanks a lot for very quick confirmation. I'm relieved to still see the good results.

Regards
Takayuki Tsunakawa

#40

Imai, Yoshikazu

imai.yoshikazu@jp.fujitsu.com

almost 7 years ago

In reply to: Tsunakawa, Takayuki (#36)

RE: Speed up transaction completion faster after many relations are accessed in a transaction

On Fri, Apr 5, 2019 at 0:05 AM, Tsunakawa, Takayuki wrote:

From: Peter Eisentraut [mailto:peter.eisentraut@2ndquadrant.com]

I can't detect any performance improvement with the patch applied to
current master, using the test case from Yoshikazu Imai (2019-03-19).

That's strange... Peter, Imai-san, can you compare your test procedures?

Just for make sure, I described my test procedures in detail.

I install and setup HEAD and patched as follows.

[HEAD(413ccaa)]
(git pull)
./configure --prefix=/usr/local/pgsql-dev --enable-depend
make clean
make

make install

su postgres
export PATH=/usr/local/pgsql-dev/bin:$PATH
initdb -D /var/lib/pgsql/data-dev
vi /var/lib/pgsql/data-dev/postgresql.conf
====
port = 44201
plan_cache_mode = 'auto' or 'force_custom_plan'
max_parallel_workers = 0
max_parallel_workers_per_gather = 0
max_locks_per_transaction = 4096
====
pg_ctl -D /var/lib/pgsql/data-dev start

[HEAD(413ccaa) + patch]
(git pull)
patch -u -p1 < 0002.patch
./configure --prefix=/usr/local/pgsql-locallock --enable-depend
make clean
make

make install

su postgres
export PATH=/usr/local/pgsql-locallock/bin:$PATH
initdb -D /var/lib/pgsql/data-locallock
vi /var/lib/pgsql/data-locallock/postgresql.conf
====
port = 44301
plan_cache_mode = 'auto' or 'force_custom_plan'
max_parallel_workers = 0
max_parallel_workers_per_gather = 0
max_locks_per_transaction = 4096
====
pg_ctl -D /var/lib/pgsql/data-locallock start

And I tested as follows.

(creating partitioned table for port 44201)
(creating partitioned table for port 44301)
(creating select4096.sql)
for i in `seq 1 5`; do
pgbench -n -f select4096.sql -T 60 -M prepared -p 44201 | grep including;
pgbench -n -f select4096.sql -T 60 -M prepared -p 44301 | grep including;
done
tps = 8146.039546 (including connections establishing)
tps = 9021.340872 (including connections establishing)
tps = 8011.186017 (including connections establishing)
tps = 8926.191054 (including connections establishing)
tps = 8006.769690 (including connections establishing)
tps = 9028.716806 (including connections establishing)
tps = 8057.709961 (including connections establishing)
tps = 9017.713714 (including connections establishing)
tps = 7956.332863 (including connections establishing)
tps = 9126.650533 (including connections establishing)

Thanks
--
Yoshikazu Imai

#41

Peter Eisentraut

peter.eisentraut@2ndquadrant.com

almost 7 years ago

In reply to: Tsunakawa, Takayuki (#23)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

On 2019-03-19 10:21, Tsunakawa, Takayuki wrote:

From: Tsunakawa, Takayuki [mailto:tsunakawa.takay@jp.fujitsu.com]

Fixed.

Rebased on HEAD.

Do you need the dlist_foreach_modify() calls? You are not actually
modifying the list structure.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#42

Tom Lane

tgl@sss.pgh.pa.us

almost 7 years ago

In reply to: Peter Eisentraut (#35)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:

I can't detect any performance improvement with the patch applied to
current master, using the test case from Yoshikazu Imai (2019-03-19).

FWIW, I tried this patch against current HEAD (959d00e9d).
Using the test case described by Amit at
<be25cadf-982e-3f01-88b4-443a6667e16a@lab.ntt.co.jp>
I do measure an undeniable speedup, close to 35%.

However ... I submit that that's a mighty extreme test case.
(I had to increase max_locks_per_transaction to get it to run
at all.) We should not be using that sort of edge case to drive
performance optimization choices.

If I reduce the number of partitions in Amit's example from 8192
to something more real-world, like 128, I do still measure a
performance gain, but it's ~ 1.5% which is below what I'd consider
a reproducible win. I'm accustomed to seeing changes up to 2%
in narrow benchmarks like this one, even when "nothing changes"
except unrelated code.

Trying a standard pgbench test case (pgbench -M prepared -S with
one client and an -s 10 database), it seems that the patch is about
0.5% slower than HEAD. Again, that's below the noise threshold,
but it's not promising for the net effects of this patch on workloads
that aren't specifically about large and prunable partition sets.

I'm also fairly concerned about the effects of the patch on
sizeof(LOCALLOCK) --- on a 64-bit machine it goes from 72 to 88
bytes, a 22% increase. That's a lot if you're considering cases
with many locks.

On the whole I don't think there's an adequate case for committing
this patch.

I'd also point out that this is hardly the only place where we've
seen hash_seq_search on nearly-empty hash tables become a bottleneck.
So I'm not thrilled about attacking that with one-table-at-time patches.
I'd rather see us do something to let hash_seq_search win across
the board.

I spent some time wondering whether we could adjust the data structure
so that all the live entries in a hashtable are linked into one chain,
but I don't quite see how to do it without adding another list link to
struct HASHELEMENT, which seems pretty expensive.

I'll sketch the idea I had, just in case it triggers a better idea
in someone else. Assuming we are willing to add another pointer to
HASHELEMENT, use the two pointers to create a doubly-linked circular
list that includes all live entries in the hashtable, with a list
header in the hashtable's control area. (Maybe we'd use a dlist for
this, but it's not essential.) Require this list to be organized so
that all entries that belong to the same hash bucket are consecutive in
the list, and make each non-null hash bucket header point to the first
entry in the list for its bucket. To allow normal searches to detect
when they've run through their bucket, add a flag to HASHELEMENT that
is set only in entries that are the first, or perhaps last, of their
bucket (so you'd detect end-of-bucket by checking the flag instead of
testing for a null pointer). Adding a bool field is free due to
alignment considerations, at least on 64-bit machines. Given this,
I think normal hash operations are more-or-less the same cost as
before, while hash_seq_search just has to follow the circular list.

I tried to figure out how to do the same thing with a singly-linked
instead of doubly-linked list, but it doesn't quite work: if you need
to remove the first element of a bucket, you have no cheap way to find
its predecessor in the overall list (which belongs to some other
bucket, but you don't know which one). Maybe we could just mark such
entries dead (there's plenty of room for another flag bit) and plan
to clean them up later? But it's not clear how to ensure that they'd
get cleaned up in any sort of timely fashion.

Another issue is that probably none of this works for the partitioned
hash tables we use for some of the larger shared-memory hashes. But
I'm not sure we care about hash_seq_search for those, so maybe we just
say those are a different data structure.

regards, tom lane

#43

Andres Freund

andres@anarazel.de

almost 7 years ago

In reply to: Tom Lane (#42)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

Hi,

On 2019-04-05 23:03:11 -0400, Tom Lane wrote:

Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:

I can't detect any performance improvement with the patch applied to
current master, using the test case from Yoshikazu Imai (2019-03-19).

FWIW, I tried this patch against current HEAD (959d00e9d).
Using the test case described by Amit at
<be25cadf-982e-3f01-88b4-443a6667e16a@lab.ntt.co.jp>
I do measure an undeniable speedup, close to 35%.

However ... I submit that that's a mighty extreme test case.
(I had to increase max_locks_per_transaction to get it to run
at all.) We should not be using that sort of edge case to drive
performance optimization choices.

If I reduce the number of partitions in Amit's example from 8192
to something more real-world, like 128, I do still measure a
performance gain, but it's ~ 1.5% which is below what I'd consider
a reproducible win. I'm accustomed to seeing changes up to 2%
in narrow benchmarks like this one, even when "nothing changes"
except unrelated code.

I'm not sure it's actually that narrow these days. With all the
partitioning improvements happening, the numbers of locks commonly held
are going to rise. And while 8192 partitions is maybe on the more
extreme side, it's a workload with only a single table, and plenty
workloads touch more than a single partitioned table.

Trying a standard pgbench test case (pgbench -M prepared -S with
one client and an -s 10 database), it seems that the patch is about
0.5% slower than HEAD. Again, that's below the noise threshold,
but it's not promising for the net effects of this patch on workloads
that aren't specifically about large and prunable partition sets.

Yea, that's concerning.

I'm also fairly concerned about the effects of the patch on
sizeof(LOCALLOCK) --- on a 64-bit machine it goes from 72 to 88
bytes, a 22% increase. That's a lot if you're considering cases
with many locks.

I'm not sure I'm quite that concerned. For one, a good bit of that space
was up for grabs until the recent reordering of LOCALLOCK and nobody
complained. But more importantly, I think commonly the amount of locks
around is fairly constrained, isn't it? We can't really have that many
concurrently held locks, due to the shared memory space, and the size of
a LOCALLOCK isn't that big compared to say relcache entries. We also
probably fairly easily could win some space back - e.g. make nLocks 32
bits.

I wonder if one approach to solve this wouldn't be to just make the
hashtable drastically smaller. Right now we'll often have have lots
empty entries that are 72 bytes + dynahash overhead. That's plenty of
memory that needs to be skipped over. I wonder if we instead should
have an array of held locallocks, and a hashtable with {hashcode,
offset_in_array} + custom comparator for lookups. That'd mean we could
either scan the array of locallocks at release (which'd need to skip
over entries that have already been released), or we could scan the much
smaller hashtable sequentially.

I don't think the above idea is quite there, and I'm tired, but I
thought it might still be worth bringing up.

I spent some time wondering whether we could adjust the data structure
so that all the live entries in a hashtable are linked into one chain,
but I don't quite see how to do it without adding another list link to
struct HASHELEMENT, which seems pretty expensive.

Yea :(

Greetings,

Andres Freund

#44

Tom Lane

tgl@sss.pgh.pa.us

almost 7 years ago

In reply to: Andres Freund (#43)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

Andres Freund <andres@anarazel.de> writes:

I wonder if one approach to solve this wouldn't be to just make the
hashtable drastically smaller. Right now we'll often have have lots
empty entries that are 72 bytes + dynahash overhead. That's plenty of
memory that needs to be skipped over. I wonder if we instead should
have an array of held locallocks, and a hashtable with {hashcode,
offset_in_array} + custom comparator for lookups.

Well, that's not going to work all that well for retail lock releases;
you'll end up with holes in the array, maybe a lot of them.

However, it led me to think of another way we might approach the general
hashtable problem: right now, we are not making any use of the fact that
the hashtable's entries are laid out in big slabs (arrays). What if we
tried to ensure that the live entries are allocated fairly compactly in
those arrays, and then implemented hash_seq_search as a scan over the
arrays, ignoring the hash bucket structure per se?

We'd need a way to reliably tell a live entry from a free entry, but
again, there's plenty of space for a flag bit or two.

This might perform poorly if you allocated a bunch of entries,
freed most-but-not-all, and then wanted to seqscan the remainder;
you'd end up with the same problem I complained of above that
you're iterating over an array that's mostly uninteresting.
In principle we could keep count of the live vs free entries and
dynamically decide to scan via the hash bucket structure instead of
searching the storage array when the array is too sparse; but that
might be overly complicated.

I haven't tried to work this out in detail, it's just a late
night brainstorm. But, again, I'd much rather solve this in
dynahash.c than by layering some kind of hack on top of it.

regards, tom lane

#45

Peter Eisentraut

peter.eisentraut@2ndquadrant.com

almost 7 years ago

In reply to: Tom Lane (#42)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

On 2019-04-06 05:03, Tom Lane wrote:

Trying a standard pgbench test case (pgbench -M prepared -S with
one client and an -s 10 database), it seems that the patch is about
0.5% slower than HEAD. Again, that's below the noise threshold,
but it's not promising for the net effects of this patch on workloads
that aren't specifically about large and prunable partition sets.

In my testing, I've also noticed that it seems to be slightly on the
slower side for these simple tests.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#46

David Rowley

david.rowley@2ndquadrant.com

almost 7 years ago

In reply to: Tom Lane (#42)

1 attachment(s)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

On Sat, 6 Apr 2019 at 16:03, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'd also point out that this is hardly the only place where we've
seen hash_seq_search on nearly-empty hash tables become a bottleneck.
So I'm not thrilled about attacking that with one-table-at-time patches.
I'd rather see us do something to let hash_seq_search win across
the board.

Rewinding back to mid-Feb:

You wrote:

My own thought about how to improve this situation was just to destroy
and recreate LockMethodLocalHash at transaction end (or start)
if its size exceeded $some-value. Leaving it permanently bloated seems
like possibly a bad idea, even if we get rid of all the hash_seq_searches
on it.

Which I thought was an okay idea. I think the one advantage that
would have over making hash_seq_search() faster for large and mostly
empty tables is that over-sized hash tables are just not very cache
efficient, and if we don't need it to be that large then we should
probably consider making it smaller again.

I've had a go at implementing this and using Amit's benchmark the
performance looks pretty good. I can't detect any slowdown for the
general case.

master:

plan_cache_mode = auto:

$ pgbench -n -M prepared -T 60 -f select.sql postgres
tps = 9373.698212 (excluding connections establishing)
tps = 9356.993148 (excluding connections establishing)
tps = 9367.579806 (excluding connections establishing)

plan_cache_mode = force_custom_plan:

$ pgbench -n -M prepared -T 60 -f select.sql postgres
tps = 12863.758185 (excluding connections establishing)
tps = 12787.766054 (excluding connections establishing)
tps = 12817.878940 (excluding connections establishing)

shrink_bloated_locallocktable.patch:

plan_cache_mode = auto:

$ pgbench -n -M prepared -T 60 -f select.sql postgres
tps = 12756.021211 (excluding connections establishing)
tps = 12800.939518 (excluding connections establishing)
tps = 12804.501977 (excluding connections establishing)

plan_cache_mode = force_custom_plan:

$ pgbench -n -M prepared -T 60 -f select.sql postgres
tps = 12763.448836 (excluding connections establishing)
tps = 12901.673271 (excluding connections establishing)
tps = 12856.512745 (excluding connections establishing)

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

shrink_bloated_locallocktable.patchapplication/octet-stream; name=shrink_bloated_locallocktable.patchDownload

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index c895876..91e3924 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -254,6 +254,20 @@ static HTAB *LockMethodLockHash;
 static HTAB *LockMethodProcLockHash;
 static HTAB *LockMethodLocalHash;
 
+/* Initial size of local lock hash */
+#define LOCKMETHODLOCALHASH_INIT_SIZE 16
+
+/*
+ * Attempt to shrink the LockMethodLocalHash after this many calls to
+ * LockRelaseAll()
+ */
+#define LOCKMETHODLOCALHASH_SIZE_CHECK_FREQ 10
+
+/*
+ * Counters to track bloat in the LockMethodLocalHash table
+ */
+static unsigned int lock_release_count = 0;
+static uint64 locks_released = 0;
 
 /* private state for error cleanup */
 static LOCALLOCK *StrongLockInProgress;
@@ -339,6 +353,7 @@ PROCLOCK_PRINT(const char *where, const PROCLOCK *proclockP)
 #endif							/* not LOCK_DEBUG */
 
 
+static void CreateLocalLockHash(long size, bool copyOldLocks);
 static uint32 proclock_hash(const void *key, Size keysize);
 static void RemoveLocalLock(LOCALLOCK *locallock);
 static PROCLOCK *SetupLockInTable(LockMethod lockMethodTable, PGPROC *proc,
@@ -441,17 +456,66 @@ InitLocks(void)
 	 * ought to be empty in the postmaster, but for safety let's zap it.)
 	 */
 	if (LockMethodLocalHash)
+	{
 		hash_destroy(LockMethodLocalHash);
+		LockMethodLocalHash = NULL;
+	}
+
+	CreateLocalLockHash(LOCKMETHODLOCALHASH_INIT_SIZE, false);
+}
+
+/*
+ * CreateLocalLockHash
+ *		Build or rebuild the LockMethodLocalHash hash table.  If copyOldLocks
+ *		is true we populate the new table with the locks from the old version
+ *		and then destroy it.
+ */
+static void
+CreateLocalLockHash(long size, bool copyOldLocks)
+{
+	static HTAB *htab;
+	HASHCTL		info;
 
 	info.keysize = sizeof(LOCALLOCKTAG);
 	info.entrysize = sizeof(LOCALLOCK);
 
-	LockMethodLocalHash = hash_create("LOCALLOCK hash",
-									  16,
-									  &info,
-									  HASH_ELEM | HASH_BLOBS);
-}
+	htab = hash_create("LOCALLOCK hash", size, &info, HASH_ELEM | HASH_BLOBS);
+
+	if (copyOldLocks)
+	{
+		HASH_SEQ_STATUS status;
+		LOCALLOCK  *locallock;
+
+		hash_seq_init(&status, LockMethodLocalHash);
+
+		/* scan over the old table and add all the locks into the new table */
+		while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+		{
+			LOCALLOCK  *new_lock;
+			bool		found;
+
+			new_lock = hash_search(htab,
+								   (void *) &locallock->tag,
+								   HASH_ENTER, &found);
 
+			Assert(!found);
+			memcpy(new_lock, locallock, sizeof(LOCALLOCK));
+		}
+
+		hash_destroy(LockMethodLocalHash);
+	}
+	else
+	{
+		/*
+		 * Ensure that if not copying old locks that the table contains no
+		 * locks.
+		 */
+		Assert(LockMethodLocalHash == NULL ||
+			   hash_get_num_entries(LockMethodLocalHash) == 0);
+	}
+
+	LockMethodLocalHash = htab;
+}
 
 /*
  * Fetch the lock method table associated with a given lock
@@ -2097,6 +2161,7 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
 	PROCLOCK   *proclock;
 	int			partition;
 	bool		have_fast_path_lwlock = false;
+	long		total_locks;
 
 	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
 		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
@@ -2118,6 +2183,8 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
 
 	numLockModes = lockMethodTable->numLockModes;
 
+	total_locks = hash_get_num_entries(LockMethodLocalHash);
+
 	/*
 	 * First we run through the locallock table and get rid of unwanted
 	 * entries, then we scan the process's proclocks and get rid of those. We
@@ -2349,6 +2416,48 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
 		LWLockRelease(partitionLock);
 	}							/* loop over partitions */
 
+	/* track average locks */
+	locks_released += total_locks;
+	lock_release_count++;
+
+	/* determine if it's worth shrinking the LockMethodLocalHash table */
+	if (lock_release_count >= LOCKMETHODLOCALHASH_SIZE_CHECK_FREQ)
+	{
+		long		avglocks = (long) locks_released / lock_release_count;
+
+		/*
+		 * The hash_seq_search can become inefficient when the hash table has
+		 * grown significantly larger than the default size due to the backend
+		 * having run queries which obtained a large numbers of locks at once.
+		 * Here we'll check for that and shrink the table if we deem it a
+		 * worthwhile thing to do.
+		 *
+		 * We need only bother checking this if the hash_seq_search is
+		 * possibly becoming inefficient.  We check this by looking if the
+		 * curBucket is larger than the initial size of the table.  We then
+		 * only bother shrinking the table if the average locks for the
+		 * previous few transactions is lower than half this value.
+		 */
+		if (status.curBucket > LOCKMETHODLOCALHASH_INIT_SIZE &&
+			avglocks < status.curBucket / 2)
+		{
+			long		newsize = LOCKMETHODLOCALHASH_INIT_SIZE;
+
+			while (newsize < avglocks)
+				newsize *= 2;
+
+			/*
+			 * If we're releasing all locks then the table will be empty, so
+			 * no need to copy out the old locks into the new table.
+			 */
+			CreateLocalLockHash(newsize, !allLocks);
+		}
+
+		/* Reset the counters */
+		locks_released = 0;
+		lock_release_count = 0;
+	}
+
 #ifdef LOCK_DEBUG
 	if (*(lockMethodTable->trace_flag))
 		elog(LOG, "LockReleaseAll done");

#47

Tom Lane

tgl@sss.pgh.pa.us

almost 7 years ago

In reply to: David Rowley (#46)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

David Rowley <david.rowley@2ndquadrant.com> writes:

On Sat, 6 Apr 2019 at 16:03, Tom Lane <tgl@sss.pgh.pa.us> wrote:

My own thought about how to improve this situation was just to destroy
and recreate LockMethodLocalHash at transaction end (or start)
if its size exceeded $some-value. Leaving it permanently bloated seems
like possibly a bad idea, even if we get rid of all the hash_seq_searches
on it.

Which I thought was an okay idea. I think the one advantage that
would have over making hash_seq_search() faster for large and mostly
empty tables is that over-sized hash tables are just not very cache
efficient, and if we don't need it to be that large then we should
probably consider making it smaller again.

I've had a go at implementing this and using Amit's benchmark the
performance looks pretty good. I can't detect any slowdown for the
general case.

I like the concept ... but the particular implementation, not so much.
It seems way overcomplicated. In the first place, why should we
add code to copy entries? Just don't do it except when the table
is empty. In the second, I think we could probably have a far
cheaper test for how big the table is --- maybe we'd need to
expose some function in dynahash.c, but the right way here is just
to see how many buckets there are. I don't like adding statistics
counting for this, because it's got basically nothing to do with
what the actual problem is. (If you acquire and release one lock,
and do that over and over, you don't have a bloat problem no
matter how many times you do it.)

LockMethodLocalHash is special in that it predictably goes to empty
at the end of every transaction, so that de-bloating at that point
is a workable strategy. I think we'd probably need something more
robust if we were trying to fix this generally for all hash tables.
But if we're going to go with the one-off hack approach, we should
certainly try to keep that hack as simple as possible.

regards, tom lane

#48

David Rowley

david.rowley@2ndquadrant.com

almost 7 years ago

In reply to: Tom Lane (#47)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

On Mon, 8 Apr 2019 at 02:20, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I like the concept ... but the particular implementation, not so much.
It seems way overcomplicated. In the first place, why should we
add code to copy entries? Just don't do it except when the table
is empty. In the second, I think we could probably have a far
cheaper test for how big the table is --- maybe we'd need to
expose some function in dynahash.c, but the right way here is just
to see how many buckets there are. I don't like adding statistics
counting for this, because it's got basically nothing to do with
what the actual problem is. (If you acquire and release one lock,
and do that over and over, you don't have a bloat problem no
matter how many times you do it.)

hash_get_num_entries() looks cheap enough to me. Can you explain why
you think that's too expensive?

LockMethodLocalHash is special in that it predictably goes to empty
at the end of every transaction, so that de-bloating at that point
is a workable strategy. I think we'd probably need something more
robust if we were trying to fix this generally for all hash tables.
But if we're going to go with the one-off hack approach, we should
certainly try to keep that hack as simple as possible.

As cheap as possible sounds good, but I'm confused at why you think
the table will always be empty at the end of transaction. It's my
understanding and I see from debugging that session level locks remain
in there. If I don't copy those into the new table they'll be lost.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#49

David Rowley

david.rowley@2ndquadrant.com

almost 7 years ago

In reply to: David Rowley (#48)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

On Mon, 8 Apr 2019 at 02:36, David Rowley <david.rowley@2ndquadrant.com> wrote:

LockMethodLocalHash is special in that it predictably goes to empty
at the end of every transaction, so that de-bloating at that point
is a workable strategy. I think we'd probably need something more
robust if we were trying to fix this generally for all hash tables.
But if we're going to go with the one-off hack approach, we should
certainly try to keep that hack as simple as possible.

As cheap as possible sounds good, but I'm confused at why you think
the table will always be empty at the end of transaction. It's my
understanding and I see from debugging that session level locks remain
in there. If I don't copy those into the new table they'll be lost.

Or we could just skip the table recreation if there are no
session-levels. That would require calling hash_get_num_entries() on
the table again and just recreating the table if there are 0 locks.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#50

Tom Lane

tgl@sss.pgh.pa.us

almost 7 years ago

In reply to: David Rowley (#48)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

David Rowley <david.rowley@2ndquadrant.com> writes:

On Mon, 8 Apr 2019 at 02:20, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I like the concept ... but the particular implementation, not so much.
It seems way overcomplicated. In the first place, why should we
add code to copy entries? Just don't do it except when the table
is empty. In the second, I think we could probably have a far
cheaper test for how big the table is --- maybe we'd need to
expose some function in dynahash.c, but the right way here is just
to see how many buckets there are. I don't like adding statistics
counting for this, because it's got basically nothing to do with
what the actual problem is. (If you acquire and release one lock,
and do that over and over, you don't have a bloat problem no
matter how many times you do it.)

hash_get_num_entries() looks cheap enough to me. Can you explain why
you think that's too expensive?

What I objected to cost-wise was counting the number of lock
acquisitions/releases, which seems entirely beside the point.

We *should* be using hash_get_num_entries(), but only to verify
that the table is empty before resetting it. The additional bit
that is needed is to see whether the number of buckets is large
enough to justify calling the table bloated.

As cheap as possible sounds good, but I'm confused at why you think
the table will always be empty at the end of transaction.

It's conceivable that it won't be, which is why we need a test.
I'm simply arguing that if it is not, we can just postpone de-bloating
till it is. Session-level locks are so rarely used that there's no
need to sweat about that case.

regards, tom lane

#51

David Rowley

david.rowley@2ndquadrant.com

almost 7 years ago

In reply to: Tom Lane (#50)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

On Mon, 8 Apr 2019 at 02:59, Tom Lane <tgl@sss.pgh.pa.us> wrote:

David Rowley <david.rowley@2ndquadrant.com> writes:

hash_get_num_entries() looks cheap enough to me. Can you explain why
you think that's too expensive?

What I objected to cost-wise was counting the number of lock
acquisitions/releases, which seems entirely beside the point.

We *should* be using hash_get_num_entries(), but only to verify
that the table is empty before resetting it. The additional bit
that is needed is to see whether the number of buckets is large
enough to justify calling the table bloated.

The reason I thought it was a good idea to track some history there
was to stop the lock table constantly being shrunk back to the default
size every time a simple single table query was executed. For example,
a workload repeatably doing:

SELECT * FROM table_with_lots_of_partitions;
SELECT * FROM non_partitioned_table;

I was worried that obtaining locks on the partitioned table would
become a little slower because it would have to expand the hash table
each time the query is executed.

As cheap as possible sounds good, but I'm confused at why you think
the table will always be empty at the end of transaction.

It's conceivable that it won't be, which is why we need a test.
I'm simply arguing that if it is not, we can just postpone de-bloating
till it is. Session-level locks are so rarely used that there's no
need to sweat about that case.

That seems fair. It would certainly simplify the patch.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#52

Tom Lane

tgl@sss.pgh.pa.us

almost 7 years ago

In reply to: David Rowley (#51)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

David Rowley <david.rowley@2ndquadrant.com> writes:

On Mon, 8 Apr 2019 at 02:59, Tom Lane <tgl@sss.pgh.pa.us> wrote:

We *should* be using hash_get_num_entries(), but only to verify
that the table is empty before resetting it. The additional bit
that is needed is to see whether the number of buckets is large
enough to justify calling the table bloated.

The reason I thought it was a good idea to track some history there
was to stop the lock table constantly being shrunk back to the default
size every time a simple single table query was executed.

I think that's probably gilding the lily, considering that this whole
issue is pretty new. There's no evidence that expanding the local
lock table is a significant drag on queries that need a lot of locks.

regards, tom lane

#53

David Rowley

david.rowley@2ndquadrant.com

almost 7 years ago

In reply to: Tom Lane (#52)

1 attachment(s)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

On Mon, 8 Apr 2019 at 03:20, Tom Lane <tgl@sss.pgh.pa.us> wrote:

David Rowley <david.rowley@2ndquadrant.com> writes:

The reason I thought it was a good idea to track some history there
was to stop the lock table constantly being shrunk back to the default
size every time a simple single table query was executed.

I think that's probably gilding the lily, considering that this whole
issue is pretty new. There's no evidence that expanding the local
lock table is a significant drag on queries that need a lot of locks.

Okay. Here's another version with all the average locks code removed
that only recreates the table when it's completely empty.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

shrink_bloated_locallocktable_v2.patchapplication/octet-stream; name=shrink_bloated_locallocktable_v2.patchDownload

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index c895876..a85d792 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -254,6 +254,8 @@ static HTAB *LockMethodLockHash;
 static HTAB *LockMethodProcLockHash;
 static HTAB *LockMethodLocalHash;
 
+/* Initial size of LockMethodLocalHash table */
+#define LOCKMETHODLOCALHASH_INIT_SIZE 16
 
 /* private state for error cleanup */
 static LOCALLOCK *StrongLockInProgress;
@@ -339,6 +341,7 @@ PROCLOCK_PRINT(const char *where, const PROCLOCK *proclockP)
 #endif							/* not LOCK_DEBUG */
 
 
+static void CreateLocalLockHash(void);
 static uint32 proclock_hash(const void *key, Size keysize);
 static void RemoveLocalLock(LOCALLOCK *locallock);
 static PROCLOCK *SetupLockInTable(LockMethod lockMethodTable, PGPROC *proc,
@@ -431,6 +434,19 @@ InitLocks(void)
 	if (!found)
 		SpinLockInit(&FastPathStrongRelationLocks->mutex);
 
+	CreateLocalLockHash();
+}
+
+/*
+ * CreateLocalLockHash
+ *		Create the LockMethodLocalHash hash table.
+ */
+static void
+CreateLocalLockHash(void)
+{
+	static HTAB *htab;
+	HASHCTL		info;
+
 	/*
 	 * Allocate non-shared hash table for LOCALLOCK structs.  This stores lock
 	 * counts and resource owner information.
@@ -447,12 +463,11 @@ InitLocks(void)
 	info.entrysize = sizeof(LOCALLOCK);
 
 	LockMethodLocalHash = hash_create("LOCALLOCK hash",
-									  16,
+									  LOCKMETHODLOCALHASH_INIT_SIZE,
 									  &info,
 									  HASH_ELEM | HASH_BLOBS);
 }
 
-
 /*
  * Fetch the lock method table associated with a given lock
  */
@@ -2349,6 +2364,17 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
 		LWLockRelease(partitionLock);
 	}							/* loop over partitions */
 
+	/*
+	 * The hash_seq_search can become inefficient when the hash table has
+	 * grown significantly larger than the default size due to the backend
+	 * having run queries which obtained large numbers of locks at once. Here
+	 * we'll build a new table at its initial size whenever the table is empty
+	 * and has expanded from its original size.
+	 */
+	if (status.curBucket > LOCKMETHODLOCALHASH_INIT_SIZE &&
+		hash_get_num_entries(LockMethodLocalHash) == 0)
+		CreateLocalLockHash();
+
 #ifdef LOCK_DEBUG
 	if (*(lockMethodTable->trace_flag))
 		elog(LOG, "LockReleaseAll done");

#54

Andres Freund

andres@anarazel.de

almost 7 years ago

In reply to: David Rowley (#53)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

Hi,

On 2019-04-08 03:40:52 +1200, David Rowley wrote:

On Mon, 8 Apr 2019 at 03:20, Tom Lane <tgl@sss.pgh.pa.us> wrote:

David Rowley <david.rowley@2ndquadrant.com> writes:

The reason I thought it was a good idea to track some history there
was to stop the lock table constantly being shrunk back to the default
size every time a simple single table query was executed.

I think that's probably gilding the lily, considering that this whole
issue is pretty new. There's no evidence that expanding the local
lock table is a significant drag on queries that need a lot of locks.

Okay. Here's another version with all the average locks code removed
that only recreates the table when it's completely empty.

Could you benchmark your adversarial case?

- Andres

#55

David Rowley

david.rowley@2ndquadrant.com

almost 7 years ago

In reply to: Andres Freund (#54)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

On Mon, 8 Apr 2019 at 03:47, Andres Freund <andres@anarazel.de> wrote:

Could you benchmark your adversarial case?

Which case?

I imagine the worst case for v2 is a query that just constantly asks
for over 16 locks. Perhaps a prepared plan, so not to add planner
overhead.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#56

Tom Lane

tgl@sss.pgh.pa.us

almost 7 years ago

In reply to: David Rowley (#53)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

David Rowley <david.rowley@2ndquadrant.com> writes:

Okay. Here's another version with all the average locks code removed
that only recreates the table when it's completely empty.

Um ... I don't see where you're destroying the old hash?

Also, I entirely dislike wiring in assumptions about hash_seq_search's
private state structure here. I think it's worth having an explicit
entry point in dynahash.c to get the current number of buckets.

Also, I would not define "significantly bloated" as "the table has
grown at all". I think the threshold ought to be at least ~100
buckets, if we're starting at 16.

Probably we ought to try to gather some evidence to inform the
choice of cutoff here. Maybe instrument the regression tests to
see how big the table typically gets?

regards, tom lane

#57

David Rowley

david.rowley@2ndquadrant.com

almost 7 years ago

In reply to: Tom Lane (#56)

1 attachment(s)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

On Mon, 8 Apr 2019 at 04:09, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Um ... I don't see where you're destroying the old hash?

In CreateLocalLockHash.

Also, I entirely dislike wiring in assumptions about hash_seq_search's
private state structure here. I think it's worth having an explicit
entry point in dynahash.c to get the current number of buckets.

Okay. Added hash_get_max_bucket()

Also, I would not define "significantly bloated" as "the table has
grown at all". I think the threshold ought to be at least ~100
buckets, if we're starting at 16.

I wouldn't either. I don't think the comment says that. It says there
can be slowdowns when its significantly bloated, and then goes on to
say that we just resize when it's bigger than standard.

Probably we ought to try to gather some evidence to inform the
choice of cutoff here. Maybe instrument the regression tests to
see how big the table typically gets?

In partition_prune.sql I see use of a bucket as high as 285 on my machine with:

drop table lp, coll_pruning, rlp, mc3p, mc2p, boolpart, rp,
coll_pruning_multi, like_op_noprune, lparted_by_int2, rparted_by_int2;

I've not added any sort of cut-off though as I benchmarked it and
surprisingly I don't see any slowdown with the worst case. So I'm
thinking there might not be any point.

alter system set plan_cache_mode = 'force_generic_plan';
create table hp (a int primary key) partition by hash (a);
select 'create table hp' || x::text || ' partition of hp for values
with (modulus 32, remainder ' || (x)::text || ');' from
generate_series(0,31) x;
\gexec

select.sql
\set p 1
select * from hp where a = :p

Master
$ pgbench -n -M prepared -f select.sql -T 60 postgres
tps = 11834.764309 (excluding connections establishing)
tps = 12279.212223 (excluding connections establishing)
tps = 12007.263547 (excluding connections establishing)

Patched:
$ pgbench -n -M prepared -f select.sql -T 60 postgres
tps = 13380.684817 (excluding connections establishing)
tps = 12790.999279 (excluding connections establishing)
tps = 12568.892558 (excluding connections establishing)

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

shrink_bloated_locallocktable_v3.patchapplication/octet-stream; name=shrink_bloated_locallocktable_v3.patchDownload

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index c895876..01adb8a 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -254,6 +254,8 @@ static HTAB *LockMethodLockHash;
 static HTAB *LockMethodProcLockHash;
 static HTAB *LockMethodLocalHash;
 
+/* Initial size of LockMethodLocalHash table */
+#define LOCKMETHODLOCALHASH_INIT_SIZE 16
 
 /* private state for error cleanup */
 static LOCALLOCK *StrongLockInProgress;
@@ -339,6 +341,7 @@ PROCLOCK_PRINT(const char *where, const PROCLOCK *proclockP)
 #endif							/* not LOCK_DEBUG */
 
 
+static void CreateLocalLockHash(void);
 static uint32 proclock_hash(const void *key, Size keysize);
 static void RemoveLocalLock(LOCALLOCK *locallock);
 static PROCLOCK *SetupLockInTable(LockMethod lockMethodTable, PGPROC *proc,
@@ -431,14 +434,28 @@ InitLocks(void)
 	if (!found)
 		SpinLockInit(&FastPathStrongRelationLocks->mutex);
 
+	CreateLocalLockHash();
+}
+
+/*
+ * CreateLocalLockHash
+ *		Create the LockMethodLocalHash hash table.
+ */
+static void
+CreateLocalLockHash(void)
+{
+	HASHCTL		info;
+
 	/*
 	 * Allocate non-shared hash table for LOCALLOCK structs.  This stores lock
 	 * counts and resource owner information.
 	 *
-	 * The non-shared table could already exist in this process (this occurs
-	 * when the postmaster is recreating shared memory after a backend crash).
-	 * If so, delete and recreate it.  (We could simply leave it, since it
-	 * ought to be empty in the postmaster, but for safety let's zap it.)
+	 * First destroy any old table that may exist.  We might just be
+	 * recreating the table or it could already exist in this process (this
+	 * occurs when the postmaster is recreating shared memory after a backend
+	 * crash).  If so, delete and recreate it.  (We could simply leave it,
+	 * since it ought to be empty in the postmaster, but for safety let's zap
+	 * it.)
 	 */
 	if (LockMethodLocalHash)
 		hash_destroy(LockMethodLocalHash);
@@ -447,12 +464,11 @@ InitLocks(void)
 	info.entrysize = sizeof(LOCALLOCK);
 
 	LockMethodLocalHash = hash_create("LOCALLOCK hash",
-									  16,
+									  LOCKMETHODLOCALHASH_INIT_SIZE,
 									  &info,
 									  HASH_ELEM | HASH_BLOBS);
 }
 
-
 /*
  * Fetch the lock method table associated with a given lock
  */
@@ -2349,6 +2365,18 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
 		LWLockRelease(partitionLock);
 	}							/* loop over partitions */
 
+	/*
+	 * The hash_seq_search can become inefficient when the hash table has
+	 * grown significantly larger than the default size due to the backend
+	 * having run queries which obtained large numbers of locks at once. Here
+	 * we'll build a new table at its initial size whenever the table is empty
+	 * and the maximum used bucket is beyond the original table size.
+	 */
+	if (hash_get_num_entries(LockMethodLocalHash) == 0 &&
+		hash_get_max_bucket(LockMethodLocalHash) >
+		LOCKMETHODLOCALHASH_INIT_SIZE)
+		CreateLocalLockHash();
+
 #ifdef LOCK_DEBUG
 	if (*(lockMethodTable->trace_flag))
 		elog(LOG, "LockReleaseAll done");
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 9dc2a55..5631258 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -1352,6 +1352,15 @@ hash_get_num_entries(HTAB *hashp)
 }
 
 /*
+ * hash_get_max_bucket -- get the maximum used bucket in a hashtable
+ */
+uint32
+hash_get_max_bucket(HTAB *hashp)
+{
+	return hashp->hctl->max_bucket;
+}
+
+/*
  * hash_seq_init/_search/_term
  *			Sequentially search through hash table and return
  *			all the elements one by one, return NULL when no more.
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index 373854c..07c38d6 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -132,6 +132,7 @@ extern void *hash_search_with_hash_value(HTAB *hashp, const void *keyPtr,
 extern bool hash_update_hash_key(HTAB *hashp, void *existingEntry,
 					 const void *newKeyPtr);
 extern long hash_get_num_entries(HTAB *hashp);
+extern uint32 hash_get_max_bucket(HTAB *hashp);
 extern void hash_seq_init(HASH_SEQ_STATUS *status, HTAB *hashp);
 extern void *hash_seq_search(HASH_SEQ_STATUS *status);
 extern void hash_seq_term(HASH_SEQ_STATUS *status);

#58

Tsunakawa, Takayuki

tsunakawa.takay@jp.fujitsu.com

almost 7 years ago

In reply to: Andres Freund (#43)

RE: Speed up transaction completion faster after many relations are accessed in a transaction

From: Tom Lane [mailto:tgl@sss.pgh.pa.us]

On the whole I don't think there's an adequate case for committing
this patch.

From: Andres Freund [mailto:andres@anarazel.de]

On 2019-04-05 23:03:11 -0400, Tom Lane wrote:

If I reduce the number of partitions in Amit's example from 8192
to something more real-world, like 128, I do still measure a
performance gain, but it's ~ 1.5% which is below what I'd consider
a reproducible win. I'm accustomed to seeing changes up to 2%
in narrow benchmarks like this one, even when "nothing changes"
except unrelated code.

I'm not sure it's actually that narrow these days. With all the
partitioning improvements happening, the numbers of locks commonly held
are going to rise. And while 8192 partitions is maybe on the more
extreme side, it's a workload with only a single table, and plenty
workloads touch more than a single partitioned table.

I would feel happy if I could say such a many-partitions use case is narrow or impractical and ignore it, but it's not narrow. Two of our customers are actually requesting such usage: one uses 5,500 partitions and is trying to migrate from a commercial database on Linux, and the other requires 200,000 partitions to migrate from a legacy database on a mainframe. At first, I thought such many partitions indicate a bad application design, but it sounded valid (or at least I can't insist that's bad). PostgreSQL is now expected to handle such huge workloads.

From: Andres Freund [mailto:andres@anarazel.de]

I'm not sure I'm quite that concerned. For one, a good bit of that space
was up for grabs until the recent reordering of LOCALLOCK and nobody
complained. But more importantly, I think commonly the amount of locks
around is fairly constrained, isn't it? We can't really have that many
concurrently held locks, due to the shared memory space, and the size of
a LOCALLOCK isn't that big compared to say relcache entries. We also
probably fairly easily could win some space back - e.g. make nLocks 32
bits.

From: Tom Lane [mailto:tgl@sss.pgh.pa.us]

I'd also point out that this is hardly the only place where we've
seen hash_seq_search on nearly-empty hash tables become a bottleneck.
So I'm not thrilled about attacking that with one-table-at-time patches.
I'd rather see us do something to let hash_seq_search win across
the board.

I spent some time wondering whether we could adjust the data structure
so that all the live entries in a hashtable are linked into one chain,
but I don't quite see how to do it without adding another list link to
struct HASHELEMENT, which seems pretty expensive.

I think the linked list of LOCALLOCK approach is natural, simple, and good. In the Jim Gray's classic book "Transaction processing: concepts and techniques", we can find the following sentence in "8.4.5 Lock Manager Internal Logic." The sample implementation code in the book uses a similar linked list to remember and release a transaction's acquired locks.

"All the locks of a transaction are kept in a list so they can be quickly found and released at commit or rollback."

And handling this issue with the LOCALLOCK linked list is more natural than with the hash table resize. We just want to quickly find all grabbed locks, so we use a linked list. A hash table is a mechanism to find a particular item quickly. So it was merely wrong to use the hash table to iterate all grabbed locks. Also, the hash table got big because some operation in the session needed it, and some subsequent operations in the same session may need it again. So we wouldn't be relieved with shrinking the hash table.

Regards
Takayuki Tsunakawa

#59

'Andres Freund'

andres@anarazel.de

almost 7 years ago

In reply to: Tsunakawa, Takayuki (#58)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

Hi,

On 2019-04-08 02:28:12 +0000, Tsunakawa, Takayuki wrote:

I think the linked list of LOCALLOCK approach is natural, simple, and
good.

Did you see that people measured slowdowns?

Greetings,

Andres Freund

#60

Tsunakawa, Takayuki

tsunakawa.takay@jp.fujitsu.com

almost 7 years ago

In reply to: 'Andres Freund' (#59)

RE: Speed up transaction completion faster after many relations are accessed in a transaction

From: 'Andres Freund' [mailto:andres@anarazel.de]

On 2019-04-08 02:28:12 +0000, Tsunakawa, Takayuki wrote:

I think the linked list of LOCALLOCK approach is natural, simple, and
good.

Did you see that people measured slowdowns?

Yeah, 0.5% decrease with pgbench -M prepared -S (select-only), which feels like a somewhat extreme test case. And that might be within noise as was mentioned.

If we want to remove even the noise, we may have to think of removing the LocalLockHash completely. But it doesn't seem feasible...

Regards
Takayuki Tsunakawa

#61

David Rowley

david.rowley@2ndquadrant.com

almost 7 years ago

In reply to: Tsunakawa, Takayuki (#60)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

On Mon, 8 Apr 2019 at 14:54, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:

From: 'Andres Freund' [mailto:andres@anarazel.de]

Did you see that people measured slowdowns?

Yeah, 0.5% decrease with pgbench -M prepared -S (select-only), which feels like a somewhat extreme test case. And that might be within noise as was mentioned.

If we want to remove even the noise, we may have to think of removing the LocalLockHash completely. But it doesn't seem feasible...

It would be good to get your view on the
shrink_bloated_locallocktable_v3.patch I worked on last night. I was
unable to measure any overhead to solving the problem that way.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#62

Tsunakawa, Takayuki

tsunakawa.takay@jp.fujitsu.com

almost 7 years ago

In reply to: David Rowley (#61)

RE: Speed up transaction completion faster after many relations are accessed in a transaction

From: David Rowley [mailto:david.rowley@2ndquadrant.com]

It would be good to get your view on the
shrink_bloated_locallocktable_v3.patch I worked on last night. I was
unable to measure any overhead to solving the problem that way.

Thanks, it looks super simple and good. I understood the idea behind your patch is:

* Transactions that touch many partitions and/or tables are a special event and not normal, and the hash table bloat is an unlucky accident. So it's reasonable to revert the bloated hash table back to the original size.

* Repeated transactions that acquire many locks have to enlarge the hash table every time. However, the overhead of hash table expansion should be hidden behind other various processing (acquiring/releasing locks, reading/writing the relations, accessing the catalogs of those relations)

TBH, I think the linked list approach feels more intuitive because the resulting code looks what it wants to do (efficiently iterate over acquired locks) and is based on the classic book. But your approach seems to relieve people. So I'm OK with your patch.

I'm registering you as another author and me as a reviewer, and marking this ready for committer.

Regards
Takayuki Tsunakawa

#63

Peter Eisentraut

peter.eisentraut@2ndquadrant.com

almost 7 years ago

In reply to: Tsunakawa, Takayuki (#62)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

On 2019-04-08 05:46, Tsunakawa, Takayuki wrote:

I'm registering you as another author and me as a reviewer, and marking this ready for committer.

Moved to next commit fest.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#64

David Rowley

david.rowley@2ndquadrant.com

over 6 years ago

In reply to: Tom Lane (#56)

1 attachment(s)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

On Mon, 8 Apr 2019 at 04:09, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Also, I would not define "significantly bloated" as "the table has
grown at all". I think the threshold ought to be at least ~100
buckets, if we're starting at 16.

I've revised the patch to add a new constant named
LOCKMETHODLOCALHASH_SHRINK_SIZE. I've set this to 64 for now. Once the
hash table grows over that size we shrink it back down to
LOCKMETHODLOCALHASH_INIT_SIZE, which I've kept at 16.

I'm not opposed to setting it to 128. For this particular benchmark,
it won't make any difference as it's only going to affect something
that does not quite use 128 locks and has to work with a slightly
bloated local lock table. I think hitting 64 locks in a transaction is
a good indication that it's not a simple transaction so users are
probably unlikely to notice the small slowdown from the hash table
reinitialisation.

Since quite a bit has changed around partition planning lately, I've
taken a fresh set of benchmarks on today's master. I'm using something
very close to Amit's benchmark from upthread. I just changed the query
so we hit the same partition each time instead of a random one.

create table ht (a int primary key, b int, c int) partition by hash (a);
select 'create table ht' || x::text || ' partition of ht for values
with (modulus 8192, remainder ' || (x)::text || ');' from
generate_series(0,8191) x;
\gexec

select.sql:
\set p 1
select * from ht where a = :p

master @ a193cbec119 + shrink_bloated_locallocktable_v4.patch:

plan_cache_mode = 'auto';

ubuntu@ip-10-0-0-201:~$ pgbench -n -M prepared -T 60 -f select.sql postgres
tps = 14101.226982 (excluding connections establishing)
tps = 14034.250962 (excluding connections establishing)
tps = 14107.937755 (excluding connections establishing)

plan_cache_mode = 'force_custom_plan';

ubuntu@ip-10-0-0-201:~$ pgbench -n -M prepared -T 60 -f select.sql postgres
tps = 14240.366770 (excluding connections establishing)
tps = 14272.244886 (excluding connections establishing)
tps = 14130.684315 (excluding connections establishing)

master @ a193cbec119:

plan_cache_mode = 'auto';

ubuntu@ip-10-0-0-201:~$ pgbench -n -M prepared -T 60 -f select.sql postgres
tps = 10467.027666 (excluding connections establishing)
tps = 10333.700917 (excluding connections establishing)
tps = 10633.084426 (excluding connections establishing)

plan_cache_mode = 'force_custom_plan';

ubuntu@ip-10-0-0-201:~$ pgbench -n -M prepared -T 60 -f select.sql postgres
tps = 13938.083272 (excluding connections establishing)
tps = 14143.241802 (excluding connections establishing)
tps = 14097.406758 (excluding connections establishing)

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

shrink_bloated_locallocktable_v4.patchapplication/octet-stream; name=shrink_bloated_locallocktable_v4.patchDownload

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 6745a2432e..ea14792fc2 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -254,6 +254,20 @@ static HTAB *LockMethodLockHash;
 static HTAB *LockMethodProcLockHash;
 static HTAB *LockMethodLocalHash;
 
+/* Initial size of LockMethodLocalHash table */
+#define LOCKMETHODLOCALHASH_INIT_SIZE 16
+
+/*
+ * If the size of the LockMethodLocalHash table grows beyond this then try
+ * to shrink the table back down to LOCKMETHODLOCALHASH_INIT_SIZE.  This must
+ * not be less than LOCKMETHODLOCALHASH_INIT_SIZE
+ */
+#define LOCKMETHODLOCALHASH_SHRINK_SIZE 64
+
+/* Complain if the above are not set to something sane */
+#if LOCKMETHODLOCALHASH_SHRINK_SIZE < LOCKMETHODLOCALHASH_INIT_SIZE
+#error "invalid LOCKMETHODLOCALHASH_SHRINK_SIZE"
+#endif
 
 /* private state for error cleanup */
 static LOCALLOCK *StrongLockInProgress;
@@ -339,6 +353,7 @@ PROCLOCK_PRINT(const char *where, const PROCLOCK *proclockP)
 #endif							/* not LOCK_DEBUG */
 
 
+static void CreateLocalLockHash(void);
 static uint32 proclock_hash(const void *key, Size keysize);
 static void RemoveLocalLock(LOCALLOCK *locallock);
 static PROCLOCK *SetupLockInTable(LockMethod lockMethodTable, PGPROC *proc,
@@ -431,14 +446,28 @@ InitLocks(void)
 	if (!found)
 		SpinLockInit(&FastPathStrongRelationLocks->mutex);
 
+	CreateLocalLockHash();
+}
+
+/*
+ * CreateLocalLockHash
+ *		Create the LockMethodLocalHash hash table.
+ */
+static void
+CreateLocalLockHash(void)
+{
+	HASHCTL		info;
+
 	/*
 	 * Allocate non-shared hash table for LOCALLOCK structs.  This stores lock
 	 * counts and resource owner information.
 	 *
-	 * The non-shared table could already exist in this process (this occurs
-	 * when the postmaster is recreating shared memory after a backend crash).
-	 * If so, delete and recreate it.  (We could simply leave it, since it
-	 * ought to be empty in the postmaster, but for safety let's zap it.)
+	 * First destroy any old table that may exist.  We might just be
+	 * recreating the table or it could already exist in this process (this
+	 * occurs when the postmaster is recreating shared memory after a backend
+	 * crash).  If so, delete and recreate it.  (We could simply leave it,
+	 * since it ought to be empty in the postmaster, but for safety let's zap
+	 * it.)
 	 */
 	if (LockMethodLocalHash)
 		hash_destroy(LockMethodLocalHash);
@@ -447,12 +476,11 @@ InitLocks(void)
 	info.entrysize = sizeof(LOCALLOCK);
 
 	LockMethodLocalHash = hash_create("LOCALLOCK hash",
-									  16,
+									  LOCKMETHODLOCALHASH_INIT_SIZE,
 									  &info,
 									  HASH_ELEM | HASH_BLOBS);
 }
 
-
 /*
  * Fetch the lock method table associated with a given lock
  */
@@ -2349,6 +2377,18 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
 		LWLockRelease(partitionLock);
 	}							/* loop over partitions */
 
+	/*
+	 * The hash_seq_search can become inefficient when the hash table has
+	 * grown significantly larger than the default size due to the backend
+	 * having run queries which obtained large numbers of locks at once. Here
+	 * we'll build a new table at its initial size whenever the table is empty
+	 * and the maximum used bucket is beyond LOCKMETHODLOCALHASH_SHRINK_SIZE.
+	 */
+	if (hash_get_num_entries(LockMethodLocalHash) == 0 &&
+		hash_get_max_bucket(LockMethodLocalHash) >
+		LOCKMETHODLOCALHASH_SHRINK_SIZE)
+		CreateLocalLockHash();
+
 #ifdef LOCK_DEBUG
 	if (*(lockMethodTable->trace_flag))
 		elog(LOG, "LockReleaseAll done");
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 0dfbec8e3e..d93e5279ee 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -1351,6 +1351,15 @@ hash_get_num_entries(HTAB *hashp)
 	return sum;
 }
 
+/*
+ * hash_get_max_bucket -- get the maximum used bucket in a hashtable
+ */
+uint32
+hash_get_max_bucket(HTAB *hashp)
+{
+	return hashp->hctl->max_bucket;
+}
+
 /*
  * hash_seq_init/_search/_term
  *			Sequentially search through hash table and return
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index fe5ab9c868..941f99398d 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -132,6 +132,7 @@ extern void *hash_search_with_hash_value(HTAB *hashp, const void *keyPtr,
 extern bool hash_update_hash_key(HTAB *hashp, void *existingEntry,
 								 const void *newKeyPtr);
 extern long hash_get_num_entries(HTAB *hashp);
+extern uint32 hash_get_max_bucket(HTAB *hashp);
 extern void hash_seq_init(HASH_SEQ_STATUS *status, HTAB *hashp);
 extern void *hash_seq_search(HASH_SEQ_STATUS *status);
 extern void hash_seq_term(HASH_SEQ_STATUS *status);

#65

Tsunakawa, Takayuki

tsunakawa.takay@jp.fujitsu.com

over 6 years ago

In reply to: David Rowley (#64)

RE: Speed up transaction completion faster after many relations are accessed in a transaction

From: David Rowley [mailto:david.rowley@2ndquadrant.com]

I've revised the patch to add a new constant named
LOCKMETHODLOCALHASH_SHRINK_SIZE. I've set this to 64 for now. Once the hash

Thank you, and good performance. The patch passed make check.

I'm OK with the current patch, but I have a few comments. Please take them as you see fit (I wouldn't mind if you don't.)

(1)
+#define LOCKMETHODLOCALHASH_SHRINK_SIZE 64

How about LOCKMETHODLOCALHASH_SHRINK_THRESHOLD, because this determines the threshold value to trigger shrinkage? Code in PostgreSQL seems to use the term threshold.

(2)
+/* Complain if the above are not set to something sane */
+#if LOCKMETHODLOCALHASH_SHRINK_SIZE < LOCKMETHODLOCALHASH_INIT_SIZE
+#error "invalid LOCKMETHODLOCALHASH_SHRINK_SIZE"
+#endif

I don't think these are necessary, because these are fixed and not configurable. FYI, src/include/utils/memutils.h doesn't have #error to test these macros.

#define ALLOCSET_DEFAULT_MINSIZE 0
#define ALLOCSET_DEFAULT_INITSIZE (8 * 1024)
#define ALLOCSET_DEFAULT_MAXSIZE (8 * 1024 * 1024)

(3)
+	if (hash_get_num_entries(LockMethodLocalHash) == 0 &&
+		hash_get_max_bucket(LockMethodLocalHash) >
+		LOCKMETHODLOCALHASH_SHRINK_SIZE)
+		CreateLocalLockHash();

I get an impression that Create just creates something where there's nothing. How about Init or Recreate?

Regards
Takayuki Tsunakawa

#66

David Rowley

david.rowley@2ndquadrant.com

over 6 years ago

In reply to: Tsunakawa, Takayuki (#65)

1 attachment(s)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

On Mon, 17 Jun 2019 at 15:05, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:

(1)
+#define LOCKMETHODLOCALHASH_SHRINK_SIZE 64

How about LOCKMETHODLOCALHASH_SHRINK_THRESHOLD, because this determines the threshold value to trigger shrinkage? Code in PostgreSQL seems to use the term threshold.

That's probably better. I've renamed it to that.

(2)
+/* Complain if the above are not set to something sane */
+#if LOCKMETHODLOCALHASH_SHRINK_SIZE < LOCKMETHODLOCALHASH_INIT_SIZE
+#error "invalid LOCKMETHODLOCALHASH_SHRINK_SIZE"
+#endif
I don't think these are necessary, because these are fixed and not configurable. FYI, src/include/utils/memutils.h doesn't have #error to test these macros.

Yeah. I was thinking it was overkill when I wrote it, but somehow
couldn't bring myself to remove it. Done now.

(3)
+       if (hash_get_num_entries(LockMethodLocalHash) == 0 &&
+               hash_get_max_bucket(LockMethodLocalHash) >
+               LOCKMETHODLOCALHASH_SHRINK_SIZE)
+               CreateLocalLockHash();
I get an impression that Create just creates something where there's nothing. How about Init or Recreate?

Renamed to InitLocalLoclHash()

v5 is attached.

Thank you for the review.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

shrink_bloated_locallocktable_v5.patchapplication/octet-stream; name=shrink_bloated_locallocktable_v5.patchDownload

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 6745a2432e..7e1f64ced1 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -254,6 +254,15 @@ static HTAB *LockMethodLockHash;
 static HTAB *LockMethodProcLockHash;
 static HTAB *LockMethodLocalHash;
 
+/* Initial size of the LockMethodLocalHash table */
+#define LOCKMETHODLOCALHASH_INIT_SIZE 16
+
+/*
+ * If the size of the LockMethodLocalHash table grows beyond this then try
+ * to shrink the table back down to LOCKMETHODLOCALHASH_INIT_SIZE.  This must
+ * not be less than LOCKMETHODLOCALHASH_INIT_SIZE
+ */
+#define LOCKMETHODLOCALHASH_SHRINK_THRESHOLD 64
 
 /* private state for error cleanup */
 static LOCALLOCK *StrongLockInProgress;
@@ -339,6 +348,7 @@ PROCLOCK_PRINT(const char *where, const PROCLOCK *proclockP)
 #endif							/* not LOCK_DEBUG */
 
 
+static void InitLocalLockHash(void);
 static uint32 proclock_hash(const void *key, Size keysize);
 static void RemoveLocalLock(LOCALLOCK *locallock);
 static PROCLOCK *SetupLockInTable(LockMethod lockMethodTable, PGPROC *proc,
@@ -431,14 +441,28 @@ InitLocks(void)
 	if (!found)
 		SpinLockInit(&FastPathStrongRelationLocks->mutex);
 
+	InitLocalLockHash();
+}
+
+/*
+ * InitLocalLockHash
+ *		Initialize the LockMethodLocalHash hash table.
+ */
+static void
+InitLocalLockHash(void)
+{
+	HASHCTL		info;
+
 	/*
 	 * Allocate non-shared hash table for LOCALLOCK structs.  This stores lock
 	 * counts and resource owner information.
 	 *
-	 * The non-shared table could already exist in this process (this occurs
-	 * when the postmaster is recreating shared memory after a backend crash).
-	 * If so, delete and recreate it.  (We could simply leave it, since it
-	 * ought to be empty in the postmaster, but for safety let's zap it.)
+	 * First destroy any old table that may exist.  We might just be
+	 * recreating the table or it could already exist in this process (this
+	 * occurs when the postmaster is recreating shared memory after a backend
+	 * crash).  If so, delete and recreate it.  (We could simply leave it,
+	 * since it ought to be empty in the postmaster, but for safety let's zap
+	 * it.)
 	 */
 	if (LockMethodLocalHash)
 		hash_destroy(LockMethodLocalHash);
@@ -447,12 +471,11 @@ InitLocks(void)
 	info.entrysize = sizeof(LOCALLOCK);
 
 	LockMethodLocalHash = hash_create("LOCALLOCK hash",
-									  16,
+									  LOCKMETHODLOCALHASH_INIT_SIZE,
 									  &info,
 									  HASH_ELEM | HASH_BLOBS);
 }
 
-
 /*
  * Fetch the lock method table associated with a given lock
  */
@@ -2349,6 +2372,18 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
 		LWLockRelease(partitionLock);
 	}							/* loop over partitions */
 
+	/*
+	 * The hash_seq_search can become inefficient when the hash table has
+	 * grown significantly larger than the default size due to the backend
+	 * having obtained large numbers of locks at once.  Here we'll build a new
+	 * table at its initial size whenever the table is empty and the maximum
+	 * used bucket is beyond LOCKMETHODLOCALHASH_SHRINK_THRESHOLD.
+	 */
+	if (hash_get_num_entries(LockMethodLocalHash) == 0 &&
+		hash_get_max_bucket(LockMethodLocalHash) >
+		LOCKMETHODLOCALHASH_SHRINK_THRESHOLD)
+		InitLocalLockHash();
+
 #ifdef LOCK_DEBUG
 	if (*(lockMethodTable->trace_flag))
 		elog(LOG, "LockReleaseAll done");
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 0dfbec8e3e..d93e5279ee 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -1351,6 +1351,15 @@ hash_get_num_entries(HTAB *hashp)
 	return sum;
 }
 
+/*
+ * hash_get_max_bucket -- get the maximum used bucket in a hashtable
+ */
+uint32
+hash_get_max_bucket(HTAB *hashp)
+{
+	return hashp->hctl->max_bucket;
+}
+
 /*
  * hash_seq_init/_search/_term
  *			Sequentially search through hash table and return
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index fe5ab9c868..941f99398d 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -132,6 +132,7 @@ extern void *hash_search_with_hash_value(HTAB *hashp, const void *keyPtr,
 extern bool hash_update_hash_key(HTAB *hashp, void *existingEntry,
 								 const void *newKeyPtr);
 extern long hash_get_num_entries(HTAB *hashp);
+extern uint32 hash_get_max_bucket(HTAB *hashp);
 extern void hash_seq_init(HASH_SEQ_STATUS *status, HTAB *hashp);
 extern void *hash_seq_search(HASH_SEQ_STATUS *status);
 extern void hash_seq_term(HASH_SEQ_STATUS *status);

#67

Tsunakawa, Takayuki

tsunakawa.takay@jp.fujitsu.com

over 6 years ago

In reply to: David Rowley (#66)

RE: Speed up transaction completion faster after many relations are accessed in a transaction

From: David Rowley [mailto:david.rowley@2ndquadrant.com]
v5 is attached.

Thank you, looks good. I find it ready for committer (I noticed the status is already set so.)

Regards
Takayuki Tsunakawa

#68

David Rowley

david.rowley@2ndquadrant.com

over 6 years ago

In reply to: Tsunakawa, Takayuki (#67)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

On Thu, 27 Jun 2019 at 12:59, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:

From: David Rowley [mailto:david.rowley@2ndquadrant.com]
Thank you, looks good. I find it ready for committer (I noticed the status is already set so.)

Thanks for looking.

I've just been looking at this again and I thought I'd better check
the performance of the worst case for the patch, where the hash table
is rebuilt each query.

To do this I first created a single column 70 partition partitioned
table ("p") and left it empty.

I then checked the performance of:

SELECT * FROM p;

Having 70 partitions means that the lock table's max bucket goes over
the LOCKMETHODLOCALHASH_SHRINK_THRESHOLD which is set to 64 and
results in the table being rebuilt each time the query is run.

The performance was as follows:

70 partitions: LOCKMETHODLOCALHASH_SHRINK_THRESHOLD = 64

master + shrink_bloated_locallocktable_v5.patch:

ubuntu@ip-10-0-0-201:~$ pgbench -n -T 60 -f select1.sql -M prepared postgres
tps = 8427.053378 (excluding connections establishing)
tps = 8583.251821 (excluding connections establishing)
tps = 8569.587268 (excluding connections establishing)
tps = 8552.988483 (excluding connections establishing)
tps = 8527.735108 (excluding connections establishing)

master (93907478):

ubuntu@ip-10-0-0-201:~$ pgbench -n -T 60 -f select1.sql -M prepared postgres
tps = 8712.919411 (excluding connections establishing)
tps = 8760.190372 (excluding connections establishing)
tps = 8755.069470 (excluding connections establishing)
tps = 8747.389735 (excluding connections establishing)
tps = 8758.275202 (excluding connections establishing)

patched is 2.45% slower

If I increase the partition count to 140 and put the
LOCKMETHODLOCALHASH_SHRINK_THRESHOLD up to 128, then the performance
is as follows:

master + shrink_bloated_locallocktable_v5.patch:

ubuntu@ip-10-0-0-201:~$ pgbench -n -T 60 -f select1.sql -M prepared postgres
tps = 2548.917856 (excluding connections establishing)
tps = 2561.283564 (excluding connections establishing)
tps = 2549.669870 (excluding connections establishing)
tps = 2421.971864 (excluding connections establishing)
tps = 2428.983660 (excluding connections establishing)

Master (93907478):

ubuntu@ip-10-0-0-201:~$ pgbench -n -T 60 -f select1.sql -M prepared postgres
tps = 2605.407529 (excluding connections establishing)
tps = 2600.691426 (excluding connections establishing)
tps = 2594.123983 (excluding connections establishing)
tps = 2455.745644 (excluding connections establishing)
tps = 2450.061483 (excluding connections establishing)

patched is 1.54% slower

I'd rather not put the LOCKMETHODLOCALHASH_SHRINK_THRESHOLD up any
higher than 128 since it can detract from the improvement we're trying
to make with this patch.

Now, this case of querying a partitioned table that happens to be
completely empty seems a bit unrealistic. Something more realistic
might be index scanning all partitions to find a value that only
exists in a single partition. Assuming the partitions actually have
some records, then that's going to be a more expensive query, so the
overhead of rebuilding the table will be less noticeable.

A previous version of the patch has already had some heuristics to try
to only rebuild the hash table when it's likely beneficial. I'd rather
not go exploring in that area again.

Is anyone particularly concerned about the worst-case slowdown here
being about 1.54%? The best case, and arguably a more realistic case
above showed a 34% speedup for the best case.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#69

David Rowley

david.rowley@2ndquadrant.com

over 6 years ago

In reply to: David Rowley (#68)

1 attachment(s)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

On Thu, 18 Jul 2019 at 14:53, David Rowley <david.rowley@2ndquadrant.com> wrote:

Is anyone particularly concerned about the worst-case slowdown here
being about 1.54%? The best case, and arguably a more realistic case
above showed a 34% speedup for the best case.

I took a bit more time to test the performance on this. I thought I
might have been a bit unfair on the patch by giving it completely
empty tables to look at. It just seems too unrealistic to have a large
number of completely empty partitions. I decided to come up with a
more realistic case where there are 140 partitions but we're
performing an index scan to find a record that can exist in only 1 of
the 140 partitions.

create table hp (a int, b int) partition by hash(a);
select 'create table hp'||x||' partition of hp for values with
(modulus 140, remainder ' || x || ');' from generate_series(0,139)x;
create index on hp (b);
insert into hp select x,x from generate_series(1, 140000) x;
analyze hp;

select3.sql: select * from hp where b = 1

Master:

$ pgbench -n -f select3.sql -T 60 -M prepared postgres
tps = 2124.591367 (excluding connections establishing)
tps = 2158.754837 (excluding connections establishing)
tps = 2146.348465 (excluding connections establishing)
tps = 2148.995800 (excluding connections establishing)
tps = 2154.223600 (excluding connections establishing)

Patched:

$ pgbench -n -f select3.sql -T 60 -M prepared postgres
tps = 2002.480729 (excluding connections establishing)
tps = 1997.272944 (excluding connections establishing)
tps = 1992.527079 (excluding connections establishing)
tps = 1995.789367 (excluding connections establishing)
tps = 2001.195760 (excluding connections establishing)

so it turned out it's even slower, and not by a small amount either!
Patched is 6.93% slower on average with this case :-(

I went back to the drawing board on this and I've added some code that
counts the number of times we've seen the table to be oversized and
just shrinks the table back down on the 1000th time. 6.93% / 1000 is
not all that much. Of course, not all the extra overhead might be from
rebuilding the table, so here's a test with the updated patch.

$ pgbench -n -f select3.sql -T 60 -M prepared postgres
tps = 2142.414323 (excluding connections establishing)
tps = 2139.666899 (excluding connections establishing)
tps = 2138.744789 (excluding connections establishing)
tps = 2138.300299 (excluding connections establishing)
tps = 2137.346278 (excluding connections establishing)

Just a 0.34% drop. Pretty hard to pick that out the noise.

Testing the original case that shows the speedup:

select.sql:
\set p 1
select * from ht where a = :p

Master:

$ pgbench -n -f select.sql -T 60 -M prepared postgres
tps = 10172.035036 (excluding connections establishing)
tps = 10192.780529 (excluding connections establishing)
tps = 10331.306003 (excluding connections establishing)

Patched:

$ pgbench -n -f select.sql -T 60 -M prepared postgres
tps = 15080.765549 (excluding connections establishing)
tps = 14994.404069 (excluding connections establishing)
tps = 14982.923480 (excluding connections establishing)

That seems fine, 46% faster.

v6 is attached.

I plan to push this in a few days unless someone objects.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

shrink_bloated_locallocktable_v6.patchapplication/octet-stream; name=shrink_bloated_locallocktable_v6.patchDownload

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 1b7053cb1c..048979c716 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -254,6 +254,22 @@ static HTAB *LockMethodLockHash;
 static HTAB *LockMethodProcLockHash;
 static HTAB *LockMethodLocalHash;
 
+/* Initial size of the LockMethodLocalHash table */
+#define LOCKMETHODLOCALHASH_INIT_SIZE 16
+
+/*
+ * If the size of the LockMethodLocalHash table grows beyond this then try
+ * to shrink the table back down to LOCKMETHODLOCALHASH_INIT_SIZE.  This must
+ * not be less than LOCKMETHODLOCALHASH_INIT_SIZE
+ */
+#define LOCKMETHODLOCALHASH_SHRINK_THRESHOLD 64
+
+/*
+ * How many times must TryShrinkLocalLockHash() be called while
+ * LockMethodLocalHash has exceeded LOCKMETHODLOCALHASH_SHRINK_THRESHOLD
+ * before we rebuild the hash table.
+ */
+#define LOCKMETHODLOCALHASH_SHRINK_FREQUENCY 1000
 
 /* private state for error cleanup */
 static LOCALLOCK *StrongLockInProgress;
@@ -339,6 +355,8 @@ PROCLOCK_PRINT(const char *where, const PROCLOCK *proclockP)
 #endif							/* not LOCK_DEBUG */
 
 
+static void InitLocalLockHash(void);
+static inline void TryShrinkLocalLockHash(void);
 static uint32 proclock_hash(const void *key, Size keysize);
 static void RemoveLocalLock(LOCALLOCK *locallock);
 static PROCLOCK *SetupLockInTable(LockMethod lockMethodTable, PGPROC *proc,
@@ -431,14 +449,28 @@ InitLocks(void)
 	if (!found)
 		SpinLockInit(&FastPathStrongRelationLocks->mutex);
 
+	InitLocalLockHash();
+}
+
+/*
+ * InitLocalLockHash
+ *		Initialize the LockMethodLocalHash hash table.
+ */
+static void
+InitLocalLockHash(void)
+{
+	HASHCTL		info;
+
 	/*
 	 * Allocate non-shared hash table for LOCALLOCK structs.  This stores lock
 	 * counts and resource owner information.
 	 *
-	 * The non-shared table could already exist in this process (this occurs
-	 * when the postmaster is recreating shared memory after a backend crash).
-	 * If so, delete and recreate it.  (We could simply leave it, since it
-	 * ought to be empty in the postmaster, but for safety let's zap it.)
+	 * First destroy any old table that may exist.  We might just be
+	 * recreating the table or it could already exist in this process (this
+	 * occurs when the postmaster is recreating shared memory after a backend
+	 * crash).  If so, delete and recreate it.  (We could simply leave it,
+	 * since it ought to be empty in the postmaster, but for safety let's zap
+	 * it.)
 	 */
 	if (LockMethodLocalHash)
 		hash_destroy(LockMethodLocalHash);
@@ -447,11 +479,47 @@ InitLocks(void)
 	info.entrysize = sizeof(LOCALLOCK);
 
 	LockMethodLocalHash = hash_create("LOCALLOCK hash",
-									  16,
+									  LOCKMETHODLOCALHASH_INIT_SIZE,
 									  &info,
 									  HASH_ELEM | HASH_BLOBS);
 }
 
+/*
+ * TryShrinkLocalLockHash
+ *		Consider rebuilding LockMethodLocalHash.
+ *
+ * NB: We only rebuild the table if; 1) The table's max bucket has gone
+ * beyond the defined threshold, and; 2) The number of times the function
+ * has been called while meeting case #1 has exceeded the defined frequency.
+ * Without #2 we may rebuild the table too often and since rebuilding the hash
+ * table is not free, we may slow down workloads that frequently obtain large
+ * numbers of locks at once.
+ */
+static inline void
+TryShrinkLocalLockHash(void)
+{
+	static int ntimes_exceeded = 0;
+
+	/*
+	 * 1. Consider shrinking the table whenever the table is empty and the
+	 * maximum used bucket is beyond LOCKMETHODLOCALHASH_SHRINK_THRESHOLD.
+	 */
+	if (hash_get_num_entries(LockMethodLocalHash) == 0 &&
+		hash_get_max_bucket(LockMethodLocalHash) >
+		LOCKMETHODLOCALHASH_SHRINK_THRESHOLD)
+	{
+		/* Increment the number of times we've exceeding the threshold */
+		ntimes_exceeded++;
+
+		/* 2. Shrink if we've exceeded the threshold enough times */
+		if (ntimes_exceeded >= LOCKMETHODLOCALHASH_SHRINK_FREQUENCY)
+		{
+			/* Rebuild the table and zero the counter */
+			InitLocalLockHash();
+			ntimes_exceeded = 0;
+		}
+	}
+}
 
 /*
  * Fetch the lock method table associated with a given lock
@@ -2349,6 +2417,13 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
 		LWLockRelease(partitionLock);
 	}							/* loop over partitions */
 
+	/*
+	 * The hash_seq_search can become inefficient when the hash table has
+	 * grown significantly larger than the default size due to the backend
+	 * having obtained large numbers of locks at once.  Consider shrinking it.
+	 */
+	TryShrinkLocalLockHash();
+
 #ifdef LOCK_DEBUG
 	if (*(lockMethodTable->trace_flag))
 		elog(LOG, "LockReleaseAll done");
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 0dfbec8e3e..d93e5279ee 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -1351,6 +1351,15 @@ hash_get_num_entries(HTAB *hashp)
 	return sum;
 }
 
+/*
+ * hash_get_max_bucket -- get the maximum used bucket in a hashtable
+ */
+uint32
+hash_get_max_bucket(HTAB *hashp)
+{
+	return hashp->hctl->max_bucket;
+}
+
 /*
  * hash_seq_init/_search/_term
  *			Sequentially search through hash table and return
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index fe5ab9c868..941f99398d 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -132,6 +132,7 @@ extern void *hash_search_with_hash_value(HTAB *hashp, const void *keyPtr,
 extern bool hash_update_hash_key(HTAB *hashp, void *existingEntry,
 								 const void *newKeyPtr);
 extern long hash_get_num_entries(HTAB *hashp);
+extern uint32 hash_get_max_bucket(HTAB *hashp);
 extern void hash_seq_init(HASH_SEQ_STATUS *status, HTAB *hashp);
 extern void *hash_seq_search(HASH_SEQ_STATUS *status);
 extern void hash_seq_term(HASH_SEQ_STATUS *status);

#70

Tsunakawa, Takayuki

tsunakawa.takay@jp.fujitsu.com

over 6 years ago

In reply to: David Rowley (#69)

RE: Speed up transaction completion faster after many relations are accessed in a transaction

From: David Rowley [mailto:david.rowley@2ndquadrant.com]

I went back to the drawing board on this and I've added some code that counts
the number of times we've seen the table to be oversized and just shrinks
the table back down on the 1000th time. 6.93% / 1000 is not all that much.

I'm afraid this kind of hidden behavior would appear mysterious to users. They may wonder "Why is the same query fast at first in the session (5 or 6 times of execution), then gets slower for a while, and gets faster again? Is there something to tune? Am I missing something wrong with my app (e.g. how to use prepared statements)?" So I prefer v5.

Of course, not all the extra overhead might be from rebuilding the table,
so here's a test with the updated patch.

Where else does the extra overhead come from?

Regards
Takayuki Tsunakawa

#71

David Rowley

david.rowley@2ndquadrant.com

over 6 years ago

In reply to: Tsunakawa, Takayuki (#70)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

On Mon, 22 Jul 2019 at 12:48, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:

From: David Rowley [mailto:david.rowley@2ndquadrant.com]

I went back to the drawing board on this and I've added some code that counts
the number of times we've seen the table to be oversized and just shrinks
the table back down on the 1000th time. 6.93% / 1000 is not all that much.

I'm afraid this kind of hidden behavior would appear mysterious to users. They may wonder "Why is the same query fast at first in the session (5 or 6 times of execution), then gets slower for a while, and gets faster again? Is there something to tune? Am I missing something wrong with my app (e.g. how to use prepared statements)?" So I prefer v5.

I personally don't think that's true. The only way you'll notice the
LockReleaseAll() overhead is to execute very fast queries with a
bloated lock table. It's pretty hard to notice that a single 0.1ms
query is slow. You'll need to execute thousands of them before you'll
be able to measure it, and once you've done that, the lock shrink code
will have run and the query will be performing optimally again.

I voice my concerns with v5 and I wasn't really willing to push it
with a known performance regression of 7% in a fairly valid case. v6
does not suffer from that.

Of course, not all the extra overhead might be from rebuilding the table,
so here's a test with the updated patch.

Where else does the extra overhead come from?

hash_get_num_entries(LockMethodLocalHash) == 0 &&
+ hash_get_max_bucket(LockMethodLocalHash) >
+ LOCKMETHODLOCALHASH_SHRINK_THRESHOLD)

that's executed every time, not every 1000 times.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#72

Tsunakawa, Takayuki

tsunakawa.takay@jp.fujitsu.com

over 6 years ago

In reply to: David Rowley (#71)

RE: Speed up transaction completion faster after many relations are accessed in a transaction

From: David Rowley [mailto:david.rowley@2ndquadrant.com]

I personally don't think that's true. The only way you'll notice the
LockReleaseAll() overhead is to execute very fast queries with a
bloated lock table. It's pretty hard to notice that a single 0.1ms
query is slow. You'll need to execute thousands of them before you'll
be able to measure it, and once you've done that, the lock shrink code
will have run and the query will be performing optimally again.

Maybe so. Will the difference be noticeable between plan_cache_mode=auto (default) and plan_cache_mode=custom?

I voice my concerns with v5 and I wasn't really willing to push it
with a known performance regression of 7% in a fairly valid case. v6
does not suffer from that.

You're right. We may have to consider the unpredictability to users by this hidden behavior as a compromise for higher throughput.

Where else does the extra overhead come from?
hash_get_num_entries(LockMethodLocalHash) == 0 &&
+ hash_get_max_bucket(LockMethodLocalHash) >
+ LOCKMETHODLOCALHASH_SHRINK_THRESHOLD)
that's executed every time, not every 1000 times.

I see. Thanks.

Regards
Takayuki Tsunakawa

#73

David Rowley

david.rowley@2ndquadrant.com

over 6 years ago

In reply to: Tsunakawa, Takayuki (#72)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

On Mon, 22 Jul 2019 at 14:21, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:

From: David Rowley [mailto:david.rowley@2ndquadrant.com]

I personally don't think that's true. The only way you'll notice the
LockReleaseAll() overhead is to execute very fast queries with a
bloated lock table. It's pretty hard to notice that a single 0.1ms
query is slow. You'll need to execute thousands of them before you'll
be able to measure it, and once you've done that, the lock shrink code
will have run and the query will be performing optimally again.

Maybe so. Will the difference be noticeable between plan_cache_mode=auto (default) and plan_cache_mode=custom?

For the use case we've been measuring with partitioned tables and the
generic plan generation causing a sudden spike in the number of
obtained locks, then having plan_cache_mode = force_custom_plan will
cause the lock table not to become bloated. I'm not sure there's
anything interesting to measure there. The only additional code that
gets executed is the hash_get_num_entries() and possibly
hash_get_max_bucket. Maybe it's worth swapping the order of those
calls since most of the time the entry will be 0 and the max bucket
count threshold won't be exceeded.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#74

Tsunakawa, Takayuki

tsunakawa.takay@jp.fujitsu.com

over 6 years ago

In reply to: David Rowley (#73)

RE: Speed up transaction completion faster after many relations are accessed in a transaction

From: David Rowley [mailto:david.rowley@2ndquadrant.com]

For the use case we've been measuring with partitioned tables and the
generic plan generation causing a sudden spike in the number of
obtained locks, then having plan_cache_mode = force_custom_plan will
cause the lock table not to become bloated. I'm not sure there's
anything interesting to measure there.

I meant the difference between the following two cases, where the query only touches one partition (e.g. SELECT ... WHERE pkey = value):

* plan_cache_mode=force_custom_plan: LocalLockHash won't bloat. The query execution time is steady.

* plan_cache_mode=auto: LocalLockHash bloats on the sixth execution due to the creation of the generic plan. The generic plan is not adopted because its cost is high. Later executions of the query will suffer from the bloat until the 1006th execution when LocalLockHash is shrunk.

Depending on the number of transactions and what each transaction does, I thought the difference will be noticeable or not.

Regards
Takayuki Tsunakawa

#75

David Rowley

david.rowley@2ndquadrant.com

over 6 years ago

In reply to: Tsunakawa, Takayuki (#74)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

On Mon, 22 Jul 2019 at 16:36, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:

From: David Rowley [mailto:david.rowley@2ndquadrant.com]

For the use case we've been measuring with partitioned tables and the
generic plan generation causing a sudden spike in the number of
obtained locks, then having plan_cache_mode = force_custom_plan will
cause the lock table not to become bloated. I'm not sure there's
anything interesting to measure there.

I meant the difference between the following two cases, where the query only touches one partition (e.g. SELECT ... WHERE pkey = value):

* plan_cache_mode=force_custom_plan: LocalLockHash won't bloat. The query execution time is steady.

* plan_cache_mode=auto: LocalLockHash bloats on the sixth execution due to the creation of the generic plan. The generic plan is not adopted because its cost is high. Later executions of the query will suffer from the bloat until the 1006th execution when LocalLockHash is shrunk.

I measured this again in
/messages/by-id/CAKJS1f_ycJ-6QTKC70pZRYdwsAwUo+t0_CV0eXC=J-b5A2MegQ@mail.gmail.com
where I posted the v6 patch. It's the final results in the email. I
didn't measure plan_cache_mode = force_custom_plan. There'd be no lock
table bloat in that case and the additional overhead would just be
from the hash_get_num_entries() && hash_get_max_bucket() calls, which
the first results show next to no overhead for.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#76

David Rowley

david.rowley@2ndquadrant.com

over 6 years ago

In reply to: Tsunakawa, Takayuki (#70)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

On Mon, 22 Jul 2019 at 12:48, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:

From: David Rowley [mailto:david.rowley@2ndquadrant.com]

I went back to the drawing board on this and I've added some code that counts
the number of times we've seen the table to be oversized and just shrinks
the table back down on the 1000th time. 6.93% / 1000 is not all that much.

I'm afraid this kind of hidden behavior would appear mysterious to users. They may wonder "Why is the same query fast at first in the session (5 or 6 times of execution), then gets slower for a while, and gets faster again? Is there something to tune? Am I missing something wrong with my app (e.g. how to use prepared statements)?" So I prefer v5.

Another counter-argument to this is that there's already an
unexplainable slowdown after you run a query which obtains a large
number of locks in a session or use prepared statements and a
partitioned table with the default plan_cache_mode setting. Are we not
just righting a wrong here? Albeit, possibly 1000 queries later.

I am, of course, open to other ideas which solve the problem that v5
has, but failing that, I don't see v6 as all that bad. At least all
the logic is contained in one function. I know Tom wanted to steer
clear of heuristics to reinitialise the table, but most of the stuff
that was in the patch back when he complained was trying to track the
average number of locks over the previous N transactions, and his
concerns were voiced before I showed the 7% performance regression
with unconditionally rebuilding the table.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#77

Andres Freund

andres@anarazel.de

over 6 years ago

In reply to: David Rowley (#69)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

Hi,

On 2019-07-21 21:37:28 +1200, David Rowley wrote:

select.sql:
\set p 1
select * from ht where a = :p

Master:

$ pgbench -n -f select.sql -T 60 -M prepared postgres
tps = 10172.035036 (excluding connections establishing)
tps = 10192.780529 (excluding connections establishing)
tps = 10331.306003 (excluding connections establishing)

Patched:

$ pgbench -n -f select.sql -T 60 -M prepared postgres
tps = 15080.765549 (excluding connections establishing)
tps = 14994.404069 (excluding connections establishing)
tps = 14982.923480 (excluding connections establishing)

That seems fine, 46% faster.

v6 is attached.

I plan to push this in a few days unless someone objects.

It does seem far less objectionable than the other case. I hate to
throw in one more wrench into a topic finally making progress, but: Have
either of you considered just replacing the dynahash table with a
simplehash style one? Given the obvious speed sensitivity, and the fact
that for it (in contrast to the shared lock table) no partitioning is
needed, that seems like a good thing to try. It seems quite possible
that both the iteration and plain manipulations are going to be faster,
due to far less indirections - e.g. the iteration through the array will
just be an array walk with a known stride, far easier for the CPU to
prefetch.

Greetings,

Andres Freund

#78

Tsunakawa, Takayuki

tsunakawa.takay@jp.fujitsu.com

over 6 years ago

In reply to: David Rowley (#76)

RE: Speed up transaction completion faster after many relations are accessed in a transaction

From: David Rowley [mailto:david.rowley@2ndquadrant.com]

Another counter-argument to this is that there's already an
unexplainable slowdown after you run a query which obtains a large
number of locks in a session or use prepared statements and a
partitioned table with the default plan_cache_mode setting. Are we not
just righting a wrong here? Albeit, possibly 1000 queries later.

I am, of course, open to other ideas which solve the problem that v5
has, but failing that, I don't see v6 as all that bad. At least all
the logic is contained in one function. I know Tom wanted to steer
clear of heuristics to reinitialise the table, but most of the stuff
that was in the patch back when he complained was trying to track the
average number of locks over the previous N transactions, and his
concerns were voiced before I showed the 7% performance regression
with unconditionally rebuilding the table.

I think I understood what you mean. Sorry, I don't have a better idea. This unexplanability is probably what we should accept to avoid the shocking 7% slowdown.

OTOH, how about my original patch that is based on the local lock list? I expect that it won't that significant slowdown in the same test case. If it's not satisfactory, then v6 is the best to commit.

Regards
Takayuki Tsunakawa

#79

David Rowley

david.rowley@2ndquadrant.com

over 6 years ago

In reply to: Tsunakawa, Takayuki (#78)

1 attachment(s)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

On Tue, 23 Jul 2019 at 15:47, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:

OTOH, how about my original patch that is based on the local lock list? I expect that it won't that significant slowdown in the same test case. If it's not satisfactory, then v6 is the best to commit.

I think we need to move beyond the versions that have been reviewed
and rejected. I don't think the reasons for their rejection will have
changed.

I've attached v7, which really is v6 with some comments adjusted and
the order of the hash_get_num_entries and hash_get_max_bucket function
calls swapped. I think hash_get_num_entries() will return 0 most of
the time where we're calling it, so it makes sense to put the
condition that's less likely to be true first in the if condition.

I'm pretty happy with v7 now. If anyone has any objections to it,
please speak up very soon. Otherwise, I plan to push it about this
time tomorrow.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

shrink_bloated_locallocktable_v7.patchapplication/octet-stream; name=shrink_bloated_locallocktable_v7.patchDownload

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 1b7053cb1c..e787bc6ee7 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -254,6 +254,22 @@ static HTAB *LockMethodLockHash;
 static HTAB *LockMethodProcLockHash;
 static HTAB *LockMethodLocalHash;
 
+/* Initial size of the LockMethodLocalHash table */
+#define LOCKMETHODLOCALHASH_INIT_SIZE 16
+
+/*
+ * If the size of the LockMethodLocalHash table grows beyond this then try
+ * to shrink the table back down to LOCKMETHODLOCALHASH_INIT_SIZE.  This must
+ * not be less than LOCKMETHODLOCALHASH_INIT_SIZE
+ */
+#define LOCKMETHODLOCALHASH_SHRINK_THRESHOLD 64
+
+/*
+ * How many times must TryShrinkLocalLockHash() be called while
+ * LockMethodLocalHash has exceeded LOCKMETHODLOCALHASH_SHRINK_THRESHOLD
+ * before we rebuild the hash table.
+ */
+#define LOCKMETHODLOCALHASH_SHRINK_FREQUENCY 1000
 
 /* private state for error cleanup */
 static LOCALLOCK *StrongLockInProgress;
@@ -339,6 +355,8 @@ PROCLOCK_PRINT(const char *where, const PROCLOCK *proclockP)
 #endif							/* not LOCK_DEBUG */
 
 
+static void InitLocalLockHash(void);
+static inline void TryShrinkLocalLockHash(void);
 static uint32 proclock_hash(const void *key, Size keysize);
 static void RemoveLocalLock(LOCALLOCK *locallock);
 static PROCLOCK *SetupLockInTable(LockMethod lockMethodTable, PGPROC *proc,
@@ -431,14 +449,26 @@ InitLocks(void)
 	if (!found)
 		SpinLockInit(&FastPathStrongRelationLocks->mutex);
 
+	InitLocalLockHash();
+}
+
+/*
+ * InitLocalLockHash
+ *		Initialize the LockMethodLocalHash hash table.
+ */
+static void
+InitLocalLockHash(void)
+{
+	HASHCTL		info;
+
 	/*
 	 * Allocate non-shared hash table for LOCALLOCK structs.  This stores lock
 	 * counts and resource owner information.
 	 *
-	 * The non-shared table could already exist in this process (this occurs
-	 * when the postmaster is recreating shared memory after a backend crash).
-	 * If so, delete and recreate it.  (We could simply leave it, since it
-	 * ought to be empty in the postmaster, but for safety let's zap it.)
+	 * First destroy any old table that may exist.  We might just be
+	 * recreating the table or it could already exist in this process (this
+	 * occurs when the postmaster is recreating shared memory after a backend
+	 * crash).  In either case, delete and recreate it.
 	 */
 	if (LockMethodLocalHash)
 		hash_destroy(LockMethodLocalHash);
@@ -447,11 +477,53 @@ InitLocks(void)
 	info.entrysize = sizeof(LOCALLOCK);
 
 	LockMethodLocalHash = hash_create("LOCALLOCK hash",
-									  16,
+									  LOCKMETHODLOCALHASH_INIT_SIZE,
 									  &info,
 									  HASH_ELEM | HASH_BLOBS);
 }
 
+/*
+ * TryShrinkLocalLockHash
+ *		Rebuild LockMethodLocalHash with its initial size if it has been
+ *		enlarged enough to go beyond the defined shrink threshold.
+ *
+ * We only rebuild the table if:
+ *
+ * 1) The max bucket has gone beyond the defined threshold and the table does
+ * not contain any locks, and;
+ *
+ * 2) The number of times the function has been called while meeting case #1
+ * has exceeded the defined frequency.
+ *
+ * Without #2 we may rebuild the table too often and since rebuilding the hash
+ * table is not free, we may slow down workloads that frequently obtain a
+ * large number of locks.
+ */
+static inline void
+TryShrinkLocalLockHash(void)
+{
+	static int ntimes_exceeded = 0;
+
+	/*
+	 * 1. Consider shrinking the table whenever the maximum used bucket is
+	 * beyond LOCKMETHODLOCALHASH_SHRINK_THRESHOLD and the table is empty.
+	 */
+	if (hash_get_max_bucket(LockMethodLocalHash) >
+		LOCKMETHODLOCALHASH_SHRINK_THRESHOLD &&
+		hash_get_num_entries(LockMethodLocalHash) == 0)
+	{
+		/* Increment the number of times we've exceeding the threshold */
+		ntimes_exceeded++;
+
+		/* 2. Shrink if we've exceeded the threshold enough times */
+		if (ntimes_exceeded >= LOCKMETHODLOCALHASH_SHRINK_FREQUENCY)
+		{
+			/* Rebuild the table and zero the counter */
+			InitLocalLockHash();
+			ntimes_exceeded = 0;
+		}
+	}
+}
 
 /*
  * Fetch the lock method table associated with a given lock
@@ -2349,6 +2421,13 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
 		LWLockRelease(partitionLock);
 	}							/* loop over partitions */
 
+	/*
+	 * The hash_seq_search can become inefficient when the hash table has
+	 * grown significantly larger than the default size due to the backend
+	 * having obtained a large number of locks.  Consider shrinking it.
+	 */
+	TryShrinkLocalLockHash();
+
 #ifdef LOCK_DEBUG
 	if (*(lockMethodTable->trace_flag))
 		elog(LOG, "LockReleaseAll done");
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 0dfbec8e3e..d93e5279ee 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -1351,6 +1351,15 @@ hash_get_num_entries(HTAB *hashp)
 	return sum;
 }
 
+/*
+ * hash_get_max_bucket -- get the maximum used bucket in a hashtable
+ */
+uint32
+hash_get_max_bucket(HTAB *hashp)
+{
+	return hashp->hctl->max_bucket;
+}
+
 /*
  * hash_seq_init/_search/_term
  *			Sequentially search through hash table and return
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index fe5ab9c868..941f99398d 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -132,6 +132,7 @@ extern void *hash_search_with_hash_value(HTAB *hashp, const void *keyPtr,
 extern bool hash_update_hash_key(HTAB *hashp, void *existingEntry,
 								 const void *newKeyPtr);
 extern long hash_get_num_entries(HTAB *hashp);
+extern uint32 hash_get_max_bucket(HTAB *hashp);
 extern void hash_seq_init(HASH_SEQ_STATUS *status, HTAB *hashp);
 extern void *hash_seq_search(HASH_SEQ_STATUS *status);
 extern void hash_seq_term(HASH_SEQ_STATUS *status);

#80

Tom Lane

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: David Rowley (#79)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

David Rowley <david.rowley@2ndquadrant.com> writes:

I've attached v7, which really is v6 with some comments adjusted and
the order of the hash_get_num_entries and hash_get_max_bucket function
calls swapped. I think hash_get_num_entries() will return 0 most of
the time where we're calling it, so it makes sense to put the
condition that's less likely to be true first in the if condition.

I'm pretty happy with v7 now. If anyone has any objections to it,
please speak up very soon. Otherwise, I plan to push it about this
time tomorrow.

I dunno, this seems close to useless in this form. As it stands,
once hash_get_max_bucket has exceeded the threshold, you will
arbitrarily reset the table 1000 transactions later (since the
max bucket is certainly not gonna decrease otherwise). So that's
got two big problems:

1. In the assumed-common case where most of the transactions take
few locks, you wasted cycles for 999 transactions.

2. You'll reset the table independently of subsequent history,
even if the session's usage is pretty much always over the
threshold. Admittedly, if you do this only once per 1K
transactions, it's probably not a horrible overhead --- but you
can't improve point 1 without making it a bigger overhead.

I did complain about the cost of tracking the stats proposed by
some earlier patches, but I don't think the answer to that is to
not track any stats at all. The proposed hash_get_max_bucket()
function is quite cheap, so I think it wouldn't be out of line to
track the average value of that at transaction end over the
session's lifespan, and reset if the current value is more than
some-multiple of the average.

The tricky part here is that if some xact kicks that up to
say 64 entries, and we don't choose to reset, then the reading
for subsequent transactions will be 64 even if they use very
few locks. So we probably need to not use a plain average,
but account for that effect somehow. Maybe we could look at
how quickly the number goes up after we reset?

[ thinks for awhile... ] As a straw-man proposal, I suggest
the following (completely untested) idea:

* Make the table size threshold variable, not constant.

* If, at end-of-transaction when the table is empty,
the table bucket count exceeds the threshold, reset
immediately; but if it's been less than K transactions
since the last reset, increase the threshold (by say 10%).

I think K can be a constant; somewhere between 10 and 100 would
probably work. At process start, we should act as though the last
reset were more than K transactions ago (i.e., don't increase the
threshold at the first reset).

The main advantage this has over v7 is that we don't have the
1000-transaction delay before reset, which ISTM is giving up
much of the benefit of the whole idea. Also, if the session
is consistently using lots of locks, we'll adapt to that after
awhile and not do useless table resets.

regards, tom lane

#81

David Rowley

david.rowley@2ndquadrant.com

over 6 years ago

In reply to: Tom Lane (#80)

1 attachment(s)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

Thanks for having a look at this.

On Wed, 24 Jul 2019 at 04:13, Tom Lane <tgl@sss.pgh.pa.us> wrote:

David Rowley <david.rowley@2ndquadrant.com> writes:

I'm pretty happy with v7 now. If anyone has any objections to it,
please speak up very soon. Otherwise, I plan to push it about this
time tomorrow.

I dunno, this seems close to useless in this form. As it stands,
once hash_get_max_bucket has exceeded the threshold, you will
arbitrarily reset the table 1000 transactions later (since the
max bucket is certainly not gonna decrease otherwise). So that's
got two big problems:

1. In the assumed-common case where most of the transactions take
few locks, you wasted cycles for 999 transactions.

2. You'll reset the table independently of subsequent history,
even if the session's usage is pretty much always over the
threshold. Admittedly, if you do this only once per 1K
transactions, it's probably not a horrible overhead --- but you
can't improve point 1 without making it a bigger overhead.

This is true, but I think you might be overestimating just how much
effort is wasted with #1. We're only seeing this overhead in small
very fast to execute xacts. In the test case in [1]/messages/by-id/CAKJS1f_ycJ-6QTKC70pZRYdwsAwUo+t0_CV0eXC=J-b5A2MegQ@mail.gmail.com, I was getting
about 10k TPS unpatched and about 15k patched. This means that, on
average, 1 unpatched xact takes 100 microseconds and 1 average patched
xact takes 66 microseconds, so the additional time spent doing the
hash_seq_search() must be about 34 microseconds. So we'll waste a
total of 34 *milliseconds* if we wait for 1000 xacts before we reset
the table. With 10k TPS we're going to react to the change in 0.1
seconds.

I think you'd struggle to measure that 1 xact is taking 34
microseconds longer without running a few thousand queries. My view is
that nobody is ever going to notice that it takes 1k xacts to shrink
the table, and I've already shown that the overhead of the shrink
every 1k xact is tiny. I mentioned 0.34% in [1]/messages/by-id/CAKJS1f_ycJ-6QTKC70pZRYdwsAwUo+t0_CV0eXC=J-b5A2MegQ@mail.gmail.com using v6. This is
likely a bit smaller in v7 due to swapping the order of the if
condition to put the less likely case first. Since the overhead of
rebuilding the table was 7% when done every xact, then it stands to
reason that this has become 0.007% doing it every 1k xats and that the
additional overhead to make up that 0.34% is from testing if the reset
condition has been met (or noise). That's not something we can remove
completely with any solution that resets the hash table.

I did complain about the cost of tracking the stats proposed by
some earlier patches, but I don't think the answer to that is to
not track any stats at all. The proposed hash_get_max_bucket()
function is quite cheap, so I think it wouldn't be out of line to
track the average value of that at transaction end over the
session's lifespan, and reset if the current value is more than
some-multiple of the average.

The tricky part here is that if some xact kicks that up to
say 64 entries, and we don't choose to reset, then the reading
for subsequent transactions will be 64 even if they use very
few locks. So we probably need to not use a plain average,
but account for that effect somehow. Maybe we could look at
how quickly the number goes up after we reset?

[ thinks for awhile... ] As a straw-man proposal, I suggest
the following (completely untested) idea:

* Make the table size threshold variable, not constant.

* If, at end-of-transaction when the table is empty,
the table bucket count exceeds the threshold, reset
immediately; but if it's been less than K transactions
since the last reset, increase the threshold (by say 10%).

I think K can be a constant; somewhere between 10 and 100 would
probably work. At process start, we should act as though the last
reset were more than K transactions ago (i.e., don't increase the
threshold at the first reset).

I think the problem with this idea is that there is no way once the
threshold has been enlarged to recover from that to work better
workloads that require very few locks again. If we end up with some
large value for the variable threshold, there's no way to decrease
that again. All this method stops is the needless hash table resets
if the typical case requires many locks. The only way to know if we
can reduce the threshold again is to count the locks released during
LockReleaseAll(). Some version of the patch did that, and you
objected.

The main advantage this has over v7 is that we don't have the
1000-transaction delay before reset, which ISTM is giving up
much of the benefit of the whole idea. Also, if the session
is consistently using lots of locks, we'll adapt to that after
awhile and not do useless table resets.

True, but you neglected to mention the looming and critical drawback,
which pretty much makes that idea useless. All we'd need to do is give
this a workload that throws that variable threshold up high so that it
can't recover. It would be pretty simple then to show that
LockReleaseAll() is still slow with workloads that just require a
small number of locks... permanently with no means to recover.

To be able to reduce the threshold down again we'd need to make a
hash_get_num_entries(LockMethodLocalHash) call before performing the
guts of LockReleaseAll(). We could then weight that onto some running
average counter with a weight of, say... 10, so we react to changes
fairly quickly, but not instantly. We could then have some sort of
logic like "rebuild the hash table if running average 4 times less
than max_bucket"

I've attached a spreadsheet of that idea and the algorithm we could
use to track the running average. Initially, I've mocked it up a
series of transactions that use 1000 locks, then at row 123 dropped
that to 10 locks. If we assume the max_bucket is 1000, then it takes
until row 136 for the running average to drop below the max_bucket
count, i.e 13 xacts. There we'd reset there at the init size of 16. If
the average went up again, then we'd automatically expand the table as
we do now. To make this work we'd need an additional call to
hash_get_num_entries(), before we release the locks, so there is more
overhead.

[1]: /messages/by-id/CAKJS1f_ycJ-6QTKC70pZRYdwsAwUo+t0_CV0eXC=J-b5A2MegQ@mail.gmail.com

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#82

David Rowley

david.rowley@2ndquadrant.com

over 6 years ago

In reply to: David Rowley (#81)

1 attachment(s)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

On Wed, 24 Jul 2019 at 15:05, David Rowley <david.rowley@2ndquadrant.com> wrote:

To be able to reduce the threshold down again we'd need to make a
hash_get_num_entries(LockMethodLocalHash) call before performing the
guts of LockReleaseAll(). We could then weight that onto some running
average counter with a weight of, say... 10, so we react to changes
fairly quickly, but not instantly. We could then have some sort of
logic like "rebuild the hash table if running average 4 times less
than max_bucket"

I've attached a spreadsheet of that idea and the algorithm we could
use to track the running average. Initially, I've mocked it up a
series of transactions that use 1000 locks, then at row 123 dropped
that to 10 locks. If we assume the max_bucket is 1000, then it takes
until row 136 for the running average to drop below the max_bucket
count, i.e 13 xacts. There we'd reset there at the init size of 16. If
the average went up again, then we'd automatically expand the table as
we do now. To make this work we'd need an additional call to
hash_get_num_entries(), before we release the locks, so there is more
overhead.

Here's a patch with this implemented. I've left a NOTICE in there to
make it easier for people to follow along at home and see when the
lock table is reset.

There will be a bit of additional overhead to the reset detection
logic over the v7 patch. Namely: additional hash_get_num_entries()
call before releasing the locks, and more complex and floating-point
maths instead of more simple integer maths in v7.

Here's a demo with the debug NOTICE in there to show us what's going on.

-- setup
create table a (a int) partition by range (a);
select 'create table a'||x||' partition of a for values from('||x||')
to ('||x+1||');' from generate_Series(1,1000) x;
\gexec

$ psql postgres
NOTICE: max_bucket = 15, threshold = 64.000000, running_avg_locks
0.100000 Reset? No
psql (13devel)
# \o /dev/null
# select * from a where a < 100;
NOTICE: max_bucket = 101, threshold = 64.000000, running_avg_locks
10.090000 Reset? Yes
# select * from a where a < 100;
NOTICE: max_bucket = 99, threshold = 76.324005, running_avg_locks
19.081001 Reset? Yes
# select * from a where a < 100;

A couple of needless resets there... Maybe we can get rid of those by
setting the initial running average up to something higher than 0.0.

NOTICE: max_bucket = 99, threshold = 108.691605, running_avg_locks
27.172901 Reset? No
# select * from a where a < 100;
NOTICE: max_bucket = 99, threshold = 137.822449, running_avg_locks
34.455612 Reset? No
# select * from a where a < 100;
NOTICE: max_bucket = 99, threshold = 164.040207, running_avg_locks
41.010052 Reset? No
# select * from a where a < 100;
NOTICE: max_bucket = 99, threshold = 187.636185, running_avg_locks
46.909046 Reset? No
# select * from a where a < 100;
NOTICE: max_bucket = 99, threshold = 208.872559, running_avg_locks
52.218140 Reset? No
# select * from a where a < 100;
NOTICE: max_bucket = 99, threshold = 227.985306, running_avg_locks
56.996326 Reset? No
# select * from a where a < 100;
NOTICE: max_bucket = 99, threshold = 245.186768, running_avg_locks
61.296692 Reset? No
# select * from a where a < 100;
NOTICE: max_bucket = 99, threshold = 260.668091, running_avg_locks
65.167023 Reset? No
# select * from a where a < 100;
NOTICE: max_bucket = 99, threshold = 274.601288, running_avg_locks
68.650322 Reset? No
# select * from a where a < 100;
NOTICE: max_bucket = 99, threshold = 287.141174, running_avg_locks
71.785294 Reset? No
# select * from a where a < 100;
NOTICE: max_bucket = 99, threshold = 298.427063, running_avg_locks
74.606766 Reset? No
# select * from a where a < 100;
NOTICE: max_bucket = 99, threshold = 308.584351, running_avg_locks
77.146088 Reset? No
# select * from a where a < 100;
NOTICE: max_bucket = 99, threshold = 317.725922, running_avg_locks
79.431480 Reset? No
# select * from a where a < 100;
NOTICE: max_bucket = 99, threshold = 325.953339, running_avg_locks
81.488335 Reset? No
# select * from a where a < 100;
NOTICE: max_bucket = 99, threshold = 333.358002, running_avg_locks
83.339500 Reset? No
# select * from a where a < 100;
NOTICE: max_bucket = 99, threshold = 340.022217, running_avg_locks
85.005554 Reset? No
# select * from a where a < 100;
NOTICE: max_bucket = 99, threshold = 346.019989, running_avg_locks
86.504997 Reset? No
# select * from a where a < 100;
NOTICE: max_bucket = 99, threshold = 351.417999, running_avg_locks
87.854500 Reset? No
# select * from a where a < 100;
NOTICE: max_bucket = 99, threshold = 356.276184, running_avg_locks
89.069046 Reset? No
# select * from a where a < 100;
NOTICE: max_bucket = 99, threshold = 360.648560, running_avg_locks
90.162140 Reset? No
# select * from a where a < 100;
NOTICE: max_bucket = 99, threshold = 364.583710, running_avg_locks
91.145927 Reset? No
# select * from a where a < 100;
NOTICE: max_bucket = 99, threshold = 368.125336, running_avg_locks
92.031334 Reset? No
# select * from a where a < 100;
NOTICE: max_bucket = 99, threshold = 371.312805, running_avg_locks
92.828201 Reset? No
# select * from a where a < 100;
NOTICE: max_bucket = 99, threshold = 374.181519, running_avg_locks
93.545380 Reset? No
# select * from a where a < 100;
NOTICE: max_bucket = 99, threshold = 376.763367, running_avg_locks
94.190842 Reset? No
# select * from a where a < 100;
NOTICE: max_bucket = 99, threshold = 379.087036, running_avg_locks
94.771759 Reset? No
# select * from a where a < 100;
NOTICE: max_bucket = 99, threshold = 381.178345, running_avg_locks
95.294586 Reset? No
# select * from a where a < 100;
NOTICE: max_bucket = 99, threshold = 383.060516, running_avg_locks
95.765129 Reset? No
# select * from a where a < 100;
NOTICE: max_bucket = 99, threshold = 384.754456, running_avg_locks
96.188614 Reset? No
# select * from a where a < 100;
NOTICE: max_bucket = 99, threshold = 386.279022, running_avg_locks
96.569756 Reset? No

-- Here I switch to only selecting from 9 partitions instead of 99.

# select * from a where a < 10;
NOTICE: max_bucket = 99, threshold = 351.651123, running_avg_locks
87.912781 Reset? No
# select * from a where a < 10;
NOTICE: max_bucket = 99, threshold = 320.486023, running_avg_locks
80.121506 Reset? No
# select * from a where a < 10;
NOTICE: max_bucket = 99, threshold = 292.437408, running_avg_locks
73.109352 Reset? No
# select * from a where a < 10;
NOTICE: max_bucket = 99, threshold = 267.193665, running_avg_locks
66.798416 Reset? No
# select * from a where a < 10;
NOTICE: max_bucket = 99, threshold = 244.474304, running_avg_locks
61.118576 Reset? No
# select * from a where a < 10;
NOTICE: max_bucket = 99, threshold = 224.026871, running_avg_locks
56.006718 Reset? No
# select * from a where a < 10;
NOTICE: max_bucket = 99, threshold = 205.624176, running_avg_locks
51.406044 Reset? No
# select * from a where a < 10;
NOTICE: max_bucket = 99, threshold = 189.061752, running_avg_locks
47.265438 Reset? No
# select * from a where a < 10;
NOTICE: max_bucket = 99, threshold = 174.155579, running_avg_locks
43.538895 Reset? No
# select * from a where a < 10;
NOTICE: max_bucket = 99, threshold = 160.740021, running_avg_locks
40.185005 Reset? No
# select * from a where a < 10;
NOTICE: max_bucket = 99, threshold = 148.666016, running_avg_locks
37.166504 Reset? No
# select * from a where a < 10;
NOTICE: max_bucket = 99, threshold = 137.799408, running_avg_locks
34.449852 Reset? No
# select * from a where a < 10;
NOTICE: max_bucket = 99, threshold = 128.019470, running_avg_locks
32.004868 Reset? No
# select * from a where a < 10;
NOTICE: max_bucket = 99, threshold = 119.217522, running_avg_locks
29.804380 Reset? No
# select * from a where a < 10;
NOTICE: max_bucket = 99, threshold = 111.295769, running_avg_locks
27.823942 Reset? No
# select * from a where a < 10;
NOTICE: max_bucket = 99, threshold = 104.166191, running_avg_locks
26.041548 Reset? No
# select * from a where a < 10;
NOTICE: max_bucket = 99, threshold = 97.749573, running_avg_locks
24.437393 Reset? Yes

It took 17 xacts to react to the change and reset the lock table.

# select * from a where a < 10;
NOTICE: max_bucket = 15, threshold = 91.974617, running_avg_locks
22.993654 Reset? No

notice max_bucket is back at 15 again.

Any thoughts on this?

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

shrink_bloated_locallocktable_v8_demo.patchapplication/octet-stream; name=shrink_bloated_locallocktable_v8_demo.patchDownload

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 1b7053cb1c..7e916a883d 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -254,6 +254,14 @@ static HTAB *LockMethodLockHash;
 static HTAB *LockMethodProcLockHash;
 static HTAB *LockMethodLocalHash;
 
+/* Initial size of the LockMethodLocalHash table */
+#define LOCKMETHODLOCALHASH_INIT_SIZE 16
+
+/*
+ * Multiplier of hash table size to control when we should shrink the
+ * LockMethodLocalHash back down to the initial size.
+ */
+#define LOCKMETHODLOCALHASH_SHRINK_MULTIPLIER 4.0
 
 /* private state for error cleanup */
 static LOCALLOCK *StrongLockInProgress;
@@ -339,6 +347,8 @@ PROCLOCK_PRINT(const char *where, const PROCLOCK *proclockP)
 #endif							/* not LOCK_DEBUG */
 
 
+static void InitLocalLockHash(void);
+static inline void TryShrinkLocalLockHash(void);
 static uint32 proclock_hash(const void *key, Size keysize);
 static void RemoveLocalLock(LOCALLOCK *locallock);
 static PROCLOCK *SetupLockInTable(LockMethod lockMethodTable, PGPROC *proc,
@@ -431,14 +441,26 @@ InitLocks(void)
 	if (!found)
 		SpinLockInit(&FastPathStrongRelationLocks->mutex);
 
+	InitLocalLockHash();
+}
+
+/*
+ * InitLocalLockHash
+ *		Initialize the LockMethodLocalHash hash table.
+ */
+static void
+InitLocalLockHash(void)
+{
+	HASHCTL		info;
+
 	/*
 	 * Allocate non-shared hash table for LOCALLOCK structs.  This stores lock
 	 * counts and resource owner information.
 	 *
-	 * The non-shared table could already exist in this process (this occurs
-	 * when the postmaster is recreating shared memory after a backend crash).
-	 * If so, delete and recreate it.  (We could simply leave it, since it
-	 * ought to be empty in the postmaster, but for safety let's zap it.)
+	 * First destroy any old table that may exist.  We might just be
+	 * recreating the table or it could already exist in this process (this
+	 * occurs when the postmaster is recreating shared memory after a backend
+	 * crash).  In either case, delete and recreate it.
 	 */
 	if (LockMethodLocalHash)
 		hash_destroy(LockMethodLocalHash);
@@ -447,11 +469,59 @@ InitLocks(void)
 	info.entrysize = sizeof(LOCALLOCK);
 
 	LockMethodLocalHash = hash_create("LOCALLOCK hash",
-									  16,
+									  LOCKMETHODLOCALHASH_INIT_SIZE,
 									  &info,
 									  HASH_ELEM | HASH_BLOBS);
 }
 
+/*
+ * TryShrinkLocalLockHash
+ *		Rebuild LockMethodLocalHash with its initial size if it has grown
+ *		significantly larger than the average locks that recent xacts have
+ *		been obtaining.
+ *
+ * 'numLocksHeld' is the number of locks that was held during this xact.
+ */
+static inline void
+TryShrinkLocalLockHash(long numLocksHeld)
+{
+	static float running_avg_locks = 0.0;
+
+	/*
+	 * Only consider shrinking if there's actually zero locks in the table.
+	 * (Session level locks will remain after end of xact.)
+	 */
+	if (hash_get_num_entries(LockMethodLocalHash) == 0)
+	{
+		uint32		max_bucket;
+		float		threshold;
+
+		/*
+		 * Calculate an approximate running average of the number of locks.
+		 * Here the constant 10.0 controls the "reaction rate" of the average.
+		 * Higher values will have it react more slowly, lower values will
+		 * cause it to react more quickly to changes in the number of locks.
+		 */
+		running_avg_locks -= running_avg_locks / 10.0;
+		running_avg_locks += numLocksHeld / 10.0;
+
+		max_bucket = hash_get_max_bucket(LockMethodLocalHash);
+
+		/*
+		 * Don't shrink unless the table is N times larger than its initial
+		 * size, and N times larger than the running_avg_lock count.  This
+		 * ensures we don't shrink it unless its worth doing.
+		 */
+		threshold = Max(LOCKMETHODLOCALHASH_INIT_SIZE *
+						LOCKMETHODLOCALHASH_SHRINK_MULTIPLIER,
+						running_avg_locks * 4);
+
+		elog(NOTICE, "max_bucket = %u, threshold = %f, running_avg_locks %f Reset? %s", max_bucket, threshold, running_avg_locks, max_bucket > threshold ? "Yes" : "No");
+		/* Rebuild the table if the max_bucket is beyond the threshold */
+		if (max_bucket > threshold)
+			InitLocalLockHash();
+	}
+}
 
 /*
  * Fetch the lock method table associated with a given lock
@@ -2095,6 +2165,7 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
 	LOCALLOCK  *locallock;
 	LOCK	   *lock;
 	PROCLOCK   *proclock;
+	long		numLocksHeld;
 	int			partition;
 	bool		have_fast_path_lwlock = false;
 
@@ -2114,10 +2185,16 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
 	 * ends.
 	 */
 	if (lockmethodid == DEFAULT_LOCKMETHOD)
+	{
 		VirtualXactLockTableCleanup();
 
+		/* Record the number of locks currently held */
+		numLocksHeld = hash_get_num_entries(LockMethodLocalHash);
+	}
+
 	numLockModes = lockMethodTable->numLockModes;
 
+
 	/*
 	 * First we run through the locallock table and get rid of unwanted
 	 * entries, then we scan the process's proclocks and get rid of those. We
@@ -2349,6 +2426,14 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
 		LWLockRelease(partitionLock);
 	}							/* loop over partitions */
 
+	/*
+	 * The hash_seq_search can become inefficient when the hash table has
+	 * grown significantly larger than the default size due to the backend
+	 * having obtained a large number of locks.  Consider shrinking it.
+	 */
+	if (lockmethodid == DEFAULT_LOCKMETHOD)
+		TryShrinkLocalLockHash(numLocksHeld);
+
 #ifdef LOCK_DEBUG
 	if (*(lockMethodTable->trace_flag))
 		elog(LOG, "LockReleaseAll done");
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 0dfbec8e3e..d93e5279ee 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -1351,6 +1351,15 @@ hash_get_num_entries(HTAB *hashp)
 	return sum;
 }
 
+/*
+ * hash_get_max_bucket -- get the maximum used bucket in a hashtable
+ */
+uint32
+hash_get_max_bucket(HTAB *hashp)
+{
+	return hashp->hctl->max_bucket;
+}
+
 /*
  * hash_seq_init/_search/_term
  *			Sequentially search through hash table and return
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index fe5ab9c868..941f99398d 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -132,6 +132,7 @@ extern void *hash_search_with_hash_value(HTAB *hashp, const void *keyPtr,
 extern bool hash_update_hash_key(HTAB *hashp, void *existingEntry,
 								 const void *newKeyPtr);
 extern long hash_get_num_entries(HTAB *hashp);
+extern uint32 hash_get_max_bucket(HTAB *hashp);
 extern void hash_seq_init(HASH_SEQ_STATUS *status, HTAB *hashp);
 extern void *hash_seq_search(HASH_SEQ_STATUS *status);
 extern void hash_seq_term(HASH_SEQ_STATUS *status);

#83

David Rowley

david.rowley@2ndquadrant.com

over 6 years ago

In reply to: David Rowley (#82)

1 attachment(s)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

On Wed, 24 Jul 2019 at 16:16, David Rowley <david.rowley@2ndquadrant.com> wrote:

On Wed, 24 Jul 2019 at 15:05, David Rowley <david.rowley@2ndquadrant.com> wrote:

To be able to reduce the threshold down again we'd need to make a
hash_get_num_entries(LockMethodLocalHash) call before performing the
guts of LockReleaseAll(). We could then weight that onto some running
average counter with a weight of, say... 10, so we react to changes
fairly quickly, but not instantly. We could then have some sort of
logic like "rebuild the hash table if running average 4 times less
than max_bucket"

I've attached a spreadsheet of that idea and the algorithm we could
use to track the running average. Initially, I've mocked it up a
series of transactions that use 1000 locks, then at row 123 dropped
that to 10 locks. If we assume the max_bucket is 1000, then it takes
until row 136 for the running average to drop below the max_bucket
count, i.e 13 xacts. There we'd reset there at the init size of 16. If
the average went up again, then we'd automatically expand the table as
we do now. To make this work we'd need an additional call to
hash_get_num_entries(), before we release the locks, so there is more
overhead.

Here's a patch with this implemented. I've left a NOTICE in there to
make it easier for people to follow along at home and see when the
lock table is reset.

Here's a more polished version with the debug code removed, complete
with benchmarks.

-- Test 1. Select 1 record from a 140 partitioned table. Tests
creating a large number of locks with a fast query.

select3.sql: select * from hp where b = 1

-
Master:

$ pgbench -n -f select3.sql -T 60 -M prepared postgres
tps = 2098.628975 (excluding connections establishing)
tps = 2101.797672 (excluding connections establishing)
tps = 2085.317292 (excluding connections establishing)
tps = 2094.931999 (excluding connections establishing)
tps = 2092.328908 (excluding connections establishing)

Patched:

$ pgbench -n -f select3.sql -T 60 -M prepared postgres
tps = 2101.691459 (excluding connections establishing)
tps = 2104.533249 (excluding connections establishing)
tps = 2106.499123 (excluding connections establishing)
tps = 2104.033459 (excluding connections establishing)
tps = 2105.463629 (excluding connections establishing)

(I'm surprised there is not more overhead in the additional tracking
added to calculate the running average)

-- Test 2. Tests a prepared query which will perform a generic plan on
the 6th execution then fallback on a custom plan due to it pruning all
but one partition. Master suffers from the lock table becoming
bloated after locking all partitions when planning the generic plan.

select.sql:
\set p 1
select * from ht where a = :p

Master:

$ pgbench -n -f select.sql -T 60 -M prepared postgres
tps = 10207.780843 (excluding connections establishing)
tps = 10205.772688 (excluding connections establishing)
tps = 10214.896449 (excluding connections establishing)
tps = 10157.572153 (excluding connections establishing)
tps = 10147.764477 (excluding connections establishing)

Patched:

$ pgbench -n -f select.sql -T 60 -M prepared postgres
tps = 14677.636570 (excluding connections establishing)
tps = 14661.437186 (excluding connections establishing)
tps = 14647.202877 (excluding connections establishing)
tps = 14784.165753 (excluding connections establishing)
tps = 14850.355344 (excluding connections establishing)

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

shrink_bloated_locallocktable_v8.patchapplication/octet-stream; name=shrink_bloated_locallocktable_v8.patchDownload

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 1b7053cb1c..83a9d7ae58 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -254,6 +254,14 @@ static HTAB *LockMethodLockHash;
 static HTAB *LockMethodProcLockHash;
 static HTAB *LockMethodLocalHash;
 
+/* Initial size of the LockMethodLocalHash table */
+#define LOCKMETHODLOCALHASH_INIT_SIZE 16
+
+/*
+ * If we see that the LockMethodLocalHash table is this many times bigger than
+ * it needs to be, then we'll shrink it down to LOCKMETHODLOCALHASH_INIT_SIZE.
+ */
+#define LOCKMETHODLOCALHASH_SHRINK_MULTIPLIER 4.0
 
 /* private state for error cleanup */
 static LOCALLOCK *StrongLockInProgress;
@@ -339,6 +347,8 @@ PROCLOCK_PRINT(const char *where, const PROCLOCK *proclockP)
 #endif							/* not LOCK_DEBUG */
 
 
+static void InitLocalLockHash(void);
+static inline void TryShrinkLocalLockHash(void);
 static uint32 proclock_hash(const void *key, Size keysize);
 static void RemoveLocalLock(LOCALLOCK *locallock);
 static PROCLOCK *SetupLockInTable(LockMethod lockMethodTable, PGPROC *proc,
@@ -431,14 +441,26 @@ InitLocks(void)
 	if (!found)
 		SpinLockInit(&FastPathStrongRelationLocks->mutex);
 
+	InitLocalLockHash();
+}
+
+/*
+ * InitLocalLockHash
+ *		Initialize the LockMethodLocalHash hash table.
+ */
+static void
+InitLocalLockHash(void)
+{
+	HASHCTL		info;
+
 	/*
 	 * Allocate non-shared hash table for LOCALLOCK structs.  This stores lock
 	 * counts and resource owner information.
 	 *
-	 * The non-shared table could already exist in this process (this occurs
-	 * when the postmaster is recreating shared memory after a backend crash).
-	 * If so, delete and recreate it.  (We could simply leave it, since it
-	 * ought to be empty in the postmaster, but for safety let's zap it.)
+	 * First destroy any old table that may exist.  We might just be
+	 * recreating the table or it could already exist in this process (this
+	 * occurs when the postmaster is recreating shared memory after a backend
+	 * crash).  In either case, delete and recreate it.
 	 */
 	if (LockMethodLocalHash)
 		hash_destroy(LockMethodLocalHash);
@@ -447,11 +469,65 @@ InitLocks(void)
 	info.entrysize = sizeof(LOCALLOCK);
 
 	LockMethodLocalHash = hash_create("LOCALLOCK hash",
-									  16,
+									  LOCKMETHODLOCALHASH_INIT_SIZE,
 									  &info,
 									  HASH_ELEM | HASH_BLOBS);
 }
 
+/*
+ * TryShrinkLocalLockHash
+ *		Rebuild LockMethodLocalHash with its initial size if it has grown
+ *		significantly larger than the average locks that recent xacts have
+ *		been obtaining.
+ *
+ * 'numLocksHeld' is the number of locks that was held during this xact.
+ */
+static inline void
+TryShrinkLocalLockHash(long numLocksHeld)
+{
+	/*
+	 * Start the running average at something above zero so we don't rebuild
+	 * the lock table until we've gotten a more meaningful running average.
+	 * Twice the LOCKMETHODLOCALHASH_INIT_SIZE should do the trick.
+	 */
+	static float running_avg_locks = LOCKMETHODLOCALHASH_INIT_SIZE * 2.0;
+
+	/*
+	 * Only consider shrinking if there's actually zero locks in the table.
+	 * (Session level locks will remain after COMMIT)
+	 */
+	if (hash_get_num_entries(LockMethodLocalHash) == 0)
+	{
+		uint32		max_bucket;
+		float		threshold;
+
+		/*
+		 * Calculate an approximate running average of the number of locks.
+		 * Here the constant 10.0 controls the "reaction rate" of the average.
+		 * Higher values will have it react more slowly, lower values will
+		 * cause it to react more quickly to changes in the number of locks.
+		 */
+		running_avg_locks -= running_avg_locks / 10.0;
+		running_avg_locks += numLocksHeld / 10.0;
+
+		max_bucket = hash_get_max_bucket(LockMethodLocalHash);
+
+		/*
+		 * Don't shrink unless the table is N times larger than its initial
+		 * size, and N times larger than the running_avg_locks counter.  This
+		 * ensure we don't shrink it unless the table is at least N times
+		 * bigger than it needs to be.
+		 */
+		threshold = Max(LOCKMETHODLOCALHASH_INIT_SIZE *
+						LOCKMETHODLOCALHASH_SHRINK_MULTIPLIER,
+						running_avg_locks *
+						LOCKMETHODLOCALHASH_SHRINK_MULTIPLIER);
+
+		/* Rebuild the table if the max_bucket is beyond the threshold */
+		if (max_bucket > threshold)
+			InitLocalLockHash();
+	}
+}
 
 /*
  * Fetch the lock method table associated with a given lock
@@ -2095,6 +2171,7 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
 	LOCALLOCK  *locallock;
 	LOCK	   *lock;
 	PROCLOCK   *proclock;
+	long		numLocksHeld;
 	int			partition;
 	bool		have_fast_path_lwlock = false;
 
@@ -2114,10 +2191,16 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
 	 * ends.
 	 */
 	if (lockmethodid == DEFAULT_LOCKMETHOD)
+	{
 		VirtualXactLockTableCleanup();
 
+		/* Record the number of locks currently held */
+		numLocksHeld = hash_get_num_entries(LockMethodLocalHash);
+	}
+
 	numLockModes = lockMethodTable->numLockModes;
 
+
 	/*
 	 * First we run through the locallock table and get rid of unwanted
 	 * entries, then we scan the process's proclocks and get rid of those. We
@@ -2349,6 +2432,14 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
 		LWLockRelease(partitionLock);
 	}							/* loop over partitions */
 
+	/*
+	 * The hash_seq_search can become inefficient when the hash table has
+	 * grown significantly larger than the default size due to the backend
+	 * having obtained a large number of locks.  Consider shrinking it.
+	 */
+	if (lockmethodid == DEFAULT_LOCKMETHOD)
+		TryShrinkLocalLockHash(numLocksHeld);
+
 #ifdef LOCK_DEBUG
 	if (*(lockMethodTable->trace_flag))
 		elog(LOG, "LockReleaseAll done");
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 0dfbec8e3e..d93e5279ee 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -1351,6 +1351,15 @@ hash_get_num_entries(HTAB *hashp)
 	return sum;
 }
 
+/*
+ * hash_get_max_bucket -- get the maximum used bucket in a hashtable
+ */
+uint32
+hash_get_max_bucket(HTAB *hashp)
+{
+	return hashp->hctl->max_bucket;
+}
+
 /*
  * hash_seq_init/_search/_term
  *			Sequentially search through hash table and return
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index fe5ab9c868..941f99398d 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -132,6 +132,7 @@ extern void *hash_search_with_hash_value(HTAB *hashp, const void *keyPtr,
 extern bool hash_update_hash_key(HTAB *hashp, void *existingEntry,
 								 const void *newKeyPtr);
 extern long hash_get_num_entries(HTAB *hashp);
+extern uint32 hash_get_max_bucket(HTAB *hashp);
 extern void hash_seq_init(HASH_SEQ_STATUS *status, HTAB *hashp);
 extern void *hash_seq_search(HASH_SEQ_STATUS *status);
 extern void hash_seq_term(HASH_SEQ_STATUS *status);

#84

Tom Lane

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: David Rowley (#83)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

David Rowley <david.rowley@2ndquadrant.com> writes:

Here's a more polished version with the debug code removed, complete
with benchmarks.

A few gripes:

You're measuring the number of locks held at completion of the
transaction, which fails to account for locks transiently taken and
released, so that the actual peak usage will be more. I think we mostly
only do that for system catalog accesses, so *maybe* the number of extra
locks involved isn't very large, but that's a very shaky assumption.

I don't especially like the fact that this seems to have a hard-wired
(and undocumented) assumption that buckets == entries, ie that the
fillfactor of the table is set at 1.0. lock.c has no business knowing
that. Perhaps instead of returning the raw bucket count, you could have
dynahash.c return bucket count times fillfactor, so that the number is in
the same units as the entry count.

This:
running_avg_locks -= running_avg_locks / 10.0;
running_avg_locks += numLocksHeld / 10.0;
seems like a weird way to do the calculation. Personally I'd write
running_avg_locks += (numLocksHeld - running_avg_locks) / 10.0;
which is more the way I'm used to seeing exponential moving averages
computed --- at least, it seems clearer to me why this will converge
towards the average value of numLocksHeld over time. It also makes
it clear that it wouldn't be sane to use two different divisors,
whereas your formulation makes it look like maybe they could be
set independently.

Your compiler might not complain that LockReleaseAll's numLocksHeld
is potentially uninitialized, but other people's compilers will.

On the whole, I don't especially like this approach, because of the
confusion between peak lock count and end-of-xact lock count. That
seems way too likely to cause problems.

regards, tom lane

#85

Thomas Munro

thomas.munro@gmail.com

over 6 years ago

In reply to: Tom Lane (#84)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

On Thu, Jul 25, 2019 at 5:49 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

David Rowley <david.rowley@2ndquadrant.com> writes:

Here's a more polished version with the debug code removed, complete
with benchmarks.

A few gripes:

[gripes]

Based on the above, I've set this to "Waiting on Author", in the next CF.

--
Thomas Munro
https://enterprisedb.com

#86

David Rowley

david.rowley@2ndquadrant.com

over 6 years ago

In reply to: Tom Lane (#84)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

On Thu, 25 Jul 2019 at 05:49, Tom Lane <tgl@sss.pgh.pa.us> wrote:

On the whole, I don't especially like this approach, because of the
confusion between peak lock count and end-of-xact lock count. That
seems way too likely to cause problems.

Thanks for having a look at this. I've not addressed the points
you've mentioned due to what you mention above. The only way I can
think of so far to resolve that would be to add something to track
peak lock usage. The best I can think of to do that, short of adding
something to dynahash.c is to check how many locks are held each time
we obtain a lock, then if that count is higher than the previous time
we checked, then update the maximum locks held, (probably a global
variable). That seems pretty horrible to me and adds overhead each
time we obtain a lock, which is a pretty performance-critical path.

I've not tested what Andres mentioned about simplehash instead of
dynahash yet. I did a quick scan of simplehash and it looked like
SH_START_ITERATE would suffer the same problems as dynahash's
hash_seq_search(), albeit, perhaps with some more efficient memory
lookups. i.e it still has to skip over empty buckets, which might be
costly in a bloated table.

For now, I'm out of ideas. If anyone else feels like suggesting
something of picking this up, feel free.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#87

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: David Rowley (#86)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

On Wed, Aug 14, 2019 at 07:25:10PM +1200, David Rowley wrote:

On Thu, 25 Jul 2019 at 05:49, Tom Lane <tgl@sss.pgh.pa.us> wrote:

On the whole, I don't especially like this approach, because of the
confusion between peak lock count and end-of-xact lock count. That
seems way too likely to cause problems.

Thanks for having a look at this. I've not addressed the points
you've mentioned due to what you mention above. The only way I can
think of so far to resolve that would be to add something to track
peak lock usage. The best I can think of to do that, short of adding
something to dynahash.c is to check how many locks are held each time
we obtain a lock, then if that count is higher than the previous time
we checked, then update the maximum locks held, (probably a global
variable). That seems pretty horrible to me and adds overhead each
time we obtain a lock, which is a pretty performance-critical path.

Would it really be a measurable overhead? I mean, we only really need
one int counter, and you don't need to do the check on every lock
acquisition - you just need to recheck on the first lock release. But
maybe I'm underestimating how expensive it is ...

Talking about dynahash - doesn't it already track this information?
Maybe not directly but surely it has to track the number of entries in
the hash table, in order to compute fill factor. Can't we piggy-back on
that and track the highest fill-factor for a particular period of time?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#88

Alvaro Herrera

alvherre@2ndquadrant.com

over 6 years ago

In reply to: David Rowley (#86)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

On 2019-Aug-14, David Rowley wrote:

For now, I'm out of ideas. If anyone else feels like suggesting
something of picking this up, feel free.

Hmm ... is this patch rejected, or is somebody still trying to get it to
committable state? David, you're listed as committer.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#89

Tsunakawa, Takayuki

tsunakawa.takay@jp.fujitsu.com

over 6 years ago

In reply to: Alvaro Herrera (#88)

RE: Speed up transaction completion faster after many relations are accessed in a transaction

From: Alvaro Herrera [mailto:alvherre@2ndquadrant.com]

Hmm ... is this patch rejected, or is somebody still trying to get it to
committable state? David, you're listed as committer.

I don't think it's rejected. It would be a pity (mottainai) to refuse this, because it provides significant speedup despite its simple modification.

Again, I think the v2 patch is OK. Tom's comment was as follows:

[Tom's comment against v2]
----------------------------------------
FWIW, I tried this patch against current HEAD (959d00e9d).
Using the test case described by Amit at
<be25cadf-982e-3f01-88b4-443a6667e16a(at)lab(dot)ntt(dot)co(dot)jp>
I do measure an undeniable speedup, close to 35%.

On the whole I don't think there's an adequate case for committing
this patch.

* Extreme test case:
Not extreme. Two of our customers, who are considering to use PostgreSQL, are using thousands of partitions now. We hit this issue -- a point query gets nearly 20% slower after automatically creating a generic plan. That's the reason for this proposal.

* 0.5% slowdown with pgbench:
I think it's below the noise, as Tom said.

* sizeof(LOCALLOCK):
As Andres replied to Tom in the immediately following mail, LOCALLOCK was bigger in PG 11.

* Use case is narrow:
No. The bloated LockMethodLocalHash affects the performance of the items below as well as transaction commit/abort:
- AtPrepare_Locks() and PostPrepare_Locks(): the hash table is scanned twice in PREPARE!
- LockReleaseSession: advisory lock
- LockReleaseCurrentOwner: ??
- LockReassignCurrentOwner: ??

Regards
Takayuki Tsunakawa

#90

Alvaro Herrera

alvherre@2ndquadrant.com

over 6 years ago

In reply to: Tsunakawa, Takayuki (#89)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

On 2019-Sep-03, Tsunakawa, Takayuki wrote:

From: Alvaro Herrera [mailto:alvherre@2ndquadrant.com]

Hmm ... is this patch rejected, or is somebody still trying to get it to
committable state? David, you're listed as committer.

I don't think it's rejected. It would be a pity (mottainai) to refuse
this, because it provides significant speedup despite its simple
modification.

I don't necessarily disagree with your argumentation, but Travis is
complaining thusly:

gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g -O2 -Wall -Werror -I../../../../src/include -I/usr/include/x86_64-linux-gnu -D_GNU_SOURCE -c -o lock.o lock.c
1840lock.c:486:1: error: conflicting types for ‘TryShrinkLocalLockHash’
1841 TryShrinkLocalLockHash(long numLocksHeld)
1842 ^
1843lock.c:351:20: note: previous declaration of ‘TryShrinkLocalLockHash’ was here
1844 static inline void TryShrinkLocalLockHash(void);
1845 ^
1846<builtin>: recipe for target 'lock.o' failed

Please fix.

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#91

Tsunakawa, Takayuki

tsunakawa.takay@jp.fujitsu.com

over 6 years ago

In reply to: Alvaro Herrera (#90)

1 attachment(s)

RE: Speed up transaction completion faster after many relations are accessed in a transaction

From: Alvaro Herrera [mailto:alvherre@2ndquadrant.com]

On 2019-Sep-03, Tsunakawa, Takayuki wrote:

I don't think it's rejected. It would be a pity (mottainai) to refuse
this, because it provides significant speedup despite its simple
modification.

I don't necessarily disagree with your argumentation, but Travis is
complaining thusly:

I tried to revise David's latest patch (v8) and address Tom's comments in his last mail. But I'm a bit at a loss.

First, to accurately count the maximum number of acquired locks in a transaction, we need to track the maximum entries in the hash table, and make it available via a new function like hash_get_max_entries(). However, to cover the shared partitioned hash table (that is not necessary for LockMethodLocalHash), we must add a spin lock in hashhdr and lock/unlock it when entering and removing entries in the hash table. It spoils the effort to decrease contention by hashhdr->freelists[].mutex. Do we want to track the maximum number of acquired locks in the global variable in lock.c, not in the hash table?

Second, I couldn't understand the comment about the fill factor well. I can understand that it's not correct to compare the number of hash buckets and the number of locks. But what can we do?

I'm sorry to repeat what I mentioned in my previous mail, but my v2 patch's approach is based on the database textbook and seems intuitive. So I attached the rebased version.

Regards
Takayuki Tsunakawa

Attachments:

faster-locallock-scan_v3.patchapplication/octet-stream; name=faster-locallock-scan_v3.patchDownload

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 9089733..4e30dc5 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -255,6 +255,17 @@ static HTAB *LockMethodProcLockHash;
 static HTAB *LockMethodLocalHash;
 
 
+/*
+ * List of LOCALLOCK structures that each backend acquired
+ *
+ * If a transaction acquires many locks, LockMethodLocalHash bloats, making
+ * the hash table scans in subsequent transactions (e.g., in LockReleaseAll)
+ * even though they only acquire a few locks.  To speed up iteration over
+ * acquired locks in a backend, we use a list of LOCALLOCKs instead.
+ */
+static dlist_head LocalLocks = DLIST_STATIC_INIT(LocalLocks);
+
+
 /* private state for error cleanup */
 static LOCALLOCK *StrongLockInProgress;
 static LOCALLOCK *awaitedLock;
@@ -794,6 +805,7 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	 */
 	if (!found)
 	{
+		dlist_push_head(&LocalLocks, &locallock->procLink);
 		locallock->lock = NULL;
 		locallock->proclock = NULL;
 		locallock->hashcode = LockTagHashCode(&(localtag.lock));
@@ -1320,6 +1332,7 @@ RemoveLocalLock(LOCALLOCK *locallock)
 		SpinLockRelease(&FastPathStrongRelationLocks->mutex);
 	}
 
+	dlist_delete(&locallock->procLink);
 	if (!hash_search(LockMethodLocalHash,
 					 (void *) &(locallock->tag),
 					 HASH_REMOVE, NULL))
@@ -2088,7 +2101,7 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 void
 LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
 {
-	HASH_SEQ_STATUS status;
+	dlist_mutable_iter iter;
 	LockMethod	lockMethodTable;
 	int			i,
 				numLockModes;
@@ -2126,10 +2139,10 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
 	 * pointers.  Fast-path locks are cleaned up during the locallock table
 	 * scan, though.
 	 */
-	hash_seq_init(&status, LockMethodLocalHash);
-
-	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+	dlist_foreach_modify(iter, &LocalLocks)
 	{
+		locallock = dlist_container(LOCALLOCK, procLink, iter.cur);
+
 		/*
 		 * If the LOCALLOCK entry is unused, we must've run out of shared
 		 * memory while trying to set up this lock.  Just forget the local
@@ -2362,16 +2375,16 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
 void
 LockReleaseSession(LOCKMETHODID lockmethodid)
 {
-	HASH_SEQ_STATUS status;
+	dlist_mutable_iter iter;
 	LOCALLOCK  *locallock;
 
 	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
 		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
 
-	hash_seq_init(&status, LockMethodLocalHash);
-
-	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+	dlist_foreach_modify(iter, &LocalLocks)
 	{
+		locallock = dlist_container(LOCALLOCK, procLink, iter.cur);
+
 		/* Ignore items that are not of the specified lock method */
 		if (LOCALLOCK_LOCKMETHOD(*locallock) != lockmethodid)
 			continue;
@@ -2394,13 +2407,14 @@ LockReleaseCurrentOwner(LOCALLOCK **locallocks, int nlocks)
 {
 	if (locallocks == NULL)
 	{
-		HASH_SEQ_STATUS status;
+		dlist_mutable_iter iter;
 		LOCALLOCK  *locallock;
 
-		hash_seq_init(&status, LockMethodLocalHash);
-
-		while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+		dlist_foreach_modify(iter, &LocalLocks)
+		{
+			locallock = dlist_container(LOCALLOCK, procLink, iter.cur);
 			ReleaseLockIfHeld(locallock, false);
+		}
 	}
 	else
 	{
@@ -2493,13 +2507,14 @@ LockReassignCurrentOwner(LOCALLOCK **locallocks, int nlocks)
 
 	if (locallocks == NULL)
 	{
-		HASH_SEQ_STATUS status;
+		dlist_mutable_iter iter;
 		LOCALLOCK  *locallock;
 
-		hash_seq_init(&status, LockMethodLocalHash);
-
-		while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+		dlist_foreach_modify(iter, &LocalLocks)
+		{
+			locallock = dlist_container(LOCALLOCK, procLink, iter.cur);
 			LockReassignOwner(locallock, parent);
+		}
 	}
 	else
 	{
@@ -3138,8 +3153,7 @@ LockRefindAndRelease(LockMethod lockMethodTable, PGPROC *proc,
 void
 AtPrepare_Locks(void)
 {
-	HASH_SEQ_STATUS status;
-	LOCALLOCK  *locallock;
+	dlist_mutable_iter iter;
 
 	/*
 	 * For the most part, we don't need to touch shared memory for this ---
@@ -3147,10 +3161,9 @@ AtPrepare_Locks(void)
 	 * Fast-path locks are an exception, however: we move any such locks to
 	 * the main table before allowing PREPARE TRANSACTION to succeed.
 	 */
-	hash_seq_init(&status, LockMethodLocalHash);
-
-	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+	dlist_foreach_modify(iter, &LocalLocks)
 	{
+		LOCALLOCK  *locallock = dlist_container(LOCALLOCK, procLink, iter.cur);
 		TwoPhaseLockRecord record;
 		LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
 		bool		haveSessionLock;
@@ -3249,8 +3262,7 @@ void
 PostPrepare_Locks(TransactionId xid)
 {
 	PGPROC	   *newproc = TwoPhaseGetDummyProc(xid, false);
-	HASH_SEQ_STATUS status;
-	LOCALLOCK  *locallock;
+	dlist_mutable_iter iter;
 	LOCK	   *lock;
 	PROCLOCK   *proclock;
 	PROCLOCKTAG proclocktag;
@@ -3272,10 +3284,9 @@ PostPrepare_Locks(TransactionId xid)
 	 * pointing to the same proclock, and we daren't end up with any dangling
 	 * pointers.
 	 */
-	hash_seq_init(&status, LockMethodLocalHash);
-
-	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+	dlist_foreach_modify(iter, &LocalLocks)
 	{
+		LOCALLOCK  *locallock = dlist_container(LOCALLOCK, procLink, iter.cur);
 		LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
 		bool		haveSessionLock;
 		bool		haveXactLock;
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index 6efb7a9..d2c4652 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -18,6 +18,7 @@
 #error "lock.h may not be included from frontend code"
 #endif
 
+#include "lib/ilist.h"
 #include "storage/lockdefs.h"
 #include "storage/backendid.h"
 #include "storage/lwlock.h"
@@ -411,6 +412,7 @@ typedef struct LOCALLOCK
 	uint32		hashcode;		/* copy of LOCKTAG's hash value */
 	LOCK	   *lock;			/* associated LOCK object, if any */
 	PROCLOCK   *proclock;		/* associated PROCLOCK object, if any */
+	dlist_node  procLink;		/* list link in a backend's list of LOCALLOCKs */
 	int64		nLocks;			/* total number of times lock is held */
 	int			numLockOwners;	/* # of relevant ResourceOwners */
 	int			maxLockOwners;	/* allocated size of array */

#92

Michael Paquier

michael@paquier.xyz

about 6 years ago

In reply to: Tsunakawa, Takayuki (#91)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

On Thu, Sep 26, 2019 at 07:11:53AM +0000, Tsunakawa, Takayuki wrote:

I'm sorry to repeat what I mentioned in my previous mail, but my v2
patch's approach is based on the database textbook and seems
intuitive. So I attached the rebased version.

If you wish to do so, that's fine by me but I have not dived into the
details of the thread much. Please not anyway that the patch does not
apply anymore and that it needs a rebase. So for now I have moved the
patch to next CF, waiting on author.
--
Michael

#93

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: Michael Paquier (#92)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

Hi, the patch was in WoA since December, waiting for a rebase. I've
marked it as returned with feedback. Feel free to re-submit an updated
version into the next CF.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#94

David Rowley

dgrowleyml@gmail.com

over 4 years ago

In reply to: David Rowley (#86)

1 attachment(s)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

On Wed, 14 Aug 2019 at 19:25, David Rowley <david.rowley@2ndquadrant.com> wrote:

For now, I'm out of ideas. If anyone else feels like suggesting
something of picking this up, feel free.

This is a pretty old thread, so we might need a recap:

# Recap

Basically LockReleaseAll() becomes slow after a large number of locks
have all been held at once by a backend. The problem is that the
LockMethodLocalHash dynahash table must grow to store all the locks
and when later transactions only take a few locks, LockReleaseAll() is
slow due to hash_seq_search() having to skip the sparsely filled
buckets in the bloated hash table.

The following things were tried on this thread. Each one failed:

1) Use a dlist in LOCALLOCK to track the next and prev LOCALLOCK.
Simply loop over the dlist in LockReleaseAll().
2) Try dropping and recreating the dynahash table if it becomes
bloated using some heuristics to decide what "bloated" means and if
recreating is worthwhile.

#1 failed due to concerns with increasing the size of LOCALLOCK to
store the dlist pointers. Performance regressions were seen too.
Possibly due to size increase or additional overhead from pushing onto
the dlist.
#2 failed because it was difficult to agree on what the heuristics
would be and we had no efficient way to determine the maximum number
of locks that a given transaction held at any one time. We only know
how many were still held at LockReleaseAll().

There were also some suggestions to fix dynahash's hash_seq_search()
slowness, and also a suggestion to try using simplehash.h instead of
dynahash.c. Unfortunately simplehash.h would suffer the same issues as
it too would have to skip over empty buckets in a sparsely populated
hash table.

I'd like to revive this effort as I have a new idea on how to solve the problem.

# Background

Over in [1]/messages/by-id/CAApHDvpkWOGLh_bYg7jproXN8B2g2T9dWDcqsmKsXG5+WwZaqw@mail.gmail.com I'm trying to improve the performance of smgropen() during
recovery. The slowness there comes from the dynahash table lookups to
find the correct SMgrRelation. Over there I proposed to use simplehash
instead of dynahash because it's quite a good bit faster and far
lessens the hash lookup overhead during recovery. One problem on that
thread is that relcache keeps a pointer into the SMgrRelation
(RelationData.rd_smgr) and because simplehash moves things around
during inserts and deletes, then we can't have anything point to
simplehash entries, they're unstable. I fixed that over on the other
thread by having the simplehash entry point to a palloced
SMgrRelationData... My problem is, I don't really like that idea as it
means we need to palloc() pfree() lots of little chunks of memory.

To fix the above, I really think we need a version of simplehash that
has stable pointers. Providing that implementation is faster than
dynahash, then it will help solve the smgropen() slowness during
recovery.

# A new hashtable implementation

I ended up thinking of this thread again because the implementation of
the stable pointer hash that I ended up writing for [1]/messages/by-id/CAApHDvpkWOGLh_bYg7jproXN8B2g2T9dWDcqsmKsXG5+WwZaqw@mail.gmail.com happens to be
lightning fast for hash table sequential scans, even if the table has
become bloated. The reason the seq scans are so fast is that the
implementation loops over the data arrays, which are tightly packed
and store the actual data rather than pointers to the data. The code
does not need to loop over the bucket array for this at all, so how
large that has become is irrelevant to hash table seq scan
performance.

The patch stores elements in "segments" which is set to some power of
2 value. When we run out of space to store new items in a segment, we
just allocate another segment. When we remove items from the table,
new items reuse the first unused item in the first segment with free
space. This helps to keep the used elements tightly packed. A
segment keeps a bitmap of used items so that means scanning all used
items is very fast. If you flip the bits in the used_item bitmap,
then you get a free list for that segment, so it's also very fast to
find a free element when inserting a new item into the table.

I've called the hash table implementation "generichash". It uses the
same preprocessor tricks as simplehash.h does and borrows the same
linear probing code that's used in simplehash. The bucket array just
stores the hash value and a uint32 index into the segment item that
stores the data. Since segments store a power of 2 items, we can
easily address both the segment number and the item within the segment
from the single uint32 index value. The 'index' field just stores a
special value when the bucket is empty. No need to add another field
for that. This means the bucket type is just 8 bytes wide.

One thing I will mention about the new hash table implementation is
that GH_ITEMS_PER_SEGMENT is, by default, set to 256. This means
that's the minimum size for the table. I could drop this downto 64,
but that's still quite a bit more than the default size of the
dynahash table of 16. I think 16 is a bit on the low side and that it
might be worth making this 64 anyway. I'd just need to lower
GH_ITEMS_PER_SEGMENT down. The current code does not allow it to go
lower as I've done nothing to allow partial bitmap words, they're
64-bits on a 64-bit machine.

I've not done too much benchmarking between it and simplehash.h, but I
think in some cases it will be faster. Since the bucket type is just 8
bytes, moving stuff around during insert/deletes will be cheaper than
with simplehash. Lookups are likely to be a bit slower due to having
to lookup the item within the segment, which is a few pointer
dereferences away.

A quick test shows an improvement when compared to dynahash.

# select count(pg_try_advisory_lock(99999,99999)) from
generate_series(1,1000000);

Master:
Time: 190.257 ms
Time: 189.440 ms
Time: 188.768 ms

Patched:
Time: 182.779 ms
Time: 182.267 ms
Time: 186.902 ms

This is just hitting the local lock table. The advisory lock key is
the same each time, so it remains a lock check. Also, it's a pretty
small gain, but I'm mostly trying to show that the new hash table is
not any slower than dynahash for probing for inserts.

The real wins come from what we're trying to solve in this thread --
the performance of LockReleaseAll().

Benchmarking results measuring the TPS of a simple select from an
empty table after another transaction has bloated the locallock table.

Master:
127544 tps
113971 tps
123255 tps
121615 tps

Patched:
170803 tps
167305 tps
165002 tps
171976 tps

About 38% faster.

The benchmark I used was:

t1.sql:
\set p 1
select a from t1 where a = :p

hp.sql:
select count(*) from hp

"hp" is a hash partitioned table with 10k partitions.

pgbench -j 20 -c 20 -T 60 -M prepared -n -f hp.sql@1 -f t1.sql@100000 postgres

I'm using the query to the hp table to bloat the locallock table. It's
only executed every 1 in 100,000 queries. The tps numbers above are
the ones to run t1.sql

I've not quite looked into why yet, but the hp.sql improved
performance by 58%. It went from an average of 1.061377 in master to
an average of 1.683616 in the patched version. I can't quite see where
this gain is coming from. It's pretty hard to get good stable
performance results out of this AMD machine, so it might be related to
that. That's why I ran 20 threads. It seems slightly better. The
machine seems to have trouble waking up properly for a single thread.

It would be good if someone could repeat the tests to see if the gains
appear on other hardware.

Also, it would be good to hear what people think about solving the
problem this way.

Patch attached.

David

[1]: /messages/by-id/CAApHDvpkWOGLh_bYg7jproXN8B2g2T9dWDcqsmKsXG5+WwZaqw@mail.gmail.com

Attachments:

v1-0001-Add-a-new-hash-table-type-which-has-stable-pointe.patchapplication/octet-stream; name=v1-0001-Add-a-new-hash-table-type-which-has-stable-pointe.patchDownload

From f4890e8ea453b013e9e4c7eba641584332dede09 Mon Sep 17 00:00:00 2001
From: David Rowley <dgrowley@gmail.com>
Date: Fri, 18 Jun 2021 04:22:11 +1200
Subject: [PATCH v4] Add a new hash table type which has stable pointers

This is named generichash.  It's similar and takes most of the code from
simplehash.h but provides stable pointers to hashed elements.  simplehash
will move these around which means it's not possible to have anything
point to your hash entry.

generichash.h allocates elements in "segments", by default 256 at a time.
When those are filled another segment is allocated.  When items are
removed from the table new items will try to fill from the lowest segment
with available space.  This should help reduce fragmentation of the data.

Sequential over the table should remain fast.  We use a bitmap to record
which elements of each segment are in use.  This allows us to quickly loop
over only used elements and skip to the next segment.

Make use of this new hash table type to help speed up the locallock table
in lock.c
---
 src/backend/storage/lmgr/lock.c    |  115 ++-
 src/backend/utils/cache/relcache.c |    9 +-
 src/include/lib/generichash.h      | 1409 ++++++++++++++++++++++++++++
 src/include/storage/lock.h         |    2 +-
 4 files changed, 1484 insertions(+), 51 deletions(-)
 create mode 100644 src/include/lib/generichash.h

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 108b4d9023..081a06b417 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -37,6 +37,7 @@
 #include "access/twophase_rmgr.h"
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "common/hashfn.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "pgstat.h"
@@ -270,6 +271,19 @@ typedef struct
 
 static volatile FastPathStrongRelationLockData *FastPathStrongRelationLocks;
 
+#define GH_PREFIX				locallocktable
+#define GH_ELEMENT_TYPE			LOCALLOCK
+#define GH_KEY_TYPE				LOCALLOCKTAG
+#define GH_KEY					tag
+#define GH_HASH_KEY(tb, key)	hash_bytes((unsigned char *) &key, sizeof(LOCALLOCKTAG))
+#define GH_EQUAL(tb, a, b)		(memcmp(&a, &b, sizeof(LOCALLOCKTAG)) == 0)
+#define GH_ALLOCATE(b)			MemoryContextAllocExtended(TopMemoryContext, b, MCXT_ALLOC_HUGE)
+#define GH_ALLOCATE_ZERO(b)		MemoryContextAllocExtended(TopMemoryContext, b, MCXT_ALLOC_HUGE | MCXT_ALLOC_ZERO)
+#define GH_FREE(p)				pfree(p)
+#define GH_SCOPE				static inline
+#define GH_DECLARE
+#define GH_DEFINE
+#include "lib/generichash.h"
 
 /*
  * Pointers to hash tables containing lock state
@@ -279,7 +293,7 @@ static volatile FastPathStrongRelationLockData *FastPathStrongRelationLocks;
  */
 static HTAB *LockMethodLockHash;
 static HTAB *LockMethodProcLockHash;
-static HTAB *LockMethodLocalHash;
+static locallocktable_hash *LockMethodLocalHash;
 
 
 /* private state for error cleanup */
@@ -467,15 +481,9 @@ InitLocks(void)
 	 * ought to be empty in the postmaster, but for safety let's zap it.)
 	 */
 	if (LockMethodLocalHash)
-		hash_destroy(LockMethodLocalHash);
+		locallocktable_destroy(LockMethodLocalHash);
 
-	info.keysize = sizeof(LOCALLOCKTAG);
-	info.entrysize = sizeof(LOCALLOCK);
-
-	LockMethodLocalHash = hash_create("LOCALLOCK hash",
-									  16,
-									  &info,
-									  HASH_ELEM | HASH_BLOBS);
+	LockMethodLocalHash = locallocktable_create(16);
 }
 
 
@@ -606,22 +614,37 @@ LockHeldByMe(const LOCKTAG *locktag, LOCKMODE lockmode)
 	localtag.lock = *locktag;
 	localtag.mode = lockmode;
 
-	locallock = (LOCALLOCK *) hash_search(LockMethodLocalHash,
-										  (void *) &localtag,
-										  HASH_FIND, NULL);
+	locallock = locallocktable_lookup(LockMethodLocalHash, localtag);
 
 	return (locallock && locallock->nLocks > 0);
 }
 
 #ifdef USE_ASSERT_CHECKING
 /*
- * GetLockMethodLocalHash -- return the hash of local locks, for modules that
- *		evaluate assertions based on all locks held.
+ * GetLockMethodLocalLocks -- returns an array of all LOCALLOCKs stored in
+ *		LockMethodLocalHash.
+ *
+ * The caller must pfree the return value when done. *size is set to the
+ * number of elements in the returned array.
  */
-HTAB *
-GetLockMethodLocalHash(void)
+LOCALLOCK **
+GetLockMethodLocalLocks(uint32 *size)
 {
-	return LockMethodLocalHash;
+	locallocktable_iterator iterator;
+	LOCALLOCK **locallocks;
+	LOCALLOCK  *locallock;
+	uint32		i = 0;
+
+	locallocks = (LOCALLOCK **) palloc(sizeof(LOCALLOCK *) *
+									   LockMethodLocalHash->members);
+
+	locallocktable_start_iterate(LockMethodLocalHash, &iterator);
+	while ((locallock = locallocktable_iterate(LockMethodLocalHash,
+											   &iterator)) != NULL)
+		locallocks[i++] = locallock;
+
+	*size = i;
+	return locallocks;
 }
 #endif
 
@@ -661,9 +684,7 @@ LockHasWaiters(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 	localtag.lock = *locktag;
 	localtag.mode = lockmode;
 
-	locallock = (LOCALLOCK *) hash_search(LockMethodLocalHash,
-										  (void *) &localtag,
-										  HASH_FIND, NULL);
+	locallock = locallocktable_lookup(LockMethodLocalHash, localtag);
 
 	/*
 	 * let the caller print its own error message, too. Do not ereport(ERROR).
@@ -823,9 +844,7 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	localtag.lock = *locktag;
 	localtag.mode = lockmode;
 
-	locallock = (LOCALLOCK *) hash_search(LockMethodLocalHash,
-										  (void *) &localtag,
-										  HASH_ENTER, &found);
+	locallock = locallocktable_insert(LockMethodLocalHash, localtag, &found);
 
 	/*
 	 * if it's a new locallock object, initialize it
@@ -1390,9 +1409,7 @@ RemoveLocalLock(LOCALLOCK *locallock)
 		SpinLockRelease(&FastPathStrongRelationLocks->mutex);
 	}
 
-	if (!hash_search(LockMethodLocalHash,
-					 (void *) &(locallock->tag),
-					 HASH_REMOVE, NULL))
+	if (!locallocktable_delete(LockMethodLocalHash, locallock->tag))
 		elog(WARNING, "locallock table corrupted");
 
 	/*
@@ -2002,9 +2019,7 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 	localtag.lock = *locktag;
 	localtag.mode = lockmode;
 
-	locallock = (LOCALLOCK *) hash_search(LockMethodLocalHash,
-										  (void *) &localtag,
-										  HASH_FIND, NULL);
+	locallock = locallocktable_lookup(LockMethodLocalHash, localtag);
 
 	/*
 	 * let the caller print its own error message, too. Do not ereport(ERROR).
@@ -2178,7 +2193,7 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 void
 LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
 {
-	HASH_SEQ_STATUS status;
+	locallocktable_iterator iterator;
 	LockMethod	lockMethodTable;
 	int			i,
 				numLockModes;
@@ -2216,9 +2231,10 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
 	 * pointers.  Fast-path locks are cleaned up during the locallock table
 	 * scan, though.
 	 */
-	hash_seq_init(&status, LockMethodLocalHash);
+	locallocktable_start_iterate(LockMethodLocalHash, &iterator);
 
-	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+	while ((locallock = locallocktable_iterate(LockMethodLocalHash,
+											   &iterator)) != NULL)
 	{
 		/*
 		 * If the LOCALLOCK entry is unused, we must've run out of shared
@@ -2452,15 +2468,16 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
 void
 LockReleaseSession(LOCKMETHODID lockmethodid)
 {
-	HASH_SEQ_STATUS status;
+	locallocktable_iterator iterator;
 	LOCALLOCK  *locallock;
 
 	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
 		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
 
-	hash_seq_init(&status, LockMethodLocalHash);
+	locallocktable_start_iterate(LockMethodLocalHash, &iterator);
 
-	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+	while ((locallock = locallocktable_iterate(LockMethodLocalHash,
+											   &iterator)) != NULL)
 	{
 		/* Ignore items that are not of the specified lock method */
 		if (LOCALLOCK_LOCKMETHOD(*locallock) != lockmethodid)
@@ -2484,12 +2501,13 @@ LockReleaseCurrentOwner(LOCALLOCK **locallocks, int nlocks)
 {
 	if (locallocks == NULL)
 	{
-		HASH_SEQ_STATUS status;
+		locallocktable_iterator iterator;
 		LOCALLOCK  *locallock;
 
-		hash_seq_init(&status, LockMethodLocalHash);
+		locallocktable_start_iterate(LockMethodLocalHash, &iterator);
 
-		while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+		while ((locallock = locallocktable_iterate(LockMethodLocalHash,
+												   &iterator)) != NULL)
 			ReleaseLockIfHeld(locallock, false);
 	}
 	else
@@ -2583,12 +2601,13 @@ LockReassignCurrentOwner(LOCALLOCK **locallocks, int nlocks)
 
 	if (locallocks == NULL)
 	{
-		HASH_SEQ_STATUS status;
+		locallocktable_iterator iterator;
 		LOCALLOCK  *locallock;
 
-		hash_seq_init(&status, LockMethodLocalHash);
+		locallocktable_start_iterate(LockMethodLocalHash, &iterator);
 
-		while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+		while ((locallock = locallocktable_iterate(LockMethodLocalHash,
+												   &iterator)) != NULL)
 			LockReassignOwner(locallock, parent);
 	}
 	else
@@ -3220,7 +3239,7 @@ LockRefindAndRelease(LockMethod lockMethodTable, PGPROC *proc,
 void
 AtPrepare_Locks(void)
 {
-	HASH_SEQ_STATUS status;
+	locallocktable_iterator iterator;
 	LOCALLOCK  *locallock;
 
 	/*
@@ -3229,9 +3248,10 @@ AtPrepare_Locks(void)
 	 * Fast-path locks are an exception, however: we move any such locks to
 	 * the main table before allowing PREPARE TRANSACTION to succeed.
 	 */
-	hash_seq_init(&status, LockMethodLocalHash);
+	locallocktable_start_iterate(LockMethodLocalHash, &iterator);
 
-	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+	while ((locallock = locallocktable_iterate(LockMethodLocalHash,
+											   &iterator)) != NULL)
 	{
 		TwoPhaseLockRecord record;
 		LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
@@ -3331,7 +3351,7 @@ void
 PostPrepare_Locks(TransactionId xid)
 {
 	PGPROC	   *newproc = TwoPhaseGetDummyProc(xid, false);
-	HASH_SEQ_STATUS status;
+	locallocktable_iterator iterator;
 	LOCALLOCK  *locallock;
 	LOCK	   *lock;
 	PROCLOCK   *proclock;
@@ -3354,9 +3374,10 @@ PostPrepare_Locks(TransactionId xid)
 	 * pointing to the same proclock, and we daren't end up with any dangling
 	 * pointers.
 	 */
-	hash_seq_init(&status, LockMethodLocalHash);
+	locallocktable_start_iterate(LockMethodLocalHash, &iterator);
 
-	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+	while ((locallock = locallocktable_iterate(LockMethodLocalHash,
+											   &iterator)) != NULL)
 	{
 		LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
 		bool		haveSessionLock;
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index d55ae016d0..85b1c52870 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -3004,12 +3004,13 @@ void
 AssertPendingSyncs_RelationCache(void)
 {
 	HASH_SEQ_STATUS status;
-	LOCALLOCK  *locallock;
+	LOCALLOCK **locallocks;
 	Relation   *rels;
 	int			maxrels;
 	int			nrels;
 	RelIdCacheEnt *idhentry;
 	int			i;
+	uint32		nlocallocks;
 
 	/*
 	 * Open every relation that this transaction has locked.  If, for some
@@ -3022,9 +3023,10 @@ AssertPendingSyncs_RelationCache(void)
 	maxrels = 1;
 	rels = palloc(maxrels * sizeof(*rels));
 	nrels = 0;
-	hash_seq_init(&status, GetLockMethodLocalHash());
-	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+	locallocks = GetLockMethodLocalLocks(&nlocallocks);
+	for (i = 0; i < nlocallocks; i++)
 	{
+		LOCALLOCK  *locallock = locallocks[i];
 		Oid			relid;
 		Relation	r;
 
@@ -3044,6 +3046,7 @@ AssertPendingSyncs_RelationCache(void)
 		}
 		rels[nrels++] = r;
 	}
+	pfree(locallocks);
 
 	hash_seq_init(&status, RelationIdCache);
 	while ((idhentry = (RelIdCacheEnt *) hash_seq_search(&status)) != NULL)
diff --git a/src/include/lib/generichash.h b/src/include/lib/generichash.h
new file mode 100644
index 0000000000..3e075d1676
--- /dev/null
+++ b/src/include/lib/generichash.h
@@ -0,0 +1,1409 @@
+/*
+ * generichash.h
+ *
+ *	  A hashtable implementation which can be included into .c files to
+ *	  provide a fast hash table implementation specific to the given type.
+ *
+ *	  GH_ELEMENT_TYPE defines the data type that the hashtable stores.  Each
+ *	  instance of GH_ELEMENT_TYPE which is stored in the hash table is done so
+ *	  inside a GH_SEGMENT.  These GH_SEGMENTs are allocated on demand and
+ *	  store GH_ITEMS_PER_SEGMENT each.  After items are removed from the hash
+ *	  table, the next inserted item's data will be stored in the earliest free
+ *	  item in the earliest free segment.  This helps keep the actual data
+ *	  compact even when the bucket array has become large.
+ *
+ *	  The bucket array is an array of GH_BUCKET and is dynamically allocated
+ *	  and may grow as more items are added to the table.  The GH_BUCKET type
+ *	  is very narrow and stores just 2 uint32 values.  One of these is the
+ *	  hash value and the other is the index into the segments which are used
+ *	  to directly look up the stored GH_ELEMENT_TYPE type.
+ *
+ *	  During inserts, hash table collisions are dealt with using linear
+ *	  probing, this means that instead of doing something like chaining with a
+ *	  linked list, we use the first free bucket which comes after the optimal
+ *	  bucket.  This is much more CPU cache efficient than traversing a linked
+ *	  list.  When we're unable to use the most optimal bucket, we may also
+ *	  move the contexts of subsequent buckets around so that we keep items as
+ *	  close to their most optimal position as possible.  This prevents
+ *	  excessively long linear probes during lookups.
+ *
+ *	  During hash table deletes, we must attempt to move the contents of
+ *	  buckets that are not in their optimal position up to either their
+ *	  optimal position, or as close as we can get to it.  During lookups, this
+ *	  means that we can stop searching for a non-existing item as soon as we
+ *	  find an empty bucket.
+ *
+ *	  Empty buckets are denoted by their 'index' field being set to
+ *	  GH_UNUSED_BUCKET_INDEX.  This is done rather than adding a special field
+ *	  so that we can keep the GH_BUCKET type as narrow as possible.
+ *	  Conveniently sizeof(GH_BUCKET) is 8, which allows 8 of these to fit on a
+ *	  single 64-byte cache line. It's important to keep this type as narrow as
+ *	  possible so that we can perform hash lookups by hitting as few
+ *	  cache lines as possible.
+ *
+ *	  The implementation here is similar to simplehash.h but has the following
+ *	  benefits:
+ *
+ *	  - Pointers to elements are stable and are not moved around like they are
+ *		in simplehash.h
+ *	  - Sequential scans of the hash table remain very fast even when the
+ *		table is sparsely populated.
+ *	  - Moving the contents of buckets around during inserts and deletes is
+ *		generally cheaper here due to GH_BUCKET being very narrow.
+ *
+ * If none of the above points are important for the given use case then,
+ * please consider using simplehash.h instead.
+ *
+ *
+ * Portions Copyright (c) 2021, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/include/lib/generichash.h
+ *
+ */
+
+#include "port/pg_bitutils.h"
+
+/* helpers */
+#define GH_MAKE_PREFIX(a) CppConcat(a,_)
+#define GH_MAKE_NAME(name) GH_MAKE_NAME_(GH_MAKE_PREFIX(GH_PREFIX),name)
+#define GH_MAKE_NAME_(a,b) CppConcat(a,b)
+
+/* type declarations */
+#define GH_TYPE GH_MAKE_NAME(hash)
+#define GH_BUCKET GH_MAKE_NAME(bucket)
+#define GH_SEGMENT GH_MAKE_NAME(segment)
+#define GH_ITERATOR GH_MAKE_NAME(iterator)
+
+/* function declarations */
+#define GH_CREATE GH_MAKE_NAME(create)
+#define GH_DESTROY GH_MAKE_NAME(destroy)
+#define GH_RESET GH_MAKE_NAME(reset)
+#define GH_INSERT GH_MAKE_NAME(insert)
+#define GH_INSERT_HASH GH_MAKE_NAME(insert_hash)
+#define GH_DELETE GH_MAKE_NAME(delete)
+#define GH_LOOKUP GH_MAKE_NAME(lookup)
+#define GH_LOOKUP_HASH GH_MAKE_NAME(lookup_hash)
+#define GH_GROW GH_MAKE_NAME(grow)
+#define GH_START_ITERATE GH_MAKE_NAME(start_iterate)
+#define GH_ITERATE GH_MAKE_NAME(iterate)
+
+/* internal helper functions (no externally visible prototypes) */
+#define GH_NEXT_ONEBIT GH_MAKE_NAME(next_onebit)
+#define GH_NEXT_ZEROBIT GH_MAKE_NAME(next_zerobit)
+#define GH_INDEX_TO_ELEMENT GH_MAKE_NAME(index_to_element)
+#define GH_MARK_SEGMENT_ITEM_USED GH_MAKE_NAME(mark_segment_item_used)
+#define GH_MARK_SEGMENT_ITEM_UNUSED GH_MAKE_NAME(mark_segment_item_unused)
+#define GH_GET_NEXT_UNUSED_ENTRY GH_MAKE_NAME(get_next_unused_entry)
+#define GH_REMOVE_ENTRY GH_MAKE_NAME(remove_entry)
+#define GH_SET_BUCKET_IN_USE GH_MAKE_NAME(set_bucket_in_use)
+#define GH_SET_BUCKET_EMPTY GH_MAKE_NAME(set_bucket_empty)
+#define GH_IS_BUCKET_IN_USE GH_MAKE_NAME(is_bucket_in_use)
+#define GH_COMPUTE_PARAMETERS GH_MAKE_NAME(compute_parameters)
+#define GH_NEXT GH_MAKE_NAME(next)
+#define GH_PREV GH_MAKE_NAME(prev)
+#define GH_DISTANCE_FROM_OPTIMAL GH_MAKE_NAME(distance)
+#define GH_INITIAL_BUCKET GH_MAKE_NAME(initial_bucket)
+#define GH_INSERT_HASH_INTERNAL GH_MAKE_NAME(insert_hash_internal)
+#define GH_LOOKUP_HASH_INTERNAL GH_MAKE_NAME(lookup_hash_internal)
+
+/*
+ * When allocating memory to store instances of GH_ELEMENT_TYPE, how many
+ * should we allocate at once?  This must be a power of 2 and at least
+ * GH_BITS_PER_WORD.
+ */
+#ifndef GH_ITEMS_PER_SEGMENT
+#define GH_ITEMS_PER_SEGMENT	256
+#endif
+
+/* A special index to set GH_BUCKET->index to when it's not in use */
+#define GH_UNUSED_BUCKET_INDEX	PG_UINT32_MAX
+
+/*
+ * Macros for translating a bucket's index into the segment and another to
+ * determine the item number within the segment.
+ */
+#define GH_INDEX_SEGMENT(i)	(i) / GH_ITEMS_PER_SEGMENT
+#define GH_INDEX_ITEM(i)	(i) % GH_ITEMS_PER_SEGMENT
+
+ /*
+  * How many elements do we need in the bitmap array to store a bit for each
+  * of GH_ITEMS_PER_SEGMENT.  Keep the word size native to the processor.
+  */
+#if SIZEOF_VOID_P >= 8
+
+#define GH_BITS_PER_WORD		64
+#define GH_BITMAP_WORD			uint64
+#define GH_RIGHTMOST_ONE_POS(x) pg_rightmost_one_pos64(x)
+
+#else
+
+#define GH_BITS_PER_WORD		32
+#define GH_BITMAP_WORD			uint32
+#define GH_RIGHTMOST_ONE_POS(x) pg_rightmost_one_pos32(x)
+
+#endif
+
+/* Sanity check on GH_ITEMS_PER_SEGMENT setting */
+#if GH_ITEMS_PER_SEGMENT < GH_BITS_PER_WORD
+#error "GH_ITEMS_PER_SEGMENT must be >= than GH_BITS_PER_WORD"
+#endif
+
+/* Ensure GH_ITEMS_PER_SEGMENT is a power of 2 */
+#if GH_ITEMS_PER_SEGMENT & (GH_ITEMS_PER_SEGMENT - 1) != 0
+#error "GH_ITEMS_PER_SEGMENT must be a power of 2"
+#endif
+
+#define GH_BITMAP_WORDS			(GH_ITEMS_PER_SEGMENT / GH_BITS_PER_WORD)
+#define GH_WORDNUM(x)			((x) / GH_BITS_PER_WORD)
+#define GH_BITNUM(x)			((x) % GH_BITS_PER_WORD)
+
+/* generate forward declarations necessary to use the hash table */
+#ifdef GH_DECLARE
+
+typedef struct GH_BUCKET
+{
+	uint32		hashvalue;		/* Hash value for this bucket */
+	uint32		index;			/* Index to the actual data */
+}			GH_BUCKET;
+
+typedef struct GH_SEGMENT
+{
+	uint32		nitems;			/* Number of items stored */
+	GH_BITMAP_WORD used_items[GH_BITMAP_WORDS]; /* A 1-bit for each used item
+												 * in the items array */
+	GH_ELEMENT_TYPE items[GH_ITEMS_PER_SEGMENT];	/* the actual data */
+}			GH_SEGMENT;
+
+/* type definitions */
+
+/*
+ * GH_TYPE
+ *		Hash table metadata type
+ */
+typedef struct GH_TYPE
+{
+	/*
+	 * Size of bucket array.  Note that the maximum number of elements is
+	 * lower (GH_MAX_FILLFACTOR)
+	 */
+	uint32		size;
+
+	/* mask for bucket and size calculations, based on size */
+	uint32		sizemask;
+
+	/* the number of elements stored */
+	uint32		members;
+
+	/* boundary after which to grow hashtable */
+	uint32		grow_threshold;
+
+	/* how many elements are there in the segments array */
+	uint32		nsegments;
+
+	/* the number of elements in the used_segments array */
+	uint32		used_segment_words;
+
+	/*
+	 * The first segment we should search in for an empty slot.  This will be
+	 * the first segment that GH_GET_NEXT_UNUSED_ENTRY will search in when
+	 * looking for an unused entry.  We'll increase the value of this when we
+	 * fill a segment and we'll lower it down when we delete an item from a
+	 * segment lower than this value.
+	 */
+	uint32		first_free_segment;
+
+	/* dynamically allocated array of hash buckets */
+	GH_BUCKET  *buckets;
+
+	/* an array of segment pointers to store data */
+	GH_SEGMENT **segments;
+
+	/*
+	 * A bitmap of non-empty segments.  A 1-bit denotes that the corresponding
+	 * segment is non-empty.
+	 */
+	GH_BITMAP_WORD *used_segments;
+
+#ifdef GH_HAVE_PRIVATE_DATA
+	/* user defined data, useful for callbacks */
+	void	   *private_data;
+#endif
+}			GH_TYPE;
+
+/*
+ * GH_ITERATOR
+ *		Used when looping over the contents of the hash table.
+ */
+typedef struct GH_ITERATOR
+{
+	int32		cursegidx;		/* current segment. -1 means not started */
+	int32		curitemidx;		/* current item within cursegidx, -1 means not
+								 * started */
+	uint32		found_members;	/* number of items visitied so far in the loop */
+	uint32		total_members;	/* number of items that existed at the start
+								 * iteration. */
+}			GH_ITERATOR;
+
+/* externally visible function prototypes */
+
+#ifdef GH_HAVE_PRIVATE_DATA
+/* <prefix>_hash <prefix>_create(uint32 nbuckets, void *private_data) */
+GH_SCOPE	GH_TYPE *GH_CREATE(uint32 nbuckets, void *private_data);
+#else
+/* <prefix>_hash <prefix>_create(uint32 nbuckets) */
+GH_SCOPE	GH_TYPE *GH_CREATE(uint32 nbuckets);
+#endif
+
+/* void <prefix>_destroy(<prefix>_hash *tb) */
+GH_SCOPE void GH_DESTROY(GH_TYPE * tb);
+
+/* void <prefix>_reset(<prefix>_hash *tb) */
+GH_SCOPE void GH_RESET(GH_TYPE * tb);
+
+/* void <prefix>_grow(<prefix>_hash *tb) */
+GH_SCOPE void GH_GROW(GH_TYPE * tb, uint32 newsize);
+
+/* <element> *<prefix>_insert(<prefix>_hash *tb, <key> key, bool *found) */
+GH_SCOPE	GH_ELEMENT_TYPE *GH_INSERT(GH_TYPE * tb, GH_KEY_TYPE key,
+									   bool *found);
+
+/*
+ * <element> *<prefix>_insert_hash(<prefix>_hash *tb, <key> key, uint32 hash,
+ * 								   bool *found)
+ */
+GH_SCOPE	GH_ELEMENT_TYPE *GH_INSERT_HASH(GH_TYPE * tb, GH_KEY_TYPE key,
+											uint32 hash, bool *found);
+
+/* <element> *<prefix>_lookup(<prefix>_hash *tb, <key> key) */
+GH_SCOPE	GH_ELEMENT_TYPE *GH_LOOKUP(GH_TYPE * tb, GH_KEY_TYPE key);
+
+/* <element> *<prefix>_lookup_hash(<prefix>_hash *tb, <key> key, uint32 hash) */
+GH_SCOPE	GH_ELEMENT_TYPE *GH_LOOKUP_HASH(GH_TYPE * tb, GH_KEY_TYPE key,
+											uint32 hash);
+
+/* bool <prefix>_delete(<prefix>_hash *tb, <key> key) */
+GH_SCOPE bool GH_DELETE(GH_TYPE * tb, GH_KEY_TYPE key);
+
+/* void <prefix>_start_iterate(<prefix>_hash *tb, <prefix>_iterator *iter) */
+GH_SCOPE void GH_START_ITERATE(GH_TYPE * tb, GH_ITERATOR * iter);
+
+/* <element> *<prefix>_iterate(<prefix>_hash *tb, <prefix>_iterator *iter) */
+GH_SCOPE	GH_ELEMENT_TYPE *GH_ITERATE(GH_TYPE * tb, GH_ITERATOR * iter);
+
+#endif							/* GH_DECLARE */
+
+/* generate implementation of the hash table */
+#ifdef GH_DEFINE
+
+/*
+ * The maximum size for the hash table.  This must be a power of 2.  We cannot
+ * make this PG_UINT32_MAX + 1 because we use GH_UNUSED_BUCKET_INDEX denote an
+ * empty bucket.  Doing so would mean we could accidentally set a used
+ * bucket's index to GH_UNUSED_BUCKET_INDEX.
+ */
+#define GH_MAX_SIZE ((uint32) PG_INT32_MAX + 1)
+
+/* normal fillfactor, unless already close to maximum */
+#ifndef GH_FILLFACTOR
+#define GH_FILLFACTOR (0.9)
+#endif
+/* increase fillfactor if we otherwise would error out */
+#define GH_MAX_FILLFACTOR (0.98)
+/* grow if actual and optimal location bigger than */
+#ifndef GH_GROW_MAX_DIB
+#define GH_GROW_MAX_DIB 25
+#endif
+/*
+ * Grow if more than this number of buckets needs to be moved when inserting.
+ */
+#ifndef GH_GROW_MAX_MOVE
+#define GH_GROW_MAX_MOVE 150
+#endif
+#ifndef GH_GROW_MIN_FILLFACTOR
+/* but do not grow due to GH_GROW_MAX_* if below */
+#define GH_GROW_MIN_FILLFACTOR 0.1
+#endif
+
+/*
+ * Wrap the following definitions in include guards, to avoid multiple
+ * definition errors if this header is included more than once.  The rest of
+ * the file deliberately has no include guards, because it can be included
+ * with different parameters to define functions and types with non-colliding
+ * names.
+ */
+#ifndef GENERICHASH_H
+#define GENERICHASH_H
+
+#ifdef FRONTEND
+#define gh_error(...) pg_log_error(__VA_ARGS__)
+#define gh_log(...) pg_log_info(__VA_ARGS__)
+#else
+#define gh_error(...) elog(ERROR, __VA_ARGS__)
+#define gh_log(...) elog(LOG, __VA_ARGS__)
+#endif
+
+#endif							/* GENERICHASH_H */
+
+/*
+ * Gets the position of the first 1-bit which comes after 'prevbit' in the
+ * 'words' array.  'nwords' is the size of the 'words' array.
+ */
+static inline int32
+GH_NEXT_ONEBIT(GH_BITMAP_WORD * words, uint32 nwords, int32 prevbit)
+{
+	uint32		wordnum;
+
+	prevbit++;
+
+	wordnum = GH_WORDNUM(prevbit);
+	if (wordnum < nwords)
+	{
+		GH_BITMAP_WORD mask = (~(GH_BITMAP_WORD) 0) << GH_BITNUM(prevbit);
+		GH_BITMAP_WORD word = words[wordnum] & mask;
+
+		if (word != 0)
+			return wordnum * GH_BITS_PER_WORD + GH_RIGHTMOST_ONE_POS(word);
+
+		for (++wordnum; wordnum < nwords; wordnum++)
+		{
+			word = words[wordnum];
+
+			if (word != 0)
+			{
+				int32		result = wordnum * GH_BITS_PER_WORD;
+
+				result += GH_RIGHTMOST_ONE_POS(word);
+				return result;
+			}
+		}
+	}
+	return -1;
+}
+
+/*
+ * Gets the position of the first 0-bit which comes after 'prevbit' in the
+ * 'words' array.  'nwords' is the size of the 'words' array.
+ *
+ * This is similar to GH_NEXT_ONEBIT but flips the bits before operating on
+ * each GH_BITMAP_WORD.
+ */
+static inline int32
+GH_NEXT_ZEROBIT(GH_BITMAP_WORD * words, uint32 nwords, int32 prevbit)
+{
+	uint32		wordnum;
+
+	prevbit++;
+
+	wordnum = GH_WORDNUM(prevbit);
+	if (wordnum < nwords)
+	{
+		GH_BITMAP_WORD mask = (~(GH_BITMAP_WORD) 0) << GH_BITNUM(prevbit);
+		GH_BITMAP_WORD word = ~(words[wordnum] & mask); /* flip bits */
+
+		if (word != 0)
+			return wordnum * GH_BITS_PER_WORD + GH_RIGHTMOST_ONE_POS(word);
+
+		for (++wordnum; wordnum < nwords; wordnum++)
+		{
+			word = ~words[wordnum]; /* flip bits */
+
+			if (word != 0)
+			{
+				int32		result = wordnum * GH_BITS_PER_WORD;
+
+				result += GH_RIGHTMOST_ONE_POS(word);
+				return result;
+			}
+		}
+	}
+	return -1;
+}
+
+/*
+ * Finds the hash table entry for a given GH_BUCKET's 'index'.
+ */
+static inline GH_ELEMENT_TYPE *
+GH_INDEX_TO_ELEMENT(GH_TYPE * tb, uint32 index)
+{
+	GH_SEGMENT *seg;
+	uint32		segidx;
+	uint32		item;
+
+	segidx = GH_INDEX_SEGMENT(index);
+	item = GH_INDEX_ITEM(index);
+
+	Assert(segidx < tb->nsegments);
+
+	seg = tb->segments[segidx];
+
+	Assert(seg != NULL);
+
+	/* ensure this segment is marked as used */
+	Assert(seg->used_items[GH_WORDNUM(item)] & (((GH_BITMAP_WORD) 1) << GH_BITNUM(item)));
+
+	return &seg->items[item];
+}
+
+static inline void
+GH_MARK_SEGMENT_ITEM_USED(GH_TYPE * tb, GH_SEGMENT * seg, uint32 segidx,
+						  uint32 segitem)
+{
+	uint32		word = GH_WORDNUM(segitem);
+	uint32		bit = GH_BITNUM(segitem);
+
+	/* ensure this item is not marked as used */
+	Assert((seg->used_items[word] & (((GH_BITMAP_WORD) 1) << bit)) == 0);
+
+	/* switch on the used bit */
+	seg->used_items[word] |= (((GH_BITMAP_WORD) 1) << bit);
+
+	/* if the segment was previously empty then mark it as used */
+	if (seg->nitems == 0)
+	{
+		word = GH_WORDNUM(segidx);
+		bit = GH_BITNUM(segidx);
+
+		/* switch on the used bit for this segment */
+		tb->used_segments[word] |= (((GH_BITMAP_WORD) 1) << bit);
+	}
+	seg->nitems++;
+}
+
+static inline void
+GH_MARK_SEGMENT_ITEM_UNUSED(GH_TYPE * tb, GH_SEGMENT * seg, uint32 segidx,
+							uint32 segitem)
+{
+	uint32		word = GH_WORDNUM(segitem);
+	uint32		bit = GH_BITNUM(segitem);
+
+	/* ensure this item is marked as used */
+	Assert((seg->used_items[word] & (((GH_BITMAP_WORD) 1) << bit)) != 0);
+
+	/* switch off the used bit */
+	seg->used_items[word] &= ~(((GH_BITMAP_WORD) 1) << bit);
+
+	/* when removing the last item mark the segment as unused */
+	if (seg->nitems == 1)
+	{
+		word = GH_WORDNUM(segidx);
+		bit = GH_BITNUM(segidx);
+
+		/* switch off the used bit for this segment */
+		tb->used_segments[word] &= ~(((GH_BITMAP_WORD) 1) << bit);
+	}
+
+	seg->nitems--;
+}
+
+/*
+ * Returns the first unused entry from the first non-full segment and set
+ * *index to the index of the returned entry.
+ */
+static inline GH_ELEMENT_TYPE *
+GH_GET_NEXT_UNUSED_ENTRY(GH_TYPE * tb, uint32 *index)
+{
+	GH_SEGMENT *seg;
+	uint32		segidx = tb->first_free_segment;
+	uint32		itemidx;
+
+	seg = tb->segments[segidx];
+
+	/* find the first segment with an unused item */
+	while (seg != NULL && seg->nitems == GH_ITEMS_PER_SEGMENT)
+		seg = tb->segments[++segidx];
+
+	tb->first_free_segment = segidx;
+
+	/* allocate the segment if it's not already */
+	if (seg == NULL)
+	{
+		seg = GH_ALLOCATE(sizeof(GH_SEGMENT));
+		tb->segments[segidx] = seg;
+
+		seg->nitems = 0;
+		memset(seg->used_items, 0, sizeof(seg->used_items));
+		/* no need to zero the items array */
+
+		/* use the first slot in this segment */
+		itemidx = 0;
+	}
+	else
+	{
+		/* find the first unused item in this segment */
+		itemidx = GH_NEXT_ZEROBIT(seg->used_items, GH_BITMAP_WORDS, -1);
+		Assert(itemidx >= 0);
+	}
+
+	/* this is a good spot to ensure nitems matches the bits in used_items */
+	Assert(seg->nitems == pg_popcount((const char *) seg->used_items, GH_ITEMS_PER_SEGMENT / 8));
+
+	GH_MARK_SEGMENT_ITEM_USED(tb, seg, segidx, itemidx);
+
+	*index = segidx * GH_ITEMS_PER_SEGMENT + itemidx;
+	return &seg->items[itemidx];
+
+}
+
+/*
+ * Remove the entry denoted by 'index' from its segment.
+ */
+static inline void
+GH_REMOVE_ENTRY(GH_TYPE * tb, uint32 index)
+{
+	GH_SEGMENT *seg;
+	uint32		segidx = GH_INDEX_SEGMENT(index);
+	uint32		item = GH_INDEX_ITEM(index);
+
+	Assert(segidx < tb->nsegments);
+	seg = tb->segments[segidx];
+	Assert(seg != NULL);
+
+	GH_MARK_SEGMENT_ITEM_UNUSED(tb, seg, segidx, item);
+
+	/*
+	 * Lower the first free segment index to point to this segment so that the
+	 * next insert will store in this segment.  If it's already pointing to an
+	 * earlier segment, then leave it be.
+	 */
+	if (tb->first_free_segment > segidx)
+		tb->first_free_segment = segidx;
+}
+
+/*
+ * Set 'bucket' as in use by 'index'.
+ */
+static inline void
+GH_SET_BUCKET_IN_USE(GH_BUCKET * bucket, uint32 index)
+{
+	bucket->index = index;
+}
+
+/*
+ * Mark 'bucket' as unused.
+ */
+static inline void
+GH_SET_BUCKET_EMPTY(GH_BUCKET * bucket)
+{
+	bucket->index = GH_UNUSED_BUCKET_INDEX;
+}
+
+/*
+ * Return true if 'bucket' is in use.
+ */
+static inline bool
+GH_IS_BUCKET_IN_USE(GH_BUCKET * bucket)
+{
+	return bucket->index != GH_UNUSED_BUCKET_INDEX;
+}
+
+ /*
+  * Compute sizing parameters for hashtable.  Called when creating and growing
+  * the hashtable.
+  */
+static inline void
+GH_COMPUTE_PARAMETERS(GH_TYPE * tb, uint32 newsize)
+{
+	uint32		size;
+
+	/*
+	 * Ensure the bucket array size has not exceeded GH_MAX_SIZE or wrapped
+	 * back to zero.
+	 */
+	if (newsize == 0 || newsize > GH_MAX_SIZE)
+		gh_error("hash table too large");
+
+	/*
+	 * Ensure we don't build a table that can't store an entire single segment
+	 * worth of data.
+	 */
+	size = Max(newsize, GH_ITEMS_PER_SEGMENT);
+
+	/* round up size to the next power of 2 */
+	size = pg_nextpower2_32(size);
+
+	/* now set size */
+	tb->size = size;
+	tb->sizemask = tb->size - 1;
+
+	/* calculate how many segments we'll need to store 'size' items */
+	tb->nsegments = pg_nextpower2_32(size / GH_ITEMS_PER_SEGMENT);
+
+	/*
+	 * Calculate the number of bitmap words needed to store a bit for each
+	 * segment.
+	 */
+	tb->used_segment_words = (tb->nsegments + GH_BITS_PER_WORD - 1) / GH_BITS_PER_WORD;
+
+	/*
+	 * Compute the next threshold at which we need to grow the hash table
+	 * again.
+	 */
+	if (tb->size == GH_MAX_SIZE)
+		tb->grow_threshold = (uint32) (((double) tb->size) * GH_MAX_FILLFACTOR);
+	else
+		tb->grow_threshold = (uint32) (((double) tb->size) * GH_FILLFACTOR);
+}
+
+/* return the optimal bucket for the hash */
+static inline uint32
+GH_INITIAL_BUCKET(GH_TYPE * tb, uint32 hash)
+{
+	return hash & tb->sizemask;
+}
+
+/* return the next bucket after the current, handling wraparound */
+static inline uint32
+GH_NEXT(GH_TYPE * tb, uint32 curelem, uint32 startelem)
+{
+	curelem = (curelem + 1) & tb->sizemask;
+
+	Assert(curelem != startelem);
+
+	return curelem;
+}
+
+/* return the bucket before the current, handling wraparound */
+static inline uint32
+GH_PREV(GH_TYPE * tb, uint32 curelem, uint32 startelem)
+{
+	curelem = (curelem - 1) & tb->sizemask;
+
+	Assert(curelem != startelem);
+
+	return curelem;
+}
+
+/* return the distance between a bucket and its optimal position */
+static inline uint32
+GH_DISTANCE_FROM_OPTIMAL(GH_TYPE * tb, uint32 optimal, uint32 bucket)
+{
+	if (optimal <= bucket)
+		return bucket - optimal;
+	else
+		return (tb->size + bucket) - optimal;
+}
+
+/*
+ * Create a hash table with 'nbuckets' buckets.
+ */
+GH_SCOPE	GH_TYPE *
+#ifdef GH_HAVE_PRIVATE_DATA
+GH_CREATE(uint32 nbuckets, void *private_data)
+#else
+GH_CREATE(uint32 nbuckets)
+#endif
+{
+	GH_TYPE    *tb;
+	uint32		size;
+	uint32		i;
+
+	tb = GH_ALLOCATE_ZERO(sizeof(GH_TYPE));
+
+#ifdef GH_HAVE_PRIVATE_DATA
+	tb->private_data = private_data;
+#endif
+
+	/* increase nelements by fillfactor, want to store nelements elements */
+	size = (uint32) Min((double) GH_MAX_SIZE, ((double) nbuckets) / GH_FILLFACTOR);
+
+	GH_COMPUTE_PARAMETERS(tb, size);
+
+	tb->buckets = GH_ALLOCATE(sizeof(GH_BUCKET) * tb->size);
+
+	/* ensure all the buckets are set to empty */
+	for (i = 0; i < tb->size; i++)
+		GH_SET_BUCKET_EMPTY(&tb->buckets[i]);
+
+	tb->segments = GH_ALLOCATE_ZERO(sizeof(GH_SEGMENT *) * tb->nsegments);
+	tb->used_segments = GH_ALLOCATE_ZERO(sizeof(GH_BITMAP_WORD) * tb->used_segment_words);
+	return tb;
+}
+
+/* destroy a previously created hash table */
+GH_SCOPE void
+GH_DESTROY(GH_TYPE * tb)
+{
+	GH_FREE(tb->buckets);
+
+	/* Free each segment one by one */
+	for (uint32 n = 0; n < tb->nsegments; n++)
+	{
+		if (tb->segments[n] != NULL)
+			GH_FREE(tb->segments[n]);
+	}
+
+	GH_FREE(tb->segments);
+	GH_FREE(tb->used_segments);
+
+	pfree(tb);
+}
+
+/* reset the contents of a previously created hash table */
+GH_SCOPE void
+GH_RESET(GH_TYPE * tb)
+{
+	int32		i = -1;
+	uint32		x;
+
+	/* reset each used segment one by one */
+	while ((i = GH_NEXT_ONEBIT(tb->used_segments, tb->used_segment_words,
+							   i)) >= 0)
+	{
+		GH_SEGMENT *seg = tb->segments[i];
+
+		Assert(seg != NULL);
+
+		seg->nitems = 0;
+		memset(seg->used_items, 0, sizeof(seg->used_items));
+	}
+
+	/* empty every bucket */
+	for (x = 0; x < tb->size; x++)
+		GH_SET_BUCKET_EMPTY(&tb->buckets[x]);
+
+	/* zero the used segment bits */
+	memset(tb->used_segments, 0, sizeof(GH_BITMAP_WORD) * tb->used_segment_words);
+
+	/* and mark the table as having zero members */
+	tb->members = 0;
+
+	/* ensure we start putting any new items in the first segment */
+	tb->first_free_segment = 0;
+}
+
+/*
+ * Grow a hash table to at least 'newsize' buckets.
+ *
+ * Usually this will automatically be called by insertions/deletions, when
+ * necessary. But resizing to the exact input size can be advantageous
+ * performance-wise, when known at some point.
+ */
+GH_SCOPE void
+GH_GROW(GH_TYPE * tb, uint32 newsize)
+{
+	uint32		oldsize = tb->size;
+	uint32		oldnsegments = tb->nsegments;
+	uint32		oldusedsegmentwords = tb->used_segment_words;
+	GH_BUCKET  *oldbuckets = tb->buckets;
+	GH_SEGMENT **oldsegments = tb->segments;
+	GH_BITMAP_WORD *oldusedsegments = tb->used_segments;
+	GH_BUCKET  *newbuckets;
+	uint32		i;
+	uint32		startelem = 0;
+	uint32		copyelem;
+
+	Assert(oldsize == pg_nextpower2_32(oldsize));
+
+	/* compute parameters for new table */
+	GH_COMPUTE_PARAMETERS(tb, newsize);
+
+	tb->buckets = GH_ALLOCATE(sizeof(GH_ELEMENT_TYPE) * tb->size);
+
+	/* Ensure all the buckets are set to empty */
+	for (i = 0; i < tb->size; i++)
+		GH_SET_BUCKET_EMPTY(&tb->buckets[i]);
+
+	newbuckets = tb->buckets;
+
+	/*
+	 * Copy buckets from the old buckets to newbuckets. We theoretically could
+	 * use GH_INSERT here, to avoid code duplication, but that's more general
+	 * than we need. We neither want tb->members increased, nor do we need to
+	 * do deal with deleted elements, nor do we need to compare keys. So a
+	 * special-cased implementation is a lot faster.  Resizing can be time
+	 * consuming and frequent, that's worthwhile to optimize.
+	 *
+	 * To be able to simply move buckets over, we have to start not at the
+	 * first bucket (i.e oldbuckets[0]), but find the first bucket that's
+	 * either empty or is occupied by an entry at its optimal position. Such a
+	 * bucket has to exist in any table with a load factor under 1, as not all
+	 * buckets are occupied, i.e. there always has to be an empty bucket.  By
+	 * starting at such a bucket we can move the entries to the larger table,
+	 * without having to deal with conflicts.
+	 */
+
+	/* search for the first element in the hash that's not wrapped around */
+	for (i = 0; i < oldsize; i++)
+	{
+		GH_BUCKET  *oldbucket = &oldbuckets[i];
+		uint32		hash;
+		uint32		optimal;
+
+		if (!GH_IS_BUCKET_IN_USE(oldbucket))
+		{
+			startelem = i;
+			break;
+		}
+
+		hash = oldbucket->hashvalue;
+		optimal = GH_INITIAL_BUCKET(tb, hash);
+
+		if (optimal == i)
+		{
+			startelem = i;
+			break;
+		}
+	}
+
+	/* and copy all elements in the old table */
+	copyelem = startelem;
+	for (i = 0; i < oldsize; i++)
+	{
+		GH_BUCKET  *oldbucket = &oldbuckets[copyelem];
+
+		if (GH_IS_BUCKET_IN_USE(oldbucket))
+		{
+			uint32		hash;
+			uint32		startelem;
+			uint32		curelem;
+			GH_BUCKET  *newbucket;
+
+			hash = oldbucket->hashvalue;
+			startelem = GH_INITIAL_BUCKET(tb, hash);
+			curelem = startelem;
+
+			/* find empty element to put data into */
+			for (;;)
+			{
+				newbucket = &newbuckets[curelem];
+
+				if (!GH_IS_BUCKET_IN_USE(newbucket))
+					break;
+
+				curelem = GH_NEXT(tb, curelem, startelem);
+			}
+
+			/* copy entry to new slot */
+			memcpy(newbucket, oldbucket, sizeof(GH_BUCKET));
+		}
+
+		/* can't use GH_NEXT here, would use new size */
+		copyelem++;
+		if (copyelem >= oldsize)
+			copyelem = 0;
+	}
+
+	GH_FREE(oldbuckets);
+
+	/*
+	 * Enlarge the segment array so we can store enough segments for the new
+	 * hash table capacity.
+	 */
+	tb->segments = GH_ALLOCATE(sizeof(GH_SEGMENT *) * tb->nsegments);
+	memcpy(tb->segments, oldsegments, sizeof(GH_SEGMENT *) * oldnsegments);
+	/* zero the newly extended part of the array */
+	memset(&tb->segments[oldnsegments], 0, sizeof(GH_SEGMENT *) *
+		   (tb->nsegments - oldnsegments));
+	GH_FREE(oldsegments);
+
+	/*
+	 * The majority of tables will only ever need one bitmap word to store
+	 * used segments, so we only bother to reallocate the used_segments array
+	 * if the number of bitmap words has actually changed.
+	 */
+	if (tb->used_segment_words != oldusedsegmentwords)
+	{
+		tb->used_segments = GH_ALLOCATE(sizeof(GH_BITMAP_WORD) *
+										tb->used_segment_words);
+		memcpy(tb->used_segments, oldusedsegments, sizeof(GH_BITMAP_WORD) *
+			   oldusedsegmentwords);
+		memset(&tb->used_segments[oldusedsegmentwords], 0,
+			   sizeof(GH_BITMAP_WORD) * (tb->used_segment_words -
+										 oldusedsegmentwords));
+
+		GH_FREE(oldusedsegments);
+	}
+}
+
+/*
+ * This is a separate static inline function, so it can be reliably be inlined
+ * into its wrapper functions even if GH_SCOPE is extern.
+ */
+static inline GH_ELEMENT_TYPE *
+GH_INSERT_HASH_INTERNAL(GH_TYPE * tb, GH_KEY_TYPE key, uint32 hash, bool *found)
+{
+	uint32		startelem;
+	uint32		curelem;
+	GH_BUCKET  *buckets;
+	uint32		insertdist;
+
+restart:
+	insertdist = 0;
+
+	/*
+	 * To avoid doing the grow check inside the loop, we do the grow check
+	 * regardless of if the key is present.  This also lets us avoid having to
+	 * re-find our position in the hashtable after resizing.
+	 *
+	 * Note that this also reached when resizing the table due to
+	 * GH_GROW_MAX_DIB / GH_GROW_MAX_MOVE.
+	 */
+	if (unlikely(tb->members >= tb->grow_threshold))
+	{
+		/* this may wrap back to 0 when we're already at GH_MAX_SIZE */
+		GH_GROW(tb, tb->size * 2);
+	}
+
+	/* perform the insert starting the bucket search at optimal location */
+	buckets = tb->buckets;
+	startelem = GH_INITIAL_BUCKET(tb, hash);
+	curelem = startelem;
+	for (;;)
+	{
+		GH_BUCKET  *bucket = &buckets[curelem];
+		GH_ELEMENT_TYPE *entry;
+		uint32		curdist;
+		uint32		curhash;
+		uint32		curoptimal;
+
+		/* any empty bucket can directly be used */
+		if (!GH_IS_BUCKET_IN_USE(bucket))
+		{
+			uint32		index;
+
+			/* and add the new entry */
+			tb->members++;
+
+			entry = GH_GET_NEXT_UNUSED_ENTRY(tb, &index);
+			entry->GH_KEY = key;
+			bucket->hashvalue = hash;
+			GH_SET_BUCKET_IN_USE(bucket, index);
+			*found = false;
+			return entry;
+		}
+
+		curhash = bucket->hashvalue;
+
+		if (curhash == hash)
+		{
+			/*
+			 * The hash value matches so we just need to ensure the key
+			 * matches too.  To do that, we need to lookup the entry in the
+			 * segments using the index stored in the bucket.
+			 */
+			entry = GH_INDEX_TO_ELEMENT(tb, bucket->index);
+
+			/* if we find a match, we're done */
+			if (GH_EQUAL(tb, key, entry->GH_KEY))
+			{
+				Assert(GH_IS_BUCKET_IN_USE(bucket));
+				*found = true;
+				return entry;
+			}
+		}
+
+		/*
+		 * For non-empty, non-matching buckets we have to decide whether to
+		 * skip over or move the colliding entry.  When the colliding
+		 * element's distance to its optimal position is smaller than the
+		 * to-be-inserted entry's, we shift the colliding entry (and its
+		 * followers) one bucket closer to their optimal position.
+		 */
+		curoptimal = GH_INITIAL_BUCKET(tb, curhash);
+		curdist = GH_DISTANCE_FROM_OPTIMAL(tb, curoptimal, curelem);
+
+		if (insertdist > curdist)
+		{
+			GH_ELEMENT_TYPE *entry;
+			GH_BUCKET  *lastbucket = bucket;
+			uint32		emptyelem = curelem;
+			uint32		moveelem;
+			int32		emptydist = 0;
+			uint32		index;
+
+			/* find next empty bucket */
+			for (;;)
+			{
+				GH_BUCKET  *emptybucket;
+
+				emptyelem = GH_NEXT(tb, emptyelem, startelem);
+				emptybucket = &buckets[emptyelem];
+
+				if (!GH_IS_BUCKET_IN_USE(emptybucket))
+				{
+					lastbucket = emptybucket;
+					break;
+				}
+
+				/*
+				 * To avoid negative consequences from overly imbalanced
+				 * hashtables, grow the hashtable if collisions would require
+				 * us to move a lot of entries.  The most likely cause of such
+				 * imbalance is filling a (currently) small table, from a
+				 * currently big one, in hashtable order.  Don't grow if the
+				 * hashtable would be too empty, to prevent quick space
+				 * explosion for some weird edge cases.
+				 */
+				if (unlikely(++emptydist > GH_GROW_MAX_MOVE) &&
+					((double) tb->members / tb->size) >= GH_GROW_MIN_FILLFACTOR)
+				{
+					tb->grow_threshold = 0;
+					goto restart;
+				}
+			}
+
+			/* shift forward, starting at last occupied element */
+
+			/*
+			 * TODO: This could be optimized to be one memcpy in many cases,
+			 * excepting wrapping around at the end of ->data. Hasn't shown up
+			 * in profiles so far though.
+			 */
+			moveelem = emptyelem;
+			while (moveelem != curelem)
+			{
+				GH_BUCKET  *movebucket;
+
+				moveelem = GH_PREV(tb, moveelem, startelem);
+				movebucket = &buckets[moveelem];
+
+				memcpy(lastbucket, movebucket, sizeof(GH_BUCKET));
+				lastbucket = movebucket;
+			}
+
+			/* and add the new entry */
+			tb->members++;
+
+			entry = GH_GET_NEXT_UNUSED_ENTRY(tb, &index);
+			entry->GH_KEY = key;
+			bucket->hashvalue = hash;
+			GH_SET_BUCKET_IN_USE(bucket, index);
+			*found = false;
+			return entry;
+		}
+
+		curelem = GH_NEXT(tb, curelem, startelem);
+		insertdist++;
+
+		/*
+		 * To avoid negative consequences from overly imbalanced hashtables,
+		 * grow the hashtable if collisions lead to large runs. The most
+		 * likely cause of such imbalance is filling a (currently) small
+		 * table, from a currently big one, in hashtable order.  Don't grow if
+		 * the hashtable would be too empty, to prevent quick space explosion
+		 * for some weird edge cases.
+		 */
+		if (unlikely(insertdist > GH_GROW_MAX_DIB) &&
+			((double) tb->members / tb->size) >= GH_GROW_MIN_FILLFACTOR)
+		{
+			tb->grow_threshold = 0;
+			goto restart;
+		}
+	}
+}
+
+/*
+ * Insert the key into the hashtable, set *found to true if the key already
+ * exists, false otherwise. Returns the hashtable entry in either case.
+ */
+GH_SCOPE	GH_ELEMENT_TYPE *
+GH_INSERT(GH_TYPE * tb, GH_KEY_TYPE key, bool *found)
+{
+	uint32		hash = GH_HASH_KEY(tb, key);
+
+	return GH_INSERT_HASH_INTERNAL(tb, key, hash, found);
+}
+
+/*
+ * Insert the key into the hashtable using an already-calculated hash. Set
+ * *found to true if the key already exists, false otherwise. Returns the
+ * hashtable entry in either case.
+ */
+GH_SCOPE	GH_ELEMENT_TYPE *
+GH_INSERT_HASH(GH_TYPE * tb, GH_KEY_TYPE key, uint32 hash, bool *found)
+{
+	return GH_INSERT_HASH_INTERNAL(tb, key, hash, found);
+}
+
+/*
+ * This is a separate static inline function, so it can be reliably be inlined
+ * into its wrapper functions even if GH_SCOPE is extern.
+ */
+static inline GH_ELEMENT_TYPE *
+GH_LOOKUP_HASH_INTERNAL(GH_TYPE * tb, GH_KEY_TYPE key, uint32 hash)
+{
+	const uint32 startelem = GH_INITIAL_BUCKET(tb, hash);
+	uint32		curelem = startelem;
+
+	for (;;)
+	{
+		GH_BUCKET  *bucket = &tb->buckets[curelem];
+
+		if (!GH_IS_BUCKET_IN_USE(bucket))
+			return NULL;
+
+		if (bucket->hashvalue == hash)
+		{
+			GH_ELEMENT_TYPE *entry;
+
+			/*
+			 * The hash value matches so we just need to ensure the key
+			 * matches too.  To do that, we need to lookup the entry in the
+			 * segments using the index stored in the bucket.
+			 */
+			entry = GH_INDEX_TO_ELEMENT(tb, bucket->index);
+
+			/* if we find a match, we're done */
+			if (GH_EQUAL(tb, key, entry->GH_KEY))
+				return entry;
+		}
+
+		/*
+		 * TODO: we could stop search based on distance. If the current
+		 * buckets's distance-from-optimal is smaller than what we've skipped
+		 * already, the entry doesn't exist.
+		 */
+
+		curelem = GH_NEXT(tb, curelem, startelem);
+	}
+}
+
+/*
+ * Lookup an entry in the hash table.  Returns NULL if key not present.
+ */
+GH_SCOPE	GH_ELEMENT_TYPE *
+GH_LOOKUP(GH_TYPE * tb, GH_KEY_TYPE key)
+{
+	uint32		hash = GH_HASH_KEY(tb, key);
+
+	return GH_LOOKUP_HASH_INTERNAL(tb, key, hash);
+}
+
+/*
+ * Lookup an entry in the hash table using an already-calculated hash.
+ *
+ * Returns NULL if key not present.
+ */
+GH_SCOPE	GH_ELEMENT_TYPE *
+GH_LOOKUP_HASH(GH_TYPE * tb, GH_KEY_TYPE key, uint32 hash)
+{
+	return GH_LOOKUP_HASH_INTERNAL(tb, key, hash);
+}
+
+/*
+ * Delete an entry from hash table by key.  Returns whether to-be-deleted key
+ * was present.
+ */
+GH_SCOPE bool
+GH_DELETE(GH_TYPE * tb, GH_KEY_TYPE key)
+{
+	uint32		hash = GH_HASH_KEY(tb, key);
+	uint32		startelem = GH_INITIAL_BUCKET(tb, hash);
+	uint32		curelem = startelem;
+
+	for (;;)
+	{
+		GH_BUCKET  *bucket = &tb->buckets[curelem];
+
+		if (!GH_IS_BUCKET_IN_USE(bucket))
+			return false;
+
+		if (bucket->hashvalue == hash)
+		{
+			GH_ELEMENT_TYPE *entry;
+
+			entry = GH_INDEX_TO_ELEMENT(tb, bucket->index);
+
+			if (GH_EQUAL(tb, key, entry->GH_KEY))
+			{
+				GH_BUCKET  *lastbucket = bucket;
+
+				/* mark the entry as unused */
+				GH_REMOVE_ENTRY(tb, bucket->index);
+				/* and mark the bucket unused */
+				GH_SET_BUCKET_EMPTY(bucket);
+
+				tb->members--;
+
+				/*
+				 * Backward shift following buckets till either an empty
+				 * bucket or a bucket at its optimal position is encountered.
+				 *
+				 * While that sounds expensive, the average chain length is
+				 * short, and deletions would otherwise require tombstones.
+				 */
+				for (;;)
+				{
+					GH_BUCKET  *curbucket;
+					uint32		curhash;
+					uint32		curoptimal;
+
+					curelem = GH_NEXT(tb, curelem, startelem);
+					curbucket = &tb->buckets[curelem];
+
+					if (!GH_IS_BUCKET_IN_USE(curbucket))
+						break;
+
+					curhash = curbucket->hashvalue;
+					curoptimal = GH_INITIAL_BUCKET(tb, curhash);
+
+					/* current is at optimal position, done */
+					if (curoptimal == curelem)
+					{
+						GH_SET_BUCKET_EMPTY(lastbucket);
+						break;
+					}
+
+					/* shift */
+					memcpy(lastbucket, curbucket, sizeof(GH_BUCKET));
+					GH_SET_BUCKET_EMPTY(curbucket);
+
+					lastbucket = curbucket;
+				}
+
+				return true;
+			}
+		}
+		/* TODO: return false; if the distance is too big */
+
+		curelem = GH_NEXT(tb, curelem, startelem);
+	}
+}
+
+/*
+ * Initialize iterator.
+ */
+GH_SCOPE void
+GH_START_ITERATE(GH_TYPE * tb, GH_ITERATOR * iter)
+{
+	iter->cursegidx = -1;
+	iter->curitemidx = -1;
+	iter->found_members = 0;
+	iter->total_members = tb->members;
+}
+
+/*
+ * Iterate over all entries in the hashtable. Return the next occupied entry,
+ * or NULL if there are no more entries.
+ *
+ * During iteration the only current entry in the hash table and any entry
+ * which was previously visited in the loop may be deleted.  Deletion of items
+ * not yet visited is prohibited as are insertions of new entries.
+ */
+GH_SCOPE	GH_ELEMENT_TYPE *
+GH_ITERATE(GH_TYPE * tb, GH_ITERATOR * iter)
+{
+	/*
+	 * Bail if we've already visited all members.  This check allows us to
+	 * exit quickly in cases where the table is large but it only contains a
+	 * small number of records.  This also means that inserts into the table
+	 * are not possible during iteration.  If that is done then we may not
+	 * visit all items in the table.  Rather than ever removing this check to
+	 * allow table insertions during iteration, we should add another iterator
+	 * where insertions are safe.
+	 */
+	if (iter->found_members == iter->total_members)
+		return NULL;
+
+	for (;;)
+	{
+		GH_SEGMENT *seg;
+
+		/* need a new segment? */
+		if (iter->curitemidx == -1)
+		{
+			iter->cursegidx = GH_NEXT_ONEBIT(tb->used_segments,
+											 tb->used_segment_words,
+											 iter->cursegidx);
+
+			/* no more segments with items? We're done */
+			if (iter->cursegidx == -1)
+				return NULL;
+		}
+
+		seg = tb->segments[iter->cursegidx];
+
+		/* if the segment has items then it certainly shouldn't be NULL */
+		Assert(seg != NULL);
+		/* advance to the next used item in this segment */
+		iter->curitemidx = GH_NEXT_ONEBIT(seg->used_items, GH_BITMAP_WORDS,
+										  iter->curitemidx);
+		if (iter->curitemidx >= 0)
+		{
+			iter->found_members++;
+			return &seg->items[iter->curitemidx];
+		}
+
+		/*
+		 * GH_NEXT_ONEBIT returns -1 when there are no more bits.  We just
+		 * loop again to fetch the next segment.
+		 */
+	}
+}
+
+#endif							/* GH_DEFINE */
+
+/* undefine external parameters, so next hash table can be defined */
+#undef GH_PREFIX
+#undef GH_KEY_TYPE
+#undef GH_KEY
+#undef GH_ELEMENT_TYPE
+#undef GH_HASH_KEY
+#undef GH_SCOPE
+#undef GH_DECLARE
+#undef GH_DEFINE
+#undef GH_EQUAL
+#undef GH_ALLOCATE
+#undef GH_ALLOCATE_ZERO
+#undef GH_FREE
+
+/* undefine locally declared macros */
+#undef GH_MAKE_PREFIX
+#undef GH_MAKE_NAME
+#undef GH_MAKE_NAME_
+#undef GH_ITEMS_PER_SEGMENT
+#undef GH_UNUSED_BUCKET_INDEX
+#undef GH_INDEX_SEGMENT
+#undef GH_INDEX_ITEM
+#undef GH_BITS_PER_WORD
+#undef GH_BITMAP_WORD
+#undef GH_RIGHTMOST_ONE_POS
+#undef GH_BITMAP_WORDS
+#undef GH_WORDNUM
+#undef GH_BITNUM
+#undef GH_RAW_ALLOCATOR
+#undef GH_MAX_SIZE
+#undef GH_FILLFACTOR
+#undef GH_MAX_FILLFACTOR
+#undef GH_GROW_MAX_DIB
+#undef GH_GROW_MAX_MOVE
+#undef GH_GROW_MIN_FILLFACTOR
+
+/* types */
+#undef GH_TYPE
+#undef GH_BUCKET
+#undef GH_SEGMENT
+#undef GH_ITERATOR
+
+/* external function names */
+#undef GH_CREATE
+#undef GH_DESTROY
+#undef GH_RESET
+#undef GH_INSERT
+#undef GH_INSERT_HASH
+#undef GH_DELETE
+#undef GH_LOOKUP
+#undef GH_LOOKUP_HASH
+#undef GH_GROW
+#undef GH_START_ITERATE
+#undef GH_ITERATE
+
+/* internal function names */
+#undef GH_NEXT_ONEBIT
+#undef GH_NEXT_ZEROBIT
+#undef GH_INDEX_TO_ELEMENT
+#undef GH_MARK_SEGMENT_ITEM_USED
+#undef GH_MARK_SEGMENT_ITEM_UNUSED
+#undef GH_GET_NEXT_UNUSED_ENTRY
+#undef GH_REMOVE_ENTRY
+#undef GH_SET_BUCKET_IN_USE
+#undef GH_SET_BUCKET_EMPTY
+#undef GH_IS_BUCKET_IN_USE
+#undef GH_COMPUTE_PARAMETERS
+#undef GH_NEXT
+#undef GH_PREV
+#undef GH_DISTANCE_FROM_OPTIMAL
+#undef GH_INITIAL_BUCKET
+#undef GH_INSERT_HASH_INTERNAL
+#undef GH_LOOKUP_HASH_INTERNAL
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index 9b2a421c32..a268879b1c 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -561,7 +561,7 @@ extern void LockReleaseCurrentOwner(LOCALLOCK **locallocks, int nlocks);
 extern void LockReassignCurrentOwner(LOCALLOCK **locallocks, int nlocks);
 extern bool LockHeldByMe(const LOCKTAG *locktag, LOCKMODE lockmode);
 #ifdef USE_ASSERT_CHECKING
-extern HTAB *GetLockMethodLocalHash(void);
+extern LOCALLOCK **GetLockMethodLocalLocks(uint32 *size);
 #endif
 extern bool LockHasWaiters(const LOCKTAG *locktag,
 						   LOCKMODE lockmode, bool sessionLock);
-- 
2.27.0

#95

Zhihong Yu

zyu@yugabyte.com

over 4 years ago

In reply to: David Rowley (#94)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

On Sun, Jun 20, 2021 at 6:56 AM David Rowley <dgrowleyml@gmail.com> wrote:

On Wed, 14 Aug 2019 at 19:25, David Rowley <david.rowley@2ndquadrant.com>
wrote:

For now, I'm out of ideas. If anyone else feels like suggesting
something of picking this up, feel free.

This is a pretty old thread, so we might need a recap:

# Recap

Basically LockReleaseAll() becomes slow after a large number of locks
have all been held at once by a backend. The problem is that the
LockMethodLocalHash dynahash table must grow to store all the locks
and when later transactions only take a few locks, LockReleaseAll() is
slow due to hash_seq_search() having to skip the sparsely filled
buckets in the bloated hash table.

The following things were tried on this thread. Each one failed:

1) Use a dlist in LOCALLOCK to track the next and prev LOCALLOCK.
Simply loop over the dlist in LockReleaseAll().
2) Try dropping and recreating the dynahash table if it becomes
bloated using some heuristics to decide what "bloated" means and if
recreating is worthwhile.

#1 failed due to concerns with increasing the size of LOCALLOCK to
store the dlist pointers. Performance regressions were seen too.
Possibly due to size increase or additional overhead from pushing onto
the dlist.
#2 failed because it was difficult to agree on what the heuristics
would be and we had no efficient way to determine the maximum number
of locks that a given transaction held at any one time. We only know
how many were still held at LockReleaseAll().

There were also some suggestions to fix dynahash's hash_seq_search()
slowness, and also a suggestion to try using simplehash.h instead of
dynahash.c. Unfortunately simplehash.h would suffer the same issues as
it too would have to skip over empty buckets in a sparsely populated
hash table.

I'd like to revive this effort as I have a new idea on how to solve the
problem.

# Background

Over in [1] I'm trying to improve the performance of smgropen() during
recovery. The slowness there comes from the dynahash table lookups to
find the correct SMgrRelation. Over there I proposed to use simplehash
instead of dynahash because it's quite a good bit faster and far
lessens the hash lookup overhead during recovery. One problem on that
thread is that relcache keeps a pointer into the SMgrRelation
(RelationData.rd_smgr) and because simplehash moves things around
during inserts and deletes, then we can't have anything point to
simplehash entries, they're unstable. I fixed that over on the other
thread by having the simplehash entry point to a palloced
SMgrRelationData... My problem is, I don't really like that idea as it
means we need to palloc() pfree() lots of little chunks of memory.

To fix the above, I really think we need a version of simplehash that
has stable pointers. Providing that implementation is faster than
dynahash, then it will help solve the smgropen() slowness during
recovery.

# A new hashtable implementation

I ended up thinking of this thread again because the implementation of
the stable pointer hash that I ended up writing for [1] happens to be
lightning fast for hash table sequential scans, even if the table has
become bloated. The reason the seq scans are so fast is that the
implementation loops over the data arrays, which are tightly packed
and store the actual data rather than pointers to the data. The code
does not need to loop over the bucket array for this at all, so how
large that has become is irrelevant to hash table seq scan
performance.

The patch stores elements in "segments" which is set to some power of
2 value. When we run out of space to store new items in a segment, we
just allocate another segment. When we remove items from the table,
new items reuse the first unused item in the first segment with free
space. This helps to keep the used elements tightly packed. A
segment keeps a bitmap of used items so that means scanning all used
items is very fast. If you flip the bits in the used_item bitmap,
then you get a free list for that segment, so it's also very fast to
find a free element when inserting a new item into the table.

I've called the hash table implementation "generichash". It uses the
same preprocessor tricks as simplehash.h does and borrows the same
linear probing code that's used in simplehash. The bucket array just
stores the hash value and a uint32 index into the segment item that
stores the data. Since segments store a power of 2 items, we can
easily address both the segment number and the item within the segment
from the single uint32 index value. The 'index' field just stores a
special value when the bucket is empty. No need to add another field
for that. This means the bucket type is just 8 bytes wide.

One thing I will mention about the new hash table implementation is
that GH_ITEMS_PER_SEGMENT is, by default, set to 256. This means
that's the minimum size for the table. I could drop this downto 64,
but that's still quite a bit more than the default size of the
dynahash table of 16. I think 16 is a bit on the low side and that it
might be worth making this 64 anyway. I'd just need to lower
GH_ITEMS_PER_SEGMENT down. The current code does not allow it to go
lower as I've done nothing to allow partial bitmap words, they're
64-bits on a 64-bit machine.

I've not done too much benchmarking between it and simplehash.h, but I
think in some cases it will be faster. Since the bucket type is just 8
bytes, moving stuff around during insert/deletes will be cheaper than
with simplehash. Lookups are likely to be a bit slower due to having
to lookup the item within the segment, which is a few pointer
dereferences away.

A quick test shows an improvement when compared to dynahash.

# select count(pg_try_advisory_lock(99999,99999)) from
generate_series(1,1000000);

Master:
Time: 190.257 ms
Time: 189.440 ms
Time: 188.768 ms

Patched:
Time: 182.779 ms
Time: 182.267 ms
Time: 186.902 ms

This is just hitting the local lock table. The advisory lock key is
the same each time, so it remains a lock check. Also, it's a pretty
small gain, but I'm mostly trying to show that the new hash table is
not any slower than dynahash for probing for inserts.

The real wins come from what we're trying to solve in this thread --
the performance of LockReleaseAll().

Benchmarking results measuring the TPS of a simple select from an
empty table after another transaction has bloated the locallock table.

Master:
127544 tps
113971 tps
123255 tps
121615 tps

Patched:
170803 tps
167305 tps
165002 tps
171976 tps

About 38% faster.

The benchmark I used was:

t1.sql:
\set p 1
select a from t1 where a = :p

hp.sql:
select count(*) from hp

"hp" is a hash partitioned table with 10k partitions.

pgbench -j 20 -c 20 -T 60 -M prepared -n -f hp.sql@1 -f t1.sql@100000
postgres

I'm using the query to the hp table to bloat the locallock table. It's
only executed every 1 in 100,000 queries. The tps numbers above are
the ones to run t1.sql

I've not quite looked into why yet, but the hp.sql improved
performance by 58%. It went from an average of 1.061377 in master to
an average of 1.683616 in the patched version. I can't quite see where
this gain is coming from. It's pretty hard to get good stable
performance results out of this AMD machine, so it might be related to
that. That's why I ran 20 threads. It seems slightly better. The
machine seems to have trouble waking up properly for a single thread.

It would be good if someone could repeat the tests to see if the gains
appear on other hardware.

Also, it would be good to hear what people think about solving the
problem this way.

Patch attached.

David

[1]
/messages/by-id/CAApHDvpkWOGLh_bYg7jproXN8B2g2T9dWDcqsmKsXG5+WwZaqw@mail.gmail.com

Hi,

+ *   GH_ELEMENT_TYPE defines the data type that the hashtable stores.  Each
+ *   instance of GH_ELEMENT_TYPE which is stored in the hash table is done
so
+ *   inside a GH_SEGMENT.

I think the second sentence can be written as (since done means stored, it
is redundant):

Each instance of GH_ELEMENT_TYPE is stored in the hash table inside a
GH_SEGMENT.

+ * Macros for translating a bucket's index into the segment and another to
+ * determine the item number within the segment.
+ */
+#define GH_INDEX_SEGMENT(i)    (i) / GH_ITEMS_PER_SEGMENT

into the segment -> into the segment number (in the code I see segidx but I
wonder if segment index may cause slight confusion).

+   GH_BITMAP_WORD used_items[GH_BITMAP_WORDS]; /* A 1-bit for each used
item
+                                                * in the items array */

'A 1-bit' -> One bit (A and 1 mean the same)

+ uint32 first_free_segment;

Since the segment may not be totally free, maybe name the field
first_segment_with_free_slot

+ * This is similar to GH_NEXT_ONEBIT but flips the bits before operating on
+ * each GH_BITMAP_WORD.

It seems the only difference from GH_NEXT_ONEBIT is in this line:

+ GH_BITMAP_WORD word = ~(words[wordnum] & mask); /* flip bits */

If a 4th parameter is added to signify whether the flipping should be done,
these two functions can be unified.

+    * next insert will store in this segment.  If it's already pointing to
an
+    * earlier segment, then leave it be.

The last sentence is confusing: first_free_segment cannot point to some
segment and earlier segment at the same time.
Maybe drop the last sentence.

Cheers

#96

David Rowley

dgrowleyml@gmail.com

over 4 years ago

In reply to: Zhihong Yu (#95)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

Thanks for having a look at this.

On Mon, 21 Jun 2021 at 05:02, Zhihong Yu <zyu@yugabyte.com> wrote:

+ *   GH_ELEMENT_TYPE defines the data type that the hashtable stores.  Each
+ *   instance of GH_ELEMENT_TYPE which is stored in the hash table is done so
+ *   inside a GH_SEGMENT.
I think the second sentence can be written as (since done means stored, it is redundant):

I've rewords this entire paragraph slightly.

Each instance of GH_ELEMENT_TYPE is stored in the hash table inside a GH_SEGMENT.
+ * Macros for translating a bucket's index into the segment and another to
+ * determine the item number within the segment.
+ */
+#define GH_INDEX_SEGMENT(i)    (i) / GH_ITEMS_PER_SEGMENT
into the segment -> into the segment number (in the code I see segidx but I wonder if segment index may cause slight confusion).

I've adjusted this comment

+   GH_BITMAP_WORD used_items[GH_BITMAP_WORDS]; /* A 1-bit for each used item
+                                                * in the items array */

'A 1-bit' -> One bit (A and 1 mean the same)

I think you might have misread this. We're storing a 1-bit for each
used item rather than a 0-bit. If I remove the 'A' then it's not
clear what the meaning of each bit's value is.

+ uint32 first_free_segment;

Since the segment may not be totally free, maybe name the field first_segment_with_free_slot

I don't really like that. It feels too long to me.

+ * This is similar to GH_NEXT_ONEBIT but flips the bits before operating on
+ * each GH_BITMAP_WORD.
It seems the only difference from GH_NEXT_ONEBIT is in this line:

+ GH_BITMAP_WORD word = ~(words[wordnum] & mask); /* flip bits */

If a 4th parameter is added to signify whether the flipping should be done, these two functions can be unified.

I don't want to do that. I'd rather have them separate to ensure the
compiler does not create any additional needless branching. Those
functions are pretty hot.

+    * next insert will store in this segment.  If it's already pointing to an
+    * earlier segment, then leave it be.
The last sentence is confusing: first_free_segment cannot point to some segment and earlier segment at the same time.
Maybe drop the last sentence.

I've adjusted this comment to become:

* Check if we need to lower the next_free_segment. We want new inserts
* to fill the lower segments first, so only change the first_free_segment
* if the removed entry was stored in an earlier segment.

Thanks for having a look at this.

I'll attach an updated patch soon.

David

#97

David Rowley

dgrowleyml@gmail.com

over 4 years ago

In reply to: David Rowley (#94)

3 attachment(s)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

On Mon, 21 Jun 2021 at 01:56, David Rowley <dgrowleyml@gmail.com> wrote:

# A new hashtable implementation

Also, it would be good to hear what people think about solving the
problem this way.

Because over on [1]/messages/by-id/CAApHDvowgRaQupC=L37iZPUzx1z7-N8deD7TxQSm8LR+f4L3-A@mail.gmail.com I'm also trying to improve the performance of
smgropen(), I posted the patch for the new hash table over there too.
Between that thread and discussions with Thomas and Andres off-list, I
get the idea that pretty much nobody likes the idea of having another
hash table implementation. Thomas wants to solve it another way and
Andres has concerns that working with bitmaps is too slow. Andres
suggested freelists would be faster, but I'm not really a fan of the
idea as, unless I have a freelist per array segment then I won't be
able to quickly identify the earliest segment slot to re-fill unused
slots with. That would mean memory would get more fragmented over time
instead of the fragmentation being slowly fixed as new items are added
after deletes. So I've not really tried implementing that to see how
it performs.

Both Andres and Thomas expressed a dislike to the name "generichash" too.

Anyway, since I did make a few small changes to the hash table
implementation before doing all that off-list talking, I thought I
should at least post where I got to here so that anything else that
comes up can get compared to where I got to, instead of where I was
with this.

I did end up renaming the hash table to "densehash" rather than
generichash. Naming is hard, but I went with dense as memory density
was on my mind when I wrote it. Having a compact 8-byte bucket width
and packing the data into arrays in a dense way. The word dense came
up a few times, so went with that.

I also adjusted the hash seq scan code so that it performs better when
faced a non-sparsely populated table. Previously my benchmark for
that case didn't do well [2]/messages/by-id/CAApHDvpuzJTQNKQ_bnAccvi-68xuh+v87B4P6ycU-UiN0dqyTg@mail.gmail.com.

I've attached the benchmark results from running the benchmark that's
included in hashbench.tar.bz2. I ran this 10 times using the included
test.sh with ./test.sh 10. I included the results I got on my AMD
machine in the attached bz2 file in results.csv.

You can see from the attached dense_vs_generic_vs_simple.png that
dense hash is quite comparable to simplehash for inserts/deletes and
lookups. It's not quite as fast as simplehash at iterations when the
table is not bloated, but blows simplehash out the water when the
hashtables have become bloated due to having once contained a large
number of records but no longer do.

Anyway, unless there is some interest in me taking this idea further
then, due to the general feedback received on [1]/messages/by-id/CAApHDvowgRaQupC=L37iZPUzx1z7-N8deD7TxQSm8LR+f4L3-A@mail.gmail.com, I'm not planning on
pushing this any further. I'll leave the commitfest entry as is for
now to give others a chance to see this.

David

[1]: /messages/by-id/CAApHDvowgRaQupC=L37iZPUzx1z7-N8deD7TxQSm8LR+f4L3-A@mail.gmail.com
[2]: /messages/by-id/CAApHDvpuzJTQNKQ_bnAccvi-68xuh+v87B4P6ycU-UiN0dqyTg@mail.gmail.com

Attachments:

dense_vs_generic_vs_simple.pngimage/png; name=dense_vs_generic_vs_simple.pngDownload

hashbench.tar.bz2application/octet-stream; name=hashbench.tar.bz2Download

BZh91AY&SY,�]������������������  `��O���i���2���}�>{��*��[�m�]���*�����x��<���fZ��[N�R�
�������
��hh��iSm���*��}�Gx�w��W��=:U�h.��xl`�`����B�G���+5c�J�gPH��U��+m���sm�[*/Ft�%��kSJM��%TU`�T�)U�(7gT:��sL��i�iv7]9*�TRrev�!-��Pa�f
EY��EEQRT�
K[�E���U��v�w\m*NY�UAV���!DT$	�F�M0CT�'�1������1
4��<���dbJT�S�
4�
���UJ�M

�h���IS���H�O�?S�h#2204��#!�211
4�"@@L������T���(�z��h���H@�B`�!�����L��Q����S�T��z�Q��
 ���q������$���B��rB�@p�=d>����M�	��>�p��~�3�����������K,(��%�l�`�$P�FC�T)j�XA�E���BH�O�R?����$~r�La�����������'3�2M�4��k���������W[���q���_�K�'RH,�Q����Q�eMo�����Ih�:0��R�L,2fj_K���d�36I#7��l�^�/���$o,Y�K*2dL�
Xm��^1-hB��Aa�e��%J�h�������6�*������WRq�"%����������kF��.K,X*12��#y)�%�`��,H-�)���,��c&&��V�a�H�IQR*,I���	��	��V,������%����FF�"�I)%f�L��
�eT�������U�����13L�K|u�RJX�����d�]n����^[&��R��$B��Wb�����F�!u���%��v��W�w^Z��%7Wuk��v�Y��3l���ZL��������,�jFZ��e&e��L����T���"��%{u�L�e�ZjT�R��6�R�Y�������������m3e��H-*R��f��iKfZR�ei5�6d�e6`AP�,bDb���a��#��O�G���-�;NI�O�F��c,����Y���fC2��n�`0Q�cR,��M�5l[��X�J�1f�����S�W?,G��^�?���.��+�9�%�AMQ�+�H��7x���~�F����U2������a��C�Q)h4�za�4L���EPIf�`���]�es���~����Lg������O���^�v�qT�J �0Qe�%]�0D���f$�t()�$
(�fl��I��a�
c �����eEfFo,�S4���JW|����Z�����1dZ�p�C0@�r�c$��y/�y���6��gS�sQ%8c�s�QO�RgmW���C�h��� �f�;T�	����]Il���RI��O����x|M]]\���O�w'�S]T�K����"B�B���.�R��K���PSS ��{�OS5df�Hn���}_-{������c^W����l��V,��s�V����������_SM',4��U9+�f��B�5��������U��k6�����l��f�k;;�a��g\m7�6����v���$�$���X���R��b�,Xi��d���W�pR�*�rl���������WY���-x����Z�*�Y-_���&������)�VD`AQK���c���U%��N�D�[lr�j��O���dv5TH)�L�t4tF{����|���a�$���#|��b:�fP�Y�W�������w�i�W3�������}��o|o���P��������9�V��F������X����w�p��2���
H
��*������&V��F
kR���T�%��r��a����w�e	�P�'7Fc���'kXOuU��V��|���~U��5���<"!	���z��_���������������;>8���������4�Lf��)Y1�y�W�Z���3]acy5�����S�>rk�w�#�<�$~���;+i���D���E��a����?^��d����OkE�1����O�O������f� �:�����k�+
a�y������4F�����K��I+4��v������-=���vw?���������z�Z������~�_G�Y?����s��CG����^��==���{�4}��_/��,������s�XC=����j��2�������+3j,��_5�4#�Z?��k��sDb����������������8��.S����S��fT7�(���vC�L�+��������V�yO3rv���B��M;��U��^]��5&�.N�a9vrE�U��eJWN�g"�"�:�M���m:9ZL�M(r��u���Lb�Z2)T���^��Q����2>8�7���!����9���U�mY5u�������+w/0a��r��q����Y	c��J�;�b����6D�f�vm^,Yj��q'X���%��x.U�����3V�T7QS[YH������Q���k
1K]�772�6J� ����
gN�NN1��~�:_.�*7�tT��b������IkG��,��p��h���*M�wF�e�VN����!�
��6(���F�������{�1�����S��,s�wkMl��q��Su���.����eJ��U-�K&�Ci]�Yr+.�3��*�
{V���A7��������SyV2�������������SSpn
cB����{�N<�/�W�}M��_f�OMR��$�HDN��d���ML���m�z�n���|����>��I/��y�fN����$4����SpE�8\5+R��X~C���~~�K�a����c�?��x��L�'����_%Z�MM���(��b)JZ�k�j���~���5{_kE�a2��[%�������S�����`��Q����3�=�'d[����?G��d�1�A�<�
b����>��&���$">o���:����v~����[�W�����=�42���vY^�W�O�������������fc����&�#Q0<���?P�h�u������|X�,���f/i���p��@���?��=����	�MO�����;���8�A���Rl��������� �*x��9��B9��m�X�TW)S
���bE(��6�D�PH��XNh��J�H��X���i�(��n]�H1!�R���R�@p�7�����SE�_�c����p�g��_�������}.���>���_�F��8YJ����9��4�$MH��h����rV�f�"�m9�,�2�M7��h[e���R����_8l�����B��>�������5#�-]'�J�t�Q�F���x8s�Ee5��`�#F�cv����:�7i�Ko{�}������
�7�����~��6<��4��&%X��Q�_��W��d�q,�J�W��K�#4�����r�����n)�=&���u�	�j|~�?�CUU��R�A�A���p}�����o�����g��92NN=������!�������Z�8��&T@����<X��?vd(M���{���_����3$af{
�>��YIjy/�3�P������S�wz�`:?��ho@}�C�KF���O%{1��_�r9Z)���@���~�,��dnLY��d����B#j�C�����64?W�rx�{����-y���tu����Z-M:�l�8�9|�FB0��6s*F-��������56���DT :KNB���A��n�)�`�5
���a\������`Q�@f�|&�&���XRP���)��r� ��!��/GWSg���;��T���i���9T��WcT.*I?|8�DD&F�����Q+/���M:t�}0Ze�I��+�����Q�\2q�ww�.~�	kJ�H$�%�$�C��e���uB�� H�PQ$�_;����$�������=g?��Y�(
<�
������KV���y��P)N&>�L09��I$����>��q�`�
=�jf	��A�T�""#�#Q���J��~�����-9����Q^�|��
 |%� �'p7}b��a-a�IB��I�$���HzS��*�iB���&M����Go�i���eE�I�������P<z�m��#�o^�������Cu����Iz�����Y����6)�9����DDm(�j0���+J���f�R�����iiZl#b���0V��ii��b�B�A�+J���-5��1��ZmKKT3M�)FV����G.;��|����b���c2�����M��j�S�P�A���}��=��2@f����a��dFC��b���D(AI@�Gs@�I/-���%�$��B#��3Q��~����o�^���_�?1����]��JBX�2?/�CI��M�B<�{?�^���o�%9|��}���O:�o�Gx}�8�k
��t7�w�<+1����������!�H���Yv���S=�]���&�W�[1x�U52V��V���E"l�G��p$�I
F���./�y�s����������p�.62��wR�KXc?:�P�����&^��T_��1jT�9fW,����Rc�e!IHPT�������QU������q��3�`�����DJ�54�� �|)��23���e�p��5*��f��L�����2.��IhUD���Z��u�C�NJ��C<��I���NcE��X@`�rh��M�%��Y�,��6:�6cO�^k�+������b������s1AC�*D�Y�<�d��
���o5>gC���C��`�d#����M�U�c!
]D�Ke	�����1��U2j��-�Y�=wC��y&�� |������a�a�I'G�a�IynIt�I�*K�Y����"�Pau�U��%V�	^�_�W&����'���=|���ss��"fG��2���?���\��\H�(�k���P)�>_��
J}N�>Q��%P�qAa"����*G`�����R��=(X������U�#>�	,����p��j,�3�������H����s�rY������*� ��zO9C�Z����t���S�s\;������}t[P�k%Q��iq���m����j��M��U��]�
Gi�)���S=
�JpG�	��,�R/b��������<���a\�*�Q�}S�QL����
j��@�\�&L/6(+~��Rx4������E�
$}m������lp���W���,���h�RF�.�"��8X��nG�0�|���uq032��-+z��j�L:�~d�`����S}������q����*3U�26^�|�P�arA3�������t����cUFG�����������NN��h�,���j�Na	�( $R	��4�f"�R��:+)\Y�Z�f3�*�3%b	�-|���p�<�0��ge���<cFJg|����A�Pr ������V��O(��!��iN�6����^���+TY�s+%rGJ�Y��k��
4R�"�yHj0������"�C���JbT;��@��5�0`1H�����fL�~q-S6���	&��zu��6�y��hj|��EN&M{����xw<4�1��v�����7����
e�,nH��
��Il.y
4�r�}��V��Q�����g�����1(@�)pL��RKA�C"	>���(������*��<`�y�U�l!$A��\�^'�L�L�3����n"Z��A�^��d0�7�2B���������62�XT<��*_�RQq����r/9
x��@��=kp�au��c��3��=
���4:��s!gM��]��K�>3�;
��}).�y�%�s�a���%_&w=ou�l��Y�&�57�{M�3��vXz&���>�b2�"(����:���^.��\��;usF�Wrd�iy�
H�-JVy���V�H��r1����jx���'����p���q�q4�e�H3�l��E���;�3�����M��{+��7i��{M���i���|\����u� ���6y�^Cp��f�s��rA+�f�|D�3?3�����HY-3.TD��.����S�3��)Q�
��B�l����j�3n[+�&�H�@H3�+sb�NF���u ����������#�;<�!�aP���,*����C��v$o{�)JMR���+�����h
���<�`$�S7Wl�i��?Z�l��:�����[=�6�I����}�s�;����W��������J�k3���b��t����BAR�����K]���fj��Zdf����L�lw�������'lzn7��H�O�~�llpo��UI""�Kv��$E����F���Y#��amL�S�#��#��\�Q����
	�{m�N���cR'��A���h����Zo33)���C*������c�SG6���Km���x"��S����:=MK7{��K�ph����s������sg7
z����Z}�]:8=�X��\I	�.��K��u=���� jCu��2`���6���39��a%�SS��an�O��L�x&_e�-������0�Y.<�����B��<��y�F����
���6^C�B��>9�Y$��d	�%,@!�",�
S��e^��}����j�@m�A�����D��CRZ��Sq��k��k��
��F����+%��d��_��)�18��:~��b�#����'a�f�j���0
�e��#��y�{�p�o�2}B�Ap�@���,:!�G��2/G��os����^.���.�������[�=k���b��k��0��eP����w�	l���8F.b
��� ��eR��3�p�i��%�`�%���Cw�rt���:�z*P��Y,��l`��W����o224������S2SjF�B�QR�t0�T&@ea`d���T�*j(hD&��PS�D�CEtd5b����zS�<�b�9��'��������i%SS�UQ���P�_yd\dH���	����xR�D,d�f��jAcU��@�.�2��
��5�R�yA�J�15p��a���m��v�l`�#���Nu{�s����'F�<����g�(�����$�:��b-h�F�\��aL��p(��3�)����t���� ���@�����#eV�z�-0�>��5u��[���y(��8H���O�����A��j��9�Fk��<eGt�t*B&H��h<� �*�A@�uU�sa-&5�eSR�"�E�-:VF��v
66��chj�u����$�"�n�U�L��M�SS�x.��C�&X���$� ���������L,����b���(�-���P`j����~�(2���Qf.'�����d�1D��Q�_"�>�z&	�56�x�3��p=�3��p�q�#}�bmw�����T��U2������I��;�
�<���K��1w�{U���w�:>�%��EYb�aiKe�������."(�� �W����%R��X
�VD�,��l&%�������{�4�i�]VLJK����������4#":�eL%�7�$�����>��so�����e�M��?��&KLLc�B���J��!�c��:�|�!��_�UUU^=�I%��
lsG#t������8�\}tE����/x���V�����.�s]9.����tKN��]�:GJ"���9tB����tu��:.���u[��r.���{����`�H��o�7:��4n��Q�������y0g���LGQ�Q;����m<��f�-dfQ<��R��9�z�Aa�x8{�0eL�x�.\#��)0��mi�l����y����M�����,����\�L��Y����m6�Vg���,-�C��,���m�b] a�p!�m��;����f�����$H��A�R�K��+�B)��@�d�o��|�.��}hNa3�_���5�8��0g�@>"�gD��3�%�����A�>���)�����'�������Qq�%��#j�<:�
�+c�`W�i�(���fJpI�J�G���HjD�P�I��VGG�[��:vzB��������N������5�/V����+SFX�������u�����U�D;m��M�ch���������G�9��~������x���<p�	5�5�V'���n�	�U�P�L��t
+�cej�s�\���Kp�lE�|G�H��qF���.��K��f���!A���e���>��-U*�Z�UT��i�K�Q��.r���z �UW���d~�5�1�gD�l�R]W|w��\?RO�����`��aU#S�_�.�U� ^J%��.A��8#I��#��$�o�Q�h�^d��dv��jI�a�
�P���.{+�M��t}l1
(/^&�#�x[L��W�V�QE�����$�H8%�4U�i�
�4�q
�QA����j�Zd�S�8

��XUT�c���erDp �&`��`��|�75x�""fZ8�QNc�����f�X\F�(T�[���#����y�L��p-�&�{/��k~��|L>�@�����i�J��ef�i4��55��fj6I�li*���J�M�U%f�����kh�f��,���2��f��m#il�Ze)��6����IJ�f�6�Rd�6m2��e�f�l��[%�2��ji&f��j��D�y)F�����S�"��	�_�>��Yz������@���H�
�>��~0�Y���7��W����T\�Y%������.����� ����^�����w�r�)�P/>�����}�k��S&(B���zaO#(�?W����6|���O�p�]w��Y�#��E��eT���|���D5 ����>F��u��+��@\_	�J�V�����I|$ fj��k���U�L��'B�������t
CI-r�d�p�9�!H-���/��
v�UK*�:��"g�cbsjJ���8w�8�`lj�chPG�t�y<���8[��J:w��8��������XAR!dI*�%���4�*��Z�������a��c5j�a��+����h�b�h����5y�����8���	#yy)@N�;�9��8�<�\�^�*��y��C�^^]�����o:��<V$
����������B���rn��;�w�Qx�o40K�o
�,56��o:�]�d�(�������^�e%�XH�uC���*�<F��������u�(�#��E����dt��d�9�,����M�s=��_v?!��rU[�����Z�x��>5(�Y)x&���n2�G��2�
I���5s�yC
jD���]db�~���g����� ��a�Hn4*	 (���ZJR��\���*A�2��.QX��� �j��40e-�=
�3 ��,�&*4������W���*W$~1��?�����IRAdb$`�����l��Z�2�V
5�2S�e�b�sk��r@0��t�h�V+j���F��������n�����t���2kE!PMk�������.Yo8��k>
	�����)_���)?/�����c��~�����ji�E���\�-��@����1X0Lks���,�7
�:�l�y�����������S�����K���=,��	=���(Y�G5��O���+�->����c�p��[��������/���O�������&dH�pD
� �B�57(~���6J���@���23�#�2�N��EAF�$�qIQj�*��-���*��	��l�\��& H���r��"H�4�k��f�����" �2�	��$������J���m�Ar�;�;�E����
P��.P�p��f]��0d+c�U����U��p���>P��>��l��/o�H����<�[�F�Y!��@��
tv�3����N�y�|����<(����
��#��W����S&n�����v.���G�$�$�I&J?�XjD=��O}+����������?�i�A����������}����&���T����z��&O���Dyz�0-}O�����2�F���|�O:�����q�@�I���g�����<�:x�=� ��fL�YL��<���HA���
��3A�c\3:��+�(D���)��=����0�#��E���q�o������g�L�C�F��*W:�k���
m�Y����f7g#x�h�MD	��\�:�����1��Za�4�$ ���aK`T�W| ���7����~�[w\�^W�����PFJ#R�;;[^��.p�������WJN�Q(�*e%!��&�z�s�E��jK~E�gG�y�
��q<�Z�,��r5t�������c&r�{o�[��r���u=��>�t�H�+�Y�L����h�a���QfjL�<MjM�#�~����>X>%������/
3&�k�y�R.�h���W.���9\|��MWc��t\����"#��jK8�W8y����kH������{�!�E�zS=�XvD>/�>���C���
�h�H�m ��\��]�iK",N0����i]����j��q��075�y�X$����p
�}q�}�*T1�#<
�P���Yw�s,�|�wT�J�q�. ��"�xF�	���K.���� ]i���sA$�Z��F1�<��s��5��W-�|��{R��)*�B����!��� L�U���;���\��t�����L���'�e*�����Z`"������8�O8-��sl@����jb�@DQ�DyH_db���[�H$�z8y4����GomWl&T^r�yi�m]v���s�1G6]�t�@����G�{��w�9�Tf��=�y��4�n��:�p�vx���d�:�R��2�U���P*=/�%��]I�IOP��/;��L��T]\�r�L��������CUM7tPy�����J�a��7��Wv��9��5s ��\	T{9�{�$Y�$B�dI�$ (�;�������]��8����Ap������	����Lp��I4�@^�$!"L��*(����m����I���)�M�Q���1353Ne���l7&4�J�D�}6EQ�BS��X�����Q�}|�;�������9.�[�b�[���I�I���"/y{{_�5��W[e�Z�"`C�"��"PL%#B��Da�4�2f�{��;�Td�
"��C&j�J/J���dB��P��V��Y`XHp���H����@�c���s$B�*��H�
[$���V�����@ �B!q����ZU�rF�����l7�u>�>/TO������`c��D(�C
y%���na�T�<�{m��-�����D�f���\�I�j�(�2��o9����^���U�� sP��g;+s+�9S*	���D^=t��h�8��g�L�i�HB ����
1��u���%W�]�����8�+ceN�3fW2��(��,+Xe�
�eQ�D^��I�������-�X�r<�o���DR�����k\V�\�>
����v���vD����b��VL)V������Qv�m_t�_i�N��,�������d�tQ������!��:���Y  ��uy\�G��Mw��!q�����[�{r=���o�@�.c���E�u�0S�c;��M����frkN�[�N1K�+x����cm��hu������$HM,��������RA!! F@`������~vc6
	��08�����Y�m��@�A�O�ziT�	QUM;��:�hRF����;](*�o���{
��4�;6����������AQ"�������G���3�S��;VW�{���FX�-���n�N�9#��t^�C�:�5{4�������B�zC�����hQ�82f��L����������j��%������Y{a�B���gyp�S�h��3Q�^O$L%5����w��As���4���y#�g����w����G7te�CA|�9��zG�|��?	R��������0^���`?����0a�A��m�������8��,P��m_�P���0�jaE��L��h�rZ�%�j��Rd�c��+��#%;�	���By��Pz���p:l�����m������Ee2�����RS��
�������O4@�O����a���`�x�BKS��*�+���je"T0u1b�!nw��^�)MjZ���k��kv���u+;��R%���J�-S��fTe����J�D����A-L�(M�H'�F�2P)(T��A��g9�
d�8����i	.����f�W�5��0�����$�8�����,
%��v,�"G0�5�i
�L�c�_��H��Y����ajr�rRjC<Fg�|u�����:�B��O�c���"a���<�2
|�[�X>c�"G"��~Bp�UB��	.�1�!����&����F*��A�d"����)�T_�SG�FIX�4Dn����$��b��|H�`�dl�	d��&lR4����	l�h�����P��[�!�GB�@��cp�XAO�X��Fx�L�O�� '�_�"ET��������������������"(��4
~�S��J��U�pGrp�d��%Nq47����l�*�0���333*�f"�����
E
���1���"pQ���?�U"�����X��m<���A!���IRM(z�b1��Q�N9D~g4H��?�DT�rm�I�C �����$�&I0��G8�EA�����Rh�DOk�~��`���m?���5�oVi������s���B���b����h�������f���;�:V�`�_n(m�N��^������mx�h��O~2���TSF�f\��m��7m�Lp
��,��Hz�Q���JOS��	����|�����i�����~bE�np����~�GfT���!�4f�����@!����2�����H�T�����������i���iiZZZZZV������i�&FdZTTTR****)��
�TR***.N{b��$sT��t�:���V�R�*��%41&�S1aQ�	V�������),�e<7��0D��!H��;Z��m���b: aV���:(�j���a���3�6��EB
�C�J�T�B��<�?����m��V����%4I�;�	I���wH�hH�d����<��U�[C `p%`����"w���"`@���=��4��oQ����8��H��"�J�i=�F����D���*QE{N��c^F5�2��osj��������[|Jy�o�gJ�����
���Z#���31�7�7��F�r�\��$�^M���e�����m��c�Vf`�d�
=����{!�lM��D���BDHEE=��B��B�0�Q��-\$��m�s-�(2����w{������H2
M@		��@`qUN(0�f��0��;f����b�������s�b�������Z���C������ ����038f��m�*�����Z��H�u������@�[36��p�30
S��3���q@PP
�*�
�l(UUUTa5T������j���`
�UP31UU�S;s�7p�32��n����H����k�0v��;v�S������k�I$����M{}9w��������U��|����A2{�.J=�����D��T0�DZ�����F���������GTN�;���PM�S�DZ��lK���0 ��fB�,I�F�>�����"9@b{Ge��I���<d���H�C:���I�O`*Dp� ��s��E����,
BV	����I��; �R�+H��N:_|rq�x#��pR����BHbH�L��bI�,�U!8��(���U(�*Wz":
�����O��:��	�J~�5�Dy��+�d�J����Gt����{R|b���������7O�������Lb*�Y�U��D��<�]T,Q�J��+���h�������e,m�:��$(eCE$�h6m�<'�I'�6C�aB�(bI�Hv��g>Z����wa�8H:��f� �M)������UIB�����mS�
+��|)=G���Y)����#)��h��_��[�s�*�r��V%�1kI&�1�R�$�2��g4J�DCH��M�\�e�Il�cW���F�UI�&D8
�R0y����U����a�h�8E`}��S�+`��'$��X����O)e�=�u�e��c5�`��QC	&1��$�0^&1Y d��-
������3Y�FV&��^!��X��|�%&�_^mZ���F"A�A��C`�+�&���I���<"|Sv�~$������%$I���L�U�00�
����D@�w����$�*[%�P�a4Rh��;O��H���M4�4�"��@�R�`��� �`�'�4O~���Jbz@M�|dG�1	�m���TND�A�J��#���
Q����jI�&�`M@���\Zai��l�,���1$��(�A|��i�Qd�	��A �A �(n�#C����=G�)�����j}����y��K��L��l��?��$����-~����p�S���~���gF�&I+������s�}��V>�7�����0�SL$da�G`�V��~�����o�����a���E*;&�i!����K)I��G��X�]$�`������f�����*���Sn��k������xy���������e�j�y��J(�`�l-� �[�xR�j�����Ui)"����PG_7
oc^hnH�T���AF�M�RP��G�j���.RI$��E@>��33*�%�U�b�+�7�r���c�V������)�TQ�X��2da
L�*�u�D�	��������-��QE$������������"���R!���cM?p}���I^��<���AA�<�����o9���F��������MZ�:4�t��$�s�.�2�'����4h������
-<��;e�m	)��B�79"��b�y��~�p���Z������������ux��^���&�K6b�)�a��5>�u�W<��?6+'*��H�vbeG���g�p���,��^G��;��r������<���>�q!�X?�]�����7�f����+u�})��|r�$�9"G`�Q�T��~��Z��<����(��������]�����,�BF&,��#	���u�h����:<M���N����+��6z�v�����Y*;�1�R,�*d��#�#PvN�~)�G�(cE#��T,������g���@@1T�����s dn���%Yf�e���%n:��J���@�"�9�/�����>��D������z7�;8�tr1��:, ��!�|�!��D��fd�Yv�����3k���	�VD>�R��D�AUVf�Y���Sl��"��9��\CZN���i���A��5��%4�m2���_������(�P7|v!#S���-���$R�Q$��{�#F ydm1��D��kX��R9�����&1*
1��@p��|	���f�j����3$�Ry0������T�V2�d�Y����r��������R*?&*�;�����G0|-�����G��:�;�ti
�X{6Kn���E��|lN��?<A��)1fd�k�����U)T������	�o��a
(�4�c�5<
C��ua���\<�p##�
sV���u�|��x���#Nj�
��c�����$�,��v5R����n|��&)�1��II�����Z���A����_v��Q��WT�2"�����e	����.�X��00�z�A#��Y�SH��v,
���	�=��t<��)�av\Y�Nbh��'�,=�����x��9!�<��v!�`0��y*"���I8u��A~l:�yJ<����hP'{�F��Z'l0���}�y�&�Ds����&����B�@%�[���d���/�:'d}�����Y�����MU0�#{�.�2iRc��k�0���{���,I�����<!���
+QpX�yl	vZb0iE�a���mm���F6L1O��J��ah�@!�r��V�lf
hlJb8{[�����F\Db��0h����9iJf�m5�p�V2p�d�2X�/#&NU"�p���R��V0�S	NY
�o0�J��o#%?���`e�E��m����/vC3r&�#p4��h��x�L�L���Mi������!`v���h����-�<WN���>��
XE�Z����U�)���J��uDi��)�{m@����>LYa�a���~�#1��E������S�y�����7�Zv@N��J�n�xS<J^\L��F�oe\�G�q�e�1w�&hy��0m�|P%R����uO!�����n��$aVI���@��AR(��]��x03=��X�'�����!��-��C����OF7��L.^���4PR.[���r����l5Rx�:���JU*�}���l�X67���C&���9�dvTNt��2����TL��Z�H�	�*��v�"lh���du%�R��FH�a��)���9����&��g�����`�F��AP�.ku����*���,�r9��^c�xWd�$!���%�S28C��M08N�����H(K~�i��,	��Y����:p�<�hP�RkH�j��%[��a0��\7�1��$jMiyl5F����SeURII���*��i62CWu����^O&����B�B�����g���d/�o��`����dpe�<[�N�P��sz��r��SNNf�wM6i�U4���BLj�-V��K��q�Z����&����
������	�E����@��Q�R�!�B���}�@9�*8/�M
��rY�']�;8��(��F�fEy�AY^��B���d��+&�@�1�X�
{BB;�����C�d������6s�����]�r����G	�F��42>�y7�\����������W@E �DH
�4�;����v�x�:<6����

�Fs���;�=�/.�	��F��U������K���'��5M����WK"$�d;��$�@���yld#MX�c������w��������I/�sp4�{E1a�Z��s���s�u�>���.�VQ�B
'F�	�[�� �qPfR�d�F�,����x�'6h��&���dH�����e�u9���dT@�����LC�3�^F�k�m	����gb�
���|���1C	��qi����RO%[|��i�exH�lr(AV��y
<�QC�!$�!����k�p�x��P�[��K����al1hSX4�����=�i�N�h�R����j���]�O��X��,�mYc�~�U�"UG(Y�}�I�z��g���=]!EK�$!+
���j���Vj�SQ*X,��%���*w;�>�Ho!'n��-[,*;���U#���%>��Y�;��r�����7@^H��	�<��KXP�	�+��|�~��2z_��Ns�D~�(>0y��:K��A��m��,��12E*U�����z����*�*�Njr]��MmC��}�GW&����]>6��h��"U|N�
�4���-���V�	,NC��9�-B�����������3�O������R69H�v�gF	�V������jZ��Jg�������u#���
�#�x�dp�Da�t>��>�h�$�c $MY�����,
BeURdA��X��� p�'��F:s8��b��hJb����I��@
;��@�����b�S�U���������'^�)�B�%��[9��7F4-��c4pa��"����T��P�����E9OwC�$o$A�B�����y��2I0��Ie�
�=�V������
���ic3|��]����5]��n�;`�|�1P<M��s@d�7����D�N+�w�-$���A� �q>:>�$ddb��H%:�tW�����?m���$���"c0LT2�����1`*�*�0�����R5�m��^^T�0�dk1F�MLl�FQ�dU/�p��m)$�����H�u8'&
DO���#I�c.Vn����b�}*�b<����x�(��hyn)m��@����,�~1e�;`�����(F��!� ���By��~��@����x�� 4v�����\�m�E����!a��`�1��@��
��h)�:��{G ]�NH��>��|��{$����8xY��0���0���c�)T\%������u�	&�c�{���"���2 n]:P�� %D�����c���B���,���R�jc"I�2������i�#�N�e�����cQ���n�"�1�.��!!���Ur���}h�Cw��`@���dQT���b���D���p8�E)6������-lM�Ja����_���������%��
)	�e'��~�Pz{F�od��-�gu��m�����m����VDad��Y$���&�h�3���l�b��!eLV1#��!E�.��
��~E(�C@U(Sa���3D��$HE�*�(��Q+�9F�����1"�������A�zU��*���E��@z�x���J�3I���~.$
��N.R���
,F�������j���o�)�b)��B�eEu�����v��7M����A���PU#(��#",`����V�DLDq:���:~X4
�)�6D����u�R,SHy'��fL����x���L?�
F�I)��t���bV1�OE��(iEW����gv�?xw���b*g�G�?��(�`]�GJ~C��S�6c��$u"���D*�J�R��Q�\d30�j�
J!�����~�1z�����|����F�v��q6��Y`|dL�L%L�_%��T��G��/P�t�Q�d��r�	�����)��9�U���v��0���ZQR����`���>����^F5i���a#aM�!8R� ��w����8+REy�~,��K$�d��F#B2;��h,�rpE�K���;'�YT(�QR*�ae�R��Ld�Sk��Y�i�@�<Tb9p	�`�����0{���b���l���t��i������XaE1��>���im�7V�+L���_�%Yb�!��-[�IF'�`�ll����	1xl0^0a�sA�����b��~C����r^�#$G���Q��HEd{�9c����G�=����N�Dd��l���K(�`[!�K�b[>��][F���K4�����o%0�,����+��W!�o$DP!��t"�"�������j`1A�A`���H#�(�*�SM�IdX��C!�����M��[X��y���:���^���kh��s �������e��6`�Ge_&O�&>/x�d��jf&"*���p�)cy#Ok>)q�H���D���%B`�&��PHd��U�K<���M���!>d��v��! �aauc^�
����d��g��{�~�:���P�[����n8Q��pG���}��#K���.��,L&bd����l��~X n���rB�K�����n`�q���X8.�B
�`O8� C���x��j7�6!p������@'m����!�7��ce2�z�����7�=���W�W�D��m��P� �)#AQF���bbR��*6
�.���u|��I���4sN]�xXy
��QQbUI�be2V�bb�SRF�HR�#U�uQ���F;�y"���|�|2#$p�Rv�Q�*6,����O��2sUY#�;Dp�$�p�F�#���*g�I�YK%-��t����;���_�-��/�0 EP�	f�E�6Y�j�U-eQeiK5�l�����e%)"�^��O9�S`lz���p��%������|�Y������I�{^bQ�U�l�X���@F"���:�
gCf�W�m0�����b��k�X]�qJ����"`��j�
��r�5��GF-�vA�A��E
��m�6�+Rd��-�����1l0�	M�S��`l��i0�(� ��
se��1#E4=�p7i�Uo28��6T�l�R�0��M������*r��9F����
�i,������5!�������ocBd �}JO�`/9��$Y�$H��d������bs�m5�UJ�0�V�K
��T�u,�SJ`�JYV	����}TPF��*��UI�PEi-�X��D�H8gksM�Y
	*�6�����RI���-6[
)�'a!Pm�����Z�jd dS�v���(����r^
!�Di�H���U'Y	,�Z�y�;������OQdp�������*�9�J����XZ�w��}G9��<)WQU!#��1
Tz�`��(�����,%����8Bc�HB�8�v�t���������]�]p$b$����H*XdqkH�lT1����2H��*��9��C"8#y���H�,����C�5_�#��`���t���S��&s`^���:6���26��n�J]R8�w(�eP�D��!D`Q������S����c$C����7=�y����t"D{����D���g�'�T����_�$��>��Lcl�1��}�}���a�1�eY�����_.���**U��J|�JW��I��1!!�x
z���_[�aK#v������{���H���T���'�,�}K��,�$����'���"���7T1���0��
l}�@-�1M
4�D7	I�,Evy�G�
��*�K
`Flik4��F9�7k'��#r2��u#���WE����j>u���*�`��|��M�'yQ$�N:�ic��g����8C�
jQPb�����r����*�K#LujQ��^�$�'%�7���z#�d	���
P����H!��
����M����T���3���IdqG�I����]P�O!�%	�"������~����Vv<�y�f��H�5�����m�K�*}�mV�HT�>��|D���~,x-������i82
��x!`����~��u&
��w�I��S`��HHH=��w+O5���I���q�y=S�����h�p��<4U6b��>��|RD:������)�a�}�b}��=����v���{�����b	��DP�p|�Q1�}R7�!�db�z1�pbN{64�XF�&�����yv��m�v�k$��*�.K(�e1b�ZL���M�Kx���)*2�����M����R�e��qR�,R.�U\��=�}����D���� �rl�j/��btA��*dY
CD2�RV���
x+���_Kz��\����,(�Q
c���"d�,h��`�y!a��/�v[�G��OPg���+��5%p0}�E��F)� ��r����
�6F�����-M���|eO���3�T�4l�P�����PD�(����t�W��d��f����>��9!�m������r2�i���X)���!��'h�]�9���}u81VT1R=a�'N��FH�"-A�l[T��*R��
PJQR	�B�+A�#�\B���)���t��i���T�C�D�Z-�U�"[
Vh@�W��;��
K3��>�q0��t6.'I��������HIG8�oy%]�_���
��E����`���y7F�S�W>V�N�t/��
'�8����%	�L*��A`�N��"d.�����
Lma����8-q���R`z��d�(8��Gc�b��dN�����I@�NLS��h��2��x7��eNjm�rX������%6UQ�m2`��;H
�������V
H���[��N�q'w�p����ND3�63B���]l�6uu��B���r�(
M�2�l����cX\ Q7���J�6A`�[�V��`���0�Ac�CC���5M����9i��)������,�)p�Dm2�9IMV�`,�����LU��m�
&�&
����SJ�[�7�u����h����L&Q����n�^+�?�����b��������-�V�,�����T����	Y����Av�l��\���t�����$����uF9��&��������7=���V�1L1J��iO����>��1`������m���4�T�6T���iZ6b�����V��)��N�;|6���
����W��,^����������@S�����I$z4��0�
�#�l�Xm�m�����%�[?�y���^�w��x�������-;�� �X�(S'��j�E�([o�����C�[v3�kH�C!#unMg�c�2��j���m,8���G�(��j�35|�m�q �����i	��N���G��p��#�,�I�����N�����r6�i;���T�#o1���Sy�fF��Dc��D��	�:5�J��:�s��<��=k��R�2�������t�h���u�����F��@=��w#�4��G>,���"�����\��
�9,�L�c����_BeK�j��;�LY�a�n�
�4��,��Sssdc�n�l��ZS��m�>�5:�s�.6�eb�LY[1���A��k&����MD��(�f����`�[V�Q�(4�E��d+�(	,��1566tT����[;#��wP��xWs�p�����#���m��Clp,��,��@�ch��l�����
M
6p�������f62����x�p�L�c�2[Ec�a�����i��f�L�1e�4�ul�p���
N��72lYu��������"���S2u���������n3r�@
2��P����/	���&�&�LA���z$�r�
�L���w�����`!���tGL�!N$����c$cP����q�ud��RY\�w�-L*��X�����n�,�o�Y����U	&�����%�4��(���jK��1�w+�+�F� ��9��HE��*RHE
(~�m���ce�<���)�SD�;h���
p�P�fL�n�&�F?�tqc
��9�N���<X�K����G�c�������;*�����#0R�;����nQ&��c�r-XZ����&���S�����)|������R���	pQ��JEp7�����:Z�u������<g����q����P��de�I6,��m��L%9aL����
K)����`��p2��
����N����	@{��pJ
]�a,�q���,��i��Jc@������*b�)kR�.��m(�.�[f2a)�<L!���G��$KH(0b��0��*RNS�(X��YK(�`�J�_��#�
7_0��1�K&�$w����4��1J�iER�@����!�)��emJ�
��J�bd�
��T����%UUq�24��c*�JQR)T����^
��WlQ4d�8F��#uD���*���w��Dx���/I��� ���(4������'�;��B��*o���[I�Uj�����8����
OJ��RU�����\�%o5SR�n��m�����JY)�����>�<�Kz+��z�����R�F����(G8#� �Z:2������BH!#��{����p�P��H�e��rX�U�h�A��	��m�Yg��X�-|�}:��56��2��.��I-�D�����x�:����D�p8�r��Jc�m���7����<������G��?QK����`ad-���� ���B,L�eT�31�b5-��D�&�E�P���R�$���b�a�bP��C�����CB0��'�T��D*��#�����G��=�
�����O��8UR��h�0��S���8nz��h�Ta����2P#�x��
�t"\I��y�j�
�0lh�M�"��%92G(��n�R�+v����d�!�md`��t'%4��m������h�����
7L�4Y��,�����K4h�Ea#c�h84����F�X��m�Cq�h�+��ifc&G4&��e: �j`���VDT�?=K2���D��
���8�< p0;���i.���P�2C��i��H����g<�,E�V�f&����)r��J �v�"��YN����6v� ���cd��@E<������R���R5Y�����7�(���H��M���7��rE8P�,�]�

densehash_for_lockreleaseall.patchapplication/octet-stream; name=densehash_for_lockreleaseall.patchDownload

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 108b4d9023..5652bfe22e 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -37,6 +37,7 @@
 #include "access/twophase_rmgr.h"
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "common/hashfn.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "pgstat.h"
@@ -270,6 +271,19 @@ typedef struct
 
 static volatile FastPathStrongRelationLockData *FastPathStrongRelationLocks;
 
+#define DH_PREFIX				locallocktable
+#define DH_ELEMENT_TYPE			LOCALLOCK
+#define DH_KEY_TYPE				LOCALLOCKTAG
+#define DH_KEY					tag
+#define DH_HASH_KEY(tb, key)	hash_bytes((unsigned char *) &key, sizeof(LOCALLOCKTAG))
+#define DH_EQUAL(tb, a, b)		(memcmp(&a, &b, sizeof(LOCALLOCKTAG)) == 0)
+#define DH_ALLOCATE(b)			MemoryContextAllocExtended(TopMemoryContext, b, MCXT_ALLOC_HUGE)
+#define DH_ALLOCATE_ZERO(b)		MemoryContextAllocExtended(TopMemoryContext, b, MCXT_ALLOC_HUGE | MCXT_ALLOC_ZERO)
+#define DH_FREE(p)				pfree(p)
+#define DH_SCOPE				static inline
+#define DH_DECLARE
+#define DH_DEFINE
+#include "lib/densehash.h"
 
 /*
  * Pointers to hash tables containing lock state
@@ -279,7 +293,7 @@ static volatile FastPathStrongRelationLockData *FastPathStrongRelationLocks;
  */
 static HTAB *LockMethodLockHash;
 static HTAB *LockMethodProcLockHash;
-static HTAB *LockMethodLocalHash;
+static locallocktable_hash *LockMethodLocalHash;
 
 
 /* private state for error cleanup */
@@ -467,15 +481,9 @@ InitLocks(void)
 	 * ought to be empty in the postmaster, but for safety let's zap it.)
 	 */
 	if (LockMethodLocalHash)
-		hash_destroy(LockMethodLocalHash);
+		locallocktable_destroy(LockMethodLocalHash);
 
-	info.keysize = sizeof(LOCALLOCKTAG);
-	info.entrysize = sizeof(LOCALLOCK);
-
-	LockMethodLocalHash = hash_create("LOCALLOCK hash",
-									  16,
-									  &info,
-									  HASH_ELEM | HASH_BLOBS);
+	LockMethodLocalHash = locallocktable_create(16);
 }
 
 
@@ -606,22 +614,37 @@ LockHeldByMe(const LOCKTAG *locktag, LOCKMODE lockmode)
 	localtag.lock = *locktag;
 	localtag.mode = lockmode;
 
-	locallock = (LOCALLOCK *) hash_search(LockMethodLocalHash,
-										  (void *) &localtag,
-										  HASH_FIND, NULL);
+	locallock = locallocktable_lookup(LockMethodLocalHash, localtag);
 
 	return (locallock && locallock->nLocks > 0);
 }
 
 #ifdef USE_ASSERT_CHECKING
 /*
- * GetLockMethodLocalHash -- return the hash of local locks, for modules that
- *		evaluate assertions based on all locks held.
+ * GetLockMethodLocalLocks -- returns an array of all LOCALLOCKs stored in
+ *		LockMethodLocalHash.
+ *
+ * The caller must pfree the return value when done. *size is set to the
+ * number of elements in the returned array.
  */
-HTAB *
-GetLockMethodLocalHash(void)
+LOCALLOCK **
+GetLockMethodLocalLocks(uint32 *size)
 {
-	return LockMethodLocalHash;
+	locallocktable_iterator iterator;
+	LOCALLOCK **locallocks;
+	LOCALLOCK  *locallock;
+	uint32		i = 0;
+
+	locallocks = (LOCALLOCK **) palloc(sizeof(LOCALLOCK *) *
+									   LockMethodLocalHash->members);
+
+	locallocktable_start_iterate(LockMethodLocalHash, &iterator);
+	while ((locallock = locallocktable_iterate(LockMethodLocalHash,
+											   &iterator)) != NULL)
+		locallocks[i++] = locallock;
+
+	*size = i;
+	return locallocks;
 }
 #endif
 
@@ -661,9 +684,7 @@ LockHasWaiters(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 	localtag.lock = *locktag;
 	localtag.mode = lockmode;
 
-	locallock = (LOCALLOCK *) hash_search(LockMethodLocalHash,
-										  (void *) &localtag,
-										  HASH_FIND, NULL);
+	locallock = locallocktable_lookup(LockMethodLocalHash, localtag);
 
 	/*
 	 * let the caller print its own error message, too. Do not ereport(ERROR).
@@ -823,9 +844,7 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	localtag.lock = *locktag;
 	localtag.mode = lockmode;
 
-	locallock = (LOCALLOCK *) hash_search(LockMethodLocalHash,
-										  (void *) &localtag,
-										  HASH_ENTER, &found);
+	locallock = locallocktable_insert(LockMethodLocalHash, localtag, &found);
 
 	/*
 	 * if it's a new locallock object, initialize it
@@ -1390,9 +1409,7 @@ RemoveLocalLock(LOCALLOCK *locallock)
 		SpinLockRelease(&FastPathStrongRelationLocks->mutex);
 	}
 
-	if (!hash_search(LockMethodLocalHash,
-					 (void *) &(locallock->tag),
-					 HASH_REMOVE, NULL))
+	if (!locallocktable_delete(LockMethodLocalHash, locallock->tag))
 		elog(WARNING, "locallock table corrupted");
 
 	/*
@@ -2002,9 +2019,7 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 	localtag.lock = *locktag;
 	localtag.mode = lockmode;
 
-	locallock = (LOCALLOCK *) hash_search(LockMethodLocalHash,
-										  (void *) &localtag,
-										  HASH_FIND, NULL);
+	locallock = locallocktable_lookup(LockMethodLocalHash, localtag);
 
 	/*
 	 * let the caller print its own error message, too. Do not ereport(ERROR).
@@ -2178,7 +2193,7 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 void
 LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
 {
-	HASH_SEQ_STATUS status;
+	locallocktable_iterator iterator;
 	LockMethod	lockMethodTable;
 	int			i,
 				numLockModes;
@@ -2216,9 +2231,10 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
 	 * pointers.  Fast-path locks are cleaned up during the locallock table
 	 * scan, though.
 	 */
-	hash_seq_init(&status, LockMethodLocalHash);
+	locallocktable_start_iterate(LockMethodLocalHash, &iterator);
 
-	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+	while ((locallock = locallocktable_iterate(LockMethodLocalHash,
+											   &iterator)) != NULL)
 	{
 		/*
 		 * If the LOCALLOCK entry is unused, we must've run out of shared
@@ -2452,15 +2468,16 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
 void
 LockReleaseSession(LOCKMETHODID lockmethodid)
 {
-	HASH_SEQ_STATUS status;
+	locallocktable_iterator iterator;
 	LOCALLOCK  *locallock;
 
 	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
 		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
 
-	hash_seq_init(&status, LockMethodLocalHash);
+	locallocktable_start_iterate(LockMethodLocalHash, &iterator);
 
-	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+	while ((locallock = locallocktable_iterate(LockMethodLocalHash,
+											   &iterator)) != NULL)
 	{
 		/* Ignore items that are not of the specified lock method */
 		if (LOCALLOCK_LOCKMETHOD(*locallock) != lockmethodid)
@@ -2484,12 +2501,13 @@ LockReleaseCurrentOwner(LOCALLOCK **locallocks, int nlocks)
 {
 	if (locallocks == NULL)
 	{
-		HASH_SEQ_STATUS status;
+		locallocktable_iterator iterator;
 		LOCALLOCK  *locallock;
 
-		hash_seq_init(&status, LockMethodLocalHash);
+		locallocktable_start_iterate(LockMethodLocalHash, &iterator);
 
-		while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+		while ((locallock = locallocktable_iterate(LockMethodLocalHash,
+												   &iterator)) != NULL)
 			ReleaseLockIfHeld(locallock, false);
 	}
 	else
@@ -2583,12 +2601,13 @@ LockReassignCurrentOwner(LOCALLOCK **locallocks, int nlocks)
 
 	if (locallocks == NULL)
 	{
-		HASH_SEQ_STATUS status;
+		locallocktable_iterator iterator;
 		LOCALLOCK  *locallock;
 
-		hash_seq_init(&status, LockMethodLocalHash);
+		locallocktable_start_iterate(LockMethodLocalHash, &iterator);
 
-		while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+		while ((locallock = locallocktable_iterate(LockMethodLocalHash,
+												   &iterator)) != NULL)
 			LockReassignOwner(locallock, parent);
 	}
 	else
@@ -3220,7 +3239,7 @@ LockRefindAndRelease(LockMethod lockMethodTable, PGPROC *proc,
 void
 AtPrepare_Locks(void)
 {
-	HASH_SEQ_STATUS status;
+	locallocktable_iterator iterator;
 	LOCALLOCK  *locallock;
 
 	/*
@@ -3229,9 +3248,10 @@ AtPrepare_Locks(void)
 	 * Fast-path locks are an exception, however: we move any such locks to
 	 * the main table before allowing PREPARE TRANSACTION to succeed.
 	 */
-	hash_seq_init(&status, LockMethodLocalHash);
+	locallocktable_start_iterate(LockMethodLocalHash, &iterator);
 
-	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+	while ((locallock = locallocktable_iterate(LockMethodLocalHash,
+											   &iterator)) != NULL)
 	{
 		TwoPhaseLockRecord record;
 		LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
@@ -3331,7 +3351,7 @@ void
 PostPrepare_Locks(TransactionId xid)
 {
 	PGPROC	   *newproc = TwoPhaseGetDummyProc(xid, false);
-	HASH_SEQ_STATUS status;
+	locallocktable_iterator iterator;
 	LOCALLOCK  *locallock;
 	LOCK	   *lock;
 	PROCLOCK   *proclock;
@@ -3354,9 +3374,10 @@ PostPrepare_Locks(TransactionId xid)
 	 * pointing to the same proclock, and we daren't end up with any dangling
 	 * pointers.
 	 */
-	hash_seq_init(&status, LockMethodLocalHash);
+	locallocktable_start_iterate(LockMethodLocalHash, &iterator);
 
-	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+	while ((locallock = locallocktable_iterate(LockMethodLocalHash,
+											   &iterator)) != NULL)
 	{
 		LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
 		bool		haveSessionLock;
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 5dac9f0696..4a924fbffb 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -3008,12 +3008,13 @@ void
 AssertPendingSyncs_RelationCache(void)
 {
 	HASH_SEQ_STATUS status;
-	LOCALLOCK  *locallock;
+	LOCALLOCK **locallocks;
 	Relation   *rels;
 	int			maxrels;
 	int			nrels;
 	RelIdCacheEnt *idhentry;
 	int			i;
+	uint32		nlocallocks;
 
 	/*
 	 * Open every relation that this transaction has locked.  If, for some
@@ -3026,9 +3027,10 @@ AssertPendingSyncs_RelationCache(void)
 	maxrels = 1;
 	rels = palloc(maxrels * sizeof(*rels));
 	nrels = 0;
-	hash_seq_init(&status, GetLockMethodLocalHash());
-	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+	locallocks = GetLockMethodLocalLocks(&nlocallocks);
+	for (i = 0; i < nlocallocks; i++)
 	{
+		LOCALLOCK  *locallock = locallocks[i];
 		Oid			relid;
 		Relation	r;
 
@@ -3048,6 +3050,7 @@ AssertPendingSyncs_RelationCache(void)
 		}
 		rels[nrels++] = r;
 	}
+	pfree(locallocks);
 
 	hash_seq_init(&status, RelationIdCache);
 	while ((idhentry = (RelIdCacheEnt *) hash_seq_search(&status)) != NULL)
diff --git a/src/include/lib/densehash.h b/src/include/lib/densehash.h
new file mode 100644
index 0000000000..26fab94479
--- /dev/null
+++ b/src/include/lib/densehash.h
@@ -0,0 +1,1436 @@
+/*
+ * densehash.h
+ *
+ *	  A hashtable implementation which can be included into .c files to
+ *	  provide a fast hash table implementation specific to the given type.
+ *
+ *	  DH_ELEMENT_TYPE defines the data type that the hashtable stores.  These
+ *	  are allocated DH_ITEMS_PER_SEGMENT at a time and stored inside a
+ *	  DH_SEGMENT.  Each DH_SEGMENT is allocated on demand only when there are
+ *	  no free slots to store another DH_ELEMENT_TYPE in an existing segment.
+ *	  After items are removed from the hash table, the next inserted item's
+ *	  data will be stored in the earliest free item in the earliest segment
+ *	  with a free slot.  This helps keep the actual data compact, or "dense"
+ *	  even when the bucket array has become large.
+ *
+ *	  The bucket array is an array of DH_BUCKET and is dynamically allocated
+ *	  and may grow as more items are added to the table.  The DH_BUCKET type
+ *	  is very narrow and stores just 2 uint32 values.  One of these is the
+ *	  hash value and the other is the index into the segments which are used
+ *	  to directly look up the stored DH_ELEMENT_TYPE type.
+ *
+ *	  During inserts, hash table collisions are dealt with using linear
+ *	  probing, this means that instead of doing something like chaining with a
+ *	  linked list, we use the first free bucket which comes after the optimal
+ *	  bucket.  This is much more CPU cache efficient than traversing a linked
+ *	  list.  When we're unable to use the most optimal bucket, we may also
+ *	  move the contents of subsequent buckets around so that we keep items as
+ *	  close to their most optimal position as possible.  This prevents
+ *	  excessively long linear probes during lookups.
+ *
+ *	  During hash table deletes, we must attempt to move the contents of
+ *	  buckets that are not in their optimal position up to either their
+ *	  optimal position, or as close as we can get to it.  During lookups, this
+ *	  means that we can stop searching for a non-existing item as soon as we
+ *	  find an empty bucket.
+ *
+ *	  Empty buckets are denoted by their 'index' field being set to
+ *	  DH_UNUSED_BUCKET_INDEX.  This is done rather than adding a special field
+ *	  so that we can keep the DH_BUCKET type as narrow as possible.
+ *	  Conveniently sizeof(DH_BUCKET) is 8, which allows 8 of these to fit on a
+ *	  single 64-byte cache line. It's important to keep this type as narrow as
+ *	  possible so that we can perform hash lookups by hitting as few
+ *	  cache lines as possible.
+ *
+ *	  The implementation here is similar to simplehash.h but has the following
+ *	  benefits:
+ *
+ *	  - Pointers to elements are stable and are not moved around like they are
+ *		in simplehash.h
+ *	  - Sequential scans of the hash table remain very fast even when the
+ *		table is sparsely populated.
+ *	  - Both simplehash.h and densehash.h may move items around during inserts
+ *		and deletes.  If DH_ELEMENT_TYPE is large, since simplehash.h stores
+ *		the data in the hash bucket, these operations may become expensive in
+ *		simplehash.h.  In densehash.h these remain fairly cheap as the bucket
+ *		is always 8 bytes wide due to the hash entry being stored in the
+ *		DH_SEGMENT.
+ *
+ * If none of the above points are important for the given use case then,
+ * please consider using simplehash.h instead.
+ *
+ *
+ * Portions Copyright (c) 2021, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/include/lib/densehash.h
+ *
+ */
+
+#include "port/pg_bitutils.h"
+
+/* helpers */
+#define DH_MAKE_PREFIX(a) CppConcat(a,_)
+#define DH_MAKE_NAME(name) DH_MAKE_NAME_(DH_MAKE_PREFIX(DH_PREFIX),name)
+#define DH_MAKE_NAME_(a,b) CppConcat(a,b)
+
+/* type declarations */
+#define DH_TYPE DH_MAKE_NAME(hash)
+#define DH_BUCKET DH_MAKE_NAME(bucket)
+#define DH_SEGMENT DH_MAKE_NAME(segment)
+#define DH_ITERATOR DH_MAKE_NAME(iterator)
+
+/* function declarations */
+#define DH_CREATE DH_MAKE_NAME(create)
+#define DH_DESTROY DH_MAKE_NAME(destroy)
+#define DH_RESET DH_MAKE_NAME(reset)
+#define DH_INSERT DH_MAKE_NAME(insert)
+#define DH_INSERT_HASH DH_MAKE_NAME(insert_hash)
+#define DH_DELETE DH_MAKE_NAME(delete)
+#define DH_LOOKUP DH_MAKE_NAME(lookup)
+#define DH_LOOKUP_HASH DH_MAKE_NAME(lookup_hash)
+#define DH_GROW DH_MAKE_NAME(grow)
+#define DH_START_ITERATE DH_MAKE_NAME(start_iterate)
+#define DH_ITERATE DH_MAKE_NAME(iterate)
+
+/* internal helper functions (no externally visible prototypes) */
+#define DH_NEXT_ONEBIT DH_MAKE_NAME(next_onebit)
+#define DH_NEXT_ZEROBIT DH_MAKE_NAME(next_zerobit)
+#define DH_INDEX_TO_ELEMENT DH_MAKE_NAME(index_to_element)
+#define DH_MARK_SEGMENT_ITEM_USED DH_MAKE_NAME(mark_segment_item_used)
+#define DH_MARK_SEGMENT_ITEM_UNUSED DH_MAKE_NAME(mark_segment_item_unused)
+#define DH_GET_NEXT_UNUSED_ENTRY DH_MAKE_NAME(get_next_unused_entry)
+#define DH_REMOVE_ENTRY DH_MAKE_NAME(remove_entry)
+#define DH_SET_BUCKET_IN_USE DH_MAKE_NAME(set_bucket_in_use)
+#define DH_SET_BUCKET_EMPTY DH_MAKE_NAME(set_bucket_empty)
+#define DH_IS_BUCKET_IN_USE DH_MAKE_NAME(is_bucket_in_use)
+#define DH_COMPUTE_PARAMETERS DH_MAKE_NAME(compute_parameters)
+#define DH_NEXT DH_MAKE_NAME(next)
+#define DH_PREV DH_MAKE_NAME(prev)
+#define DH_DISTANCE_FROM_OPTIMAL DH_MAKE_NAME(distance)
+#define DH_INITIAL_BUCKET DH_MAKE_NAME(initial_bucket)
+#define DH_INSERT_HASH_INTERNAL DH_MAKE_NAME(insert_hash_internal)
+#define DH_LOOKUP_HASH_INTERNAL DH_MAKE_NAME(lookup_hash_internal)
+
+/*
+ * When allocating memory to store instances of DH_ELEMENT_TYPE, how many
+ * should we allocate at once?  This must be a power of 2 and at least
+ * DH_BITS_PER_WORD.
+ */
+#ifndef DH_ITEMS_PER_SEGMENT
+#define DH_ITEMS_PER_SEGMENT	256
+#endif
+
+/* A special index to set DH_BUCKET->index to when it's not in use */
+#define DH_UNUSED_BUCKET_INDEX	PG_UINT32_MAX
+
+/*
+ * Macros for translating a bucket's index into the segment index and another
+ * to determine the item number within the segment.
+ */
+#define DH_INDEX_SEGMENT(i)	(i) / DH_ITEMS_PER_SEGMENT
+#define DH_INDEX_ITEM(i)	(i) % DH_ITEMS_PER_SEGMENT
+
+ /*
+  * How many elements do we need in the bitmap array to store a bit for each
+  * of DH_ITEMS_PER_SEGMENT.  Keep the word size native to the processor.
+  */
+#if SIZEOF_VOID_P >= 8
+
+#define DH_BITS_PER_WORD		64
+#define DH_BITMAP_WORD			uint64
+#define DH_RIGHTMOST_ONE_POS(x) pg_rightmost_one_pos64(x)
+
+#else
+
+#define DH_BITS_PER_WORD		32
+#define DH_BITMAP_WORD			uint32
+#define DH_RIGHTMOST_ONE_POS(x) pg_rightmost_one_pos32(x)
+
+#endif
+
+/* Sanity check on DH_ITEMS_PER_SEGMENT setting */
+#if DH_ITEMS_PER_SEGMENT < DH_BITS_PER_WORD
+#error "DH_ITEMS_PER_SEGMENT must be >= than DH_BITS_PER_WORD"
+#endif
+
+/* Ensure DH_ITEMS_PER_SEGMENT is a power of 2 */
+#if DH_ITEMS_PER_SEGMENT & (DH_ITEMS_PER_SEGMENT - 1) != 0
+#error "DH_ITEMS_PER_SEGMENT must be a power of 2"
+#endif
+
+#define DH_BITMAP_WORDS			(DH_ITEMS_PER_SEGMENT / DH_BITS_PER_WORD)
+#define DH_WORDNUM(x)			((x) / DH_BITS_PER_WORD)
+#define DH_BITNUM(x)			((x) % DH_BITS_PER_WORD)
+
+/* generate forward declarations necessary to use the hash table */
+#ifdef DH_DECLARE
+
+typedef struct DH_BUCKET
+{
+	uint32		hashvalue;		/* Hash value for this bucket */
+	uint32		index;			/* Index to the actual data */
+}			DH_BUCKET;
+
+typedef struct DH_SEGMENT
+{
+	uint32		nitems;			/* Number of items stored */
+	DH_BITMAP_WORD used_items[DH_BITMAP_WORDS]; /* A 1-bit for each used item
+												 * in the items array */
+	DH_ELEMENT_TYPE items[DH_ITEMS_PER_SEGMENT];	/* the actual data */
+}			DH_SEGMENT;
+
+/* type definitions */
+
+/*
+ * DH_TYPE
+ *		Hash table metadata type
+ */
+typedef struct DH_TYPE
+{
+	/*
+	 * Size of bucket array.  Note that the maximum number of elements is
+	 * lower (DH_MAX_FILLFACTOR)
+	 */
+	uint32		size;
+
+	/* mask for bucket and size calculations, based on size */
+	uint32		sizemask;
+
+	/* the number of elements stored */
+	uint32		members;
+
+	/* boundary after which to grow hashtable */
+	uint32		grow_threshold;
+
+	/* how many elements are there in the segments array */
+	uint32		nsegments;
+
+	/* the number of elements in the used_segments array */
+	uint32		used_segment_words;
+
+	/*
+	 * The first segment we should search in for an empty slot.  This will be
+	 * the first segment that DH_GET_NEXT_UNUSED_ENTRY will search in when
+	 * looking for an unused entry.  We'll increase the value of this when we
+	 * fill a segment and we'll lower it down when we delete an item from a
+	 * segment lower than this value.
+	 */
+	uint32		first_free_segment;
+
+	/* dynamically allocated array of hash buckets */
+	DH_BUCKET  *buckets;
+
+	/* an array of segment pointers to store data */
+	DH_SEGMENT **segments;
+
+	/*
+	 * A bitmap of non-empty segments.  A 1-bit denotes that the corresponding
+	 * segment is non-empty.
+	 */
+	DH_BITMAP_WORD *used_segments;
+
+#ifdef DH_HAVE_PRIVATE_DATA
+	/* user defined data, useful for callbacks */
+	void	   *private_data;
+#endif
+}			DH_TYPE;
+
+/*
+ * DH_ITERATOR
+ *		Used when looping over the contents of the hash table.
+ */
+typedef struct DH_ITERATOR
+{
+	int32		cursegidx;		/* current segment. -1 means not started */
+	int32		curitemidx;		/* current item within cursegidx, -1 means not
+								 * started */
+	uint32		found_members;	/* number of items visitied so far in the loop */
+	uint32		total_members;	/* number of items that existed at the start
+								 * iteration. */
+}			DH_ITERATOR;
+
+/* externally visible function prototypes */
+
+#ifdef DH_HAVE_PRIVATE_DATA
+/* <prefix>_hash <prefix>_create(uint32 nbuckets, void *private_data) */
+DH_SCOPE	DH_TYPE *DH_CREATE(uint32 nbuckets, void *private_data);
+#else
+/* <prefix>_hash <prefix>_create(uint32 nbuckets) */
+DH_SCOPE	DH_TYPE *DH_CREATE(uint32 nbuckets);
+#endif
+
+/* void <prefix>_destroy(<prefix>_hash *tb) */
+DH_SCOPE void DH_DESTROY(DH_TYPE * tb);
+
+/* void <prefix>_reset(<prefix>_hash *tb) */
+DH_SCOPE void DH_RESET(DH_TYPE * tb);
+
+/* void <prefix>_grow(<prefix>_hash *tb) */
+DH_SCOPE void DH_GROW(DH_TYPE * tb, uint32 newsize);
+
+/* <element> *<prefix>_insert(<prefix>_hash *tb, <key> key, bool *found) */
+DH_SCOPE	DH_ELEMENT_TYPE *DH_INSERT(DH_TYPE * tb, DH_KEY_TYPE key,
+									   bool *found);
+
+/*
+ * <element> *<prefix>_insert_hash(<prefix>_hash *tb, <key> key, uint32 hash,
+ * 								   bool *found)
+ */
+DH_SCOPE	DH_ELEMENT_TYPE *DH_INSERT_HASH(DH_TYPE * tb, DH_KEY_TYPE key,
+											uint32 hash, bool *found);
+
+/* <element> *<prefix>_lookup(<prefix>_hash *tb, <key> key) */
+DH_SCOPE	DH_ELEMENT_TYPE *DH_LOOKUP(DH_TYPE * tb, DH_KEY_TYPE key);
+
+/* <element> *<prefix>_lookup_hash(<prefix>_hash *tb, <key> key, uint32 hash) */
+DH_SCOPE	DH_ELEMENT_TYPE *DH_LOOKUP_HASH(DH_TYPE * tb, DH_KEY_TYPE key,
+											uint32 hash);
+
+/* bool <prefix>_delete(<prefix>_hash *tb, <key> key) */
+DH_SCOPE bool DH_DELETE(DH_TYPE * tb, DH_KEY_TYPE key);
+
+/* void <prefix>_start_iterate(<prefix>_hash *tb, <prefix>_iterator *iter) */
+DH_SCOPE void DH_START_ITERATE(DH_TYPE * tb, DH_ITERATOR * iter);
+
+/* <element> *<prefix>_iterate(<prefix>_hash *tb, <prefix>_iterator *iter) */
+DH_SCOPE	DH_ELEMENT_TYPE *DH_ITERATE(DH_TYPE * tb, DH_ITERATOR * iter);
+
+#endif							/* DH_DECLARE */
+
+/* generate implementation of the hash table */
+#ifdef DH_DEFINE
+
+/*
+ * The maximum size for the hash table.  This must be a power of 2.  We cannot
+ * make this PG_UINT32_MAX + 1 because we use DH_UNUSED_BUCKET_INDEX denote an
+ * empty bucket.  Doing so would mean we could accidentally set a used
+ * bucket's index to DH_UNUSED_BUCKET_INDEX.
+ */
+#define DH_MAX_SIZE ((uint32) PG_INT32_MAX + 1)
+
+/* normal fillfactor, unless already close to maximum */
+#ifndef DH_FILLFACTOR
+#define DH_FILLFACTOR (0.9)
+#endif
+/* increase fillfactor if we otherwise would error out */
+#define DH_MAX_FILLFACTOR (0.98)
+/* grow if actual and optimal location bigger than */
+#ifndef DH_GROW_MAX_DIB
+#define DH_GROW_MAX_DIB 25
+#endif
+/*
+ * Grow if more than this number of buckets needs to be moved when inserting.
+ */
+#ifndef DH_GROW_MAX_MOVE
+#define DH_GROW_MAX_MOVE 150
+#endif
+#ifndef DH_GROW_MIN_FILLFACTOR
+/* but do not grow due to DH_GROW_MAX_* if below */
+#define DH_GROW_MIN_FILLFACTOR 0.1
+#endif
+
+/*
+ * Wrap the following definitions in include guards, to avoid multiple
+ * definition errors if this header is included more than once.  The rest of
+ * the file deliberately has no include guards, because it can be included
+ * with different parameters to define functions and types with non-colliding
+ * names.
+ */
+#ifndef DENSEHASH_H
+#define DENSEHASH_H
+
+#ifdef FRONTEND
+#define gh_error(...) pg_log_error(__VA_ARGS__)
+#define gh_log(...) pg_log_info(__VA_ARGS__)
+#else
+#define gh_error(...) elog(ERROR, __VA_ARGS__)
+#define gh_log(...) elog(LOG, __VA_ARGS__)
+#endif
+
+#endif							/* DENSEHASH_H */
+
+/*
+ * Gets the position of the first 1-bit which comes after 'prevbit' in the
+ * 'words' array.  'nwords' is the size of the 'words' array.
+ */
+static inline int32
+DH_NEXT_ONEBIT(DH_BITMAP_WORD * words, uint32 nwords, int32 prevbit)
+{
+	uint32		wordnum;
+
+	prevbit++;
+
+	wordnum = DH_WORDNUM(prevbit);
+	if (wordnum < nwords)
+	{
+		DH_BITMAP_WORD mask = (~(DH_BITMAP_WORD) 0) << DH_BITNUM(prevbit);
+		DH_BITMAP_WORD word = words[wordnum] & mask;
+
+		if (word != 0)
+			return wordnum * DH_BITS_PER_WORD + DH_RIGHTMOST_ONE_POS(word);
+
+		for (++wordnum; wordnum < nwords; wordnum++)
+		{
+			word = words[wordnum];
+
+			if (word != 0)
+			{
+				int32		result = wordnum * DH_BITS_PER_WORD;
+
+				result += DH_RIGHTMOST_ONE_POS(word);
+				return result;
+			}
+		}
+	}
+	return -1;
+}
+
+/*
+ * Gets the position of the first 0-bit which comes after 'prevbit' in the
+ * 'words' array.  'nwords' is the size of the 'words' array.
+ *
+ * This is similar to DH_NEXT_ONEBIT but flips the bits before operating on
+ * each DH_BITMAP_WORD.
+ */
+static inline int32
+DH_NEXT_ZEROBIT(DH_BITMAP_WORD * words, uint32 nwords, int32 prevbit)
+{
+	uint32		wordnum;
+
+	prevbit++;
+
+	wordnum = DH_WORDNUM(prevbit);
+	if (wordnum < nwords)
+	{
+		DH_BITMAP_WORD mask = (~(DH_BITMAP_WORD) 0) << DH_BITNUM(prevbit);
+		DH_BITMAP_WORD word = ~(words[wordnum] & mask); /* flip bits */
+
+		if (word != 0)
+			return wordnum * DH_BITS_PER_WORD + DH_RIGHTMOST_ONE_POS(word);
+
+		for (++wordnum; wordnum < nwords; wordnum++)
+		{
+			word = ~words[wordnum]; /* flip bits */
+
+			if (word != 0)
+			{
+				int32		result = wordnum * DH_BITS_PER_WORD;
+
+				result += DH_RIGHTMOST_ONE_POS(word);
+				return result;
+			}
+		}
+	}
+	return -1;
+}
+
+/*
+ * Finds the hash table entry for a given DH_BUCKET's 'index'.
+ */
+static inline DH_ELEMENT_TYPE *
+DH_INDEX_TO_ELEMENT(DH_TYPE * tb, uint32 index)
+{
+	DH_SEGMENT *seg;
+	uint32		segidx;
+	uint32		item;
+
+	segidx = DH_INDEX_SEGMENT(index);
+	item = DH_INDEX_ITEM(index);
+
+	Assert(segidx < tb->nsegments);
+
+	seg = tb->segments[segidx];
+
+	Assert(seg != NULL);
+
+	/* ensure this segment is marked as used */
+	Assert(seg->used_items[DH_WORDNUM(item)] & (((DH_BITMAP_WORD) 1) << DH_BITNUM(item)));
+
+	return &seg->items[item];
+}
+
+static inline void
+DH_MARK_SEGMENT_ITEM_USED(DH_TYPE * tb, DH_SEGMENT * seg, uint32 segidx,
+						  uint32 segitem)
+{
+	uint32		word = DH_WORDNUM(segitem);
+	uint32		bit = DH_BITNUM(segitem);
+
+	/* ensure this item is not marked as used */
+	Assert((seg->used_items[word] & (((DH_BITMAP_WORD) 1) << bit)) == 0);
+
+	/* switch on the used bit */
+	seg->used_items[word] |= (((DH_BITMAP_WORD) 1) << bit);
+
+	/* if the segment was previously empty then mark it as used */
+	if (seg->nitems == 0)
+	{
+		word = DH_WORDNUM(segidx);
+		bit = DH_BITNUM(segidx);
+
+		/* switch on the used bit for this segment */
+		tb->used_segments[word] |= (((DH_BITMAP_WORD) 1) << bit);
+	}
+	seg->nitems++;
+}
+
+static inline void
+DH_MARK_SEGMENT_ITEM_UNUSED(DH_TYPE * tb, DH_SEGMENT * seg, uint32 segidx,
+							uint32 segitem)
+{
+	uint32		word = DH_WORDNUM(segitem);
+	uint32		bit = DH_BITNUM(segitem);
+
+	/* ensure this item is marked as used */
+	Assert((seg->used_items[word] & (((DH_BITMAP_WORD) 1) << bit)) != 0);
+
+	/* switch off the used bit */
+	seg->used_items[word] &= ~(((DH_BITMAP_WORD) 1) << bit);
+
+	/* when removing the last item mark the segment as unused */
+	if (seg->nitems == 1)
+	{
+		word = DH_WORDNUM(segidx);
+		bit = DH_BITNUM(segidx);
+
+		/* switch off the used bit for this segment */
+		tb->used_segments[word] &= ~(((DH_BITMAP_WORD) 1) << bit);
+	}
+
+	seg->nitems--;
+}
+
+/*
+ * Returns the first unused entry from the first non-full segment and set
+ * *index to the index of the returned entry.
+ */
+static inline DH_ELEMENT_TYPE *
+DH_GET_NEXT_UNUSED_ENTRY(DH_TYPE * tb, uint32 *index)
+{
+	DH_SEGMENT *seg;
+	uint32		segidx = tb->first_free_segment;
+	uint32		itemidx;
+
+	seg = tb->segments[segidx];
+
+	/* find the first segment with an unused item */
+	while (seg != NULL && seg->nitems == DH_ITEMS_PER_SEGMENT)
+		seg = tb->segments[++segidx];
+
+	tb->first_free_segment = segidx;
+
+	/* allocate the segment if it's not already */
+	if (seg == NULL)
+	{
+		seg = DH_ALLOCATE(sizeof(DH_SEGMENT));
+		tb->segments[segidx] = seg;
+
+		seg->nitems = 0;
+		memset(seg->used_items, 0, sizeof(seg->used_items));
+		/* no need to zero the items array */
+
+		/* use the first slot in this segment */
+		itemidx = 0;
+	}
+	else
+	{
+		/* find the first unused item in this segment */
+		itemidx = DH_NEXT_ZEROBIT(seg->used_items, DH_BITMAP_WORDS, -1);
+		Assert(itemidx >= 0);
+	}
+
+	/* this is a good spot to ensure nitems matches the bits in used_items */
+	Assert(seg->nitems == pg_popcount((const char *) seg->used_items, DH_ITEMS_PER_SEGMENT / 8));
+
+	DH_MARK_SEGMENT_ITEM_USED(tb, seg, segidx, itemidx);
+
+	*index = segidx * DH_ITEMS_PER_SEGMENT + itemidx;
+	return &seg->items[itemidx];
+
+}
+
+/*
+ * Remove the entry denoted by 'index' from its segment.
+ */
+static inline void
+DH_REMOVE_ENTRY(DH_TYPE * tb, uint32 index)
+{
+	DH_SEGMENT *seg;
+	uint32		segidx = DH_INDEX_SEGMENT(index);
+	uint32		item = DH_INDEX_ITEM(index);
+
+	Assert(segidx < tb->nsegments);
+	seg = tb->segments[segidx];
+	Assert(seg != NULL);
+
+	DH_MARK_SEGMENT_ITEM_UNUSED(tb, seg, segidx, item);
+
+	/*
+	 * Lower the first free segment index to point to this segment so that the
+	 * next insert will store in this segment.  If it's already set to a lower
+	 * segment number then don't adjust as we want to consume slots from the
+	 * earliest segment first.
+	 */
+	if (tb->first_free_segment > segidx)
+		tb->first_free_segment = segidx;
+}
+
+/*
+ * Set 'bucket' as in use by 'index'.
+ */
+static inline void
+DH_SET_BUCKET_IN_USE(DH_BUCKET * bucket, uint32 index)
+{
+	bucket->index = index;
+}
+
+/*
+ * Mark 'bucket' as unused.
+ */
+static inline void
+DH_SET_BUCKET_EMPTY(DH_BUCKET * bucket)
+{
+	bucket->index = DH_UNUSED_BUCKET_INDEX;
+}
+
+/*
+ * Return true if 'bucket' is in use.
+ */
+static inline bool
+DH_IS_BUCKET_IN_USE(DH_BUCKET * bucket)
+{
+	return bucket->index != DH_UNUSED_BUCKET_INDEX;
+}
+
+ /*
+  * Compute sizing parameters for hashtable.  Called when creating and growing
+  * the hashtable.
+  */
+static inline void
+DH_COMPUTE_PARAMETERS(DH_TYPE * tb, uint32 newsize)
+{
+	uint32		size;
+
+	/*
+	 * Ensure the bucket array size has not exceeded DH_MAX_SIZE or wrapped
+	 * back to zero.
+	 */
+	if (newsize == 0 || newsize > DH_MAX_SIZE)
+		gh_error("hash table too large");
+
+	/*
+	 * Ensure we don't build a table that can't store an entire single segment
+	 * worth of data.
+	 */
+	size = Max(newsize, DH_ITEMS_PER_SEGMENT);
+
+	/* round up size to the next power of 2 */
+	size = pg_nextpower2_32(size);
+
+	/* now set size */
+	tb->size = size;
+	tb->sizemask = tb->size - 1;
+
+	/* calculate how many segments we'll need to store 'size' items */
+	tb->nsegments = pg_nextpower2_32(size / DH_ITEMS_PER_SEGMENT);
+
+	/*
+	 * Calculate the number of bitmap words needed to store a bit for each
+	 * segment.
+	 */
+	tb->used_segment_words = (tb->nsegments + DH_BITS_PER_WORD - 1) / DH_BITS_PER_WORD;
+
+	/*
+	 * Compute the next threshold at which we need to grow the hash table
+	 * again.
+	 */
+	if (tb->size == DH_MAX_SIZE)
+		tb->grow_threshold = (uint32) (((double) tb->size) * DH_MAX_FILLFACTOR);
+	else
+		tb->grow_threshold = (uint32) (((double) tb->size) * DH_FILLFACTOR);
+}
+
+/* return the optimal bucket for the hash */
+static inline uint32
+DH_INITIAL_BUCKET(DH_TYPE * tb, uint32 hash)
+{
+	return hash & tb->sizemask;
+}
+
+/* return the next bucket after the current, handling wraparound */
+static inline uint32
+DH_NEXT(DH_TYPE * tb, uint32 curelem, uint32 startelem)
+{
+	curelem = (curelem + 1) & tb->sizemask;
+
+	Assert(curelem != startelem);
+
+	return curelem;
+}
+
+/* return the bucket before the current, handling wraparound */
+static inline uint32
+DH_PREV(DH_TYPE * tb, uint32 curelem, uint32 startelem)
+{
+	curelem = (curelem - 1) & tb->sizemask;
+
+	Assert(curelem != startelem);
+
+	return curelem;
+}
+
+/* return the distance between a bucket and its optimal position */
+static inline uint32
+DH_DISTANCE_FROM_OPTIMAL(DH_TYPE * tb, uint32 optimal, uint32 bucket)
+{
+	if (optimal <= bucket)
+		return bucket - optimal;
+	else
+		return (tb->size + bucket) - optimal;
+}
+
+/*
+ * Create a hash table with 'nbuckets' buckets.
+ */
+DH_SCOPE	DH_TYPE *
+#ifdef DH_HAVE_PRIVATE_DATA
+DH_CREATE(uint32 nbuckets, void *private_data)
+#else
+DH_CREATE(uint32 nbuckets)
+#endif
+{
+	DH_TYPE    *tb;
+	uint32		size;
+	uint32		i;
+
+	tb = DH_ALLOCATE_ZERO(sizeof(DH_TYPE));
+
+#ifdef DH_HAVE_PRIVATE_DATA
+	tb->private_data = private_data;
+#endif
+
+	/* increase nelements by fillfactor, want to store nelements elements */
+	size = (uint32) Min((double) DH_MAX_SIZE, ((double) nbuckets) / DH_FILLFACTOR);
+
+	DH_COMPUTE_PARAMETERS(tb, size);
+
+	tb->buckets = DH_ALLOCATE(sizeof(DH_BUCKET) * tb->size);
+
+	/* ensure all the buckets are set to empty */
+	for (i = 0; i < tb->size; i++)
+		DH_SET_BUCKET_EMPTY(&tb->buckets[i]);
+
+	tb->segments = DH_ALLOCATE_ZERO(sizeof(DH_SEGMENT *) * tb->nsegments);
+	tb->used_segments = DH_ALLOCATE_ZERO(sizeof(DH_BITMAP_WORD) * tb->used_segment_words);
+	return tb;
+}
+
+/* destroy a previously created hash table */
+DH_SCOPE void
+DH_DESTROY(DH_TYPE * tb)
+{
+	DH_FREE(tb->buckets);
+
+	/* Free each segment one by one */
+	for (uint32 n = 0; n < tb->nsegments; n++)
+	{
+		if (tb->segments[n] != NULL)
+			DH_FREE(tb->segments[n]);
+	}
+
+	DH_FREE(tb->segments);
+	DH_FREE(tb->used_segments);
+
+	pfree(tb);
+}
+
+/* reset the contents of a previously created hash table */
+DH_SCOPE void
+DH_RESET(DH_TYPE * tb)
+{
+	int32		i = -1;
+	uint32		x;
+
+	/* reset each used segment one by one */
+	while ((i = DH_NEXT_ONEBIT(tb->used_segments, tb->used_segment_words,
+							   i)) >= 0)
+	{
+		DH_SEGMENT *seg = tb->segments[i];
+
+		Assert(seg != NULL);
+
+		seg->nitems = 0;
+		memset(seg->used_items, 0, sizeof(seg->used_items));
+	}
+
+	/* empty every bucket */
+	for (x = 0; x < tb->size; x++)
+		DH_SET_BUCKET_EMPTY(&tb->buckets[x]);
+
+	/* zero the used segment bits */
+	memset(tb->used_segments, 0, sizeof(DH_BITMAP_WORD) * tb->used_segment_words);
+
+	/* and mark the table as having zero members */
+	tb->members = 0;
+
+	/* ensure we start putting any new items in the first segment */
+	tb->first_free_segment = 0;
+}
+
+/*
+ * Grow a hash table to at least 'newsize' buckets.
+ *
+ * Usually this will automatically be called by insertions/deletions, when
+ * necessary. But resizing to the exact input size can be advantageous
+ * performance-wise, when known at some point.
+ */
+DH_SCOPE void
+DH_GROW(DH_TYPE * tb, uint32 newsize)
+{
+	uint32		oldsize = tb->size;
+	uint32		oldnsegments = tb->nsegments;
+	uint32		oldusedsegmentwords = tb->used_segment_words;
+	DH_BUCKET  *oldbuckets = tb->buckets;
+	DH_SEGMENT **oldsegments = tb->segments;
+	DH_BITMAP_WORD *oldusedsegments = tb->used_segments;
+	DH_BUCKET  *newbuckets;
+	uint32		i;
+	uint32		startelem = 0;
+	uint32		copyelem;
+
+	Assert(oldsize == pg_nextpower2_32(oldsize));
+
+	/* compute parameters for new table */
+	DH_COMPUTE_PARAMETERS(tb, newsize);
+
+	tb->buckets = DH_ALLOCATE(sizeof(DH_ELEMENT_TYPE) * tb->size);
+
+	/* Ensure all the buckets are set to empty */
+	for (i = 0; i < tb->size; i++)
+		DH_SET_BUCKET_EMPTY(&tb->buckets[i]);
+
+	newbuckets = tb->buckets;
+
+	/*
+	 * Copy buckets from the old buckets to newbuckets. We theoretically could
+	 * use DH_INSERT here, to avoid code duplication, but that's more general
+	 * than we need. We neither want tb->members increased, nor do we need to
+	 * do deal with deleted elements, nor do we need to compare keys. So a
+	 * special-cased implementation is a lot faster.  Resizing can be time
+	 * consuming and frequent, that's worthwhile to optimize.
+	 *
+	 * To be able to simply move buckets over, we have to start not at the
+	 * first bucket (i.e oldbuckets[0]), but find the first bucket that's
+	 * either empty or is occupied by an entry at its optimal position. Such a
+	 * bucket has to exist in any table with a load factor under 1, as not all
+	 * buckets are occupied, i.e. there always has to be an empty bucket.  By
+	 * starting at such a bucket we can move the entries to the larger table,
+	 * without having to deal with conflicts.
+	 */
+
+	/* search for the first element in the hash that's not wrapped around */
+	for (i = 0; i < oldsize; i++)
+	{
+		DH_BUCKET  *oldbucket = &oldbuckets[i];
+		uint32		hash;
+		uint32		optimal;
+
+		if (!DH_IS_BUCKET_IN_USE(oldbucket))
+		{
+			startelem = i;
+			break;
+		}
+
+		hash = oldbucket->hashvalue;
+		optimal = DH_INITIAL_BUCKET(tb, hash);
+
+		if (optimal == i)
+		{
+			startelem = i;
+			break;
+		}
+	}
+
+	/* and copy all elements in the old table */
+	copyelem = startelem;
+	for (i = 0; i < oldsize; i++)
+	{
+		DH_BUCKET  *oldbucket = &oldbuckets[copyelem];
+
+		if (DH_IS_BUCKET_IN_USE(oldbucket))
+		{
+			uint32		hash;
+			uint32		startelem;
+			uint32		curelem;
+			DH_BUCKET  *newbucket;
+
+			hash = oldbucket->hashvalue;
+			startelem = DH_INITIAL_BUCKET(tb, hash);
+			curelem = startelem;
+
+			/* find empty element to put data into */
+			for (;;)
+			{
+				newbucket = &newbuckets[curelem];
+
+				if (!DH_IS_BUCKET_IN_USE(newbucket))
+					break;
+
+				curelem = DH_NEXT(tb, curelem, startelem);
+			}
+
+			/* copy entry to new slot */
+			memcpy(newbucket, oldbucket, sizeof(DH_BUCKET));
+		}
+
+		/* can't use DH_NEXT here, would use new size */
+		copyelem++;
+		if (copyelem >= oldsize)
+			copyelem = 0;
+	}
+
+	DH_FREE(oldbuckets);
+
+	/*
+	 * Enlarge the segment array so we can store enough segments for the new
+	 * hash table capacity.
+	 */
+	tb->segments = DH_ALLOCATE(sizeof(DH_SEGMENT *) * tb->nsegments);
+	memcpy(tb->segments, oldsegments, sizeof(DH_SEGMENT *) * oldnsegments);
+	/* zero the newly extended part of the array */
+	memset(&tb->segments[oldnsegments], 0, sizeof(DH_SEGMENT *) *
+		   (tb->nsegments - oldnsegments));
+	DH_FREE(oldsegments);
+
+	/*
+	 * The majority of tables will only ever need one bitmap word to store
+	 * used segments, so we only bother to reallocate the used_segments array
+	 * if the number of bitmap words has actually changed.
+	 */
+	if (tb->used_segment_words != oldusedsegmentwords)
+	{
+		tb->used_segments = DH_ALLOCATE(sizeof(DH_BITMAP_WORD) *
+										tb->used_segment_words);
+		memcpy(tb->used_segments, oldusedsegments, sizeof(DH_BITMAP_WORD) *
+			   oldusedsegmentwords);
+		memset(&tb->used_segments[oldusedsegmentwords], 0,
+			   sizeof(DH_BITMAP_WORD) * (tb->used_segment_words -
+										 oldusedsegmentwords));
+
+		DH_FREE(oldusedsegments);
+	}
+}
+
+/*
+ * This is a separate static inline function, so it can be reliably be inlined
+ * into its wrapper functions even if DH_SCOPE is extern.
+ */
+static inline DH_ELEMENT_TYPE *
+DH_INSERT_HASH_INTERNAL(DH_TYPE * tb, DH_KEY_TYPE key, uint32 hash, bool *found)
+{
+	uint32		startelem;
+	uint32		curelem;
+	DH_BUCKET  *buckets;
+	uint32		insertdist;
+
+restart:
+	insertdist = 0;
+
+	/*
+	 * To avoid doing the grow check inside the loop, we do the grow check
+	 * regardless of if the key is present.  This also lets us avoid having to
+	 * re-find our position in the hashtable after resizing.
+	 *
+	 * Note that this also reached when resizing the table due to
+	 * DH_GROW_MAX_DIB / DH_GROW_MAX_MOVE.
+	 */
+	if (unlikely(tb->members >= tb->grow_threshold))
+	{
+		/* this may wrap back to 0 when we're already at DH_MAX_SIZE */
+		DH_GROW(tb, tb->size * 2);
+	}
+
+	/* perform the insert starting the bucket search at optimal location */
+	buckets = tb->buckets;
+	startelem = DH_INITIAL_BUCKET(tb, hash);
+	curelem = startelem;
+	for (;;)
+	{
+		DH_BUCKET  *bucket = &buckets[curelem];
+		DH_ELEMENT_TYPE *entry;
+		uint32		curdist;
+		uint32		curhash;
+		uint32		curoptimal;
+
+		/* any empty bucket can directly be used */
+		if (!DH_IS_BUCKET_IN_USE(bucket))
+		{
+			uint32		index;
+
+			/* and add the new entry */
+			tb->members++;
+
+			entry = DH_GET_NEXT_UNUSED_ENTRY(tb, &index);
+			entry->DH_KEY = key;
+			bucket->hashvalue = hash;
+			DH_SET_BUCKET_IN_USE(bucket, index);
+			*found = false;
+			return entry;
+		}
+
+		curhash = bucket->hashvalue;
+
+		if (curhash == hash)
+		{
+			/*
+			 * The hash value matches so we just need to ensure the key
+			 * matches too.  To do that, we need to lookup the entry in the
+			 * segments using the index stored in the bucket.
+			 */
+			entry = DH_INDEX_TO_ELEMENT(tb, bucket->index);
+
+			/* if we find a match, we're done */
+			if (DH_EQUAL(tb, key, entry->DH_KEY))
+			{
+				Assert(DH_IS_BUCKET_IN_USE(bucket));
+				*found = true;
+				return entry;
+			}
+		}
+
+		/*
+		 * For non-empty, non-matching buckets we have to decide whether to
+		 * skip over or move the colliding entry.  When the colliding
+		 * element's distance to its optimal position is smaller than the
+		 * to-be-inserted entry's, we shift the colliding entry (and its
+		 * followers) one bucket closer to their optimal position.
+		 */
+		curoptimal = DH_INITIAL_BUCKET(tb, curhash);
+		curdist = DH_DISTANCE_FROM_OPTIMAL(tb, curoptimal, curelem);
+
+		if (insertdist > curdist)
+		{
+			DH_ELEMENT_TYPE *entry;
+			DH_BUCKET  *lastbucket = bucket;
+			uint32		emptyelem = curelem;
+			uint32		moveelem;
+			int32		emptydist = 0;
+			uint32		index;
+
+			/* find next empty bucket */
+			for (;;)
+			{
+				DH_BUCKET  *emptybucket;
+
+				emptyelem = DH_NEXT(tb, emptyelem, startelem);
+				emptybucket = &buckets[emptyelem];
+
+				if (!DH_IS_BUCKET_IN_USE(emptybucket))
+				{
+					lastbucket = emptybucket;
+					break;
+				}
+
+				/*
+				 * To avoid negative consequences from overly imbalanced
+				 * hashtables, grow the hashtable if collisions would require
+				 * us to move a lot of entries.  The most likely cause of such
+				 * imbalance is filling a (currently) small table, from a
+				 * currently big one, in hashtable order.  Don't grow if the
+				 * hashtable would be too empty, to prevent quick space
+				 * explosion for some weird edge cases.
+				 */
+				if (unlikely(++emptydist > DH_GROW_MAX_MOVE) &&
+					((double) tb->members / tb->size) >= DH_GROW_MIN_FILLFACTOR)
+				{
+					tb->grow_threshold = 0;
+					goto restart;
+				}
+			}
+
+			/* shift forward, starting at last occupied element */
+
+			/*
+			 * TODO: This could be optimized to be one memcpy in many cases,
+			 * excepting wrapping around at the end of ->data. Hasn't shown up
+			 * in profiles so far though.
+			 */
+			moveelem = emptyelem;
+			while (moveelem != curelem)
+			{
+				DH_BUCKET  *movebucket;
+
+				moveelem = DH_PREV(tb, moveelem, startelem);
+				movebucket = &buckets[moveelem];
+
+				memcpy(lastbucket, movebucket, sizeof(DH_BUCKET));
+				lastbucket = movebucket;
+			}
+
+			/* and add the new entry */
+			tb->members++;
+
+			entry = DH_GET_NEXT_UNUSED_ENTRY(tb, &index);
+			entry->DH_KEY = key;
+			bucket->hashvalue = hash;
+			DH_SET_BUCKET_IN_USE(bucket, index);
+			*found = false;
+			return entry;
+		}
+
+		curelem = DH_NEXT(tb, curelem, startelem);
+		insertdist++;
+
+		/*
+		 * To avoid negative consequences from overly imbalanced hashtables,
+		 * grow the hashtable if collisions lead to large runs. The most
+		 * likely cause of such imbalance is filling a (currently) small
+		 * table, from a currently big one, in hashtable order.  Don't grow if
+		 * the hashtable would be too empty, to prevent quick space explosion
+		 * for some weird edge cases.
+		 */
+		if (unlikely(insertdist > DH_GROW_MAX_DIB) &&
+			((double) tb->members / tb->size) >= DH_GROW_MIN_FILLFACTOR)
+		{
+			tb->grow_threshold = 0;
+			goto restart;
+		}
+	}
+}
+
+/*
+ * Insert the key into the hashtable, set *found to true if the key already
+ * exists, false otherwise. Returns the hashtable entry in either case.
+ */
+DH_SCOPE	DH_ELEMENT_TYPE *
+DH_INSERT(DH_TYPE * tb, DH_KEY_TYPE key, bool *found)
+{
+	uint32		hash = DH_HASH_KEY(tb, key);
+
+	return DH_INSERT_HASH_INTERNAL(tb, key, hash, found);
+}
+
+/*
+ * Insert the key into the hashtable using an already-calculated hash. Set
+ * *found to true if the key already exists, false otherwise. Returns the
+ * hashtable entry in either case.
+ */
+DH_SCOPE	DH_ELEMENT_TYPE *
+DH_INSERT_HASH(DH_TYPE * tb, DH_KEY_TYPE key, uint32 hash, bool *found)
+{
+	return DH_INSERT_HASH_INTERNAL(tb, key, hash, found);
+}
+
+/*
+ * This is a separate static inline function, so it can be reliably be inlined
+ * into its wrapper functions even if DH_SCOPE is extern.
+ */
+static inline DH_ELEMENT_TYPE *
+DH_LOOKUP_HASH_INTERNAL(DH_TYPE * tb, DH_KEY_TYPE key, uint32 hash)
+{
+	const uint32 startelem = DH_INITIAL_BUCKET(tb, hash);
+	uint32		curelem = startelem;
+
+	for (;;)
+	{
+		DH_BUCKET  *bucket = &tb->buckets[curelem];
+
+		if (!DH_IS_BUCKET_IN_USE(bucket))
+			return NULL;
+
+		if (bucket->hashvalue == hash)
+		{
+			DH_ELEMENT_TYPE *entry;
+
+			/*
+			 * The hash value matches so we just need to ensure the key
+			 * matches too.  To do that, we need to lookup the entry in the
+			 * segments using the index stored in the bucket.
+			 */
+			entry = DH_INDEX_TO_ELEMENT(tb, bucket->index);
+
+			/* if we find a match, we're done */
+			if (DH_EQUAL(tb, key, entry->DH_KEY))
+				return entry;
+		}
+
+		/*
+		 * TODO: we could stop search based on distance. If the current
+		 * buckets's distance-from-optimal is smaller than what we've skipped
+		 * already, the entry doesn't exist.
+		 */
+
+		curelem = DH_NEXT(tb, curelem, startelem);
+	}
+}
+
+/*
+ * Lookup an entry in the hash table.  Returns NULL if key not present.
+ */
+DH_SCOPE	DH_ELEMENT_TYPE *
+DH_LOOKUP(DH_TYPE * tb, DH_KEY_TYPE key)
+{
+	uint32		hash = DH_HASH_KEY(tb, key);
+
+	return DH_LOOKUP_HASH_INTERNAL(tb, key, hash);
+}
+
+/*
+ * Lookup an entry in the hash table using an already-calculated hash.
+ *
+ * Returns NULL if key not present.
+ */
+DH_SCOPE	DH_ELEMENT_TYPE *
+DH_LOOKUP_HASH(DH_TYPE * tb, DH_KEY_TYPE key, uint32 hash)
+{
+	return DH_LOOKUP_HASH_INTERNAL(tb, key, hash);
+}
+
+/*
+ * Delete an entry from hash table by key.  Returns whether to-be-deleted key
+ * was present.
+ */
+DH_SCOPE bool
+DH_DELETE(DH_TYPE * tb, DH_KEY_TYPE key)
+{
+	uint32		hash = DH_HASH_KEY(tb, key);
+	uint32		startelem = DH_INITIAL_BUCKET(tb, hash);
+	uint32		curelem = startelem;
+
+	for (;;)
+	{
+		DH_BUCKET  *bucket = &tb->buckets[curelem];
+
+		if (!DH_IS_BUCKET_IN_USE(bucket))
+			return false;
+
+		if (bucket->hashvalue == hash)
+		{
+			DH_ELEMENT_TYPE *entry;
+
+			entry = DH_INDEX_TO_ELEMENT(tb, bucket->index);
+
+			if (DH_EQUAL(tb, key, entry->DH_KEY))
+			{
+				DH_BUCKET  *lastbucket = bucket;
+
+				/* mark the entry as unused */
+				DH_REMOVE_ENTRY(tb, bucket->index);
+				/* and mark the bucket unused */
+				DH_SET_BUCKET_EMPTY(bucket);
+
+				tb->members--;
+
+				/*
+				 * Backward shift following buckets till either an empty
+				 * bucket or a bucket at its optimal position is encountered.
+				 *
+				 * While that sounds expensive, the average chain length is
+				 * short, and deletions would otherwise require tombstones.
+				 */
+				for (;;)
+				{
+					DH_BUCKET  *curbucket;
+					uint32		curhash;
+					uint32		curoptimal;
+
+					curelem = DH_NEXT(tb, curelem, startelem);
+					curbucket = &tb->buckets[curelem];
+
+					if (!DH_IS_BUCKET_IN_USE(curbucket))
+						break;
+
+					curhash = curbucket->hashvalue;
+					curoptimal = DH_INITIAL_BUCKET(tb, curhash);
+
+					/* current is at optimal position, done */
+					if (curoptimal == curelem)
+					{
+						DH_SET_BUCKET_EMPTY(lastbucket);
+						break;
+					}
+
+					/* shift */
+					memcpy(lastbucket, curbucket, sizeof(DH_BUCKET));
+					DH_SET_BUCKET_EMPTY(curbucket);
+
+					lastbucket = curbucket;
+				}
+
+				return true;
+			}
+		}
+		/* TODO: return false; if the distance is too big */
+
+		curelem = DH_NEXT(tb, curelem, startelem);
+	}
+}
+
+/*
+ * Initialize iterator.
+ */
+DH_SCOPE void
+DH_START_ITERATE(DH_TYPE * tb, DH_ITERATOR * iter)
+{
+	iter->cursegidx = -1;
+	iter->curitemidx = -1;
+	iter->found_members = 0;
+	iter->total_members = tb->members;
+}
+
+/*
+ * Iterate over all entries in the hashtable. Return the next occupied entry,
+ * or NULL if there are no more entries.
+ *
+ * During iteration the only current entry in the hash table and any entry
+ * which was previously visited in the loop may be deleted.  Deletion of items
+ * not yet visited is prohibited as are insertions of new entries.
+ */
+DH_SCOPE	DH_ELEMENT_TYPE *
+DH_ITERATE(DH_TYPE * tb, DH_ITERATOR * iter)
+{
+	/*
+	 * Bail if we've already visited all members.  This check allows us to
+	 * exit quickly in cases where the table is large but it only contains a
+	 * small number of records.  This also means that inserts into the table
+	 * are not possible during iteration.  If that is done then we may not
+	 * visit all items in the table.  Rather than ever removing this check to
+	 * allow table insertions during iteration, we should add another iterator
+	 * where insertions are safe.
+	 */
+	if (iter->found_members == iter->total_members)
+		return NULL;
+
+	for (;;)
+	{
+		DH_SEGMENT *seg;
+
+		/* need a new segment? */
+		if (iter->curitemidx == -1)
+		{
+			iter->cursegidx = DH_NEXT_ONEBIT(tb->used_segments,
+											 tb->used_segment_words,
+											 iter->cursegidx);
+
+			/* no more segments with items? We're done */
+			if (iter->cursegidx == -1)
+				return NULL;
+		}
+
+		seg = tb->segments[iter->cursegidx];
+
+		/* if the segment has items then it certainly shouldn't be NULL */
+		Assert(seg != NULL);
+
+		/*
+		 * Advance to the next used item in this segment.  For full segments
+		 * we bypass the bitmap and just skip to the next item, otherwise we
+		 * consult the bitmap to find the next used item.
+		 */
+		if (seg->nitems == DH_ITEMS_PER_SEGMENT)
+		{
+			if (iter->curitemidx == DH_ITEMS_PER_SEGMENT - 1)
+				iter->curitemidx = -1;
+			else
+			{
+				iter->curitemidx++;
+				iter->found_members++;
+				return &seg->items[iter->curitemidx];
+			}
+		}
+		else
+		{
+			iter->curitemidx = DH_NEXT_ONEBIT(seg->used_items,
+											  DH_BITMAP_WORDS,
+											  iter->curitemidx);
+
+			if (iter->curitemidx >= 0)
+			{
+				iter->found_members++;
+				return &seg->items[iter->curitemidx];
+			}
+		}
+
+		/*
+		 * DH_NEXT_ONEBIT returns -1 when there are no more bits.  We just
+		 * loop again to fetch the next segment.
+		 */
+	}
+}
+
+#endif							/* DH_DEFINE */
+
+/* undefine external parameters, so next hash table can be defined */
+#undef DH_PREFIX
+#undef DH_KEY_TYPE
+#undef DH_KEY
+#undef DH_ELEMENT_TYPE
+#undef DH_HASH_KEY
+#undef DH_SCOPE
+#undef DH_DECLARE
+#undef DH_DEFINE
+#undef DH_EQUAL
+#undef DH_ALLOCATE
+#undef DH_ALLOCATE_ZERO
+#undef DH_FREE
+
+/* undefine locally declared macros */
+#undef DH_MAKE_PREFIX
+#undef DH_MAKE_NAME
+#undef DH_MAKE_NAME_
+#undef DH_ITEMS_PER_SEGMENT
+#undef DH_UNUSED_BUCKET_INDEX
+#undef DH_INDEX_SEGMENT
+#undef DH_INDEX_ITEM
+#undef DH_BITS_PER_WORD
+#undef DH_BITMAP_WORD
+#undef DH_RIGHTMOST_ONE_POS
+#undef DH_BITMAP_WORDS
+#undef DH_WORDNUM
+#undef DH_BITNUM
+#undef DH_RAW_ALLOCATOR
+#undef DH_MAX_SIZE
+#undef DH_FILLFACTOR
+#undef DH_MAX_FILLFACTOR
+#undef DH_GROW_MAX_DIB
+#undef DH_GROW_MAX_MOVE
+#undef DH_GROW_MIN_FILLFACTOR
+
+/* types */
+#undef DH_TYPE
+#undef DH_BUCKET
+#undef DH_SEGMENT
+#undef DH_ITERATOR
+
+/* external function names */
+#undef DH_CREATE
+#undef DH_DESTROY
+#undef DH_RESET
+#undef DH_INSERT
+#undef DH_INSERT_HASH
+#undef DH_DELETE
+#undef DH_LOOKUP
+#undef DH_LOOKUP_HASH
+#undef DH_GROW
+#undef DH_START_ITERATE
+#undef DH_ITERATE
+
+/* internal function names */
+#undef DH_NEXT_ONEBIT
+#undef DH_NEXT_ZEROBIT
+#undef DH_INDEX_TO_ELEMENT
+#undef DH_MARK_SEGMENT_ITEM_USED
+#undef DH_MARK_SEGMENT_ITEM_UNUSED
+#undef DH_GET_NEXT_UNUSED_ENTRY
+#undef DH_REMOVE_ENTRY
+#undef DH_SET_BUCKET_IN_USE
+#undef DH_SET_BUCKET_EMPTY
+#undef DH_IS_BUCKET_IN_USE
+#undef DH_COMPUTE_PARAMETERS
+#undef DH_NEXT
+#undef DH_PREV
+#undef DH_DISTANCE_FROM_OPTIMAL
+#undef DH_INITIAL_BUCKET
+#undef DH_INSERT_HASH_INTERNAL
+#undef DH_LOOKUP_HASH_INTERNAL
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index 9b2a421c32..a268879b1c 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -561,7 +561,7 @@ extern void LockReleaseCurrentOwner(LOCALLOCK **locallocks, int nlocks);
 extern void LockReassignCurrentOwner(LOCALLOCK **locallocks, int nlocks);
 extern bool LockHeldByMe(const LOCKTAG *locktag, LOCKMODE lockmode);
 #ifdef USE_ASSERT_CHECKING
-extern HTAB *GetLockMethodLocalHash(void);
+extern LOCALLOCK **GetLockMethodLocalLocks(uint32 *size);
 #endif
 extern bool LockHasWaiters(const LOCKTAG *locktag,
 						   LOCKMODE lockmode, bool sessionLock);

#98

David Rowley

dgrowleyml@gmail.com

over 4 years ago

In reply to: David Rowley (#97)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

On Mon, 12 Jul 2021 at 19:23, David Rowley <dgrowleyml@gmail.com> wrote:

I also adjusted the hash seq scan code so that it performs better when
faced a non-sparsely populated table. Previously my benchmark for
that case didn't do well [2].

I was running some select only pgbench tests today on an AMD 3990x
machine with a large number of processes.

I saw that LockReleaseAll was coming up on the profile a bit at:

Master: 0.77% postgres [.] LockReleaseAll

I wondered if this patch would help, so I tried it and got:

dense hash lockrelease all: 0.67% postgres [.] LockReleaseAll

It's a very small increase which translated to about a 0.62% gain in
tps. It made me think it might be worth doing something about this
LockReleaseAll can show up when releasing small numbers of locks.

pgbench -T 240 -P 10 -c 132 -j 132 -S -M prepared --random-seed=12345 postgres

Units = tps

Sec master dense hash LockReleaseAll
10 3758201.2 3713521.5 98.81%
20 3810125.5 3844142.9 100.89%
30 3806505.1 3848458 101.10%
40 3816094.8 3855706.6 101.04%
50 3820317.2 3851717.7 100.82%
60 3827809 3851499.4 100.62%
70 3828757.9 3849312 100.54%
80 3824492.1 3852378.8 100.73%
90 3816502.1 3854793.8 101.00%
100 3819124.1 3860418.6 101.08%
110 3816154.3 3845327.7 100.76%
120 3817070.5 3845842.5 100.75%
130 3815424.7 3847626 100.84%
140 3823631.1 3846760.6 100.60%
150 3820963.8 3840196.6 100.50%
160 3827737 3841149.3 100.35%
170 3827779.2 3840130.9 100.32%
180 3829352 3842814.5 100.35%
190 3825518.3 3841991 100.43%
200 3823477.2 3839390.7 100.42%
210 3809304.3 3836433.5 100.71%
220 3814328.5 3842073.7 100.73%
230 3811399.3 3843780.7 100.85%
avg 3816959.53 3840672.478 100.62%

David

#99

Michael Paquier

michael@paquier.xyz

over 4 years ago

In reply to: David Rowley (#98)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

On Tue, Jul 20, 2021 at 05:04:19PM +1200, David Rowley wrote:

On Mon, 12 Jul 2021 at 19:23, David Rowley <dgrowleyml@gmail.com> wrote:

I also adjusted the hash seq scan code so that it performs better when
faced a non-sparsely populated table. Previously my benchmark for
that case didn't do well [2].

I was running some select only pgbench tests today on an AMD 3990x
machine with a large number of processes.

I saw that LockReleaseAll was coming up on the profile a bit at:

This last update was two months ago, and the patch has not moved
since:
https://commitfest.postgresql.org/34/3220/

Do you have plans to work more on that or perhaps the CF entry should
be withdrawn or RwF'd?
--
Michael

#100

Yura Sokolov

y.sokolov@postgrespro.ru

over 4 years ago

In reply to: David Rowley (#98)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

I've made some remarks in related thread:
/messages/by-id/0A3221C70F24FB45833433255569204D1FB976EF@G01JPEXMBYT05

The new status of this patch is: Waiting on Author

#101

Michael Paquier

michael@paquier.xyz

about 4 years ago

In reply to: Michael Paquier (#99)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

On Fri, Oct 01, 2021 at 04:03:09PM +0900, Michael Paquier wrote:

This last update was two months ago, and the patch has not moved
since:
https://commitfest.postgresql.org/34/3220/

Do you have plans to work more on that or perhaps the CF entry should
be withdrawn or RwF'd?

Two months later, this has been switched to RwF.
--
Michael

#102

David Rowley

dgrowleyml@gmail.com

about 4 years ago

In reply to: Michael Paquier (#101)

1 attachment(s)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

On Fri, 3 Dec 2021 at 20:36, Michael Paquier <michael@paquier.xyz> wrote:

Two months later, this has been switched to RwF.

I was discussing this patch with Andres. He's not very keen on my
densehash hash table idea and suggested that instead of relying on
trying to make the hash table iteration faster, why don't we just
ditch the lossiness of the resource owner code which only records the
first 16 locks, and just make it have a linked list of all locks.

This is a little more along the lines of the original patch, however,
it does not increase the size of the LOCALLOCK struct.

I've attached a patch which does this. This was mostly written
(quickly) by Andres, I just did a cleanup, fixed up a few mistakes and
fixed a few bugs.

I ran the same performance tests on this patch as I did back in [1]/messages/by-id/CAKJS1f8Lt0kS4bb5EH=hV+ksqBZNnmVa8jujoYBYu5PVhWbZZg@mail.gmail.com:

-- Test 1. Select 1 record from a 140 partitioned table. Tests
creating a large number of locks with a fast query.

select3.sql: select * from hp where b = 1

select3.sql master

drowley@amd3990x:~$ pgbench -n -f select3.sql -T 60 -M prepared postgres
tps = 2099.708748 (without initial connection time)
tps = 2100.398516 (without initial connection time)
tps = 2094.882341 (without initial connection time)
tps = 2113.218820 (without initial connection time)
tps = 2104.717597 (without initial connection time)

select3.sql patched

drowley@amd3990x:~$ pgbench -n -f select3.sql -T 60 -M prepared postgres
tps = 2010.070738 (without initial connection time)
tps = 1994.963606 (without initial connection time)
tps = 1994.668849 (without initial connection time)
tps = 1995.948168 (without initial connection time)
tps = 1985.650945 (without initial connection time)

You can see that there's a performance regression here. I've not yet
studied why this appears.

select.sql:
\set p 1
select * from ht where a = :p

select.sql master
drowley@amd3990x:~$ pgbench -n -f select.sql -T 60 -M prepared postgres
tps = 18014.460090 (without initial connection time)
tps = 17973.358889 (without initial connection time)
tps = 17847.480647 (without initial connection time)
tps = 18038.332507 (without initial connection time)
tps = 17776.143206 (without initial connection time)

select.sql patched
drowley@amd3990x:~$ pgbench -n -f select.sql -T 60 -M prepared postgres
tps = 32393.457106 (without initial connection time)
tps = 32277.204349 (without initial connection time)
tps = 32160.719830 (without initial connection time)
tps = 32530.038130 (without initial connection time)
tps = 32299.019657 (without initial connection time)

You can see that there are some quite good performance gains with this test.

I'm going to add this to the January commitfest.

David

[1]: /messages/by-id/CAKJS1f8Lt0kS4bb5EH=hV+ksqBZNnmVa8jujoYBYu5PVhWbZZg@mail.gmail.com

Attachments:

speedup_releasing_all_locks.patchapplication/octet-stream; name=speedup_releasing_all_locks.patchDownload

diff --git a/src/backend/commands/discard.c b/src/backend/commands/discard.c
index 57d3d7dd9b..8da22b1b12 100644
--- a/src/backend/commands/discard.c
+++ b/src/backend/commands/discard.c
@@ -71,7 +71,7 @@ DiscardAll(bool isTopLevel)
 	ResetAllOptions();
 	DropAllPreparedStatements();
 	Async_UnlistenAll();
-	LockReleaseAll(USER_LOCKMETHOD, true);
+	LockReleaseSession(USER_LOCKMETHOD);
 	ResetPlanCache();
 	ResetTempTableNamespace();
 	ResetSequenceCaches();
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index c25af7fe09..cd23396ce4 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -289,6 +289,9 @@ static LOCALLOCK *awaitedLock;
 static ResourceOwner awaitedOwner;
 
 
+static dlist_head session_locks[lengthof(LockMethods)];
+
+
 #ifdef LOCK_DEBUG
 
 /*------
@@ -376,7 +379,8 @@ static void BeginStrongLockAcquire(LOCALLOCK *locallock, uint32 fasthashcode);
 static void FinishStrongLockAcquire(void);
 static void WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner);
 static void ReleaseLockIfHeld(LOCALLOCK *locallock, bool sessionLock);
-static void LockReassignOwner(LOCALLOCK *locallock, ResourceOwner parent);
+static void LockReassignOwner(LOCALLOCKOWNER *locallockowner,
+							  ResourceOwner parent);
 static bool UnGrantLock(LOCK *lock, LOCKMODE lockmode,
 						PROCLOCK *proclock, LockMethod lockMethodTable);
 static void CleanUpLock(LOCK *lock, PROCLOCK *proclock,
@@ -701,7 +705,7 @@ LockHasWaiters(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 	{
 		PROCLOCK_PRINT("LockHasWaiters: WRONGTYPE", proclock);
 		LWLockRelease(partitionLock);
-		elog(WARNING, "you don't own a lock of type %s",
+		elog(PANIC, "you don't own a lock of type %s",
 			 lockMethodTable->lockModeNames[lockmode]);
 		RemoveLocalLock(locallock);
 		return false;
@@ -839,26 +843,9 @@ LockAcquireExtended(const LOCKTAG *locktag,
 		locallock->nLocks = 0;
 		locallock->holdsStrongLockCount = false;
 		locallock->lockCleared = false;
-		locallock->numLockOwners = 0;
-		locallock->maxLockOwners = 8;
-		locallock->lockOwners = NULL;	/* in case next line fails */
-		locallock->lockOwners = (LOCALLOCKOWNER *)
-			MemoryContextAlloc(TopMemoryContext,
-							   locallock->maxLockOwners * sizeof(LOCALLOCKOWNER));
+		dlist_init(&locallock->locallockowners);
 	}
-	else
-	{
-		/* Make sure there will be room to remember the lock */
-		if (locallock->numLockOwners >= locallock->maxLockOwners)
-		{
-			int			newsize = locallock->maxLockOwners * 2;
 
-			locallock->lockOwners = (LOCALLOCKOWNER *)
-				repalloc(locallock->lockOwners,
-						 newsize * sizeof(LOCALLOCKOWNER));
-			locallock->maxLockOwners = newsize;
-		}
-	}
 	hashcode = locallock->hashcode;
 
 	if (locallockp)
@@ -1366,17 +1353,18 @@ CheckAndSetLockHeld(LOCALLOCK *locallock, bool acquired)
 static void
 RemoveLocalLock(LOCALLOCK *locallock)
 {
-	int			i;
+	dlist_mutable_iter iter;
 
-	for (i = locallock->numLockOwners - 1; i >= 0; i--)
+	dlist_foreach_modify(iter, &locallock->locallockowners)
 	{
-		if (locallock->lockOwners[i].owner != NULL)
-			ResourceOwnerForgetLock(locallock->lockOwners[i].owner, locallock);
+		LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER, locallock_node, iter.cur);
+
+		Assert(locallockowner->owner != NULL);
+		dlist_delete(&locallockowner->locallock_node);
+		ResourceOwnerForgetLock(locallockowner->owner, locallockowner);
 	}
-	locallock->numLockOwners = 0;
-	if (locallock->lockOwners != NULL)
-		pfree(locallock->lockOwners);
-	locallock->lockOwners = NULL;
+
+	Assert(dlist_is_empty(&locallock->locallockowners));
 
 	if (locallock->holdsStrongLockCount)
 	{
@@ -1394,7 +1382,7 @@ RemoveLocalLock(LOCALLOCK *locallock)
 	if (!hash_search(LockMethodLocalHash,
 					 (void *) &(locallock->tag),
 					 HASH_REMOVE, NULL))
-		elog(WARNING, "locallock table corrupted");
+		elog(PANIC, "locallock table corrupted");
 
 	/*
 	 * Indicate that the lock is released for certain types of locks
@@ -1688,26 +1676,40 @@ CleanUpLock(LOCK *lock, PROCLOCK *proclock,
 static void
 GrantLockLocal(LOCALLOCK *locallock, ResourceOwner owner)
 {
-	LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
-	int			i;
+	LOCALLOCKOWNER *locallockowner;
+	dlist_iter iter;
 
-	Assert(locallock->numLockOwners < locallock->maxLockOwners);
 	/* Count the total */
 	locallock->nLocks++;
+
 	/* Count the per-owner lock */
-	for (i = 0; i < locallock->numLockOwners; i++)
+	dlist_foreach(iter, &locallock->locallockowners)
 	{
-		if (lockOwners[i].owner == owner)
+		locallockowner = dlist_container(LOCALLOCKOWNER, locallock_node, iter.cur);
+
+		if (locallockowner->owner == owner)
 		{
-			lockOwners[i].nLocks++;
+			locallockowner->nLocks++;
 			return;
 		}
 	}
-	lockOwners[i].owner = owner;
-	lockOwners[i].nLocks = 1;
-	locallock->numLockOwners++;
+
+	locallockowner = MemoryContextAlloc(TopMemoryContext, sizeof(LOCALLOCKOWNER));
+	locallockowner->owner = owner;
+	locallockowner->nLocks = 1;
+	locallockowner->locallock = locallock;
+
+	dlist_push_tail(&locallock->locallockowners, &locallockowner->locallock_node);
+
 	if (owner != NULL)
-		ResourceOwnerRememberLock(owner, locallock);
+		ResourceOwnerRememberLock(owner, locallockowner);
+	else
+	{
+		LOCKMETHODID lockmethodid = LOCALLOCK_LOCKMETHOD(*locallockowner->locallock);
+
+		Assert(lockmethodid > 0 && lockmethodid <= 2);
+		dlist_push_tail(&session_locks[lockmethodid - 1], &locallockowner->resowner_node);
+	}
 
 	/* Indicate that the lock is acquired for certain types of locks. */
 	CheckAndSetLockHeld(locallock, true);
@@ -2021,9 +2023,9 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 	 * Decrease the count for the resource owner.
 	 */
 	{
-		LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
 		ResourceOwner owner;
-		int			i;
+		dlist_iter iter;
+		bool		found = false;
 
 		/* Identify owner for lock */
 		if (sessionLock)
@@ -2031,24 +2033,29 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 		else
 			owner = CurrentResourceOwner;
 
-		for (i = locallock->numLockOwners - 1; i >= 0; i--)
+		dlist_foreach(iter, &locallock->locallockowners)
 		{
-			if (lockOwners[i].owner == owner)
+			LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER, locallock_node, iter.cur);
+
+			if (locallockowner->owner != owner)
+				continue;
+
+			found = true;
+
+			if (--locallockowner->nLocks == 0)
 			{
-				Assert(lockOwners[i].nLocks > 0);
-				if (--lockOwners[i].nLocks == 0)
-				{
-					if (owner != NULL)
-						ResourceOwnerForgetLock(owner, locallock);
-					/* compact out unused slot */
-					locallock->numLockOwners--;
-					if (i < locallock->numLockOwners)
-						lockOwners[i] = lockOwners[locallock->numLockOwners];
-				}
-				break;
+				dlist_delete(&locallockowner->locallock_node);
+
+				if (owner != NULL)
+					ResourceOwnerForgetLock(owner, locallockowner);
+				else
+					dlist_delete(&locallockowner->resowner_node);
 			}
+
+			Assert(locallockowner->nLocks >= 0);
 		}
-		if (i < 0)
+
+		if (!found)
 		{
 			/* don't release a lock belonging to another owner */
 			elog(WARNING, "you don't own a lock of type %s",
@@ -2066,6 +2073,8 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 	if (locallock->nLocks > 0)
 		return true;
 
+	Assert(locallock->nLocks >= 0);
+
 	/*
 	 * At this point we can no longer suppose we are clear of invalidation
 	 * messages related to this lock.  Although we'll delete the LOCALLOCK
@@ -2147,7 +2156,7 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 	{
 		PROCLOCK_PRINT("LockRelease: WRONGTYPE", proclock);
 		LWLockRelease(partitionLock);
-		elog(WARNING, "you don't own a lock of type %s",
+		elog(PANIC, "you don't own a lock of type %s",
 			 lockMethodTable->lockModeNames[lockmode]);
 		RemoveLocalLock(locallock);
 		return false;
@@ -2168,283 +2177,44 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 	return true;
 }
 
+#ifdef USE_ASSERT_CHECKING
 /*
- * LockReleaseAll -- Release all locks of the specified lock method that
- *		are held by the current process.
- *
- * Well, not necessarily *all* locks.  The available behaviors are:
- *		allLocks == true: release all locks including session locks.
- *		allLocks == false: release all non-session locks.
+ * LockAssertNoneHeld -- Assert that we no longer hold any DEFAULT_LOCKMETHOD
+ * locks during an abort.
  */
-void
-LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
+extern void
+LockAssertNoneHeld(bool isCommit)
 {
 	HASH_SEQ_STATUS status;
-	LockMethod	lockMethodTable;
-	int			i,
-				numLockModes;
 	LOCALLOCK  *locallock;
-	LOCK	   *lock;
-	PROCLOCK   *proclock;
-	int			partition;
-	bool		have_fast_path_lwlock = false;
-
-	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
-		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
-	lockMethodTable = LockMethods[lockmethodid];
-
-#ifdef LOCK_DEBUG
-	if (*(lockMethodTable->trace_flag))
-		elog(LOG, "LockReleaseAll: lockmethod=%d", lockmethodid);
-#endif
-
-	/*
-	 * Get rid of our fast-path VXID lock, if appropriate.  Note that this is
-	 * the only way that the lock we hold on our own VXID can ever get
-	 * released: it is always and only released when a toplevel transaction
-	 * ends.
-	 */
-	if (lockmethodid == DEFAULT_LOCKMETHOD)
-		VirtualXactLockTableCleanup();
-
-	numLockModes = lockMethodTable->numLockModes;
-
-	/*
-	 * First we run through the locallock table and get rid of unwanted
-	 * entries, then we scan the process's proclocks and get rid of those. We
-	 * do this separately because we may have multiple locallock entries
-	 * pointing to the same proclock, and we daren't end up with any dangling
-	 * pointers.  Fast-path locks are cleaned up during the locallock table
-	 * scan, though.
-	 */
-	hash_seq_init(&status, LockMethodLocalHash);
 
-	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+	if (!isCommit)
 	{
-		/*
-		 * If the LOCALLOCK entry is unused, we must've run out of shared
-		 * memory while trying to set up this lock.  Just forget the local
-		 * entry.
-		 */
-		if (locallock->nLocks == 0)
-		{
-			RemoveLocalLock(locallock);
-			continue;
-		}
-
-		/* Ignore items that are not of the lockmethod to be removed */
-		if (LOCALLOCK_LOCKMETHOD(*locallock) != lockmethodid)
-			continue;
+		hash_seq_init(&status, LockMethodLocalHash);
 
-		/*
-		 * If we are asked to release all locks, we can just zap the entry.
-		 * Otherwise, must scan to see if there are session locks. We assume
-		 * there is at most one lockOwners entry for session locks.
-		 */
-		if (!allLocks)
+		while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
 		{
-			LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
+			dlist_iter local_iter;
 
-			/* If session lock is above array position 0, move it down to 0 */
-			for (i = 0; i < locallock->numLockOwners; i++)
-			{
-				if (lockOwners[i].owner == NULL)
-					lockOwners[0] = lockOwners[i];
-				else
-					ResourceOwnerForgetLock(lockOwners[i].owner, locallock);
-			}
+			Assert(locallock->nLocks >= 0);
 
-			if (locallock->numLockOwners > 0 &&
-				lockOwners[0].owner == NULL &&
-				lockOwners[0].nLocks > 0)
+			dlist_foreach(local_iter, &locallock->locallockowners)
 			{
-				/* Fix the locallock to show just the session locks */
-				locallock->nLocks = lockOwners[0].nLocks;
-				locallock->numLockOwners = 1;
-				/* We aren't deleting this locallock, so done */
-				continue;
-			}
-			else
-				locallock->numLockOwners = 0;
-		}
-
-		/*
-		 * If the lock or proclock pointers are NULL, this lock was taken via
-		 * the relation fast-path (and is not known to have been transferred).
-		 */
-		if (locallock->proclock == NULL || locallock->lock == NULL)
-		{
-			LOCKMODE	lockmode = locallock->tag.mode;
-			Oid			relid;
+				LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER,
+																 locallock_node,
+																 local_iter.cur);
 
-			/* Verify that a fast-path lock is what we've got. */
-			if (!EligibleForRelationFastPath(&locallock->tag.lock, lockmode))
-				elog(PANIC, "locallock table corrupted");
-
-			/*
-			 * If we don't currently hold the LWLock that protects our
-			 * fast-path data structures, we must acquire it before attempting
-			 * to release the lock via the fast-path.  We will continue to
-			 * hold the LWLock until we're done scanning the locallock table,
-			 * unless we hit a transferred fast-path lock.  (XXX is this
-			 * really such a good idea?  There could be a lot of entries ...)
-			 */
-			if (!have_fast_path_lwlock)
-			{
-				LWLockAcquire(&MyProc->fpInfoLock, LW_EXCLUSIVE);
-				have_fast_path_lwlock = true;
-			}
+				Assert(locallockowner->owner == NULL);
 
-			/* Attempt fast-path release. */
-			relid = locallock->tag.lock.locktag_field2;
-			if (FastPathUnGrantRelationLock(relid, lockmode))
-			{
-				RemoveLocalLock(locallock);
-				continue;
+				if (locallockowner->nLocks > 0 &&
+					LOCALLOCK_LOCKMETHOD(*locallock) == DEFAULT_LOCKMETHOD)
+					Assert(false);
 			}
-
-			/*
-			 * Our lock, originally taken via the fast path, has been
-			 * transferred to the main lock table.  That's going to require
-			 * some extra work, so release our fast-path lock before starting.
-			 */
-			LWLockRelease(&MyProc->fpInfoLock);
-			have_fast_path_lwlock = false;
-
-			/*
-			 * Now dump the lock.  We haven't got a pointer to the LOCK or
-			 * PROCLOCK in this case, so we have to handle this a bit
-			 * differently than a normal lock release.  Unfortunately, this
-			 * requires an extra LWLock acquire-and-release cycle on the
-			 * partitionLock, but hopefully it shouldn't happen often.
-			 */
-			LockRefindAndRelease(lockMethodTable, MyProc,
-								 &locallock->tag.lock, lockmode, false);
-			RemoveLocalLock(locallock);
-			continue;
 		}
-
-		/* Mark the proclock to show we need to release this lockmode */
-		if (locallock->nLocks > 0)
-			locallock->proclock->releaseMask |= LOCKBIT_ON(locallock->tag.mode);
-
-		/* And remove the locallock hashtable entry */
-		RemoveLocalLock(locallock);
 	}
-
-	/* Done with the fast-path data structures */
-	if (have_fast_path_lwlock)
-		LWLockRelease(&MyProc->fpInfoLock);
-
-	/*
-	 * Now, scan each lock partition separately.
-	 */
-	for (partition = 0; partition < NUM_LOCK_PARTITIONS; partition++)
-	{
-		LWLock	   *partitionLock;
-		SHM_QUEUE  *procLocks = &(MyProc->myProcLocks[partition]);
-		PROCLOCK   *nextplock;
-
-		partitionLock = LockHashPartitionLockByIndex(partition);
-
-		/*
-		 * If the proclock list for this partition is empty, we can skip
-		 * acquiring the partition lock.  This optimization is trickier than
-		 * it looks, because another backend could be in process of adding
-		 * something to our proclock list due to promoting one of our
-		 * fast-path locks.  However, any such lock must be one that we
-		 * decided not to delete above, so it's okay to skip it again now;
-		 * we'd just decide not to delete it again.  We must, however, be
-		 * careful to re-fetch the list header once we've acquired the
-		 * partition lock, to be sure we have a valid, up-to-date pointer.
-		 * (There is probably no significant risk if pointer fetch/store is
-		 * atomic, but we don't wish to assume that.)
-		 *
-		 * XXX This argument assumes that the locallock table correctly
-		 * represents all of our fast-path locks.  While allLocks mode
-		 * guarantees to clean up all of our normal locks regardless of the
-		 * locallock situation, we lose that guarantee for fast-path locks.
-		 * This is not ideal.
-		 */
-		if (SHMQueueNext(procLocks, procLocks,
-						 offsetof(PROCLOCK, procLink)) == NULL)
-			continue;			/* needn't examine this partition */
-
-		LWLockAcquire(partitionLock, LW_EXCLUSIVE);
-
-		for (proclock = (PROCLOCK *) SHMQueueNext(procLocks, procLocks,
-												  offsetof(PROCLOCK, procLink));
-			 proclock;
-			 proclock = nextplock)
-		{
-			bool		wakeupNeeded = false;
-
-			/* Get link first, since we may unlink/delete this proclock */
-			nextplock = (PROCLOCK *)
-				SHMQueueNext(procLocks, &proclock->procLink,
-							 offsetof(PROCLOCK, procLink));
-
-			Assert(proclock->tag.myProc == MyProc);
-
-			lock = proclock->tag.myLock;
-
-			/* Ignore items that are not of the lockmethod to be removed */
-			if (LOCK_LOCKMETHOD(*lock) != lockmethodid)
-				continue;
-
-			/*
-			 * In allLocks mode, force release of all locks even if locallock
-			 * table had problems
-			 */
-			if (allLocks)
-				proclock->releaseMask = proclock->holdMask;
-			else
-				Assert((proclock->releaseMask & ~proclock->holdMask) == 0);
-
-			/*
-			 * Ignore items that have nothing to be released, unless they have
-			 * holdMask == 0 and are therefore recyclable
-			 */
-			if (proclock->releaseMask == 0 && proclock->holdMask != 0)
-				continue;
-
-			PROCLOCK_PRINT("LockReleaseAll", proclock);
-			LOCK_PRINT("LockReleaseAll", lock, 0);
-			Assert(lock->nRequested >= 0);
-			Assert(lock->nGranted >= 0);
-			Assert(lock->nGranted <= lock->nRequested);
-			Assert((proclock->holdMask & ~lock->grantMask) == 0);
-
-			/*
-			 * Release the previously-marked lock modes
-			 */
-			for (i = 1; i <= numLockModes; i++)
-			{
-				if (proclock->releaseMask & LOCKBIT_ON(i))
-					wakeupNeeded |= UnGrantLock(lock, i, proclock,
-												lockMethodTable);
-			}
-			Assert((lock->nRequested >= 0) && (lock->nGranted >= 0));
-			Assert(lock->nGranted <= lock->nRequested);
-			LOCK_PRINT("LockReleaseAll: updated", lock, 0);
-
-			proclock->releaseMask = 0;
-
-			/* CleanUpLock will wake up waiters if needed. */
-			CleanUpLock(lock, proclock,
-						lockMethodTable,
-						LockTagHashCode(&lock->tag),
-						wakeupNeeded);
-		}						/* loop over PROCLOCKs within this partition */
-
-		LWLockRelease(partitionLock);
-	}							/* loop over partitions */
-
-#ifdef LOCK_DEBUG
-	if (*(lockMethodTable->trace_flag))
-		elog(LOG, "LockReleaseAll done");
-#endif
+	Assert(MyProc->fpLockBits == 0);
 }
+#endif
 
 /*
  * LockReleaseSession -- Release all session locks of the specified lock method
@@ -2453,22 +2223,21 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
 void
 LockReleaseSession(LOCKMETHODID lockmethodid)
 {
-	HASH_SEQ_STATUS status;
-	LOCALLOCK  *locallock;
+	dlist_mutable_iter iter;
 
 	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
 		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
 
-	hash_seq_init(&status, LockMethodLocalHash);
-
-	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+	dlist_foreach_modify(iter, &session_locks[lockmethodid - 1])
 	{
-		/* Ignore items that are not of the specified lock method */
-		if (LOCALLOCK_LOCKMETHOD(*locallock) != lockmethodid)
-			continue;
+		LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER, resowner_node, iter.cur);
+
+		Assert(LOCALLOCK_LOCKMETHOD(*locallockowner->locallock) == lockmethodid);
 
-		ReleaseLockIfHeld(locallock, true);
+		ReleaseLockIfHeld(locallockowner->locallock, true);
 	}
+
+	Assert(dlist_is_empty(&session_locks[lockmethodid - 1]));
 }
 
 /*
@@ -2480,26 +2249,12 @@ LockReleaseSession(LOCKMETHODID lockmethodid)
  * Otherwise, pass NULL for locallocks, and we'll traverse through our hash
  * table to find them.
  */
-void
-LockReleaseCurrentOwner(LOCALLOCK **locallocks, int nlocks)
+extern void
+LockReleaseCurrentOwner(ResourceOwner owner, dlist_node *resowner_node)
 {
-	if (locallocks == NULL)
-	{
-		HASH_SEQ_STATUS status;
-		LOCALLOCK  *locallock;
-
-		hash_seq_init(&status, LockMethodLocalHash);
+	LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER, resowner_node, resowner_node);
 
-		while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
-			ReleaseLockIfHeld(locallock, false);
-	}
-	else
-	{
-		int			i;
-
-		for (i = nlocks - 1; i >= 0; i--)
-			ReleaseLockIfHeld(locallocks[i], false);
-	}
+	ReleaseLockIfHeld(locallockowner->locallock, false);
 }
 
 /*
@@ -2519,8 +2274,7 @@ static void
 ReleaseLockIfHeld(LOCALLOCK *locallock, bool sessionLock)
 {
 	ResourceOwner owner;
-	LOCALLOCKOWNER *lockOwners;
-	int			i;
+	dlist_iter  iter;
 
 	/* Identify owner for lock (must match LockRelease!) */
 	if (sessionLock)
@@ -2529,39 +2283,49 @@ ReleaseLockIfHeld(LOCALLOCK *locallock, bool sessionLock)
 		owner = CurrentResourceOwner;
 
 	/* Scan to see if there are any locks belonging to the target owner */
-	lockOwners = locallock->lockOwners;
-	for (i = locallock->numLockOwners - 1; i >= 0; i--)
+	dlist_foreach(iter, &locallock->locallockowners)
 	{
-		if (lockOwners[i].owner == owner)
+		LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER, locallock_node, iter.cur);
+		LOCALLOCK *locallock = locallockowner->locallock;
+
+		if (locallockowner->owner != owner)
+			continue;
+
+		/* release all references to the lock by this resource owner */
+
+		if (sessionLock)
+			Assert(locallockowner->owner == NULL);
+		else
+			Assert(locallockowner->owner != NULL);
+
+		/*
+		 * We will still hold this lock after forgetting this
+		 * ResourceOwner.
+		 */
+		if (locallockowner->nLocks < locallock->nLocks)
 		{
-			Assert(lockOwners[i].nLocks > 0);
-			if (lockOwners[i].nLocks < locallock->nLocks)
-			{
-				/*
-				 * We will still hold this lock after forgetting this
-				 * ResourceOwner.
-				 */
-				locallock->nLocks -= lockOwners[i].nLocks;
-				/* compact out unused slot */
-				locallock->numLockOwners--;
-				if (owner != NULL)
-					ResourceOwnerForgetLock(owner, locallock);
-				if (i < locallock->numLockOwners)
-					lockOwners[i] = lockOwners[locallock->numLockOwners];
-			}
+			locallock->nLocks -= locallockowner->nLocks;
+			Assert(locallock->nLocks >= 0);
+			dlist_delete(&locallockowner->locallock_node);
+
+			if (sessionLock)
+				dlist_delete(&locallockowner->resowner_node);
 			else
-			{
-				Assert(lockOwners[i].nLocks == locallock->nLocks);
-				/* We want to call LockRelease just once */
-				lockOwners[i].nLocks = 1;
-				locallock->nLocks = 1;
-				if (!LockRelease(&locallock->tag.lock,
-								 locallock->tag.mode,
-								 sessionLock))
-					elog(WARNING, "ReleaseLockIfHeld: failed??");
-			}
-			break;
+				ResourceOwnerForgetLock(owner, locallockowner);
+		}
+		else
+		{
+			Assert(locallockowner->nLocks == locallock->nLocks);
+			/* We want to call LockRelease just once */
+			locallockowner->nLocks = 1;
+			locallock->nLocks = 1;
+
+			if (!LockRelease(&locallock->tag.lock,
+							 locallock->tag.mode,
+							 sessionLock))
+				elog(PANIC, "ReleaseLockIfHeld: failed??");
 		}
+		break;
 	}
 }
 
@@ -2576,29 +2340,12 @@ ReleaseLockIfHeld(LOCALLOCK *locallock, bool sessionLock)
  * and we'll traverse through our hash table to find them.
  */
 void
-LockReassignCurrentOwner(LOCALLOCK **locallocks, int nlocks)
+LockReassignCurrentOwner(ResourceOwner owner, dlist_node *resowner_node)
 {
+	LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER, resowner_node, resowner_node);
 	ResourceOwner parent = ResourceOwnerGetParent(CurrentResourceOwner);
 
-	Assert(parent != NULL);
-
-	if (locallocks == NULL)
-	{
-		HASH_SEQ_STATUS status;
-		LOCALLOCK  *locallock;
-
-		hash_seq_init(&status, LockMethodLocalHash);
-
-		while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
-			LockReassignOwner(locallock, parent);
-	}
-	else
-	{
-		int			i;
-
-		for (i = nlocks - 1; i >= 0; i--)
-			LockReassignOwner(locallocks[i], parent);
-	}
+	LockReassignOwner(locallockowner, parent);
 }
 
 /*
@@ -2606,45 +2353,33 @@ LockReassignCurrentOwner(LOCALLOCK **locallocks, int nlocks)
  * CurrentResourceOwner to its parent.
  */
 static void
-LockReassignOwner(LOCALLOCK *locallock, ResourceOwner parent)
+LockReassignOwner(LOCALLOCKOWNER *locallockowner, ResourceOwner parent)
 {
-	LOCALLOCKOWNER *lockOwners;
-	int			i;
-	int			ic = -1;
-	int			ip = -1;
+	dlist_iter iter;
+	LOCALLOCK *locallock = locallockowner->locallock;
 
-	/*
-	 * Scan to see if there are any locks belonging to current owner or its
-	 * parent
-	 */
-	lockOwners = locallock->lockOwners;
-	for (i = locallock->numLockOwners - 1; i >= 0; i--)
+	ResourceOwnerForgetLock(CurrentResourceOwner, locallockowner);
+
+	dlist_foreach(iter, &locallock->locallockowners)
 	{
-		if (lockOwners[i].owner == CurrentResourceOwner)
-			ic = i;
-		else if (lockOwners[i].owner == parent)
-			ip = i;
-	}
+		LOCALLOCKOWNER *parentlocalowner = dlist_container(LOCALLOCKOWNER, locallock_node, iter.cur);
 
-	if (ic < 0)
-		return;					/* no current locks */
+		Assert(parentlocalowner->locallock == locallock);
 
-	if (ip < 0)
-	{
-		/* Parent has no slot, so just give it the child's slot */
-		lockOwners[ic].owner = parent;
-		ResourceOwnerRememberLock(parent, locallock);
-	}
-	else
-	{
-		/* Merge child's count with parent's */
-		lockOwners[ip].nLocks += lockOwners[ic].nLocks;
-		/* compact out unused slot */
-		locallock->numLockOwners--;
-		if (ic < locallock->numLockOwners)
-			lockOwners[ic] = lockOwners[locallock->numLockOwners];
+		if (parentlocalowner->owner != parent)
+			continue;
+
+		parentlocalowner->nLocks += locallockowner->nLocks;
+
+		locallockowner->nLocks = 0;
+		dlist_delete(&locallockowner->locallock_node);
+		pfree(locallockowner);
+		return;
 	}
-	ResourceOwnerForgetLock(CurrentResourceOwner, locallock);
+
+	/* reassign locallockowner to parent resowner */
+	locallockowner->owner = parent;
+	ResourceOwnerRememberLock(parent, locallockowner);
 }
 
 /*
@@ -3178,7 +2913,7 @@ LockRefindAndRelease(LockMethod lockMethodTable, PGPROC *proc,
 	{
 		PROCLOCK_PRINT("lock_twophase_postcommit: WRONGTYPE", proclock);
 		LWLockRelease(partitionLock);
-		elog(WARNING, "you don't own a lock of type %s",
+		elog(PANIC, "you don't own a lock of type %s",
 			 lockMethodTable->lockModeNames[lockmode]);
 		return;
 	}
@@ -3259,10 +2994,9 @@ CheckForSessionAndXactLocks(void)
 
 	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
 	{
-		LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
 		PerLockTagEntry *hentry;
 		bool		found;
-		int			i;
+		dlist_iter iter;
 
 		/*
 		 * Ignore VXID locks.  We don't want those to be held by prepared
@@ -3283,9 +3017,13 @@ CheckForSessionAndXactLocks(void)
 			hentry->sessLock = hentry->xactLock = false;
 
 		/* Scan to see if we hold lock at session or xact level or both */
-		for (i = locallock->numLockOwners - 1; i >= 0; i--)
+		dlist_foreach(iter, &locallock->locallockowners)
 		{
-			if (lockOwners[i].owner == NULL)
+			LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER,
+															 locallock_node,
+															 iter.cur);
+
+			if (locallockowner->owner == NULL)
 				hentry->sessLock = true;
 			else
 				hentry->xactLock = true;
@@ -3332,10 +3070,9 @@ AtPrepare_Locks(void)
 	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
 	{
 		TwoPhaseLockRecord record;
-		LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
 		bool		haveSessionLock;
 		bool		haveXactLock;
-		int			i;
+		dlist_iter iter;
 
 		/*
 		 * Ignore VXID locks.  We don't want those to be held by prepared
@@ -3350,9 +3087,13 @@ AtPrepare_Locks(void)
 
 		/* Scan to see whether we hold it at session or transaction level */
 		haveSessionLock = haveXactLock = false;
-		for (i = locallock->numLockOwners - 1; i >= 0; i--)
+		dlist_foreach(iter, &locallock->locallockowners)
 		{
-			if (lockOwners[i].owner == NULL)
+			LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER,
+															 locallock_node,
+															 iter.cur);
+
+			if (locallockowner->owner == NULL)
 				haveSessionLock = true;
 			else
 				haveXactLock = true;
@@ -3444,10 +3185,9 @@ PostPrepare_Locks(TransactionId xid)
 
 	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
 	{
-		LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
 		bool		haveSessionLock;
 		bool		haveXactLock;
-		int			i;
+		dlist_iter iter;
 
 		if (locallock->proclock == NULL || locallock->lock == NULL)
 		{
@@ -3466,9 +3206,13 @@ PostPrepare_Locks(TransactionId xid)
 
 		/* Scan to see whether we hold it at session or transaction level */
 		haveSessionLock = haveXactLock = false;
-		for (i = locallock->numLockOwners - 1; i >= 0; i--)
+		dlist_foreach(iter, &locallock->locallockowners)
 		{
-			if (lockOwners[i].owner == NULL)
+			LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER,
+															 locallock_node,
+															 iter.cur);
+
+			if (locallockowner->owner == NULL)
 				haveSessionLock = true;
 			else
 				haveXactLock = true;
@@ -3599,6 +3343,7 @@ PostPrepare_Locks(TransactionId xid)
 	}							/* loop over partitions */
 
 	END_CRIT_SECTION();
+
 }
 
 
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index d1d3cd0dc8..498722dfdd 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -778,10 +778,17 @@ ProcReleaseLocks(bool isCommit)
 		return;
 	/* If waiting, get off wait queue (should only be needed after error) */
 	LockErrorCleanup();
-	/* Release standard locks, including session-level if aborting */
-	LockReleaseAll(DEFAULT_LOCKMETHOD, !isCommit);
-	/* Release transaction-level advisory locks */
-	LockReleaseAll(USER_LOCKMETHOD, false);
+
+	VirtualXactLockTableCleanup();
+
+	/* Release session-level locks if aborting */
+	if (!isCommit)
+		LockReleaseSession(DEFAULT_LOCKMETHOD);
+
+#ifdef USE_ASSERT_CHECKING
+	/* Ensure all locks were released */
+	LockAssertNoneHeld(isCommit);
+#endif
 }
 
 
@@ -870,6 +877,8 @@ ProcKill(int code, Datum arg)
 		LWLockRelease(leader_lwlock);
 	}
 
+	Assert(MyProc->fpLockBits == 0);
+
 	/*
 	 * Reset MyLatch to the process local one.  This is so that signal
 	 * handlers et al can continue using the latch after the shared latch
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 7292e51f7d..f9092c65c1 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -1176,7 +1176,7 @@ ShutdownPostgres(int code, Datum arg)
 	 * User locks are not released by transaction end, so be sure to release
 	 * them explicitly.
 	 */
-	LockReleaseAll(USER_LOCKMETHOD, true);
+	LockReleaseSession(USER_LOCKMETHOD);
 }
 
 
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index e24f00f060..0d40d5a7e7 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -33,6 +33,7 @@
 #include "utils/resowner_private.h"
 #include "utils/snapmgr.h"
 
+#include "lib/ilist.h"
 
 /*
  * All resource IDs managed by this code are required to fit into a Datum,
@@ -133,9 +134,7 @@ typedef struct ResourceOwnerData
 	ResourceArray cryptohasharr;	/* cryptohash contexts */
 	ResourceArray hmacarr;		/* HMAC contexts */
 
-	/* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
-	int			nlocks;			/* number of owned locks */
-	LOCALLOCK  *locks[MAX_RESOWNER_LOCKS];	/* list of owned locks */
+	dlist_head  locks;
 }			ResourceOwnerData;
 
 
@@ -452,6 +451,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
 	ResourceArrayInit(&(owner->jitarr), PointerGetDatum(NULL));
 	ResourceArrayInit(&(owner->cryptohasharr), PointerGetDatum(NULL));
 	ResourceArrayInit(&(owner->hmacarr), PointerGetDatum(NULL));
+	dlist_init(&owner->locks);
 
 	return owner;
 }
@@ -585,50 +585,39 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 	}
 	else if (phase == RESOURCE_RELEASE_LOCKS)
 	{
+		dlist_mutable_iter iter;
+
 		if (isTopLevel)
 		{
-			/*
-			 * For a top-level xact we are going to release all locks (or at
-			 * least all non-session locks), so just do a single lmgr call at
-			 * the top of the recursion.
-			 */
+			dlist_foreach_modify(iter, &owner->locks)
+			{
+				LockReleaseCurrentOwner(owner, iter.cur);
+			}
+			Assert(dlist_is_empty(&owner->locks));
+
 			if (owner == TopTransactionResourceOwner)
 			{
 				ProcReleaseLocks(isCommit);
+
 				ReleasePredicateLocks(isCommit, false);
 			}
 		}
 		else
 		{
-			/*
-			 * Release locks retail.  Note that if we are committing a
-			 * subtransaction, we do NOT release its locks yet, but transfer
-			 * them to the parent.
-			 */
-			LOCALLOCK **locks;
-			int			nlocks;
-
-			Assert(owner->parent != NULL);
-
-			/*
-			 * Pass the list of locks owned by this resource owner to the lock
-			 * manager, unless it has overflowed.
-			 */
-			if (owner->nlocks > MAX_RESOWNER_LOCKS)
+			if (isCommit)
 			{
-				locks = NULL;
-				nlocks = 0;
+				dlist_foreach_modify(iter, &owner->locks)
+					LockReassignCurrentOwner(owner, iter.cur);
+
+				Assert(dlist_is_empty(&owner->locks));
 			}
 			else
 			{
-				locks = owner->locks;
-				nlocks = owner->nlocks;
-			}
+				dlist_foreach_modify(iter, &owner->locks)
+					LockReleaseCurrentOwner(owner, iter.cur);
 
-			if (isCommit)
-				LockReassignCurrentOwner(locks, nlocks);
-			else
-				LockReleaseCurrentOwner(locks, nlocks);
+				Assert(dlist_is_empty(&owner->locks));
+			}
 		}
 	}
 	else if (phase == RESOURCE_RELEASE_AFTER_LOCKS)
@@ -752,7 +741,7 @@ ResourceOwnerDelete(ResourceOwner owner)
 	Assert(owner->jitarr.nitems == 0);
 	Assert(owner->cryptohasharr.nitems == 0);
 	Assert(owner->hmacarr.nitems == 0);
-	Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1);
+	Assert(dlist_is_empty(&owner->locks));
 
 	/*
 	 * Delete children.  The recursive call will delink the child from me, so
@@ -983,45 +972,56 @@ ResourceOwnerForgetBuffer(ResourceOwner owner, Buffer buffer)
  * the entry.
  */
 void
-ResourceOwnerRememberLock(ResourceOwner owner, LOCALLOCK *locallock)
+ResourceOwnerRememberLock(ResourceOwner owner, LOCALLOCKOWNER *locallockowner)
 {
-	Assert(locallock != NULL);
-
-	if (owner->nlocks > MAX_RESOWNER_LOCKS)
-		return;					/* we have already overflowed */
+	Assert(owner != NULL);
+	Assert(locallockowner != NULL);
 
-	if (owner->nlocks < MAX_RESOWNER_LOCKS)
-		owner->locks[owner->nlocks] = locallock;
-	else
+#ifdef USE_ASSERT_CHECKING
 	{
-		/* overflowed */
+		dlist_iter iter;
+
+		dlist_foreach(iter, &owner->locks)
+		{
+			LOCALLOCKOWNER *i = dlist_container(LOCALLOCKOWNER, resowner_node, iter.cur);
+
+			Assert(i->locallock != locallockowner->locallock);
+		}
 	}
-	owner->nlocks++;
+#endif
+
+	dlist_push_tail(&owner->locks, &locallockowner->resowner_node);
 }
 
 /*
  * Forget that a Local Lock is owned by a ResourceOwner
  */
 void
-ResourceOwnerForgetLock(ResourceOwner owner, LOCALLOCK *locallock)
+ResourceOwnerForgetLock(ResourceOwner owner, LOCALLOCKOWNER *locallockowner)
 {
-	int			i;
-
-	if (owner->nlocks > MAX_RESOWNER_LOCKS)
-		return;					/* we have overflowed */
+	Assert(owner != NULL);
+	Assert(locallockowner != NULL);
 
-	Assert(owner->nlocks > 0);
-	for (i = owner->nlocks - 1; i >= 0; i--)
+#ifdef USE_ASSERT_CHECKING
 	{
-		if (locallock == owner->locks[i])
+		dlist_iter iter;
+		bool found = false;
+
+		dlist_foreach(iter, &owner->locks)
 		{
-			owner->locks[i] = owner->locks[owner->nlocks - 1];
-			owner->nlocks--;
-			return;
+			LOCALLOCKOWNER *i = dlist_container(LOCALLOCKOWNER, resowner_node, iter.cur);
+
+			if (locallockowner == i)
+			{
+				Assert(!found);
+				found = true;
+			}
 		}
+
+		Assert(found);
 	}
-	elog(ERROR, "lock reference %p is not owned by resource owner %s",
-		 locallock, owner->name);
+#endif
+	dlist_delete(&locallockowner->resowner_node);
 }
 
 /*
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index a5286fab89..b656103d26 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -23,6 +23,7 @@
 #include "storage/lwlock.h"
 #include "storage/shmem.h"
 #include "utils/timestamp.h"
+#include "lib/ilist.h"
 
 /* struct PGPROC is declared in proc.h, but must forward-reference it */
 typedef struct PGPROC PGPROC;
@@ -412,6 +413,13 @@ typedef struct LOCALLOCKOWNER
 	 * Must use a forward struct reference to avoid circularity.
 	 */
 	struct ResourceOwnerData *owner;
+
+	dlist_node  resowner_node;
+
+	dlist_node  locallock_node;
+
+	struct LOCALLOCK *locallock;
+
 	int64		nLocks;			/* # of times held by this owner */
 } LOCALLOCKOWNER;
 
@@ -425,9 +433,9 @@ typedef struct LOCALLOCK
 	LOCK	   *lock;			/* associated LOCK object, if any */
 	PROCLOCK   *proclock;		/* associated PROCLOCK object, if any */
 	int64		nLocks;			/* total number of times lock is held */
-	int			numLockOwners;	/* # of relevant ResourceOwners */
-	int			maxLockOwners;	/* allocated size of array */
-	LOCALLOCKOWNER *lockOwners; /* dynamically resizable array */
+
+	dlist_head locallockowners;
+
 	bool		holdsStrongLockCount;	/* bumped FastPathStrongRelationLocks */
 	bool		lockCleared;	/* we read all sinval msgs for lock */
 } LOCALLOCK;
@@ -556,10 +564,15 @@ extern void AbortStrongLockAcquire(void);
 extern void MarkLockClear(LOCALLOCK *locallock);
 extern bool LockRelease(const LOCKTAG *locktag,
 						LOCKMODE lockmode, bool sessionLock);
-extern void LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks);
+
+#ifdef USE_ASSERT_CHECKING
+extern void LockAssertNoneHeld(bool isCommit);
+#endif
+
 extern void LockReleaseSession(LOCKMETHODID lockmethodid);
-extern void LockReleaseCurrentOwner(LOCALLOCK **locallocks, int nlocks);
-extern void LockReassignCurrentOwner(LOCALLOCK **locallocks, int nlocks);
+struct ResourceOwnerData;
+extern void LockReleaseCurrentOwner(struct ResourceOwnerData *owner, dlist_node *resowner_node);
+extern void LockReassignCurrentOwner(struct ResourceOwnerData *owner, dlist_node *resowner_node);
 extern bool LockHeldByMe(const LOCKTAG *locktag, LOCKMODE lockmode);
 #ifdef USE_ASSERT_CHECKING
 extern HTAB *GetLockMethodLocalHash(void);
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index 6dafc87e28..d23d8c4c7c 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -31,8 +31,8 @@ extern void ResourceOwnerRememberBuffer(ResourceOwner owner, Buffer buffer);
 extern void ResourceOwnerForgetBuffer(ResourceOwner owner, Buffer buffer);
 
 /* support for local lock management */
-extern void ResourceOwnerRememberLock(ResourceOwner owner, LOCALLOCK *locallock);
-extern void ResourceOwnerForgetLock(ResourceOwner owner, LOCALLOCK *locallock);
+extern void ResourceOwnerRememberLock(ResourceOwner owner, LOCALLOCKOWNER *locallock);
+extern void ResourceOwnerForgetLock(ResourceOwner owner, LOCALLOCKOWNER *locallock);
 
 /* support for catcache refcount management */
 extern void ResourceOwnerEnlargeCatCacheRefs(ResourceOwner owner);

#103

Zhihong Yu

zyu@yugabyte.com

about 4 years ago

In reply to: David Rowley (#102)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

On Fri, Dec 31, 2021 at 5:45 PM David Rowley <dgrowleyml@gmail.com> wrote:

On Fri, 3 Dec 2021 at 20:36, Michael Paquier <michael@paquier.xyz> wrote:

Two months later, this has been switched to RwF.

I was discussing this patch with Andres. He's not very keen on my
densehash hash table idea and suggested that instead of relying on
trying to make the hash table iteration faster, why don't we just
ditch the lossiness of the resource owner code which only records the
first 16 locks, and just make it have a linked list of all locks.

This is a little more along the lines of the original patch, however,
it does not increase the size of the LOCALLOCK struct.

I've attached a patch which does this. This was mostly written
(quickly) by Andres, I just did a cleanup, fixed up a few mistakes and
fixed a few bugs.

I ran the same performance tests on this patch as I did back in [1]:

-- Test 1. Select 1 record from a 140 partitioned table. Tests
creating a large number of locks with a fast query.

create table hp (a int, b int) partition by hash(a);
select 'create table hp'||x||' partition of hp for values with
(modulus 140, remainder ' || x || ');' from generate_series(0,139)x;
create index on hp (b);
insert into hp select x,x from generate_series(1, 140000) x;
analyze hp;

select3.sql: select * from hp where b = 1

select3.sql master

drowley@amd3990x:~$ pgbench -n -f select3.sql -T 60 -M prepared postgres
tps = 2099.708748 (without initial connection time)
tps = 2100.398516 (without initial connection time)
tps = 2094.882341 (without initial connection time)
tps = 2113.218820 (without initial connection time)
tps = 2104.717597 (without initial connection time)

select3.sql patched

drowley@amd3990x:~$ pgbench -n -f select3.sql -T 60 -M prepared postgres
tps = 2010.070738 (without initial connection time)
tps = 1994.963606 (without initial connection time)
tps = 1994.668849 (without initial connection time)
tps = 1995.948168 (without initial connection time)
tps = 1985.650945 (without initial connection time)

You can see that there's a performance regression here. I've not yet
studied why this appears.

-- Test 2. Tests a prepared query which will perform a generic plan on
the 6th execution then fallback on a custom plan due to it pruning all
but one partition. Master suffers from the lock table becoming
bloated after locking all partitions when planning the generic plan.

create table ht (a int primary key, b int, c int) partition by hash (a);
select 'create table ht' || x::text || ' partition of ht for values
with (modulus 8192, remainder ' || (x)::text || ');' from
generate_series(0,8191) x;
\gexec

select.sql:
\set p 1
select * from ht where a = :p

select.sql master
drowley@amd3990x:~$ pgbench -n -f select.sql -T 60 -M prepared postgres
tps = 18014.460090 (without initial connection time)
tps = 17973.358889 (without initial connection time)
tps = 17847.480647 (without initial connection time)
tps = 18038.332507 (without initial connection time)
tps = 17776.143206 (without initial connection time)

select.sql patched
drowley@amd3990x:~$ pgbench -n -f select.sql -T 60 -M prepared postgres
tps = 32393.457106 (without initial connection time)
tps = 32277.204349 (without initial connection time)
tps = 32160.719830 (without initial connection time)
tps = 32530.038130 (without initial connection time)
tps = 32299.019657 (without initial connection time)

You can see that there are some quite good performance gains with this
test.

I'm going to add this to the January commitfest.

David

[1]
/messages/by-id/CAKJS1f8Lt0kS4bb5EH=hV+ksqBZNnmVa8jujoYBYu5PVhWbZZg@mail.gmail.com

Hi,

+           locallock->nLocks -= locallockowner->nLocks;
+           Assert(locallock->nLocks >= 0);

I think the assertion is not needed since the above code is in if block :

+ if (locallockowner->nLocks < locallock->nLocks)

the condition, locallock->nLocks >= 0, would always hold after the
subtraction.

Cheers

#104

David Rowley

dgrowleyml@gmail.com

almost 4 years ago

In reply to: Zhihong Yu (#103)

6 attachment(s)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

On Sat, 1 Jan 2022 at 15:40, Zhihong Yu <zyu@yugabyte.com> wrote:

+           locallock->nLocks -= locallockowner->nLocks;
+           Assert(locallock->nLocks >= 0);
I think the assertion is not needed since the above code is in if block :

+ if (locallockowner->nLocks < locallock->nLocks)

the condition, locallock->nLocks >= 0, would always hold after the subtraction.

That makes sense. I've removed the Assert in the attached patch.
Thanks for looking at the patch

I've also spent a bit more time on the patch. There were quite a few
outdated comments remaining. Also, the PROCLOCK releaseMask field
appears to no longer be needed.

I also did a round of benchmarking on the attached patch using a very
recent master. I've attached .sql files and the script I used to
benchmark.

With 1024 partitions, lock1.sql shows about a 4.75% performance
increase. This would become larger with more partitions and less with
fewer partitions.
With the patch, lock2.sql about a 10% performance increase over master.
lock3.sql does not seem to have changed much and lock4.sql shows a
small regression with the patch of about 2.5%.

I'm not quite sure how worried we should be about lock4.sql slowing
down lightly. 2.5% is fairly small given how hard I'm exercising the
locking code in that test. There's also nothing much to say that the
slowdown is not just due to code alignment changes.

I also understand that Amit L is working on another patch that will
improve the situation for lock1.sql by not taking the locks on
relations that will be run-time pruned at executor startup. I think
it's still worth solving this regardless of Amit's patch as with
current master we still have a situation where short fast queries
which access a small number of tables can become slower once the
backend has obtained a large number of locks concurrently and bloated
the locallocktable.

As for the patch. I feel it's a pretty invasive change to how we
release locks and the resowner code. I'd be quite happy for some
review of it.

Here are the full results as output by the attached script.

drowley@amd3990x:~$ echo master
master
drowley@amd3990x:~$ ./lockbench.sh
lock1.sql
tps = 38078.011433 (without initial connection time)
tps = 38070.016792 (without initial connection time)
tps = 39223.118769 (without initial connection time)
tps = 37510.105287 (without initial connection time)
tps = 38164.394128 (without initial connection time)
lock2.sql
tps = 247.963797 (without initial connection time)
tps = 247.374174 (without initial connection time)
tps = 248.412744 (without initial connection time)
tps = 248.192629 (without initial connection time)
tps = 248.503728 (without initial connection time)
lock3.sql
tps = 1162.937692 (without initial connection time)
tps = 1160.968689 (without initial connection time)
tps = 1166.908643 (without initial connection time)
tps = 1160.288547 (without initial connection time)
tps = 1160.336572 (without initial connection time)
lock4.sql
tps = 282.173560 (without initial connection time)
tps = 284.470330 (without initial connection time)
tps = 286.089644 (without initial connection time)
tps = 285.548487 (without initial connection time)
tps = 284.313505 (without initial connection time)

drowley@amd3990x:~$ echo Patched
Patched
drowley@amd3990x:~$ ./lockbench.sh
lock1.sql
tps = 40338.975219 (without initial connection time)
tps = 39803.433365 (without initial connection time)
tps = 39504.824194 (without initial connection time)
tps = 39843.422438 (without initial connection time)
tps = 40624.483013 (without initial connection time)
lock2.sql
tps = 274.413309 (without initial connection time)
tps = 271.978813 (without initial connection time)
tps = 275.795091 (without initial connection time)
tps = 273.628649 (without initial connection time)
tps = 273.049977 (without initial connection time)
lock3.sql
tps = 1168.557054 (without initial connection time)
tps = 1168.139469 (without initial connection time)
tps = 1166.366440 (without initial connection time)
tps = 1165.464214 (without initial connection time)
tps = 1167.250809 (without initial connection time)
lock4.sql
tps = 274.842298 (without initial connection time)
tps = 277.911394 (without initial connection time)
tps = 278.702620 (without initial connection time)
tps = 275.715606 (without initial connection time)
tps = 278.816060 (without initial connection time)

David

Attachments:

lockreleaseall_speedup3.patchtext/plain; charset=US-ASCII; name=lockreleaseall_speedup3.patchDownload

diff --git a/src/backend/commands/discard.c b/src/backend/commands/discard.c
index c583539e0c..2c43b9e0c8 100644
--- a/src/backend/commands/discard.c
+++ b/src/backend/commands/discard.c
@@ -71,7 +71,7 @@ DiscardAll(bool isTopLevel)
 	ResetAllOptions();
 	DropAllPreparedStatements();
 	Async_UnlistenAll();
-	LockReleaseAll(USER_LOCKMETHOD, true);
+	LockReleaseSession(USER_LOCKMETHOD);
 	ResetPlanCache();
 	ResetTempTableNamespace();
 	ResetSequenceCaches();
diff --git a/src/backend/storage/lmgr/README b/src/backend/storage/lmgr/README
index d08ec6c402..563ba681e5 100644
--- a/src/backend/storage/lmgr/README
+++ b/src/backend/storage/lmgr/README
@@ -182,12 +182,6 @@ holdMask -
     subset of the PGPROC object's heldLocks mask (if the PGPROC is
     currently waiting for another lock mode on this lock).
 
-releaseMask -
-    A bitmask for the lock modes due to be released during LockReleaseAll.
-    This must be a subset of the holdMask.  Note that it is modified without
-    taking the partition LWLock, and therefore it is unsafe for any
-    backend except the one owning the PROCLOCK to examine/change it.
-
 lockLink -
     List link for shared memory queue of all the PROCLOCK objects for the
     same LOCK.
@@ -321,8 +315,7 @@ and will notice any weak lock we take when it does.
 
 Fast-path VXID locks do not use the FastPathStrongRelationLocks table.  The
 first lock taken on a VXID is always the ExclusiveLock taken by its owner.
-Any subsequent lockers are share lockers waiting for the VXID to terminate.
-Indeed, the only reason VXID locks use the lock manager at all (rather than
+Any subsequent lockers are share lockers wait 
 waiting for the VXID to terminate via some other method) is for deadlock
 detection.  Thus, the initial VXID lock can *always* be taken via the fast
 path without checking for conflicts.  Any subsequent locker must check
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index ee2e15c17e..7bc33f476e 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -22,8 +22,7 @@
  *	Interface:
  *
  *	InitLocks(), GetLocksMethodTable(), GetLockTagsMethodTable(),
- *	LockAcquire(), LockRelease(), LockReleaseAll(),
- *	LockCheckConflicts(), GrantLock()
+ *	LockAcquire(), LockRelease(), LockCheckConflicts(), GrantLock()
  *
  *-------------------------------------------------------------------------
  */
@@ -289,6 +288,9 @@ static LOCALLOCK *awaitedLock;
 static ResourceOwner awaitedOwner;
 
 
+static dlist_head session_locks[lengthof(LockMethods)];
+
+
 #ifdef LOCK_DEBUG
 
 /*------
@@ -375,8 +377,9 @@ static void GrantLockLocal(LOCALLOCK *locallock, ResourceOwner owner);
 static void BeginStrongLockAcquire(LOCALLOCK *locallock, uint32 fasthashcode);
 static void FinishStrongLockAcquire(void);
 static void WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner);
-static void ReleaseLockIfHeld(LOCALLOCK *locallock, bool sessionLock);
-static void LockReassignOwner(LOCALLOCK *locallock, ResourceOwner parent);
+static void ReleaseLockIfHeld(LOCALLOCKOWNER *locallockowner, bool sessionLock);
+static void LockReassignOwner(LOCALLOCKOWNER *locallockowner,
+							  ResourceOwner parent);
 static bool UnGrantLock(LOCK *lock, LOCKMODE lockmode,
 						PROCLOCK *proclock, LockMethod lockMethodTable);
 static void CleanUpLock(LOCK *lock, PROCLOCK *proclock,
@@ -477,6 +480,10 @@ InitLocks(void)
 									  16,
 									  &info,
 									  HASH_ELEM | HASH_BLOBS);
+
+	/* Initialize each element of the session_locks array */
+	for (int i = 0; i < lengthof(LockMethods); i++)
+		dlist_init(&session_locks[i]);
 }
 
 
@@ -701,7 +708,7 @@ LockHasWaiters(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 	{
 		PROCLOCK_PRINT("LockHasWaiters: WRONGTYPE", proclock);
 		LWLockRelease(partitionLock);
-		elog(WARNING, "you don't own a lock of type %s",
+		elog(PANIC, "you don't own a lock of type %s",
 			 lockMethodTable->lockModeNames[lockmode]);
 		RemoveLocalLock(locallock);
 		return false;
@@ -839,26 +846,9 @@ LockAcquireExtended(const LOCKTAG *locktag,
 		locallock->nLocks = 0;
 		locallock->holdsStrongLockCount = false;
 		locallock->lockCleared = false;
-		locallock->numLockOwners = 0;
-		locallock->maxLockOwners = 8;
-		locallock->lockOwners = NULL;	/* in case next line fails */
-		locallock->lockOwners = (LOCALLOCKOWNER *)
-			MemoryContextAlloc(TopMemoryContext,
-							   locallock->maxLockOwners * sizeof(LOCALLOCKOWNER));
+		dlist_init(&locallock->locallockowners);
 	}
-	else
-	{
-		/* Make sure there will be room to remember the lock */
-		if (locallock->numLockOwners >= locallock->maxLockOwners)
-		{
-			int			newsize = locallock->maxLockOwners * 2;
 
-			locallock->lockOwners = (LOCALLOCKOWNER *)
-				repalloc(locallock->lockOwners,
-						 newsize * sizeof(LOCALLOCKOWNER));
-			locallock->maxLockOwners = newsize;
-		}
-	}
 	hashcode = locallock->hashcode;
 
 	if (locallockp)
@@ -1268,7 +1258,6 @@ SetupLockInTable(LockMethod lockMethodTable, PGPROC *proc,
 		proclock->groupLeader = proc->lockGroupLeader != NULL ?
 			proc->lockGroupLeader : proc;
 		proclock->holdMask = 0;
-		proclock->releaseMask = 0;
 		/* Add proclock to appropriate lists */
 		SHMQueueInsertBefore(&lock->procLocks, &proclock->lockLink);
 		SHMQueueInsertBefore(&(proc->myProcLocks[partition]),
@@ -1366,17 +1355,18 @@ CheckAndSetLockHeld(LOCALLOCK *locallock, bool acquired)
 static void
 RemoveLocalLock(LOCALLOCK *locallock)
 {
-	int			i;
+	dlist_mutable_iter iter;
 
-	for (i = locallock->numLockOwners - 1; i >= 0; i--)
+	dlist_foreach_modify(iter, &locallock->locallockowners)
 	{
-		if (locallock->lockOwners[i].owner != NULL)
-			ResourceOwnerForgetLock(locallock->lockOwners[i].owner, locallock);
+		LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER, locallock_node, iter.cur);
+
+		Assert(locallockowner->owner != NULL);
+		dlist_delete(&locallockowner->locallock_node);
+		ResourceOwnerForgetLock(locallockowner);
 	}
-	locallock->numLockOwners = 0;
-	if (locallock->lockOwners != NULL)
-		pfree(locallock->lockOwners);
-	locallock->lockOwners = NULL;
+
+	Assert(dlist_is_empty(&locallock->locallockowners));
 
 	if (locallock->holdsStrongLockCount)
 	{
@@ -1394,7 +1384,7 @@ RemoveLocalLock(LOCALLOCK *locallock)
 	if (!hash_search(LockMethodLocalHash,
 					 (void *) &(locallock->tag),
 					 HASH_REMOVE, NULL))
-		elog(WARNING, "locallock table corrupted");
+		elog(PANIC, "locallock table corrupted");
 
 	/*
 	 * Indicate that the lock is released for certain types of locks
@@ -1688,26 +1678,40 @@ CleanUpLock(LOCK *lock, PROCLOCK *proclock,
 static void
 GrantLockLocal(LOCALLOCK *locallock, ResourceOwner owner)
 {
-	LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
-	int			i;
+	LOCALLOCKOWNER *locallockowner;
+	dlist_iter iter;
 
-	Assert(locallock->numLockOwners < locallock->maxLockOwners);
 	/* Count the total */
 	locallock->nLocks++;
+
 	/* Count the per-owner lock */
-	for (i = 0; i < locallock->numLockOwners; i++)
+	dlist_foreach(iter, &locallock->locallockowners)
 	{
-		if (lockOwners[i].owner == owner)
+		locallockowner = dlist_container(LOCALLOCKOWNER, locallock_node, iter.cur);
+
+		if (locallockowner->owner == owner)
 		{
-			lockOwners[i].nLocks++;
+			locallockowner->nLocks++;
 			return;
 		}
 	}
-	lockOwners[i].owner = owner;
-	lockOwners[i].nLocks = 1;
-	locallock->numLockOwners++;
+
+	locallockowner = MemoryContextAlloc(TopMemoryContext, sizeof(LOCALLOCKOWNER));
+	locallockowner->owner = owner;
+	locallockowner->nLocks = 1;
+	locallockowner->locallock = locallock;
+
+	dlist_push_tail(&locallock->locallockowners, &locallockowner->locallock_node);
+
 	if (owner != NULL)
-		ResourceOwnerRememberLock(owner, locallock);
+		ResourceOwnerRememberLock(owner, locallockowner);
+	else
+	{
+		LOCKMETHODID lockmethodid = LOCALLOCK_LOCKMETHOD(*locallockowner->locallock);
+
+		Assert(lockmethodid > 0 && lockmethodid <= 2);
+		dlist_push_tail(&session_locks[lockmethodid - 1], &locallockowner->resowner_node);
+	}
 
 	/* Indicate that the lock is acquired for certain types of locks. */
 	CheckAndSetLockHeld(locallock, true);
@@ -2021,9 +2025,9 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 	 * Decrease the count for the resource owner.
 	 */
 	{
-		LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
 		ResourceOwner owner;
-		int			i;
+		dlist_iter iter;
+		bool		found = false;
 
 		/* Identify owner for lock */
 		if (sessionLock)
@@ -2031,24 +2035,29 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 		else
 			owner = CurrentResourceOwner;
 
-		for (i = locallock->numLockOwners - 1; i >= 0; i--)
+		dlist_foreach(iter, &locallock->locallockowners)
 		{
-			if (lockOwners[i].owner == owner)
+			LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER, locallock_node, iter.cur);
+
+			if (locallockowner->owner != owner)
+				continue;
+
+			found = true;
+
+			if (--locallockowner->nLocks == 0)
 			{
-				Assert(lockOwners[i].nLocks > 0);
-				if (--lockOwners[i].nLocks == 0)
-				{
-					if (owner != NULL)
-						ResourceOwnerForgetLock(owner, locallock);
-					/* compact out unused slot */
-					locallock->numLockOwners--;
-					if (i < locallock->numLockOwners)
-						lockOwners[i] = lockOwners[locallock->numLockOwners];
-				}
-				break;
+				dlist_delete(&locallockowner->locallock_node);
+
+				if (owner != NULL)
+					ResourceOwnerForgetLock(locallockowner);
+				else
+					dlist_delete(&locallockowner->resowner_node);
 			}
+
+			Assert(locallockowner->nLocks >= 0);
 		}
-		if (i < 0)
+
+		if (!found)
 		{
 			/* don't release a lock belonging to another owner */
 			elog(WARNING, "you don't own a lock of type %s",
@@ -2066,6 +2075,8 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 	if (locallock->nLocks > 0)
 		return true;
 
+	Assert(locallock->nLocks >= 0);
+
 	/*
 	 * At this point we can no longer suppose we are clear of invalidation
 	 * messages related to this lock.  Although we'll delete the LOCALLOCK
@@ -2147,7 +2158,7 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 	{
 		PROCLOCK_PRINT("LockRelease: WRONGTYPE", proclock);
 		LWLockRelease(partitionLock);
-		elog(WARNING, "you don't own a lock of type %s",
+		elog(PANIC, "you don't own a lock of type %s",
 			 lockMethodTable->lockModeNames[lockmode]);
 		RemoveLocalLock(locallock);
 		return false;
@@ -2168,283 +2179,44 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 	return true;
 }
 
+#ifdef USE_ASSERT_CHECKING
 /*
- * LockReleaseAll -- Release all locks of the specified lock method that
- *		are held by the current process.
- *
- * Well, not necessarily *all* locks.  The available behaviors are:
- *		allLocks == true: release all locks including session locks.
- *		allLocks == false: release all non-session locks.
+ * LockAssertNoneHeld -- Assert that we no longer hold any DEFAULT_LOCKMETHOD
+ * locks during an abort.
  */
-void
-LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
+extern void
+LockAssertNoneHeld(bool isCommit)
 {
 	HASH_SEQ_STATUS status;
-	LockMethod	lockMethodTable;
-	int			i,
-				numLockModes;
 	LOCALLOCK  *locallock;
-	LOCK	   *lock;
-	PROCLOCK   *proclock;
-	int			partition;
-	bool		have_fast_path_lwlock = false;
-
-	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
-		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
-	lockMethodTable = LockMethods[lockmethodid];
-
-#ifdef LOCK_DEBUG
-	if (*(lockMethodTable->trace_flag))
-		elog(LOG, "LockReleaseAll: lockmethod=%d", lockmethodid);
-#endif
-
-	/*
-	 * Get rid of our fast-path VXID lock, if appropriate.  Note that this is
-	 * the only way that the lock we hold on our own VXID can ever get
-	 * released: it is always and only released when a toplevel transaction
-	 * ends.
-	 */
-	if (lockmethodid == DEFAULT_LOCKMETHOD)
-		VirtualXactLockTableCleanup();
-
-	numLockModes = lockMethodTable->numLockModes;
-
-	/*
-	 * First we run through the locallock table and get rid of unwanted
-	 * entries, then we scan the process's proclocks and get rid of those. We
-	 * do this separately because we may have multiple locallock entries
-	 * pointing to the same proclock, and we daren't end up with any dangling
-	 * pointers.  Fast-path locks are cleaned up during the locallock table
-	 * scan, though.
-	 */
-	hash_seq_init(&status, LockMethodLocalHash);
 
-	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+	if (!isCommit)
 	{
-		/*
-		 * If the LOCALLOCK entry is unused, we must've run out of shared
-		 * memory while trying to set up this lock.  Just forget the local
-		 * entry.
-		 */
-		if (locallock->nLocks == 0)
-		{
-			RemoveLocalLock(locallock);
-			continue;
-		}
-
-		/* Ignore items that are not of the lockmethod to be removed */
-		if (LOCALLOCK_LOCKMETHOD(*locallock) != lockmethodid)
-			continue;
+		hash_seq_init(&status, LockMethodLocalHash);
 
-		/*
-		 * If we are asked to release all locks, we can just zap the entry.
-		 * Otherwise, must scan to see if there are session locks. We assume
-		 * there is at most one lockOwners entry for session locks.
-		 */
-		if (!allLocks)
+		while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
 		{
-			LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
+			dlist_iter local_iter;
 
-			/* If session lock is above array position 0, move it down to 0 */
-			for (i = 0; i < locallock->numLockOwners; i++)
-			{
-				if (lockOwners[i].owner == NULL)
-					lockOwners[0] = lockOwners[i];
-				else
-					ResourceOwnerForgetLock(lockOwners[i].owner, locallock);
-			}
+			Assert(locallock->nLocks >= 0);
 
-			if (locallock->numLockOwners > 0 &&
-				lockOwners[0].owner == NULL &&
-				lockOwners[0].nLocks > 0)
+			dlist_foreach(local_iter, &locallock->locallockowners)
 			{
-				/* Fix the locallock to show just the session locks */
-				locallock->nLocks = lockOwners[0].nLocks;
-				locallock->numLockOwners = 1;
-				/* We aren't deleting this locallock, so done */
-				continue;
-			}
-			else
-				locallock->numLockOwners = 0;
-		}
-
-		/*
-		 * If the lock or proclock pointers are NULL, this lock was taken via
-		 * the relation fast-path (and is not known to have been transferred).
-		 */
-		if (locallock->proclock == NULL || locallock->lock == NULL)
-		{
-			LOCKMODE	lockmode = locallock->tag.mode;
-			Oid			relid;
+				LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER,
+																 locallock_node,
+																 local_iter.cur);
 
-			/* Verify that a fast-path lock is what we've got. */
-			if (!EligibleForRelationFastPath(&locallock->tag.lock, lockmode))
-				elog(PANIC, "locallock table corrupted");
-
-			/*
-			 * If we don't currently hold the LWLock that protects our
-			 * fast-path data structures, we must acquire it before attempting
-			 * to release the lock via the fast-path.  We will continue to
-			 * hold the LWLock until we're done scanning the locallock table,
-			 * unless we hit a transferred fast-path lock.  (XXX is this
-			 * really such a good idea?  There could be a lot of entries ...)
-			 */
-			if (!have_fast_path_lwlock)
-			{
-				LWLockAcquire(&MyProc->fpInfoLock, LW_EXCLUSIVE);
-				have_fast_path_lwlock = true;
-			}
+				Assert(locallockowner->owner == NULL);
 
-			/* Attempt fast-path release. */
-			relid = locallock->tag.lock.locktag_field2;
-			if (FastPathUnGrantRelationLock(relid, lockmode))
-			{
-				RemoveLocalLock(locallock);
-				continue;
+				if (locallockowner->nLocks > 0 &&
+					LOCALLOCK_LOCKMETHOD(*locallock) == DEFAULT_LOCKMETHOD)
+					Assert(false);
 			}
-
-			/*
-			 * Our lock, originally taken via the fast path, has been
-			 * transferred to the main lock table.  That's going to require
-			 * some extra work, so release our fast-path lock before starting.
-			 */
-			LWLockRelease(&MyProc->fpInfoLock);
-			have_fast_path_lwlock = false;
-
-			/*
-			 * Now dump the lock.  We haven't got a pointer to the LOCK or
-			 * PROCLOCK in this case, so we have to handle this a bit
-			 * differently than a normal lock release.  Unfortunately, this
-			 * requires an extra LWLock acquire-and-release cycle on the
-			 * partitionLock, but hopefully it shouldn't happen often.
-			 */
-			LockRefindAndRelease(lockMethodTable, MyProc,
-								 &locallock->tag.lock, lockmode, false);
-			RemoveLocalLock(locallock);
-			continue;
 		}
-
-		/* Mark the proclock to show we need to release this lockmode */
-		if (locallock->nLocks > 0)
-			locallock->proclock->releaseMask |= LOCKBIT_ON(locallock->tag.mode);
-
-		/* And remove the locallock hashtable entry */
-		RemoveLocalLock(locallock);
 	}
-
-	/* Done with the fast-path data structures */
-	if (have_fast_path_lwlock)
-		LWLockRelease(&MyProc->fpInfoLock);
-
-	/*
-	 * Now, scan each lock partition separately.
-	 */
-	for (partition = 0; partition < NUM_LOCK_PARTITIONS; partition++)
-	{
-		LWLock	   *partitionLock;
-		SHM_QUEUE  *procLocks = &(MyProc->myProcLocks[partition]);
-		PROCLOCK   *nextplock;
-
-		partitionLock = LockHashPartitionLockByIndex(partition);
-
-		/*
-		 * If the proclock list for this partition is empty, we can skip
-		 * acquiring the partition lock.  This optimization is trickier than
-		 * it looks, because another backend could be in process of adding
-		 * something to our proclock list due to promoting one of our
-		 * fast-path locks.  However, any such lock must be one that we
-		 * decided not to delete above, so it's okay to skip it again now;
-		 * we'd just decide not to delete it again.  We must, however, be
-		 * careful to re-fetch the list header once we've acquired the
-		 * partition lock, to be sure we have a valid, up-to-date pointer.
-		 * (There is probably no significant risk if pointer fetch/store is
-		 * atomic, but we don't wish to assume that.)
-		 *
-		 * XXX This argument assumes that the locallock table correctly
-		 * represents all of our fast-path locks.  While allLocks mode
-		 * guarantees to clean up all of our normal locks regardless of the
-		 * locallock situation, we lose that guarantee for fast-path locks.
-		 * This is not ideal.
-		 */
-		if (SHMQueueNext(procLocks, procLocks,
-						 offsetof(PROCLOCK, procLink)) == NULL)
-			continue;			/* needn't examine this partition */
-
-		LWLockAcquire(partitionLock, LW_EXCLUSIVE);
-
-		for (proclock = (PROCLOCK *) SHMQueueNext(procLocks, procLocks,
-												  offsetof(PROCLOCK, procLink));
-			 proclock;
-			 proclock = nextplock)
-		{
-			bool		wakeupNeeded = false;
-
-			/* Get link first, since we may unlink/delete this proclock */
-			nextplock = (PROCLOCK *)
-				SHMQueueNext(procLocks, &proclock->procLink,
-							 offsetof(PROCLOCK, procLink));
-
-			Assert(proclock->tag.myProc == MyProc);
-
-			lock = proclock->tag.myLock;
-
-			/* Ignore items that are not of the lockmethod to be removed */
-			if (LOCK_LOCKMETHOD(*lock) != lockmethodid)
-				continue;
-
-			/*
-			 * In allLocks mode, force release of all locks even if locallock
-			 * table had problems
-			 */
-			if (allLocks)
-				proclock->releaseMask = proclock->holdMask;
-			else
-				Assert((proclock->releaseMask & ~proclock->holdMask) == 0);
-
-			/*
-			 * Ignore items that have nothing to be released, unless they have
-			 * holdMask == 0 and are therefore recyclable
-			 */
-			if (proclock->releaseMask == 0 && proclock->holdMask != 0)
-				continue;
-
-			PROCLOCK_PRINT("LockReleaseAll", proclock);
-			LOCK_PRINT("LockReleaseAll", lock, 0);
-			Assert(lock->nRequested >= 0);
-			Assert(lock->nGranted >= 0);
-			Assert(lock->nGranted <= lock->nRequested);
-			Assert((proclock->holdMask & ~lock->grantMask) == 0);
-
-			/*
-			 * Release the previously-marked lock modes
-			 */
-			for (i = 1; i <= numLockModes; i++)
-			{
-				if (proclock->releaseMask & LOCKBIT_ON(i))
-					wakeupNeeded |= UnGrantLock(lock, i, proclock,
-												lockMethodTable);
-			}
-			Assert((lock->nRequested >= 0) && (lock->nGranted >= 0));
-			Assert(lock->nGranted <= lock->nRequested);
-			LOCK_PRINT("LockReleaseAll: updated", lock, 0);
-
-			proclock->releaseMask = 0;
-
-			/* CleanUpLock will wake up waiters if needed. */
-			CleanUpLock(lock, proclock,
-						lockMethodTable,
-						LockTagHashCode(&lock->tag),
-						wakeupNeeded);
-		}						/* loop over PROCLOCKs within this partition */
-
-		LWLockRelease(partitionLock);
-	}							/* loop over partitions */
-
-#ifdef LOCK_DEBUG
-	if (*(lockMethodTable->trace_flag))
-		elog(LOG, "LockReleaseAll done");
-#endif
+	Assert(MyProc->fpLockBits == 0);
 }
+#endif
 
 /*
  * LockReleaseSession -- Release all session locks of the specified lock method
@@ -2453,59 +2225,41 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
 void
 LockReleaseSession(LOCKMETHODID lockmethodid)
 {
-	HASH_SEQ_STATUS status;
-	LOCALLOCK  *locallock;
+	dlist_mutable_iter iter;
 
 	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
 		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
 
-	hash_seq_init(&status, LockMethodLocalHash);
-
-	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+	dlist_foreach_modify(iter, &session_locks[lockmethodid - 1])
 	{
-		/* Ignore items that are not of the specified lock method */
-		if (LOCALLOCK_LOCKMETHOD(*locallock) != lockmethodid)
-			continue;
+		LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER, resowner_node, iter.cur);
 
-		ReleaseLockIfHeld(locallock, true);
+		Assert(LOCALLOCK_LOCKMETHOD(*locallockowner->locallock) == lockmethodid);
+
+		ReleaseLockIfHeld(locallockowner, true);
 	}
+
+	Assert(dlist_is_empty(&session_locks[lockmethodid - 1]));
 }
 
 /*
  * LockReleaseCurrentOwner
- *		Release all locks belonging to CurrentResourceOwner
- *
- * If the caller knows what those locks are, it can pass them as an array.
- * That speeds up the call significantly, when a lot of locks are held.
- * Otherwise, pass NULL for locallocks, and we'll traverse through our hash
- * table to find them.
+ *		Release all locks belonging to 'owner'
  */
-void
-LockReleaseCurrentOwner(LOCALLOCK **locallocks, int nlocks)
+extern void
+LockReleaseCurrentOwner(ResourceOwner owner, dlist_node *resowner_node)
 {
-	if (locallocks == NULL)
-	{
-		HASH_SEQ_STATUS status;
-		LOCALLOCK  *locallock;
+	LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER, resowner_node, resowner_node);
 
-		hash_seq_init(&status, LockMethodLocalHash);
-
-		while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
-			ReleaseLockIfHeld(locallock, false);
-	}
-	else
-	{
-		int			i;
+	Assert(locallockowner->owner == owner);
 
-		for (i = nlocks - 1; i >= 0; i--)
-			ReleaseLockIfHeld(locallocks[i], false);
-	}
+	ReleaseLockIfHeld(locallockowner, false);
 }
 
 /*
  * ReleaseLockIfHeld
- *		Release any session-level locks on this lockable object if sessionLock
- *		is true; else, release any locks held by CurrentResourceOwner.
+ *		Release any session-level locks on this 'locallockowner' if
+ *		sessionLock is true; else, release any locks held by 'locallockowner'.
  *
  * It is tempting to pass this a ResourceOwner pointer (or NULL for session
  * locks), but without refactoring LockRelease() we cannot support releasing
@@ -2516,52 +2270,40 @@ LockReleaseCurrentOwner(LOCALLOCK **locallocks, int nlocks)
  * convenience.
  */
 static void
-ReleaseLockIfHeld(LOCALLOCK *locallock, bool sessionLock)
+ReleaseLockIfHeld(LOCALLOCKOWNER *locallockowner, bool sessionLock)
 {
-	ResourceOwner owner;
-	LOCALLOCKOWNER *lockOwners;
-	int			i;
+	LOCALLOCK *locallock = locallockowner->locallock;
+
+	/* release all references to the lock by this resource owner */
 
-	/* Identify owner for lock (must match LockRelease!) */
 	if (sessionLock)
-		owner = NULL;
+		Assert(locallockowner->owner == NULL);
 	else
-		owner = CurrentResourceOwner;
+		Assert(locallockowner->owner != NULL);
 
-	/* Scan to see if there are any locks belonging to the target owner */
-	lockOwners = locallock->lockOwners;
-	for (i = locallock->numLockOwners - 1; i >= 0; i--)
+	/* We will still hold this lock after forgetting this ResourceOwner. */
+	if (locallockowner->nLocks < locallock->nLocks)
 	{
-		if (lockOwners[i].owner == owner)
-		{
-			Assert(lockOwners[i].nLocks > 0);
-			if (lockOwners[i].nLocks < locallock->nLocks)
-			{
-				/*
-				 * We will still hold this lock after forgetting this
-				 * ResourceOwner.
-				 */
-				locallock->nLocks -= lockOwners[i].nLocks;
-				/* compact out unused slot */
-				locallock->numLockOwners--;
-				if (owner != NULL)
-					ResourceOwnerForgetLock(owner, locallock);
-				if (i < locallock->numLockOwners)
-					lockOwners[i] = lockOwners[locallock->numLockOwners];
-			}
-			else
-			{
-				Assert(lockOwners[i].nLocks == locallock->nLocks);
-				/* We want to call LockRelease just once */
-				lockOwners[i].nLocks = 1;
-				locallock->nLocks = 1;
-				if (!LockRelease(&locallock->tag.lock,
-								 locallock->tag.mode,
-								 sessionLock))
-					elog(WARNING, "ReleaseLockIfHeld: failed??");
-			}
-			break;
-		}
+		locallock->nLocks -= locallockowner->nLocks;
+		Assert(locallock->nLocks >= 0);
+		dlist_delete(&locallockowner->locallock_node);
+
+		if (sessionLock)
+			dlist_delete(&locallockowner->resowner_node);
+		else
+			ResourceOwnerForgetLock(locallockowner);
+	}
+	else
+	{
+		Assert(locallockowner->nLocks == locallock->nLocks);
+		/* We want to call LockRelease just once */
+		locallockowner->nLocks = 1;
+		locallock->nLocks = 1;
+
+		if (!LockRelease(&locallock->tag.lock,
+						 locallock->tag.mode,
+						 sessionLock))
+			elog(PANIC, "ReleaseLockIfHeld: failed??");
 	}
 }
 
@@ -2576,75 +2318,46 @@ ReleaseLockIfHeld(LOCALLOCK *locallock, bool sessionLock)
  * and we'll traverse through our hash table to find them.
  */
 void
-LockReassignCurrentOwner(LOCALLOCK **locallocks, int nlocks)
+LockReassignCurrentOwner(ResourceOwner owner, dlist_node *resowner_node)
 {
+	LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER, resowner_node, resowner_node);
 	ResourceOwner parent = ResourceOwnerGetParent(CurrentResourceOwner);
 
-	Assert(parent != NULL);
-
-	if (locallocks == NULL)
-	{
-		HASH_SEQ_STATUS status;
-		LOCALLOCK  *locallock;
-
-		hash_seq_init(&status, LockMethodLocalHash);
-
-		while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
-			LockReassignOwner(locallock, parent);
-	}
-	else
-	{
-		int			i;
-
-		for (i = nlocks - 1; i >= 0; i--)
-			LockReassignOwner(locallocks[i], parent);
-	}
+	LockReassignOwner(locallockowner, parent);
 }
 
 /*
- * Subroutine of LockReassignCurrentOwner. Reassigns a given lock belonging to
- * CurrentResourceOwner to its parent.
+ * Subroutine of LockReassignCurrentOwner. Reassigns the given
+ * 'locallockowner' to 'parent'.
  */
 static void
-LockReassignOwner(LOCALLOCK *locallock, ResourceOwner parent)
+LockReassignOwner(LOCALLOCKOWNER *locallockowner, ResourceOwner parent)
 {
-	LOCALLOCKOWNER *lockOwners;
-	int			i;
-	int			ic = -1;
-	int			ip = -1;
+	dlist_iter iter;
+	LOCALLOCK *locallock = locallockowner->locallock;
 
-	/*
-	 * Scan to see if there are any locks belonging to current owner or its
-	 * parent
-	 */
-	lockOwners = locallock->lockOwners;
-	for (i = locallock->numLockOwners - 1; i >= 0; i--)
+	ResourceOwnerForgetLock(locallockowner);
+
+	dlist_foreach(iter, &locallock->locallockowners)
 	{
-		if (lockOwners[i].owner == CurrentResourceOwner)
-			ic = i;
-		else if (lockOwners[i].owner == parent)
-			ip = i;
-	}
+		LOCALLOCKOWNER *parentlocalowner = dlist_container(LOCALLOCKOWNER, locallock_node, iter.cur);
 
-	if (ic < 0)
-		return;					/* no current locks */
+		Assert(parentlocalowner->locallock == locallock);
 
-	if (ip < 0)
-	{
-		/* Parent has no slot, so just give it the child's slot */
-		lockOwners[ic].owner = parent;
-		ResourceOwnerRememberLock(parent, locallock);
-	}
-	else
-	{
-		/* Merge child's count with parent's */
-		lockOwners[ip].nLocks += lockOwners[ic].nLocks;
-		/* compact out unused slot */
-		locallock->numLockOwners--;
-		if (ic < locallock->numLockOwners)
-			lockOwners[ic] = lockOwners[locallock->numLockOwners];
+		if (parentlocalowner->owner != parent)
+			continue;
+
+		parentlocalowner->nLocks += locallockowner->nLocks;
+
+		locallockowner->nLocks = 0;
+		dlist_delete(&locallockowner->locallock_node);
+		pfree(locallockowner);
+		return;
 	}
-	ResourceOwnerForgetLock(CurrentResourceOwner, locallock);
+
+	/* reassign locallockowner to parent resowner */
+	locallockowner->owner = parent;
+	ResourceOwnerRememberLock(parent, locallockowner);
 }
 
 /*
@@ -3124,7 +2837,7 @@ GetLockConflicts(const LOCKTAG *locktag, LOCKMODE lockmode, int *countp)
  * We currently use this in two situations: first, to release locks held by
  * prepared transactions on commit (see lock_twophase_postcommit); and second,
  * to release locks taken via the fast-path, transferred to the main hash
- * table, and then released (see LockReleaseAll).
+ * table, and then released (see ResourceOwnerRelease).
  */
 static void
 LockRefindAndRelease(LockMethod lockMethodTable, PGPROC *proc,
@@ -3179,7 +2892,7 @@ LockRefindAndRelease(LockMethod lockMethodTable, PGPROC *proc,
 	{
 		PROCLOCK_PRINT("lock_twophase_postcommit: WRONGTYPE", proclock);
 		LWLockRelease(partitionLock);
-		elog(WARNING, "you don't own a lock of type %s",
+		elog(PANIC, "you don't own a lock of type %s",
 			 lockMethodTable->lockModeNames[lockmode]);
 		return;
 	}
@@ -3260,10 +2973,9 @@ CheckForSessionAndXactLocks(void)
 
 	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
 	{
-		LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
 		PerLockTagEntry *hentry;
 		bool		found;
-		int			i;
+		dlist_iter iter;
 
 		/*
 		 * Ignore VXID locks.  We don't want those to be held by prepared
@@ -3284,9 +2996,13 @@ CheckForSessionAndXactLocks(void)
 			hentry->sessLock = hentry->xactLock = false;
 
 		/* Scan to see if we hold lock at session or xact level or both */
-		for (i = locallock->numLockOwners - 1; i >= 0; i--)
+		dlist_foreach(iter, &locallock->locallockowners)
 		{
-			if (lockOwners[i].owner == NULL)
+			LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER,
+															 locallock_node,
+															 iter.cur);
+
+			if (locallockowner->owner == NULL)
 				hentry->sessLock = true;
 			else
 				hentry->xactLock = true;
@@ -3333,10 +3049,9 @@ AtPrepare_Locks(void)
 	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
 	{
 		TwoPhaseLockRecord record;
-		LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
 		bool		haveSessionLock;
 		bool		haveXactLock;
-		int			i;
+		dlist_iter iter;
 
 		/*
 		 * Ignore VXID locks.  We don't want those to be held by prepared
@@ -3351,9 +3066,13 @@ AtPrepare_Locks(void)
 
 		/* Scan to see whether we hold it at session or transaction level */
 		haveSessionLock = haveXactLock = false;
-		for (i = locallock->numLockOwners - 1; i >= 0; i--)
+		dlist_foreach(iter, &locallock->locallockowners)
 		{
-			if (lockOwners[i].owner == NULL)
+			LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER,
+															 locallock_node,
+															 iter.cur);
+
+			if (locallockowner->owner == NULL)
 				haveSessionLock = true;
 			else
 				haveXactLock = true;
@@ -3411,8 +3130,8 @@ AtPrepare_Locks(void)
  * pointers in the transaction's resource owner.  This is OK at the
  * moment since resowner.c doesn't try to free locks retail at a toplevel
  * transaction commit or abort.  We could alternatively zero out nLocks
- * and leave the LOCALLOCK entries to be garbage-collected by LockReleaseAll,
- * but that probably costs more cycles.
+ * and leave the LOCALLOCK entries to be garbage-collected by
+ * ResourceOwnerRelease, but that probably costs more cycles.
  */
 void
 PostPrepare_Locks(TransactionId xid)
@@ -3445,10 +3164,9 @@ PostPrepare_Locks(TransactionId xid)
 
 	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
 	{
-		LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
 		bool		haveSessionLock;
 		bool		haveXactLock;
-		int			i;
+		dlist_iter iter;
 
 		if (locallock->proclock == NULL || locallock->lock == NULL)
 		{
@@ -3467,9 +3185,13 @@ PostPrepare_Locks(TransactionId xid)
 
 		/* Scan to see whether we hold it at session or transaction level */
 		haveSessionLock = haveXactLock = false;
-		for (i = locallock->numLockOwners - 1; i >= 0; i--)
+		dlist_foreach(iter, &locallock->locallockowners)
 		{
-			if (lockOwners[i].owner == NULL)
+			LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER,
+															 locallock_node,
+															 iter.cur);
+
+			if (locallockowner->owner == NULL)
 				haveSessionLock = true;
 			else
 				haveXactLock = true;
@@ -3485,10 +3207,6 @@ PostPrepare_Locks(TransactionId xid)
 					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 					 errmsg("cannot PREPARE while holding both session-level and transaction-level locks on the same object")));
 
-		/* Mark the proclock to show we need to release this lockmode */
-		if (locallock->nLocks > 0)
-			locallock->proclock->releaseMask |= LOCKBIT_ON(locallock->tag.mode);
-
 		/* And remove the locallock hashtable entry */
 		RemoveLocalLock(locallock);
 	}
@@ -3506,11 +3224,7 @@ PostPrepare_Locks(TransactionId xid)
 
 		/*
 		 * If the proclock list for this partition is empty, we can skip
-		 * acquiring the partition lock.  This optimization is safer than the
-		 * situation in LockReleaseAll, because we got rid of any fast-path
-		 * locks during AtPrepare_Locks, so there cannot be any case where
-		 * another backend is adding something to our lists now.  For safety,
-		 * though, we code this the same way as in LockReleaseAll.
+		 * acquiring the partition lock.
 		 */
 		if (SHMQueueNext(procLocks, procLocks,
 						 offsetof(PROCLOCK, procLink)) == NULL)
@@ -3543,14 +3257,6 @@ PostPrepare_Locks(TransactionId xid)
 			Assert(lock->nGranted <= lock->nRequested);
 			Assert((proclock->holdMask & ~lock->grantMask) == 0);
 
-			/* Ignore it if nothing to release (must be a session lock) */
-			if (proclock->releaseMask == 0)
-				continue;
-
-			/* Else we should be releasing all locks */
-			if (proclock->releaseMask != proclock->holdMask)
-				elog(PANIC, "we seem to have dropped a bit somewhere");
-
 			/*
 			 * We cannot simply modify proclock->tag.myProc to reassign
 			 * ownership of the lock, because that's part of the hash key and
@@ -3600,6 +3306,7 @@ PostPrepare_Locks(TransactionId xid)
 	}							/* loop over partitions */
 
 	END_CRIT_SECTION();
+
 }
 
 
@@ -4336,7 +4043,6 @@ lock_twophase_recover(TransactionId xid, uint16 info,
 		Assert(proc->lockGroupLeader == NULL);
 		proclock->groupLeader = proc;
 		proclock->holdMask = 0;
-		proclock->releaseMask = 0;
 		/* Add proclock to appropriate lists */
 		SHMQueueInsertBefore(&lock->procLocks, &proclock->lockLink);
 		SHMQueueInsertBefore(&(proc->myProcLocks[partition]),
@@ -4473,7 +4179,7 @@ lock_twophase_postabort(TransactionId xid, uint16 info,
  *
  *		We don't bother recording this lock in the local lock table, since it's
  *		only ever released at the end of a transaction.  Instead,
- *		LockReleaseAll() calls VirtualXactLockTableCleanup().
+ *		ProcReleaseLocks() calls VirtualXactLockTableCleanup().
  */
 void
 VirtualXactLockTableInsert(VirtualTransactionId vxid)
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 90283f8a9f..5c2e1cbd06 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -779,10 +779,17 @@ ProcReleaseLocks(bool isCommit)
 		return;
 	/* If waiting, get off wait queue (should only be needed after error) */
 	LockErrorCleanup();
-	/* Release standard locks, including session-level if aborting */
-	LockReleaseAll(DEFAULT_LOCKMETHOD, !isCommit);
-	/* Release transaction-level advisory locks */
-	LockReleaseAll(USER_LOCKMETHOD, false);
+
+	VirtualXactLockTableCleanup();
+
+	/* Release session-level locks if aborting */
+	if (!isCommit)
+		LockReleaseSession(DEFAULT_LOCKMETHOD);
+
+#ifdef USE_ASSERT_CHECKING
+	/* Ensure all locks were released */
+	LockAssertNoneHeld(isCommit);
+#endif
 }
 
 
@@ -864,6 +871,8 @@ ProcKill(int code, Datum arg)
 		LWLockRelease(leader_lwlock);
 	}
 
+	Assert(MyProc->fpLockBits == 0);
+
 	/*
 	 * Reset MyLatch to the process local one.  This is so that signal
 	 * handlers et al can continue using the latch after the shared latch
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 86d193c89f..1cf3d77f92 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -1261,7 +1261,7 @@ ShutdownPostgres(int code, Datum arg)
 	 * User locks are not released by transaction end, so be sure to release
 	 * them explicitly.
 	 */
-	LockReleaseAll(USER_LOCKMETHOD, true);
+	LockReleaseSession(USER_LOCKMETHOD);
 
 	/*
 	 * temp debugging aid to analyze 019_replslot_limit failures
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 3236b1b919..52f015b7f8 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -33,6 +33,7 @@
 #include "utils/resowner_private.h"
 #include "utils/snapmgr.h"
 
+#include "lib/ilist.h"
 
 /*
  * All resource IDs managed by this code are required to fit into a Datum,
@@ -133,9 +134,7 @@ typedef struct ResourceOwnerData
 	ResourceArray cryptohasharr;	/* cryptohash contexts */
 	ResourceArray hmacarr;		/* HMAC contexts */
 
-	/* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
-	int			nlocks;			/* number of owned locks */
-	LOCALLOCK  *locks[MAX_RESOWNER_LOCKS];	/* list of owned locks */
+	dlist_head  locks;			/* dlist of owned locks */
 }			ResourceOwnerData;
 
 
@@ -452,6 +451,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
 	ResourceArrayInit(&(owner->jitarr), PointerGetDatum(NULL));
 	ResourceArrayInit(&(owner->cryptohasharr), PointerGetDatum(NULL));
 	ResourceArrayInit(&(owner->hmacarr), PointerGetDatum(NULL));
+	dlist_init(&owner->locks);
 
 	return owner;
 }
@@ -585,8 +585,15 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 	}
 	else if (phase == RESOURCE_RELEASE_LOCKS)
 	{
+		dlist_mutable_iter iter;
+
 		if (isTopLevel)
 		{
+			dlist_foreach_modify(iter, &owner->locks)
+				LockReleaseCurrentOwner(owner, iter.cur);
+
+			Assert(dlist_is_empty(&owner->locks));
+
 			/*
 			 * For a top-level xact we are going to release all locks (or at
 			 * least all non-session locks), so just do a single lmgr call at
@@ -605,30 +612,20 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 			 * subtransaction, we do NOT release its locks yet, but transfer
 			 * them to the parent.
 			 */
-			LOCALLOCK **locks;
-			int			nlocks;
-
-			Assert(owner->parent != NULL);
-
-			/*
-			 * Pass the list of locks owned by this resource owner to the lock
-			 * manager, unless it has overflowed.
-			 */
-			if (owner->nlocks > MAX_RESOWNER_LOCKS)
+			if (isCommit)
 			{
-				locks = NULL;
-				nlocks = 0;
+				dlist_foreach_modify(iter, &owner->locks)
+					LockReassignCurrentOwner(owner, iter.cur);
+
+				Assert(dlist_is_empty(&owner->locks));
 			}
 			else
 			{
-				locks = owner->locks;
-				nlocks = owner->nlocks;
-			}
+				dlist_foreach_modify(iter, &owner->locks)
+					LockReleaseCurrentOwner(owner, iter.cur);
 
-			if (isCommit)
-				LockReassignCurrentOwner(locks, nlocks);
-			else
-				LockReleaseCurrentOwner(locks, nlocks);
+				Assert(dlist_is_empty(&owner->locks));
+			}
 		}
 	}
 	else if (phase == RESOURCE_RELEASE_AFTER_LOCKS)
@@ -752,7 +749,7 @@ ResourceOwnerDelete(ResourceOwner owner)
 	Assert(owner->jitarr.nitems == 0);
 	Assert(owner->cryptohasharr.nitems == 0);
 	Assert(owner->hmacarr.nitems == 0);
-	Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1);
+	Assert(dlist_is_empty(&owner->locks));
 
 	/*
 	 * Delete children.  The recursive call will delink the child from me, so
@@ -974,54 +971,61 @@ ResourceOwnerForgetBuffer(ResourceOwner owner, Buffer buffer)
 
 /*
  * Remember that a Local Lock is owned by a ResourceOwner
- *
- * This is different from the other Remember functions in that the list of
- * locks is only a lossy cache. It can hold up to MAX_RESOWNER_LOCKS entries,
- * and when it overflows, we stop tracking locks. The point of only remembering
- * only up to MAX_RESOWNER_LOCKS entries is that if a lot of locks are held,
- * ResourceOwnerForgetLock doesn't need to scan through a large array to find
- * the entry.
  */
 void
-ResourceOwnerRememberLock(ResourceOwner owner, LOCALLOCK *locallock)
+ResourceOwnerRememberLock(ResourceOwner owner, LOCALLOCKOWNER *locallockowner)
 {
-	Assert(locallock != NULL);
-
-	if (owner->nlocks > MAX_RESOWNER_LOCKS)
-		return;					/* we have already overflowed */
+	Assert(owner != NULL);
+	Assert(locallockowner != NULL);
 
-	if (owner->nlocks < MAX_RESOWNER_LOCKS)
-		owner->locks[owner->nlocks] = locallock;
-	else
+#ifdef USE_ASSERT_CHECKING
 	{
-		/* overflowed */
+		dlist_iter iter;
+
+		dlist_foreach(iter, &owner->locks)
+		{
+			LOCALLOCKOWNER *i = dlist_container(LOCALLOCKOWNER, resowner_node, iter.cur);
+
+			Assert(i->locallock != locallockowner->locallock);
+		}
 	}
-	owner->nlocks++;
+#endif
+
+	dlist_push_tail(&owner->locks, &locallockowner->resowner_node);
 }
 
 /*
- * Forget that a Local Lock is owned by a ResourceOwner
+ * Forget that a Local Lock is owned by the given LOCALLOCKOWNER.
  */
 void
-ResourceOwnerForgetLock(ResourceOwner owner, LOCALLOCK *locallock)
+ResourceOwnerForgetLock(LOCALLOCKOWNER *locallockowner)
 {
-	int			i;
+#ifdef USE_ASSERT_CHECKING
+	ResourceOwner owner;
+
+	Assert(locallockowner != NULL);
 
-	if (owner->nlocks > MAX_RESOWNER_LOCKS)
-		return;					/* we have overflowed */
+	owner = locallockowner->owner;
 
-	Assert(owner->nlocks > 0);
-	for (i = owner->nlocks - 1; i >= 0; i--)
 	{
-		if (locallock == owner->locks[i])
+		dlist_iter iter;
+		bool found = false;
+
+		dlist_foreach(iter, &owner->locks)
 		{
-			owner->locks[i] = owner->locks[owner->nlocks - 1];
-			owner->nlocks--;
-			return;
+			LOCALLOCKOWNER *owner = dlist_container(LOCALLOCKOWNER, resowner_node, iter.cur);
+
+			if (locallockowner == owner)
+			{
+				Assert(!found);
+				found = true;
+			}
 		}
+
+		Assert(found);
 	}
-	elog(ERROR, "lock reference %p is not owned by resource owner %s",
-		 locallock, owner->name);
+#endif
+	dlist_delete(&locallockowner->resowner_node);
 }
 
 /*
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index dc537e20f2..f664bd5136 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -23,6 +23,7 @@
 #include "storage/lwlock.h"
 #include "storage/shmem.h"
 #include "utils/timestamp.h"
+#include "lib/ilist.h"
 
 /* struct PGPROC is declared in proc.h, but must forward-reference it */
 typedef struct PGPROC PGPROC;
@@ -341,10 +342,6 @@ typedef struct LOCK
  * Otherwise, proclock objects whose holdMasks are zero are recycled
  * as soon as convenient.
  *
- * releaseMask is workspace for LockReleaseAll(): it shows the locks due
- * to be released during the current call.  This must only be examined or
- * set by the backend owning the PROCLOCK.
- *
  * Each PROCLOCK object is linked into lists for both the associated LOCK
  * object and the owning PGPROC object.  Note that the PROCLOCK is entered
  * into these lists as soon as it is created, even if no lock has yet been
@@ -366,7 +363,6 @@ typedef struct PROCLOCK
 	/* data */
 	PGPROC	   *groupLeader;	/* proc's lock group leader, or proc itself */
 	LOCKMASK	holdMask;		/* bitmask for lock types currently held */
-	LOCKMASK	releaseMask;	/* bitmask for lock types to be released */
 	SHM_QUEUE	lockLink;		/* list link in LOCK's list of proclocks */
 	SHM_QUEUE	procLink;		/* list link in PGPROC's list of proclocks */
 } PROCLOCK;
@@ -412,6 +408,13 @@ typedef struct LOCALLOCKOWNER
 	 * Must use a forward struct reference to avoid circularity.
 	 */
 	struct ResourceOwnerData *owner;
+
+	dlist_node  resowner_node;
+
+	dlist_node  locallock_node;
+
+	struct LOCALLOCK *locallock;
+
 	int64		nLocks;			/* # of times held by this owner */
 } LOCALLOCKOWNER;
 
@@ -425,9 +428,9 @@ typedef struct LOCALLOCK
 	LOCK	   *lock;			/* associated LOCK object, if any */
 	PROCLOCK   *proclock;		/* associated PROCLOCK object, if any */
 	int64		nLocks;			/* total number of times lock is held */
-	int			numLockOwners;	/* # of relevant ResourceOwners */
-	int			maxLockOwners;	/* allocated size of array */
-	LOCALLOCKOWNER *lockOwners; /* dynamically resizable array */
+
+	dlist_head locallockowners;	/* dlist of LOCALLOCKOWNER */
+
 	bool		holdsStrongLockCount;	/* bumped FastPathStrongRelationLocks */
 	bool		lockCleared;	/* we read all sinval msgs for lock */
 } LOCALLOCK;
@@ -556,10 +559,17 @@ extern void AbortStrongLockAcquire(void);
 extern void MarkLockClear(LOCALLOCK *locallock);
 extern bool LockRelease(const LOCKTAG *locktag,
 						LOCKMODE lockmode, bool sessionLock);
-extern void LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks);
+
+#ifdef USE_ASSERT_CHECKING
+extern void LockAssertNoneHeld(bool isCommit);
+#endif
+
 extern void LockReleaseSession(LOCKMETHODID lockmethodid);
-extern void LockReleaseCurrentOwner(LOCALLOCK **locallocks, int nlocks);
-extern void LockReassignCurrentOwner(LOCALLOCK **locallocks, int nlocks);
+struct ResourceOwnerData;
+extern void LockReleaseCurrentOwner(struct ResourceOwnerData *owner,
+									dlist_node *resowner_node);
+extern void LockReassignCurrentOwner(struct ResourceOwnerData *owner,
+									 dlist_node *resowner_node);
 extern bool LockHeldByMe(const LOCKTAG *locktag, LOCKMODE lockmode);
 #ifdef USE_ASSERT_CHECKING
 extern HTAB *GetLockMethodLocalHash(void);
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index d01cccc27c..ab4796d287 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -31,8 +31,9 @@ extern void ResourceOwnerRememberBuffer(ResourceOwner owner, Buffer buffer);
 extern void ResourceOwnerForgetBuffer(ResourceOwner owner, Buffer buffer);
 
 /* support for local lock management */
-extern void ResourceOwnerRememberLock(ResourceOwner owner, LOCALLOCK *locallock);
-extern void ResourceOwnerForgetLock(ResourceOwner owner, LOCALLOCK *locallock);
+extern void ResourceOwnerRememberLock(ResourceOwner owner,
+									  LOCALLOCKOWNER *locallock);
+extern void ResourceOwnerForgetLock(LOCALLOCKOWNER *locallock);
 
 /* support for catcache refcount management */
 extern void ResourceOwnerEnlargeCatCacheRefs(ResourceOwner owner);
diff --git a/src/tools/make_mkid b/src/tools/make_mkid
deleted file mode 100755
index 6f160614cd..0000000000
--- a/src/tools/make_mkid
+++ /dev/null
@@ -1,11 +0,0 @@
-#!/bin/sh
-
-# src/tools/make_mkid
-
-mkid `find \`pwd\`/ \( -name _deadcode -a -prune \) -o \
-	-type f -name '*.[chyl]' -print|sed 's;//;/;g'`
-
-find . \( -name .git -a -prune \) -o -type d -print  |while read DIR
-do
-	[ "$DIR" != "." ] && ln -f -s `echo "$DIR" | sed 's;/[^/]*;/..;g'`/ID $DIR/ID
-done

lock2.sqlapplication/octet-stream; name=lock2.sqlDownload

lock1.sqlapplication/octet-stream; name=lock1.sqlDownload

lockbench.sh.txttext/plain; charset=US-ASCII; name=lockbench.sh.txtDownload

lock4.sqlapplication/octet-stream; name=lock4.sqlDownload

lock3.sqlapplication/octet-stream; name=lock3.sqlDownload

#105

Yura Sokolov

y.sokolov@postgrespro.ru

almost 4 years ago

In reply to: David Rowley (#104)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

Good day, David.

I'm looking on patch and don't get some moments.

`GrantLockLocal` allocates `LOCALLOCKOWNER` and links it into
`locallock->locallockowners`. It links it regardless `owner` could be
NULL. But then `RemoveLocalLock` does `Assert(locallockowner->owner != NULL);`.
Why it should not fail?

`GrantLockLocal` allocates `LOCALLOCKOWNER` in `TopMemoryContext`.
But there is single `pfree(locallockowner)` in `LockReassignOwner`.
Looks like there should be more `pfree`. Shouldn't they?

`GrantLockLocal` does `dlist_push_tail`, but isn't it better to
do `dlist_push_head`? Resource owners usually form stack, so usually
when owner searches for itself it is last added to list.
Then `dlist_foreach` will find it sooner if it were added to the head.

regards

---------

Yura Sokolov
Postgres Professional
y.sokolov@postgrespro.ru
funny.falcon@gmail.com

#106

David Rowley

dgrowleyml@gmail.com

almost 4 years ago

In reply to: Yura Sokolov (#105)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

On Wed, 6 Apr 2022 at 03:40, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

I'm looking on patch and don't get some moments.

`GrantLockLocal` allocates `LOCALLOCKOWNER` and links it into
`locallock->locallockowners`. It links it regardless `owner` could be
NULL. But then `RemoveLocalLock` does `Assert(locallockowner->owner != NULL);`.
Why it should not fail?

`GrantLockLocal` allocates `LOCALLOCKOWNER` in `TopMemoryContext`.
But there is single `pfree(locallockowner)` in `LockReassignOwner`.
Looks like there should be more `pfree`. Shouldn't they?

`GrantLockLocal` does `dlist_push_tail`, but isn't it better to
do `dlist_push_head`? Resource owners usually form stack, so usually
when owner searches for itself it is last added to list.
Then `dlist_foreach` will find it sooner if it were added to the head.

Thanks for having a look at this. It's a bit unrealistic for me to
get a look at addressing these for v15. I've pushed this one out to
the next CF.

David

#107

Jacob Champion

jchampion@timescale.com

over 3 years ago

In reply to: David Rowley (#106)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

This entry has been waiting on author input for a while (our current
threshold is roughly two weeks), so I've marked it Returned with
Feedback.

Once you think the patchset is ready for review again, you (or any
interested party) can resurrect the patch entry by visiting

https://commitfest.postgresql.org/38/3501/

and changing the status to "Needs Review", and then changing the
status again to "Move to next CF". (Don't forget the second step;
hopefully we will have streamlined this in the near future!)

Thanks,
--Jacob

#108

David Rowley

dgrowleyml@gmail.com

over 3 years ago

In reply to: Jacob Champion (#107)

1 attachment(s)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

On Wed, 3 Aug 2022 at 07:04, Jacob Champion <jchampion@timescale.com> wrote:

This entry has been waiting on author input for a while (our current
threshold is roughly two weeks), so I've marked it Returned with
Feedback.

Thanks for taking care of this. You dealt with this correctly based on
the fact that I'd failed to rebase before or during the entire July
CF.

I'm still interested in having the LockReleaseAll slowness fixed, so
here's a rebased patch.

David

Attachments:

lockreleaseall_speedup4.patchtext/plain; charset=US-ASCII; name=lockreleaseall_speedup4.patchDownload

diff --git a/src/backend/commands/discard.c b/src/backend/commands/discard.c
index c583539e0c..2c43b9e0c8 100644
--- a/src/backend/commands/discard.c
+++ b/src/backend/commands/discard.c
@@ -71,7 +71,7 @@ DiscardAll(bool isTopLevel)
 	ResetAllOptions();
 	DropAllPreparedStatements();
 	Async_UnlistenAll();
-	LockReleaseAll(USER_LOCKMETHOD, true);
+	LockReleaseSession(USER_LOCKMETHOD);
 	ResetPlanCache();
 	ResetTempTableNamespace();
 	ResetSequenceCaches();
diff --git a/src/backend/storage/lmgr/README b/src/backend/storage/lmgr/README
index d08ec6c402..563ba681e5 100644
--- a/src/backend/storage/lmgr/README
+++ b/src/backend/storage/lmgr/README
@@ -182,12 +182,6 @@ holdMask -
     subset of the PGPROC object's heldLocks mask (if the PGPROC is
     currently waiting for another lock mode on this lock).
 
-releaseMask -
-    A bitmask for the lock modes due to be released during LockReleaseAll.
-    This must be a subset of the holdMask.  Note that it is modified without
-    taking the partition LWLock, and therefore it is unsafe for any
-    backend except the one owning the PROCLOCK to examine/change it.
-
 lockLink -
     List link for shared memory queue of all the PROCLOCK objects for the
     same LOCK.
@@ -321,8 +315,7 @@ and will notice any weak lock we take when it does.
 
 Fast-path VXID locks do not use the FastPathStrongRelationLocks table.  The
 first lock taken on a VXID is always the ExclusiveLock taken by its owner.
-Any subsequent lockers are share lockers waiting for the VXID to terminate.
-Indeed, the only reason VXID locks use the lock manager at all (rather than
+Any subsequent lockers are share lockers wait 
 waiting for the VXID to terminate via some other method) is for deadlock
 detection.  Thus, the initial VXID lock can *always* be taken via the fast
 path without checking for conflicts.  Any subsequent locker must check
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 5f5803f681..381a2527f7 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -22,8 +22,7 @@
  *	Interface:
  *
  *	InitLocks(), GetLocksMethodTable(), GetLockTagsMethodTable(),
- *	LockAcquire(), LockRelease(), LockReleaseAll(),
- *	LockCheckConflicts(), GrantLock()
+ *	LockAcquire(), LockRelease(), LockCheckConflicts(), GrantLock()
  *
  *-------------------------------------------------------------------------
  */
@@ -289,6 +288,9 @@ static LOCALLOCK *awaitedLock;
 static ResourceOwner awaitedOwner;
 
 
+static dlist_head session_locks[lengthof(LockMethods)];
+
+
 #ifdef LOCK_DEBUG
 
 /*------
@@ -375,8 +377,9 @@ static void GrantLockLocal(LOCALLOCK *locallock, ResourceOwner owner);
 static void BeginStrongLockAcquire(LOCALLOCK *locallock, uint32 fasthashcode);
 static void FinishStrongLockAcquire(void);
 static void WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner);
-static void ReleaseLockIfHeld(LOCALLOCK *locallock, bool sessionLock);
-static void LockReassignOwner(LOCALLOCK *locallock, ResourceOwner parent);
+static void ReleaseLockIfHeld(LOCALLOCKOWNER *locallockowner, bool sessionLock);
+static void LockReassignOwner(LOCALLOCKOWNER *locallockowner,
+							  ResourceOwner parent);
 static bool UnGrantLock(LOCK *lock, LOCKMODE lockmode,
 						PROCLOCK *proclock, LockMethod lockMethodTable);
 static void CleanUpLock(LOCK *lock, PROCLOCK *proclock,
@@ -477,6 +480,10 @@ InitLocks(void)
 									  16,
 									  &info,
 									  HASH_ELEM | HASH_BLOBS);
+
+	/* Initialize each element of the session_locks array */
+	for (int i = 0; i < lengthof(LockMethods); i++)
+		dlist_init(&session_locks[i]);
 }
 
 
@@ -701,7 +708,7 @@ LockHasWaiters(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 	{
 		PROCLOCK_PRINT("LockHasWaiters: WRONGTYPE", proclock);
 		LWLockRelease(partitionLock);
-		elog(WARNING, "you don't own a lock of type %s",
+		elog(PANIC, "you don't own a lock of type %s",
 			 lockMethodTable->lockModeNames[lockmode]);
 		RemoveLocalLock(locallock);
 		return false;
@@ -839,26 +846,9 @@ LockAcquireExtended(const LOCKTAG *locktag,
 		locallock->nLocks = 0;
 		locallock->holdsStrongLockCount = false;
 		locallock->lockCleared = false;
-		locallock->numLockOwners = 0;
-		locallock->maxLockOwners = 8;
-		locallock->lockOwners = NULL;	/* in case next line fails */
-		locallock->lockOwners = (LOCALLOCKOWNER *)
-			MemoryContextAlloc(TopMemoryContext,
-							   locallock->maxLockOwners * sizeof(LOCALLOCKOWNER));
+		dlist_init(&locallock->locallockowners);
 	}
-	else
-	{
-		/* Make sure there will be room to remember the lock */
-		if (locallock->numLockOwners >= locallock->maxLockOwners)
-		{
-			int			newsize = locallock->maxLockOwners * 2;
 
-			locallock->lockOwners = (LOCALLOCKOWNER *)
-				repalloc(locallock->lockOwners,
-						 newsize * sizeof(LOCALLOCKOWNER));
-			locallock->maxLockOwners = newsize;
-		}
-	}
 	hashcode = locallock->hashcode;
 
 	if (locallockp)
@@ -1268,7 +1258,6 @@ SetupLockInTable(LockMethod lockMethodTable, PGPROC *proc,
 		proclock->groupLeader = proc->lockGroupLeader != NULL ?
 			proc->lockGroupLeader : proc;
 		proclock->holdMask = 0;
-		proclock->releaseMask = 0;
 		/* Add proclock to appropriate lists */
 		SHMQueueInsertBefore(&lock->procLocks, &proclock->lockLink);
 		SHMQueueInsertBefore(&(proc->myProcLocks[partition]),
@@ -1366,17 +1355,18 @@ CheckAndSetLockHeld(LOCALLOCK *locallock, bool acquired)
 static void
 RemoveLocalLock(LOCALLOCK *locallock)
 {
-	int			i;
+	dlist_mutable_iter iter;
 
-	for (i = locallock->numLockOwners - 1; i >= 0; i--)
+	dlist_foreach_modify(iter, &locallock->locallockowners)
 	{
-		if (locallock->lockOwners[i].owner != NULL)
-			ResourceOwnerForgetLock(locallock->lockOwners[i].owner, locallock);
+		LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER, locallock_node, iter.cur);
+
+		Assert(locallockowner->owner != NULL);
+		dlist_delete(&locallockowner->locallock_node);
+		ResourceOwnerForgetLock(locallockowner);
 	}
-	locallock->numLockOwners = 0;
-	if (locallock->lockOwners != NULL)
-		pfree(locallock->lockOwners);
-	locallock->lockOwners = NULL;
+
+	Assert(dlist_is_empty(&locallock->locallockowners));
 
 	if (locallock->holdsStrongLockCount)
 	{
@@ -1394,7 +1384,7 @@ RemoveLocalLock(LOCALLOCK *locallock)
 	if (!hash_search(LockMethodLocalHash,
 					 (void *) &(locallock->tag),
 					 HASH_REMOVE, NULL))
-		elog(WARNING, "locallock table corrupted");
+		elog(PANIC, "locallock table corrupted");
 
 	/*
 	 * Indicate that the lock is released for certain types of locks
@@ -1688,26 +1678,40 @@ CleanUpLock(LOCK *lock, PROCLOCK *proclock,
 static void
 GrantLockLocal(LOCALLOCK *locallock, ResourceOwner owner)
 {
-	LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
-	int			i;
+	LOCALLOCKOWNER *locallockowner;
+	dlist_iter iter;
 
-	Assert(locallock->numLockOwners < locallock->maxLockOwners);
 	/* Count the total */
 	locallock->nLocks++;
+
 	/* Count the per-owner lock */
-	for (i = 0; i < locallock->numLockOwners; i++)
+	dlist_foreach(iter, &locallock->locallockowners)
 	{
-		if (lockOwners[i].owner == owner)
+		locallockowner = dlist_container(LOCALLOCKOWNER, locallock_node, iter.cur);
+
+		if (locallockowner->owner == owner)
 		{
-			lockOwners[i].nLocks++;
+			locallockowner->nLocks++;
 			return;
 		}
 	}
-	lockOwners[i].owner = owner;
-	lockOwners[i].nLocks = 1;
-	locallock->numLockOwners++;
+
+	locallockowner = MemoryContextAlloc(TopMemoryContext, sizeof(LOCALLOCKOWNER));
+	locallockowner->owner = owner;
+	locallockowner->nLocks = 1;
+	locallockowner->locallock = locallock;
+
+	dlist_push_tail(&locallock->locallockowners, &locallockowner->locallock_node);
+
 	if (owner != NULL)
-		ResourceOwnerRememberLock(owner, locallock);
+		ResourceOwnerRememberLock(owner, locallockowner);
+	else
+	{
+		LOCKMETHODID lockmethodid = LOCALLOCK_LOCKMETHOD(*locallockowner->locallock);
+
+		Assert(lockmethodid > 0 && lockmethodid <= 2);
+		dlist_push_tail(&session_locks[lockmethodid - 1], &locallockowner->resowner_node);
+	}
 
 	/* Indicate that the lock is acquired for certain types of locks. */
 	CheckAndSetLockHeld(locallock, true);
@@ -2021,9 +2025,9 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 	 * Decrease the count for the resource owner.
 	 */
 	{
-		LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
 		ResourceOwner owner;
-		int			i;
+		dlist_iter iter;
+		bool		found = false;
 
 		/* Identify owner for lock */
 		if (sessionLock)
@@ -2031,24 +2035,29 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 		else
 			owner = CurrentResourceOwner;
 
-		for (i = locallock->numLockOwners - 1; i >= 0; i--)
+		dlist_foreach(iter, &locallock->locallockowners)
 		{
-			if (lockOwners[i].owner == owner)
+			LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER, locallock_node, iter.cur);
+
+			if (locallockowner->owner != owner)
+				continue;
+
+			found = true;
+
+			if (--locallockowner->nLocks == 0)
 			{
-				Assert(lockOwners[i].nLocks > 0);
-				if (--lockOwners[i].nLocks == 0)
-				{
-					if (owner != NULL)
-						ResourceOwnerForgetLock(owner, locallock);
-					/* compact out unused slot */
-					locallock->numLockOwners--;
-					if (i < locallock->numLockOwners)
-						lockOwners[i] = lockOwners[locallock->numLockOwners];
-				}
-				break;
+				dlist_delete(&locallockowner->locallock_node);
+
+				if (owner != NULL)
+					ResourceOwnerForgetLock(locallockowner);
+				else
+					dlist_delete(&locallockowner->resowner_node);
 			}
+
+			Assert(locallockowner->nLocks >= 0);
 		}
-		if (i < 0)
+
+		if (!found)
 		{
 			/* don't release a lock belonging to another owner */
 			elog(WARNING, "you don't own a lock of type %s",
@@ -2066,6 +2075,8 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 	if (locallock->nLocks > 0)
 		return true;
 
+	Assert(locallock->nLocks >= 0);
+
 	/*
 	 * At this point we can no longer suppose we are clear of invalidation
 	 * messages related to this lock.  Although we'll delete the LOCALLOCK
@@ -2147,7 +2158,7 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 	{
 		PROCLOCK_PRINT("LockRelease: WRONGTYPE", proclock);
 		LWLockRelease(partitionLock);
-		elog(WARNING, "you don't own a lock of type %s",
+		elog(PANIC, "you don't own a lock of type %s",
 			 lockMethodTable->lockModeNames[lockmode]);
 		RemoveLocalLock(locallock);
 		return false;
@@ -2168,283 +2179,44 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 	return true;
 }
 
+#ifdef USE_ASSERT_CHECKING
 /*
- * LockReleaseAll -- Release all locks of the specified lock method that
- *		are held by the current process.
- *
- * Well, not necessarily *all* locks.  The available behaviors are:
- *		allLocks == true: release all locks including session locks.
- *		allLocks == false: release all non-session locks.
+ * LockAssertNoneHeld -- Assert that we no longer hold any DEFAULT_LOCKMETHOD
+ * locks during an abort.
  */
-void
-LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
+extern void
+LockAssertNoneHeld(bool isCommit)
 {
 	HASH_SEQ_STATUS status;
-	LockMethod	lockMethodTable;
-	int			i,
-				numLockModes;
 	LOCALLOCK  *locallock;
-	LOCK	   *lock;
-	PROCLOCK   *proclock;
-	int			partition;
-	bool		have_fast_path_lwlock = false;
-
-	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
-		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
-	lockMethodTable = LockMethods[lockmethodid];
-
-#ifdef LOCK_DEBUG
-	if (*(lockMethodTable->trace_flag))
-		elog(LOG, "LockReleaseAll: lockmethod=%d", lockmethodid);
-#endif
-
-	/*
-	 * Get rid of our fast-path VXID lock, if appropriate.  Note that this is
-	 * the only way that the lock we hold on our own VXID can ever get
-	 * released: it is always and only released when a toplevel transaction
-	 * ends.
-	 */
-	if (lockmethodid == DEFAULT_LOCKMETHOD)
-		VirtualXactLockTableCleanup();
-
-	numLockModes = lockMethodTable->numLockModes;
-
-	/*
-	 * First we run through the locallock table and get rid of unwanted
-	 * entries, then we scan the process's proclocks and get rid of those. We
-	 * do this separately because we may have multiple locallock entries
-	 * pointing to the same proclock, and we daren't end up with any dangling
-	 * pointers.  Fast-path locks are cleaned up during the locallock table
-	 * scan, though.
-	 */
-	hash_seq_init(&status, LockMethodLocalHash);
 
-	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+	if (!isCommit)
 	{
-		/*
-		 * If the LOCALLOCK entry is unused, we must've run out of shared
-		 * memory while trying to set up this lock.  Just forget the local
-		 * entry.
-		 */
-		if (locallock->nLocks == 0)
-		{
-			RemoveLocalLock(locallock);
-			continue;
-		}
-
-		/* Ignore items that are not of the lockmethod to be removed */
-		if (LOCALLOCK_LOCKMETHOD(*locallock) != lockmethodid)
-			continue;
+		hash_seq_init(&status, LockMethodLocalHash);
 
-		/*
-		 * If we are asked to release all locks, we can just zap the entry.
-		 * Otherwise, must scan to see if there are session locks. We assume
-		 * there is at most one lockOwners entry for session locks.
-		 */
-		if (!allLocks)
+		while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
 		{
-			LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
+			dlist_iter local_iter;
 
-			/* If session lock is above array position 0, move it down to 0 */
-			for (i = 0; i < locallock->numLockOwners; i++)
-			{
-				if (lockOwners[i].owner == NULL)
-					lockOwners[0] = lockOwners[i];
-				else
-					ResourceOwnerForgetLock(lockOwners[i].owner, locallock);
-			}
+			Assert(locallock->nLocks >= 0);
 
-			if (locallock->numLockOwners > 0 &&
-				lockOwners[0].owner == NULL &&
-				lockOwners[0].nLocks > 0)
+			dlist_foreach(local_iter, &locallock->locallockowners)
 			{
-				/* Fix the locallock to show just the session locks */
-				locallock->nLocks = lockOwners[0].nLocks;
-				locallock->numLockOwners = 1;
-				/* We aren't deleting this locallock, so done */
-				continue;
-			}
-			else
-				locallock->numLockOwners = 0;
-		}
-
-		/*
-		 * If the lock or proclock pointers are NULL, this lock was taken via
-		 * the relation fast-path (and is not known to have been transferred).
-		 */
-		if (locallock->proclock == NULL || locallock->lock == NULL)
-		{
-			LOCKMODE	lockmode = locallock->tag.mode;
-			Oid			relid;
+				LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER,
+																 locallock_node,
+																 local_iter.cur);
 
-			/* Verify that a fast-path lock is what we've got. */
-			if (!EligibleForRelationFastPath(&locallock->tag.lock, lockmode))
-				elog(PANIC, "locallock table corrupted");
-
-			/*
-			 * If we don't currently hold the LWLock that protects our
-			 * fast-path data structures, we must acquire it before attempting
-			 * to release the lock via the fast-path.  We will continue to
-			 * hold the LWLock until we're done scanning the locallock table,
-			 * unless we hit a transferred fast-path lock.  (XXX is this
-			 * really such a good idea?  There could be a lot of entries ...)
-			 */
-			if (!have_fast_path_lwlock)
-			{
-				LWLockAcquire(&MyProc->fpInfoLock, LW_EXCLUSIVE);
-				have_fast_path_lwlock = true;
-			}
+				Assert(locallockowner->owner == NULL);
 
-			/* Attempt fast-path release. */
-			relid = locallock->tag.lock.locktag_field2;
-			if (FastPathUnGrantRelationLock(relid, lockmode))
-			{
-				RemoveLocalLock(locallock);
-				continue;
+				if (locallockowner->nLocks > 0 &&
+					LOCALLOCK_LOCKMETHOD(*locallock) == DEFAULT_LOCKMETHOD)
+					Assert(false);
 			}
-
-			/*
-			 * Our lock, originally taken via the fast path, has been
-			 * transferred to the main lock table.  That's going to require
-			 * some extra work, so release our fast-path lock before starting.
-			 */
-			LWLockRelease(&MyProc->fpInfoLock);
-			have_fast_path_lwlock = false;
-
-			/*
-			 * Now dump the lock.  We haven't got a pointer to the LOCK or
-			 * PROCLOCK in this case, so we have to handle this a bit
-			 * differently than a normal lock release.  Unfortunately, this
-			 * requires an extra LWLock acquire-and-release cycle on the
-			 * partitionLock, but hopefully it shouldn't happen often.
-			 */
-			LockRefindAndRelease(lockMethodTable, MyProc,
-								 &locallock->tag.lock, lockmode, false);
-			RemoveLocalLock(locallock);
-			continue;
 		}
-
-		/* Mark the proclock to show we need to release this lockmode */
-		if (locallock->nLocks > 0)
-			locallock->proclock->releaseMask |= LOCKBIT_ON(locallock->tag.mode);
-
-		/* And remove the locallock hashtable entry */
-		RemoveLocalLock(locallock);
 	}
-
-	/* Done with the fast-path data structures */
-	if (have_fast_path_lwlock)
-		LWLockRelease(&MyProc->fpInfoLock);
-
-	/*
-	 * Now, scan each lock partition separately.
-	 */
-	for (partition = 0; partition < NUM_LOCK_PARTITIONS; partition++)
-	{
-		LWLock	   *partitionLock;
-		SHM_QUEUE  *procLocks = &(MyProc->myProcLocks[partition]);
-		PROCLOCK   *nextplock;
-
-		partitionLock = LockHashPartitionLockByIndex(partition);
-
-		/*
-		 * If the proclock list for this partition is empty, we can skip
-		 * acquiring the partition lock.  This optimization is trickier than
-		 * it looks, because another backend could be in process of adding
-		 * something to our proclock list due to promoting one of our
-		 * fast-path locks.  However, any such lock must be one that we
-		 * decided not to delete above, so it's okay to skip it again now;
-		 * we'd just decide not to delete it again.  We must, however, be
-		 * careful to re-fetch the list header once we've acquired the
-		 * partition lock, to be sure we have a valid, up-to-date pointer.
-		 * (There is probably no significant risk if pointer fetch/store is
-		 * atomic, but we don't wish to assume that.)
-		 *
-		 * XXX This argument assumes that the locallock table correctly
-		 * represents all of our fast-path locks.  While allLocks mode
-		 * guarantees to clean up all of our normal locks regardless of the
-		 * locallock situation, we lose that guarantee for fast-path locks.
-		 * This is not ideal.
-		 */
-		if (SHMQueueNext(procLocks, procLocks,
-						 offsetof(PROCLOCK, procLink)) == NULL)
-			continue;			/* needn't examine this partition */
-
-		LWLockAcquire(partitionLock, LW_EXCLUSIVE);
-
-		for (proclock = (PROCLOCK *) SHMQueueNext(procLocks, procLocks,
-												  offsetof(PROCLOCK, procLink));
-			 proclock;
-			 proclock = nextplock)
-		{
-			bool		wakeupNeeded = false;
-
-			/* Get link first, since we may unlink/delete this proclock */
-			nextplock = (PROCLOCK *)
-				SHMQueueNext(procLocks, &proclock->procLink,
-							 offsetof(PROCLOCK, procLink));
-
-			Assert(proclock->tag.myProc == MyProc);
-
-			lock = proclock->tag.myLock;
-
-			/* Ignore items that are not of the lockmethod to be removed */
-			if (LOCK_LOCKMETHOD(*lock) != lockmethodid)
-				continue;
-
-			/*
-			 * In allLocks mode, force release of all locks even if locallock
-			 * table had problems
-			 */
-			if (allLocks)
-				proclock->releaseMask = proclock->holdMask;
-			else
-				Assert((proclock->releaseMask & ~proclock->holdMask) == 0);
-
-			/*
-			 * Ignore items that have nothing to be released, unless they have
-			 * holdMask == 0 and are therefore recyclable
-			 */
-			if (proclock->releaseMask == 0 && proclock->holdMask != 0)
-				continue;
-
-			PROCLOCK_PRINT("LockReleaseAll", proclock);
-			LOCK_PRINT("LockReleaseAll", lock, 0);
-			Assert(lock->nRequested >= 0);
-			Assert(lock->nGranted >= 0);
-			Assert(lock->nGranted <= lock->nRequested);
-			Assert((proclock->holdMask & ~lock->grantMask) == 0);
-
-			/*
-			 * Release the previously-marked lock modes
-			 */
-			for (i = 1; i <= numLockModes; i++)
-			{
-				if (proclock->releaseMask & LOCKBIT_ON(i))
-					wakeupNeeded |= UnGrantLock(lock, i, proclock,
-												lockMethodTable);
-			}
-			Assert((lock->nRequested >= 0) && (lock->nGranted >= 0));
-			Assert(lock->nGranted <= lock->nRequested);
-			LOCK_PRINT("LockReleaseAll: updated", lock, 0);
-
-			proclock->releaseMask = 0;
-
-			/* CleanUpLock will wake up waiters if needed. */
-			CleanUpLock(lock, proclock,
-						lockMethodTable,
-						LockTagHashCode(&lock->tag),
-						wakeupNeeded);
-		}						/* loop over PROCLOCKs within this partition */
-
-		LWLockRelease(partitionLock);
-	}							/* loop over partitions */
-
-#ifdef LOCK_DEBUG
-	if (*(lockMethodTable->trace_flag))
-		elog(LOG, "LockReleaseAll done");
-#endif
+	Assert(MyProc->fpLockBits == 0);
 }
+#endif
 
 /*
  * LockReleaseSession -- Release all session locks of the specified lock method
@@ -2453,59 +2225,41 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
 void
 LockReleaseSession(LOCKMETHODID lockmethodid)
 {
-	HASH_SEQ_STATUS status;
-	LOCALLOCK  *locallock;
+	dlist_mutable_iter iter;
 
 	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
 		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
 
-	hash_seq_init(&status, LockMethodLocalHash);
-
-	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+	dlist_foreach_modify(iter, &session_locks[lockmethodid - 1])
 	{
-		/* Ignore items that are not of the specified lock method */
-		if (LOCALLOCK_LOCKMETHOD(*locallock) != lockmethodid)
-			continue;
+		LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER, resowner_node, iter.cur);
 
-		ReleaseLockIfHeld(locallock, true);
+		Assert(LOCALLOCK_LOCKMETHOD(*locallockowner->locallock) == lockmethodid);
+
+		ReleaseLockIfHeld(locallockowner, true);
 	}
+
+	Assert(dlist_is_empty(&session_locks[lockmethodid - 1]));
 }
 
 /*
  * LockReleaseCurrentOwner
- *		Release all locks belonging to CurrentResourceOwner
- *
- * If the caller knows what those locks are, it can pass them as an array.
- * That speeds up the call significantly, when a lot of locks are held.
- * Otherwise, pass NULL for locallocks, and we'll traverse through our hash
- * table to find them.
+ *		Release all locks belonging to 'owner'
  */
-void
-LockReleaseCurrentOwner(LOCALLOCK **locallocks, int nlocks)
+extern void
+LockReleaseCurrentOwner(ResourceOwner owner, dlist_node *resowner_node)
 {
-	if (locallocks == NULL)
-	{
-		HASH_SEQ_STATUS status;
-		LOCALLOCK  *locallock;
+	LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER, resowner_node, resowner_node);
 
-		hash_seq_init(&status, LockMethodLocalHash);
-
-		while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
-			ReleaseLockIfHeld(locallock, false);
-	}
-	else
-	{
-		int			i;
+	Assert(locallockowner->owner == owner);
 
-		for (i = nlocks - 1; i >= 0; i--)
-			ReleaseLockIfHeld(locallocks[i], false);
-	}
+	ReleaseLockIfHeld(locallockowner, false);
 }
 
 /*
  * ReleaseLockIfHeld
- *		Release any session-level locks on this lockable object if sessionLock
- *		is true; else, release any locks held by CurrentResourceOwner.
+ *		Release any session-level locks on this 'locallockowner' if
+ *		sessionLock is true; else, release any locks held by 'locallockowner'.
  *
  * It is tempting to pass this a ResourceOwner pointer (or NULL for session
  * locks), but without refactoring LockRelease() we cannot support releasing
@@ -2516,52 +2270,39 @@ LockReleaseCurrentOwner(LOCALLOCK **locallocks, int nlocks)
  * convenience.
  */
 static void
-ReleaseLockIfHeld(LOCALLOCK *locallock, bool sessionLock)
+ReleaseLockIfHeld(LOCALLOCKOWNER *locallockowner, bool sessionLock)
 {
-	ResourceOwner owner;
-	LOCALLOCKOWNER *lockOwners;
-	int			i;
+	LOCALLOCK *locallock = locallockowner->locallock;
+
+	/* release all references to the lock by this resource owner */
 
-	/* Identify owner for lock (must match LockRelease!) */
 	if (sessionLock)
-		owner = NULL;
+		Assert(locallockowner->owner == NULL);
 	else
-		owner = CurrentResourceOwner;
+		Assert(locallockowner->owner != NULL);
 
-	/* Scan to see if there are any locks belonging to the target owner */
-	lockOwners = locallock->lockOwners;
-	for (i = locallock->numLockOwners - 1; i >= 0; i--)
+	/* We will still hold this lock after forgetting this ResourceOwner. */
+	if (locallockowner->nLocks < locallock->nLocks)
 	{
-		if (lockOwners[i].owner == owner)
-		{
-			Assert(lockOwners[i].nLocks > 0);
-			if (lockOwners[i].nLocks < locallock->nLocks)
-			{
-				/*
-				 * We will still hold this lock after forgetting this
-				 * ResourceOwner.
-				 */
-				locallock->nLocks -= lockOwners[i].nLocks;
-				/* compact out unused slot */
-				locallock->numLockOwners--;
-				if (owner != NULL)
-					ResourceOwnerForgetLock(owner, locallock);
-				if (i < locallock->numLockOwners)
-					lockOwners[i] = lockOwners[locallock->numLockOwners];
-			}
-			else
-			{
-				Assert(lockOwners[i].nLocks == locallock->nLocks);
-				/* We want to call LockRelease just once */
-				lockOwners[i].nLocks = 1;
-				locallock->nLocks = 1;
-				if (!LockRelease(&locallock->tag.lock,
-								 locallock->tag.mode,
-								 sessionLock))
-					elog(WARNING, "ReleaseLockIfHeld: failed??");
-			}
-			break;
-		}
+		locallock->nLocks -= locallockowner->nLocks;
+		dlist_delete(&locallockowner->locallock_node);
+
+		if (sessionLock)
+			dlist_delete(&locallockowner->resowner_node);
+		else
+			ResourceOwnerForgetLock(locallockowner);
+	}
+	else
+	{
+		Assert(locallockowner->nLocks == locallock->nLocks);
+		/* We want to call LockRelease just once */
+		locallockowner->nLocks = 1;
+		locallock->nLocks = 1;
+
+		if (!LockRelease(&locallock->tag.lock,
+						 locallock->tag.mode,
+						 sessionLock))
+			elog(PANIC, "ReleaseLockIfHeld: failed??");
 	}
 }
 
@@ -2576,75 +2317,46 @@ ReleaseLockIfHeld(LOCALLOCK *locallock, bool sessionLock)
  * and we'll traverse through our hash table to find them.
  */
 void
-LockReassignCurrentOwner(LOCALLOCK **locallocks, int nlocks)
+LockReassignCurrentOwner(ResourceOwner owner, dlist_node *resowner_node)
 {
+	LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER, resowner_node, resowner_node);
 	ResourceOwner parent = ResourceOwnerGetParent(CurrentResourceOwner);
 
-	Assert(parent != NULL);
-
-	if (locallocks == NULL)
-	{
-		HASH_SEQ_STATUS status;
-		LOCALLOCK  *locallock;
-
-		hash_seq_init(&status, LockMethodLocalHash);
-
-		while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
-			LockReassignOwner(locallock, parent);
-	}
-	else
-	{
-		int			i;
-
-		for (i = nlocks - 1; i >= 0; i--)
-			LockReassignOwner(locallocks[i], parent);
-	}
+	LockReassignOwner(locallockowner, parent);
 }
 
 /*
- * Subroutine of LockReassignCurrentOwner. Reassigns a given lock belonging to
- * CurrentResourceOwner to its parent.
+ * Subroutine of LockReassignCurrentOwner. Reassigns the given
+ * 'locallockowner' to 'parent'.
  */
 static void
-LockReassignOwner(LOCALLOCK *locallock, ResourceOwner parent)
+LockReassignOwner(LOCALLOCKOWNER *locallockowner, ResourceOwner parent)
 {
-	LOCALLOCKOWNER *lockOwners;
-	int			i;
-	int			ic = -1;
-	int			ip = -1;
+	dlist_iter iter;
+	LOCALLOCK *locallock = locallockowner->locallock;
 
-	/*
-	 * Scan to see if there are any locks belonging to current owner or its
-	 * parent
-	 */
-	lockOwners = locallock->lockOwners;
-	for (i = locallock->numLockOwners - 1; i >= 0; i--)
+	ResourceOwnerForgetLock(locallockowner);
+
+	dlist_foreach(iter, &locallock->locallockowners)
 	{
-		if (lockOwners[i].owner == CurrentResourceOwner)
-			ic = i;
-		else if (lockOwners[i].owner == parent)
-			ip = i;
-	}
+		LOCALLOCKOWNER *parentlocalowner = dlist_container(LOCALLOCKOWNER, locallock_node, iter.cur);
 
-	if (ic < 0)
-		return;					/* no current locks */
+		Assert(parentlocalowner->locallock == locallock);
 
-	if (ip < 0)
-	{
-		/* Parent has no slot, so just give it the child's slot */
-		lockOwners[ic].owner = parent;
-		ResourceOwnerRememberLock(parent, locallock);
-	}
-	else
-	{
-		/* Merge child's count with parent's */
-		lockOwners[ip].nLocks += lockOwners[ic].nLocks;
-		/* compact out unused slot */
-		locallock->numLockOwners--;
-		if (ic < locallock->numLockOwners)
-			lockOwners[ic] = lockOwners[locallock->numLockOwners];
+		if (parentlocalowner->owner != parent)
+			continue;
+
+		parentlocalowner->nLocks += locallockowner->nLocks;
+
+		locallockowner->nLocks = 0;
+		dlist_delete(&locallockowner->locallock_node);
+		pfree(locallockowner);
+		return;
 	}
-	ResourceOwnerForgetLock(CurrentResourceOwner, locallock);
+
+	/* reassign locallockowner to parent resowner */
+	locallockowner->owner = parent;
+	ResourceOwnerRememberLock(parent, locallockowner);
 }
 
 /*
@@ -3123,7 +2835,7 @@ GetLockConflicts(const LOCKTAG *locktag, LOCKMODE lockmode, int *countp)
  * We currently use this in two situations: first, to release locks held by
  * prepared transactions on commit (see lock_twophase_postcommit); and second,
  * to release locks taken via the fast-path, transferred to the main hash
- * table, and then released (see LockReleaseAll).
+ * table, and then released (see ResourceOwnerRelease).
  */
 static void
 LockRefindAndRelease(LockMethod lockMethodTable, PGPROC *proc,
@@ -3178,7 +2890,7 @@ LockRefindAndRelease(LockMethod lockMethodTable, PGPROC *proc,
 	{
 		PROCLOCK_PRINT("lock_twophase_postcommit: WRONGTYPE", proclock);
 		LWLockRelease(partitionLock);
-		elog(WARNING, "you don't own a lock of type %s",
+		elog(PANIC, "you don't own a lock of type %s",
 			 lockMethodTable->lockModeNames[lockmode]);
 		return;
 	}
@@ -3259,10 +2971,9 @@ CheckForSessionAndXactLocks(void)
 
 	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
 	{
-		LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
 		PerLockTagEntry *hentry;
 		bool		found;
-		int			i;
+		dlist_iter iter;
 
 		/*
 		 * Ignore VXID locks.  We don't want those to be held by prepared
@@ -3283,9 +2994,13 @@ CheckForSessionAndXactLocks(void)
 			hentry->sessLock = hentry->xactLock = false;
 
 		/* Scan to see if we hold lock at session or xact level or both */
-		for (i = locallock->numLockOwners - 1; i >= 0; i--)
+		dlist_foreach(iter, &locallock->locallockowners)
 		{
-			if (lockOwners[i].owner == NULL)
+			LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER,
+															 locallock_node,
+															 iter.cur);
+
+			if (locallockowner->owner == NULL)
 				hentry->sessLock = true;
 			else
 				hentry->xactLock = true;
@@ -3332,10 +3047,9 @@ AtPrepare_Locks(void)
 	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
 	{
 		TwoPhaseLockRecord record;
-		LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
 		bool		haveSessionLock;
 		bool		haveXactLock;
-		int			i;
+		dlist_iter iter;
 
 		/*
 		 * Ignore VXID locks.  We don't want those to be held by prepared
@@ -3350,9 +3064,13 @@ AtPrepare_Locks(void)
 
 		/* Scan to see whether we hold it at session or transaction level */
 		haveSessionLock = haveXactLock = false;
-		for (i = locallock->numLockOwners - 1; i >= 0; i--)
+		dlist_foreach(iter, &locallock->locallockowners)
 		{
-			if (lockOwners[i].owner == NULL)
+			LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER,
+															 locallock_node,
+															 iter.cur);
+
+			if (locallockowner->owner == NULL)
 				haveSessionLock = true;
 			else
 				haveXactLock = true;
@@ -3410,8 +3128,8 @@ AtPrepare_Locks(void)
  * pointers in the transaction's resource owner.  This is OK at the
  * moment since resowner.c doesn't try to free locks retail at a toplevel
  * transaction commit or abort.  We could alternatively zero out nLocks
- * and leave the LOCALLOCK entries to be garbage-collected by LockReleaseAll,
- * but that probably costs more cycles.
+ * and leave the LOCALLOCK entries to be garbage-collected by
+ * ResourceOwnerRelease, but that probably costs more cycles.
  */
 void
 PostPrepare_Locks(TransactionId xid)
@@ -3444,10 +3162,9 @@ PostPrepare_Locks(TransactionId xid)
 
 	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
 	{
-		LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
 		bool		haveSessionLock;
 		bool		haveXactLock;
-		int			i;
+		dlist_iter iter;
 
 		if (locallock->proclock == NULL || locallock->lock == NULL)
 		{
@@ -3466,9 +3183,13 @@ PostPrepare_Locks(TransactionId xid)
 
 		/* Scan to see whether we hold it at session or transaction level */
 		haveSessionLock = haveXactLock = false;
-		for (i = locallock->numLockOwners - 1; i >= 0; i--)
+		dlist_foreach(iter, &locallock->locallockowners)
 		{
-			if (lockOwners[i].owner == NULL)
+			LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER,
+															 locallock_node,
+															 iter.cur);
+
+			if (locallockowner->owner == NULL)
 				haveSessionLock = true;
 			else
 				haveXactLock = true;
@@ -3484,10 +3205,6 @@ PostPrepare_Locks(TransactionId xid)
 					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 					 errmsg("cannot PREPARE while holding both session-level and transaction-level locks on the same object")));
 
-		/* Mark the proclock to show we need to release this lockmode */
-		if (locallock->nLocks > 0)
-			locallock->proclock->releaseMask |= LOCKBIT_ON(locallock->tag.mode);
-
 		/* And remove the locallock hashtable entry */
 		RemoveLocalLock(locallock);
 	}
@@ -3505,11 +3222,7 @@ PostPrepare_Locks(TransactionId xid)
 
 		/*
 		 * If the proclock list for this partition is empty, we can skip
-		 * acquiring the partition lock.  This optimization is safer than the
-		 * situation in LockReleaseAll, because we got rid of any fast-path
-		 * locks during AtPrepare_Locks, so there cannot be any case where
-		 * another backend is adding something to our lists now.  For safety,
-		 * though, we code this the same way as in LockReleaseAll.
+		 * acquiring the partition lock.
 		 */
 		if (SHMQueueNext(procLocks, procLocks,
 						 offsetof(PROCLOCK, procLink)) == NULL)
@@ -3542,14 +3255,6 @@ PostPrepare_Locks(TransactionId xid)
 			Assert(lock->nGranted <= lock->nRequested);
 			Assert((proclock->holdMask & ~lock->grantMask) == 0);
 
-			/* Ignore it if nothing to release (must be a session lock) */
-			if (proclock->releaseMask == 0)
-				continue;
-
-			/* Else we should be releasing all locks */
-			if (proclock->releaseMask != proclock->holdMask)
-				elog(PANIC, "we seem to have dropped a bit somewhere");
-
 			/*
 			 * We cannot simply modify proclock->tag.myProc to reassign
 			 * ownership of the lock, because that's part of the hash key and
@@ -3599,6 +3304,7 @@ PostPrepare_Locks(TransactionId xid)
 	}							/* loop over partitions */
 
 	END_CRIT_SECTION();
+
 }
 
 
@@ -4333,7 +4039,6 @@ lock_twophase_recover(TransactionId xid, uint16 info,
 		Assert(proc->lockGroupLeader == NULL);
 		proclock->groupLeader = proc;
 		proclock->holdMask = 0;
-		proclock->releaseMask = 0;
 		/* Add proclock to appropriate lists */
 		SHMQueueInsertBefore(&lock->procLocks, &proclock->lockLink);
 		SHMQueueInsertBefore(&(proc->myProcLocks[partition]),
@@ -4470,7 +4175,7 @@ lock_twophase_postabort(TransactionId xid, uint16 info,
  *
  *		We don't bother recording this lock in the local lock table, since it's
  *		only ever released at the end of a transaction.  Instead,
- *		LockReleaseAll() calls VirtualXactLockTableCleanup().
+ *		ProcReleaseLocks() calls VirtualXactLockTableCleanup().
  */
 void
 VirtualXactLockTableInsert(VirtualTransactionId vxid)
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 37aaab1338..0160f4ef77 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -778,10 +778,17 @@ ProcReleaseLocks(bool isCommit)
 		return;
 	/* If waiting, get off wait queue (should only be needed after error) */
 	LockErrorCleanup();
-	/* Release standard locks, including session-level if aborting */
-	LockReleaseAll(DEFAULT_LOCKMETHOD, !isCommit);
-	/* Release transaction-level advisory locks */
-	LockReleaseAll(USER_LOCKMETHOD, false);
+
+	VirtualXactLockTableCleanup();
+
+	/* Release session-level locks if aborting */
+	if (!isCommit)
+		LockReleaseSession(DEFAULT_LOCKMETHOD);
+
+#ifdef USE_ASSERT_CHECKING
+	/* Ensure all locks were released */
+	LockAssertNoneHeld(isCommit);
+#endif
 }
 
 
@@ -863,6 +870,8 @@ ProcKill(int code, Datum arg)
 		LWLockRelease(leader_lwlock);
 	}
 
+	Assert(MyProc->fpLockBits == 0);
+
 	/*
 	 * Reset MyLatch to the process local one.  This is so that signal
 	 * handlers et al can continue using the latch after the shared latch
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 29f70accb2..a46ca739c8 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -1259,7 +1259,7 @@ ShutdownPostgres(int code, Datum arg)
 	 * User locks are not released by transaction end, so be sure to release
 	 * them explicitly.
 	 */
-	LockReleaseAll(USER_LOCKMETHOD, true);
+	LockReleaseSession(USER_LOCKMETHOD);
 }
 
 
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index ceb4b0e3f7..f01f4d6223 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -33,6 +33,7 @@
 #include "utils/resowner_private.h"
 #include "utils/snapmgr.h"
 
+#include "lib/ilist.h"
 
 /*
  * All resource IDs managed by this code are required to fit into a Datum,
@@ -133,9 +134,7 @@ typedef struct ResourceOwnerData
 	ResourceArray cryptohasharr;	/* cryptohash contexts */
 	ResourceArray hmacarr;		/* HMAC contexts */
 
-	/* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
-	int			nlocks;			/* number of owned locks */
-	LOCALLOCK  *locks[MAX_RESOWNER_LOCKS];	/* list of owned locks */
+	dlist_head  locks;			/* dlist of owned locks */
 }			ResourceOwnerData;
 
 
@@ -452,6 +451,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
 	ResourceArrayInit(&(owner->jitarr), PointerGetDatum(NULL));
 	ResourceArrayInit(&(owner->cryptohasharr), PointerGetDatum(NULL));
 	ResourceArrayInit(&(owner->hmacarr), PointerGetDatum(NULL));
+	dlist_init(&owner->locks);
 
 	return owner;
 }
@@ -585,8 +585,15 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 	}
 	else if (phase == RESOURCE_RELEASE_LOCKS)
 	{
+		dlist_mutable_iter iter;
+
 		if (isTopLevel)
 		{
+			dlist_foreach_modify(iter, &owner->locks)
+				LockReleaseCurrentOwner(owner, iter.cur);
+
+			Assert(dlist_is_empty(&owner->locks));
+
 			/*
 			 * For a top-level xact we are going to release all locks (or at
 			 * least all non-session locks), so just do a single lmgr call at
@@ -605,30 +612,20 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 			 * subtransaction, we do NOT release its locks yet, but transfer
 			 * them to the parent.
 			 */
-			LOCALLOCK **locks;
-			int			nlocks;
-
-			Assert(owner->parent != NULL);
-
-			/*
-			 * Pass the list of locks owned by this resource owner to the lock
-			 * manager, unless it has overflowed.
-			 */
-			if (owner->nlocks > MAX_RESOWNER_LOCKS)
+			if (isCommit)
 			{
-				locks = NULL;
-				nlocks = 0;
+				dlist_foreach_modify(iter, &owner->locks)
+					LockReassignCurrentOwner(owner, iter.cur);
+
+				Assert(dlist_is_empty(&owner->locks));
 			}
 			else
 			{
-				locks = owner->locks;
-				nlocks = owner->nlocks;
-			}
+				dlist_foreach_modify(iter, &owner->locks)
+					LockReleaseCurrentOwner(owner, iter.cur);
 
-			if (isCommit)
-				LockReassignCurrentOwner(locks, nlocks);
-			else
-				LockReleaseCurrentOwner(locks, nlocks);
+				Assert(dlist_is_empty(&owner->locks));
+			}
 		}
 	}
 	else if (phase == RESOURCE_RELEASE_AFTER_LOCKS)
@@ -752,7 +749,7 @@ ResourceOwnerDelete(ResourceOwner owner)
 	Assert(owner->jitarr.nitems == 0);
 	Assert(owner->cryptohasharr.nitems == 0);
 	Assert(owner->hmacarr.nitems == 0);
-	Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1);
+	Assert(dlist_is_empty(&owner->locks));
 
 	/*
 	 * Delete children.  The recursive call will delink the child from me, so
@@ -973,54 +970,61 @@ ResourceOwnerForgetBuffer(ResourceOwner owner, Buffer buffer)
 
 /*
  * Remember that a Local Lock is owned by a ResourceOwner
- *
- * This is different from the other Remember functions in that the list of
- * locks is only a lossy cache. It can hold up to MAX_RESOWNER_LOCKS entries,
- * and when it overflows, we stop tracking locks. The point of only remembering
- * only up to MAX_RESOWNER_LOCKS entries is that if a lot of locks are held,
- * ResourceOwnerForgetLock doesn't need to scan through a large array to find
- * the entry.
  */
 void
-ResourceOwnerRememberLock(ResourceOwner owner, LOCALLOCK *locallock)
+ResourceOwnerRememberLock(ResourceOwner owner, LOCALLOCKOWNER *locallockowner)
 {
-	Assert(locallock != NULL);
-
-	if (owner->nlocks > MAX_RESOWNER_LOCKS)
-		return;					/* we have already overflowed */
+	Assert(owner != NULL);
+	Assert(locallockowner != NULL);
 
-	if (owner->nlocks < MAX_RESOWNER_LOCKS)
-		owner->locks[owner->nlocks] = locallock;
-	else
+#ifdef USE_ASSERT_CHECKING
 	{
-		/* overflowed */
+		dlist_iter iter;
+
+		dlist_foreach(iter, &owner->locks)
+		{
+			LOCALLOCKOWNER *i = dlist_container(LOCALLOCKOWNER, resowner_node, iter.cur);
+
+			Assert(i->locallock != locallockowner->locallock);
+		}
 	}
-	owner->nlocks++;
+#endif
+
+	dlist_push_tail(&owner->locks, &locallockowner->resowner_node);
 }
 
 /*
- * Forget that a Local Lock is owned by a ResourceOwner
+ * Forget that a Local Lock is owned by the given LOCALLOCKOWNER.
  */
 void
-ResourceOwnerForgetLock(ResourceOwner owner, LOCALLOCK *locallock)
+ResourceOwnerForgetLock(LOCALLOCKOWNER *locallockowner)
 {
-	int			i;
+#ifdef USE_ASSERT_CHECKING
+	ResourceOwner owner;
+
+	Assert(locallockowner != NULL);
 
-	if (owner->nlocks > MAX_RESOWNER_LOCKS)
-		return;					/* we have overflowed */
+	owner = locallockowner->owner;
 
-	Assert(owner->nlocks > 0);
-	for (i = owner->nlocks - 1; i >= 0; i--)
 	{
-		if (locallock == owner->locks[i])
+		dlist_iter iter;
+		bool found = false;
+
+		dlist_foreach(iter, &owner->locks)
 		{
-			owner->locks[i] = owner->locks[owner->nlocks - 1];
-			owner->nlocks--;
-			return;
+			LOCALLOCKOWNER *owner = dlist_container(LOCALLOCKOWNER, resowner_node, iter.cur);
+
+			if (locallockowner == owner)
+			{
+				Assert(!found);
+				found = true;
+			}
 		}
+
+		Assert(found);
 	}
-	elog(ERROR, "lock reference %p is not owned by resource owner %s",
-		 locallock, owner->name);
+#endif
+	dlist_delete(&locallockowner->resowner_node);
 }
 
 /*
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index e4e1495b24..b2808f629f 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -23,6 +23,7 @@
 #include "storage/lwlock.h"
 #include "storage/shmem.h"
 #include "utils/timestamp.h"
+#include "lib/ilist.h"
 
 /* struct PGPROC is declared in proc.h, but must forward-reference it */
 typedef struct PGPROC PGPROC;
@@ -341,10 +342,6 @@ typedef struct LOCK
  * Otherwise, proclock objects whose holdMasks are zero are recycled
  * as soon as convenient.
  *
- * releaseMask is workspace for LockReleaseAll(): it shows the locks due
- * to be released during the current call.  This must only be examined or
- * set by the backend owning the PROCLOCK.
- *
  * Each PROCLOCK object is linked into lists for both the associated LOCK
  * object and the owning PGPROC object.  Note that the PROCLOCK is entered
  * into these lists as soon as it is created, even if no lock has yet been
@@ -366,7 +363,6 @@ typedef struct PROCLOCK
 	/* data */
 	PGPROC	   *groupLeader;	/* proc's lock group leader, or proc itself */
 	LOCKMASK	holdMask;		/* bitmask for lock types currently held */
-	LOCKMASK	releaseMask;	/* bitmask for lock types to be released */
 	SHM_QUEUE	lockLink;		/* list link in LOCK's list of proclocks */
 	SHM_QUEUE	procLink;		/* list link in PGPROC's list of proclocks */
 } PROCLOCK;
@@ -412,6 +408,13 @@ typedef struct LOCALLOCKOWNER
 	 * Must use a forward struct reference to avoid circularity.
 	 */
 	struct ResourceOwnerData *owner;
+
+	dlist_node  resowner_node;
+
+	dlist_node  locallock_node;
+
+	struct LOCALLOCK *locallock;
+
 	int64		nLocks;			/* # of times held by this owner */
 } LOCALLOCKOWNER;
 
@@ -425,9 +428,9 @@ typedef struct LOCALLOCK
 	LOCK	   *lock;			/* associated LOCK object, if any */
 	PROCLOCK   *proclock;		/* associated PROCLOCK object, if any */
 	int64		nLocks;			/* total number of times lock is held */
-	int			numLockOwners;	/* # of relevant ResourceOwners */
-	int			maxLockOwners;	/* allocated size of array */
-	LOCALLOCKOWNER *lockOwners; /* dynamically resizable array */
+
+	dlist_head locallockowners;	/* dlist of LOCALLOCKOWNER */
+
 	bool		holdsStrongLockCount;	/* bumped FastPathStrongRelationLocks */
 	bool		lockCleared;	/* we read all sinval msgs for lock */
 } LOCALLOCK;
@@ -556,10 +559,17 @@ extern void AbortStrongLockAcquire(void);
 extern void MarkLockClear(LOCALLOCK *locallock);
 extern bool LockRelease(const LOCKTAG *locktag,
 						LOCKMODE lockmode, bool sessionLock);
-extern void LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks);
+
+#ifdef USE_ASSERT_CHECKING
+extern void LockAssertNoneHeld(bool isCommit);
+#endif
+
 extern void LockReleaseSession(LOCKMETHODID lockmethodid);
-extern void LockReleaseCurrentOwner(LOCALLOCK **locallocks, int nlocks);
-extern void LockReassignCurrentOwner(LOCALLOCK **locallocks, int nlocks);
+struct ResourceOwnerData;
+extern void LockReleaseCurrentOwner(struct ResourceOwnerData *owner,
+									dlist_node *resowner_node);
+extern void LockReassignCurrentOwner(struct ResourceOwnerData *owner,
+									 dlist_node *resowner_node);
 extern bool LockHeldByMe(const LOCKTAG *locktag, LOCKMODE lockmode);
 #ifdef USE_ASSERT_CHECKING
 extern HTAB *GetLockMethodLocalHash(void);
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index d01cccc27c..ab4796d287 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -31,8 +31,9 @@ extern void ResourceOwnerRememberBuffer(ResourceOwner owner, Buffer buffer);
 extern void ResourceOwnerForgetBuffer(ResourceOwner owner, Buffer buffer);
 
 /* support for local lock management */
-extern void ResourceOwnerRememberLock(ResourceOwner owner, LOCALLOCK *locallock);
-extern void ResourceOwnerForgetLock(ResourceOwner owner, LOCALLOCK *locallock);
+extern void ResourceOwnerRememberLock(ResourceOwner owner,
+									  LOCALLOCKOWNER *locallock);
+extern void ResourceOwnerForgetLock(LOCALLOCKOWNER *locallock);
 
 /* support for catcache refcount management */
 extern void ResourceOwnerEnlargeCatCacheRefs(ResourceOwner owner);
diff --git a/src/tools/make_mkid b/src/tools/make_mkid
deleted file mode 100755
index 6f160614cd..0000000000
--- a/src/tools/make_mkid
+++ /dev/null
@@ -1,11 +0,0 @@
-#!/bin/sh
-
-# src/tools/make_mkid
-
-mkid `find \`pwd\`/ \( -name _deadcode -a -prune \) -o \
-	-type f -name '*.[chyl]' -print|sed 's;//;/;g'`
-
-find . \( -name .git -a -prune \) -o -type d -print  |while read DIR
-do
-	[ "$DIR" != "." ] && ln -f -s `echo "$DIR" | sed 's;/[^/]*;/..;g'`/ID $DIR/ID
-done

#109

Ankit Kumar Pandey

itsankitkp@gmail.com

about 3 years ago

In reply to: David Rowley (#108)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

Hi David,

This is review of speed up releasing of locks patch.

Contents & Purpose:
Subject is missing in patch. It would have been easier to understand purpose had it been included.
Included in the patch are change in README, but no new tests are included..

Initial Run:
The patch applies cleanly to HEAD. The regression tests all pass
successfully against the new patch.

Nitpicking & conclusion:
I don't see any performance improvement in tests. Lots of comments
were removed which were not fully replaced. Change of log level for ReleaseLockIfHeld: failed
from warning to panic is mystery.
Change in readme doesn't look right.
`Any subsequent lockers are share lockers wait
waiting for the VXID to terminate via some other method) is for deadlock`. This sentence could be rewritten.
Also more comments could be added to explain new methods added.

Thanks,
Ankit

#110

David Rowley

dgrowleyml@gmail.com

about 3 years ago

In reply to: Ankit Kumar Pandey (#109)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

Thank you for looking at the patch.

On Fri, 4 Nov 2022 at 04:43, Ankit Kumar Pandey <itsankitkp@gmail.com> wrote:

I don't see any performance improvement in tests.

Are you able to share what your test was?

In order to see a performance improvement you're likely going to have
to obtain a large number of locks in the session so that the local
lock table becomes bloated, then continue to run some fast query and
observe that LockReleaseAll has become slower as a result of the hash
table becoming bloated. Running pgbench running a SELECT on a hash
partitioned table with a good number of partitions to look up a single
row with -M prepared. The reason this becomes slow is that the
planner will try a generic plan on the 6th execution which will lock
every partition and bloat the local lock table. From then on it will
use a custom plan which only locks a single leaf partition.

I just tried the following:

$ pgbench -i --partition-method=hash --partitions=1000 postgres

Master:
$ pgbench -T 60 -S -M prepared postgres | grep tps
tps = 21286.172326 (without initial connection time)

Patched:
$ pgbench -T 60 -S -M prepared postgres | grep tps
tps = 23034.063261 (without initial connection time)

If I try again with 10,000 partitions, I get:

Master:
$ pgbench -T 60 -S -M prepared postgres | grep tps
tps = 13044.290903 (without initial connection time)

Patched:
$ pgbench -T 60 -S -M prepared postgres | grep tps
tps = 22683.545686 (without initial connection time)

David

#111

vignesh C

vignesh21@gmail.com

almost 3 years ago

In reply to: David Rowley (#108)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

On Wed, 3 Aug 2022 at 09:04, David Rowley <dgrowleyml@gmail.com> wrote:

On Wed, 3 Aug 2022 at 07:04, Jacob Champion <jchampion@timescale.com> wrote:

This entry has been waiting on author input for a while (our current
threshold is roughly two weeks), so I've marked it Returned with
Feedback.

Thanks for taking care of this. You dealt with this correctly based on
the fact that I'd failed to rebase before or during the entire July
CF.

I'm still interested in having the LockReleaseAll slowness fixed, so
here's a rebased patch.

CFBot shows some compilation errors as in [1]https://cirrus-ci.com/task/4562493863886848, please post an updated
version for the same:
[15:40:00.287] [1239/1809] Linking target src/backend/postgres
[15:40:00.287] FAILED: src/backend/postgres
[15:40:00.287] cc @src/backend/postgres.rsp
[15:40:00.287] /usr/bin/ld:
src/backend/postgres_lib.a.p/replication_logical_launcher.c.o: in
function `logicalrep_worker_onexit':
[15:40:00.287] /tmp/cirrus-ci-build/build/../src/backend/replication/logical/launcher.c:773:
undefined reference to `LockReleaseAll'
[15:40:00.287] collect2: error: ld returned 1 exit status

[1]: https://cirrus-ci.com/task/4562493863886848

Regards,
Vignesh

#112

David Rowley

dgrowleyml@gmail.com

almost 3 years ago

In reply to: vignesh C (#111)

1 attachment(s)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

On Fri, 20 Jan 2023 at 00:26, vignesh C <vignesh21@gmail.com> wrote:

CFBot shows some compilation errors as in [1], please post an updated
version for the same:

I've attached a rebased patch.

While reading over this again, I wondered if instead of allocating the
memory for the LOCALLOCKOWNER in TopMemoryContext, maybe we should
create a Slab context as a child of TopMemoryContext and perform the
allocations there. I feel like slab might be a better option here as
it'll use slightly less memory due to it not rounding up allocations
to the next power of 2. sizeof(LOCALLOCKOWNER) == 56, so it's not a
great deal of memory, but more than nothing. The primary reason that I
think this might be a good idea is mostly around better handling of
chunk on block fragmentation in slab.c than aset.c. If we have
transactions which create a large number of locks then we may end up
growing the TopMemoryContext and never releasing the AllocBlocks and
just having a high number of 64-byte chunks left on the freelist
that'll maybe never be used again. I'm thinking slab.c might handle
that better as it'll only keep around 10 completely empty SlabBlocks
before it'll start free'ing them. The slab allocator is quite a bit
faster now as a result of d21ded75f.

I would like to get this LockReleaseAll problem finally fixed in PG16,
but I'd feel much better about this patch if it had some review from
someone who has more in-depth knowledge of the locking code.

I've also gone and adjusted all the places that upgraded the
elog(WARNING)s of local table corruption to PANIC and put them back to
use WARNING again. While I think it might be a good idea to do that,
it seems to be adding a bit more resistance to this patch which I
don't think it really needs. Maybe we can consider that in a separate
effort.

David

Attachments:

lockreleaseall_speedup5.patchtext/plain; charset=US-ASCII; name=lockreleaseall_speedup5.patchDownload

diff --git a/src/backend/commands/discard.c b/src/backend/commands/discard.c
index 296dc82d2e..edb8b6026e 100644
--- a/src/backend/commands/discard.c
+++ b/src/backend/commands/discard.c
@@ -71,7 +71,7 @@ DiscardAll(bool isTopLevel)
 	ResetAllOptions();
 	DropAllPreparedStatements();
 	Async_UnlistenAll();
-	LockReleaseAll(USER_LOCKMETHOD, true);
+	LockReleaseSession(USER_LOCKMETHOD);
 	ResetPlanCache();
 	ResetTempTableNamespace();
 	ResetSequenceCaches();
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index 564bffe5ca..20b2e3497e 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -798,7 +798,7 @@ logicalrep_worker_onexit(int code, Datum arg)
 	 * parallel apply mode and will not be released when the worker
 	 * terminates, so manually release all locks before the worker exits.
 	 */
-	LockReleaseAll(DEFAULT_LOCKMETHOD, true);
+	LockReleaseSession(DEFAULT_LOCKMETHOD);
 
 	ApplyLauncherWakeup();
 }
diff --git a/src/backend/storage/lmgr/README b/src/backend/storage/lmgr/README
index d08ec6c402..9603cc8959 100644
--- a/src/backend/storage/lmgr/README
+++ b/src/backend/storage/lmgr/README
@@ -182,12 +182,6 @@ holdMask -
     subset of the PGPROC object's heldLocks mask (if the PGPROC is
     currently waiting for another lock mode on this lock).
 
-releaseMask -
-    A bitmask for the lock modes due to be released during LockReleaseAll.
-    This must be a subset of the holdMask.  Note that it is modified without
-    taking the partition LWLock, and therefore it is unsafe for any
-    backend except the one owning the PROCLOCK to examine/change it.
-
 lockLink -
     List link for shared memory queue of all the PROCLOCK objects for the
     same LOCK.
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 49d62a0dc7..9d9d27e0c9 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -22,8 +22,7 @@
  *	Interface:
  *
  *	InitLocks(), GetLocksMethodTable(), GetLockTagsMethodTable(),
- *	LockAcquire(), LockRelease(), LockReleaseAll(),
- *	LockCheckConflicts(), GrantLock()
+ *	LockAcquire(), LockRelease(), LockCheckConflicts(), GrantLock()
  *
  *-------------------------------------------------------------------------
  */
@@ -289,6 +288,9 @@ static LOCALLOCK *awaitedLock;
 static ResourceOwner awaitedOwner;
 
 
+static dlist_head session_locks[lengthof(LockMethods)];
+
+
 #ifdef LOCK_DEBUG
 
 /*------
@@ -375,8 +377,8 @@ static void GrantLockLocal(LOCALLOCK *locallock, ResourceOwner owner);
 static void BeginStrongLockAcquire(LOCALLOCK *locallock, uint32 fasthashcode);
 static void FinishStrongLockAcquire(void);
 static void WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner);
-static void ReleaseLockIfHeld(LOCALLOCK *locallock, bool sessionLock);
-static void LockReassignOwner(LOCALLOCK *locallock, ResourceOwner parent);
+static void ReleaseLockIfHeld(LOCALLOCKOWNER *locallockowner, bool sessionLock);
+static void LockReassignOwner(LOCALLOCKOWNER *locallockowner, ResourceOwner parent);
 static bool UnGrantLock(LOCK *lock, LOCKMODE lockmode,
 						PROCLOCK *proclock, LockMethod lockMethodTable);
 static void CleanUpLock(LOCK *lock, PROCLOCK *proclock,
@@ -477,6 +479,10 @@ InitLocks(void)
 									  16,
 									  &info,
 									  HASH_ELEM | HASH_BLOBS);
+
+	/* Initialize each element of the session_locks array */
+	for (int i = 0; i < lengthof(LockMethods); i++)
+		dlist_init(&session_locks[i]);
 }
 
 
@@ -839,26 +845,9 @@ LockAcquireExtended(const LOCKTAG *locktag,
 		locallock->nLocks = 0;
 		locallock->holdsStrongLockCount = false;
 		locallock->lockCleared = false;
-		locallock->numLockOwners = 0;
-		locallock->maxLockOwners = 8;
-		locallock->lockOwners = NULL;	/* in case next line fails */
-		locallock->lockOwners = (LOCALLOCKOWNER *)
-			MemoryContextAlloc(TopMemoryContext,
-							   locallock->maxLockOwners * sizeof(LOCALLOCKOWNER));
+		dlist_init(&locallock->locallockowners);
 	}
-	else
-	{
-		/* Make sure there will be room to remember the lock */
-		if (locallock->numLockOwners >= locallock->maxLockOwners)
-		{
-			int			newsize = locallock->maxLockOwners * 2;
 
-			locallock->lockOwners = (LOCALLOCKOWNER *)
-				repalloc(locallock->lockOwners,
-						 newsize * sizeof(LOCALLOCKOWNER));
-			locallock->maxLockOwners = newsize;
-		}
-	}
 	hashcode = locallock->hashcode;
 
 	if (locallockp)
@@ -1268,7 +1257,6 @@ SetupLockInTable(LockMethod lockMethodTable, PGPROC *proc,
 		proclock->groupLeader = proc->lockGroupLeader != NULL ?
 			proc->lockGroupLeader : proc;
 		proclock->holdMask = 0;
-		proclock->releaseMask = 0;
 		/* Add proclock to appropriate lists */
 		dlist_push_tail(&lock->procLocks, &proclock->lockLink);
 		dlist_push_tail(&proc->myProcLocks[partition], &proclock->procLink);
@@ -1365,17 +1353,18 @@ CheckAndSetLockHeld(LOCALLOCK *locallock, bool acquired)
 static void
 RemoveLocalLock(LOCALLOCK *locallock)
 {
-	int			i;
+	dlist_mutable_iter iter;
 
-	for (i = locallock->numLockOwners - 1; i >= 0; i--)
+	dlist_foreach_modify(iter, &locallock->locallockowners)
 	{
-		if (locallock->lockOwners[i].owner != NULL)
-			ResourceOwnerForgetLock(locallock->lockOwners[i].owner, locallock);
+		LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER, locallock_node, iter.cur);
+
+		Assert(locallockowner->owner != NULL);
+		dlist_delete(&locallockowner->locallock_node);
+		ResourceOwnerForgetLock(locallockowner);
 	}
-	locallock->numLockOwners = 0;
-	if (locallock->lockOwners != NULL)
-		pfree(locallock->lockOwners);
-	locallock->lockOwners = NULL;
+
+	Assert(dlist_is_empty(&locallock->locallockowners));
 
 	if (locallock->holdsStrongLockCount)
 	{
@@ -1683,26 +1672,38 @@ CleanUpLock(LOCK *lock, PROCLOCK *proclock,
 static void
 GrantLockLocal(LOCALLOCK *locallock, ResourceOwner owner)
 {
-	LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
-	int			i;
+	LOCALLOCKOWNER *locallockowner;
+	dlist_iter iter;
 
-	Assert(locallock->numLockOwners < locallock->maxLockOwners);
 	/* Count the total */
 	locallock->nLocks++;
 	/* Count the per-owner lock */
-	for (i = 0; i < locallock->numLockOwners; i++)
+	dlist_foreach(iter, &locallock->locallockowners)
 	{
-		if (lockOwners[i].owner == owner)
+		locallockowner = dlist_container(LOCALLOCKOWNER, locallock_node, iter.cur);
+
+		if (locallockowner->owner == owner)
 		{
-			lockOwners[i].nLocks++;
+			locallockowner->nLocks++;
 			return;
 		}
 	}
-	lockOwners[i].owner = owner;
-	lockOwners[i].nLocks = 1;
-	locallock->numLockOwners++;
+	locallockowner = MemoryContextAlloc(TopMemoryContext, sizeof(LOCALLOCKOWNER));
+	locallockowner->owner = owner;
+	locallockowner->nLocks = 1;
+	locallockowner->locallock = locallock;
+
+	dlist_push_tail(&locallock->locallockowners, &locallockowner->locallock_node);
+
 	if (owner != NULL)
-		ResourceOwnerRememberLock(owner, locallock);
+		ResourceOwnerRememberLock(owner, locallockowner);
+	else
+	{
+		LOCKMETHODID lockmethodid = LOCALLOCK_LOCKMETHOD(*locallockowner->locallock);
+
+		Assert(lockmethodid > 0 && lockmethodid <= 2);
+		dlist_push_tail(&session_locks[lockmethodid - 1], &locallockowner->resowner_node);
+	}
 
 	/* Indicate that the lock is acquired for certain types of locks. */
 	CheckAndSetLockHeld(locallock, true);
@@ -2015,9 +2016,9 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 	 * Decrease the count for the resource owner.
 	 */
 	{
-		LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
 		ResourceOwner owner;
-		int			i;
+		dlist_iter iter;
+		bool		found = false;
 
 		/* Identify owner for lock */
 		if (sessionLock)
@@ -2025,24 +2026,29 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 		else
 			owner = CurrentResourceOwner;
 
-		for (i = locallock->numLockOwners - 1; i >= 0; i--)
+		dlist_foreach(iter, &locallock->locallockowners)
 		{
-			if (lockOwners[i].owner == owner)
+			LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER, locallock_node, iter.cur);
+
+			if (locallockowner->owner != owner)
+				continue;
+
+			found = true;
+
+			if (--locallockowner->nLocks == 0)
 			{
-				Assert(lockOwners[i].nLocks > 0);
-				if (--lockOwners[i].nLocks == 0)
-				{
-					if (owner != NULL)
-						ResourceOwnerForgetLock(owner, locallock);
-					/* compact out unused slot */
-					locallock->numLockOwners--;
-					if (i < locallock->numLockOwners)
-						lockOwners[i] = lockOwners[locallock->numLockOwners];
-				}
-				break;
+				dlist_delete(&locallockowner->locallock_node);
+
+				if (owner != NULL)
+					ResourceOwnerForgetLock(locallockowner);
+				else
+					dlist_delete(&locallockowner->resowner_node);
 			}
+
+			Assert(locallockowner->nLocks >= 0);
 		}
-		if (i < 0)
+
+		if (!found)
 		{
 			/* don't release a lock belonging to another owner */
 			elog(WARNING, "you don't own a lock of type %s",
@@ -2060,6 +2066,8 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 	if (locallock->nLocks > 0)
 		return true;
 
+	Assert(locallock->nLocks >= 0);
+
 	/*
 	 * At this point we can no longer suppose we are clear of invalidation
 	 * messages related to this lock.  Although we'll delete the LOCALLOCK
@@ -2162,274 +2170,44 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 	return true;
 }
 
+#ifdef USE_ASSERT_CHECKING
 /*
- * LockReleaseAll -- Release all locks of the specified lock method that
- *		are held by the current process.
- *
- * Well, not necessarily *all* locks.  The available behaviors are:
- *		allLocks == true: release all locks including session locks.
- *		allLocks == false: release all non-session locks.
+ * LockAssertNoneHeld -- Assert that we no longer hold any DEFAULT_LOCKMETHOD
+ * locks during an abort.
  */
-void
-LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
+extern void
+LockAssertNoneHeld(bool isCommit)
 {
 	HASH_SEQ_STATUS status;
-	LockMethod	lockMethodTable;
-	int			i,
-				numLockModes;
 	LOCALLOCK  *locallock;
-	LOCK	   *lock;
-	int			partition;
-	bool		have_fast_path_lwlock = false;
-
-	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
-		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
-	lockMethodTable = LockMethods[lockmethodid];
-
-#ifdef LOCK_DEBUG
-	if (*(lockMethodTable->trace_flag))
-		elog(LOG, "LockReleaseAll: lockmethod=%d", lockmethodid);
-#endif
-
-	/*
-	 * Get rid of our fast-path VXID lock, if appropriate.  Note that this is
-	 * the only way that the lock we hold on our own VXID can ever get
-	 * released: it is always and only released when a toplevel transaction
-	 * ends.
-	 */
-	if (lockmethodid == DEFAULT_LOCKMETHOD)
-		VirtualXactLockTableCleanup();
 
-	numLockModes = lockMethodTable->numLockModes;
-
-	/*
-	 * First we run through the locallock table and get rid of unwanted
-	 * entries, then we scan the process's proclocks and get rid of those. We
-	 * do this separately because we may have multiple locallock entries
-	 * pointing to the same proclock, and we daren't end up with any dangling
-	 * pointers.  Fast-path locks are cleaned up during the locallock table
-	 * scan, though.
-	 */
-	hash_seq_init(&status, LockMethodLocalHash);
-
-	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+	if (!isCommit)
 	{
-		/*
-		 * If the LOCALLOCK entry is unused, we must've run out of shared
-		 * memory while trying to set up this lock.  Just forget the local
-		 * entry.
-		 */
-		if (locallock->nLocks == 0)
-		{
-			RemoveLocalLock(locallock);
-			continue;
-		}
-
-		/* Ignore items that are not of the lockmethod to be removed */
-		if (LOCALLOCK_LOCKMETHOD(*locallock) != lockmethodid)
-			continue;
+		hash_seq_init(&status, LockMethodLocalHash);
 
-		/*
-		 * If we are asked to release all locks, we can just zap the entry.
-		 * Otherwise, must scan to see if there are session locks. We assume
-		 * there is at most one lockOwners entry for session locks.
-		 */
-		if (!allLocks)
+		while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
 		{
-			LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
+			dlist_iter local_iter;
 
-			/* If session lock is above array position 0, move it down to 0 */
-			for (i = 0; i < locallock->numLockOwners; i++)
-			{
-				if (lockOwners[i].owner == NULL)
-					lockOwners[0] = lockOwners[i];
-				else
-					ResourceOwnerForgetLock(lockOwners[i].owner, locallock);
-			}
+			Assert(locallock->nLocks >= 0);
 
-			if (locallock->numLockOwners > 0 &&
-				lockOwners[0].owner == NULL &&
-				lockOwners[0].nLocks > 0)
+			dlist_foreach(local_iter, &locallock->locallockowners)
 			{
-				/* Fix the locallock to show just the session locks */
-				locallock->nLocks = lockOwners[0].nLocks;
-				locallock->numLockOwners = 1;
-				/* We aren't deleting this locallock, so done */
-				continue;
-			}
-			else
-				locallock->numLockOwners = 0;
-		}
-
-		/*
-		 * If the lock or proclock pointers are NULL, this lock was taken via
-		 * the relation fast-path (and is not known to have been transferred).
-		 */
-		if (locallock->proclock == NULL || locallock->lock == NULL)
-		{
-			LOCKMODE	lockmode = locallock->tag.mode;
-			Oid			relid;
+				LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER,
+																 locallock_node,
+																 local_iter.cur);
 
-			/* Verify that a fast-path lock is what we've got. */
-			if (!EligibleForRelationFastPath(&locallock->tag.lock, lockmode))
-				elog(PANIC, "locallock table corrupted");
+				Assert(locallockowner->owner == NULL);
 
-			/*
-			 * If we don't currently hold the LWLock that protects our
-			 * fast-path data structures, we must acquire it before attempting
-			 * to release the lock via the fast-path.  We will continue to
-			 * hold the LWLock until we're done scanning the locallock table,
-			 * unless we hit a transferred fast-path lock.  (XXX is this
-			 * really such a good idea?  There could be a lot of entries ...)
-			 */
-			if (!have_fast_path_lwlock)
-			{
-				LWLockAcquire(&MyProc->fpInfoLock, LW_EXCLUSIVE);
-				have_fast_path_lwlock = true;
-			}
-
-			/* Attempt fast-path release. */
-			relid = locallock->tag.lock.locktag_field2;
-			if (FastPathUnGrantRelationLock(relid, lockmode))
-			{
-				RemoveLocalLock(locallock);
-				continue;
+				if (locallockowner->nLocks > 0 &&
+					LOCALLOCK_LOCKMETHOD(*locallock) == DEFAULT_LOCKMETHOD)
+					Assert(false);
 			}
-
-			/*
-			 * Our lock, originally taken via the fast path, has been
-			 * transferred to the main lock table.  That's going to require
-			 * some extra work, so release our fast-path lock before starting.
-			 */
-			LWLockRelease(&MyProc->fpInfoLock);
-			have_fast_path_lwlock = false;
-
-			/*
-			 * Now dump the lock.  We haven't got a pointer to the LOCK or
-			 * PROCLOCK in this case, so we have to handle this a bit
-			 * differently than a normal lock release.  Unfortunately, this
-			 * requires an extra LWLock acquire-and-release cycle on the
-			 * partitionLock, but hopefully it shouldn't happen often.
-			 */
-			LockRefindAndRelease(lockMethodTable, MyProc,
-								 &locallock->tag.lock, lockmode, false);
-			RemoveLocalLock(locallock);
-			continue;
 		}
-
-		/* Mark the proclock to show we need to release this lockmode */
-		if (locallock->nLocks > 0)
-			locallock->proclock->releaseMask |= LOCKBIT_ON(locallock->tag.mode);
-
-		/* And remove the locallock hashtable entry */
-		RemoveLocalLock(locallock);
 	}
-
-	/* Done with the fast-path data structures */
-	if (have_fast_path_lwlock)
-		LWLockRelease(&MyProc->fpInfoLock);
-
-	/*
-	 * Now, scan each lock partition separately.
-	 */
-	for (partition = 0; partition < NUM_LOCK_PARTITIONS; partition++)
-	{
-		LWLock	   *partitionLock;
-		dlist_head *procLocks = &MyProc->myProcLocks[partition];
-		dlist_mutable_iter proclock_iter;
-
-		partitionLock = LockHashPartitionLockByIndex(partition);
-
-		/*
-		 * If the proclock list for this partition is empty, we can skip
-		 * acquiring the partition lock.  This optimization is trickier than
-		 * it looks, because another backend could be in process of adding
-		 * something to our proclock list due to promoting one of our
-		 * fast-path locks.  However, any such lock must be one that we
-		 * decided not to delete above, so it's okay to skip it again now;
-		 * we'd just decide not to delete it again.  We must, however, be
-		 * careful to re-fetch the list header once we've acquired the
-		 * partition lock, to be sure we have a valid, up-to-date pointer.
-		 * (There is probably no significant risk if pointer fetch/store is
-		 * atomic, but we don't wish to assume that.)
-		 *
-		 * XXX This argument assumes that the locallock table correctly
-		 * represents all of our fast-path locks.  While allLocks mode
-		 * guarantees to clean up all of our normal locks regardless of the
-		 * locallock situation, we lose that guarantee for fast-path locks.
-		 * This is not ideal.
-		 */
-		if (dlist_is_empty(procLocks))
-			continue;			/* needn't examine this partition */
-
-		LWLockAcquire(partitionLock, LW_EXCLUSIVE);
-
-		dlist_foreach_modify(proclock_iter, procLocks)
-		{
-			PROCLOCK   *proclock = dlist_container(PROCLOCK, procLink, proclock_iter.cur);
-			bool		wakeupNeeded = false;
-
-			Assert(proclock->tag.myProc == MyProc);
-
-			lock = proclock->tag.myLock;
-
-			/* Ignore items that are not of the lockmethod to be removed */
-			if (LOCK_LOCKMETHOD(*lock) != lockmethodid)
-				continue;
-
-			/*
-			 * In allLocks mode, force release of all locks even if locallock
-			 * table had problems
-			 */
-			if (allLocks)
-				proclock->releaseMask = proclock->holdMask;
-			else
-				Assert((proclock->releaseMask & ~proclock->holdMask) == 0);
-
-			/*
-			 * Ignore items that have nothing to be released, unless they have
-			 * holdMask == 0 and are therefore recyclable
-			 */
-			if (proclock->releaseMask == 0 && proclock->holdMask != 0)
-				continue;
-
-			PROCLOCK_PRINT("LockReleaseAll", proclock);
-			LOCK_PRINT("LockReleaseAll", lock, 0);
-			Assert(lock->nRequested >= 0);
-			Assert(lock->nGranted >= 0);
-			Assert(lock->nGranted <= lock->nRequested);
-			Assert((proclock->holdMask & ~lock->grantMask) == 0);
-
-			/*
-			 * Release the previously-marked lock modes
-			 */
-			for (i = 1; i <= numLockModes; i++)
-			{
-				if (proclock->releaseMask & LOCKBIT_ON(i))
-					wakeupNeeded |= UnGrantLock(lock, i, proclock,
-												lockMethodTable);
-			}
-			Assert((lock->nRequested >= 0) && (lock->nGranted >= 0));
-			Assert(lock->nGranted <= lock->nRequested);
-			LOCK_PRINT("LockReleaseAll: updated", lock, 0);
-
-			proclock->releaseMask = 0;
-
-			/* CleanUpLock will wake up waiters if needed. */
-			CleanUpLock(lock, proclock,
-						lockMethodTable,
-						LockTagHashCode(&lock->tag),
-						wakeupNeeded);
-		}						/* loop over PROCLOCKs within this partition */
-
-		LWLockRelease(partitionLock);
-	}							/* loop over partitions */
-
-#ifdef LOCK_DEBUG
-	if (*(lockMethodTable->trace_flag))
-		elog(LOG, "LockReleaseAll done");
-#endif
+	Assert(MyProc->fpLockBits == 0);
 }
+#endif
 
 /*
  * LockReleaseSession -- Release all session locks of the specified lock method
@@ -2438,59 +2216,41 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
 void
 LockReleaseSession(LOCKMETHODID lockmethodid)
 {
-	HASH_SEQ_STATUS status;
-	LOCALLOCK  *locallock;
+	dlist_mutable_iter iter;
 
 	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
 		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
 
-	hash_seq_init(&status, LockMethodLocalHash);
-
-	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+	dlist_foreach_modify(iter, &session_locks[lockmethodid - 1])
 	{
-		/* Ignore items that are not of the specified lock method */
-		if (LOCALLOCK_LOCKMETHOD(*locallock) != lockmethodid)
-			continue;
+		LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER, resowner_node, iter.cur);
 
-		ReleaseLockIfHeld(locallock, true);
+		Assert(LOCALLOCK_LOCKMETHOD(*locallockowner->locallock) == lockmethodid);
+
+		ReleaseLockIfHeld(locallockowner, true);
 	}
+
+	Assert(dlist_is_empty(&session_locks[lockmethodid - 1]));
 }
 
 /*
  * LockReleaseCurrentOwner
- *		Release all locks belonging to CurrentResourceOwner
- *
- * If the caller knows what those locks are, it can pass them as an array.
- * That speeds up the call significantly, when a lot of locks are held.
- * Otherwise, pass NULL for locallocks, and we'll traverse through our hash
- * table to find them.
+ *		Release all locks belonging to 'owner'
  */
 void
-LockReleaseCurrentOwner(LOCALLOCK **locallocks, int nlocks)
+LockReleaseCurrentOwner(ResourceOwner owner, dlist_node *resowner_node)
 {
-	if (locallocks == NULL)
-	{
-		HASH_SEQ_STATUS status;
-		LOCALLOCK  *locallock;
-
-		hash_seq_init(&status, LockMethodLocalHash);
+	LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER, resowner_node, resowner_node);
 
-		while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
-			ReleaseLockIfHeld(locallock, false);
-	}
-	else
-	{
-		int			i;
+	Assert(locallockowner->owner == owner);
 
-		for (i = nlocks - 1; i >= 0; i--)
-			ReleaseLockIfHeld(locallocks[i], false);
-	}
+	ReleaseLockIfHeld(locallockowner, false);
 }
 
 /*
  * ReleaseLockIfHeld
- *		Release any session-level locks on this lockable object if sessionLock
- *		is true; else, release any locks held by CurrentResourceOwner.
+ *		Release any session-level locks on this 'locallockowner' if
+ *		sessionLock is true; else, release any locks held by 'locallockowner'.
  *
  * It is tempting to pass this a ResourceOwner pointer (or NULL for session
  * locks), but without refactoring LockRelease() we cannot support releasing
@@ -2501,52 +2261,39 @@ LockReleaseCurrentOwner(LOCALLOCK **locallocks, int nlocks)
  * convenience.
  */
 static void
-ReleaseLockIfHeld(LOCALLOCK *locallock, bool sessionLock)
+ReleaseLockIfHeld(LOCALLOCKOWNER *locallockowner, bool sessionLock)
 {
-	ResourceOwner owner;
-	LOCALLOCKOWNER *lockOwners;
-	int			i;
+	LOCALLOCK *locallock = locallockowner->locallock;
+
+	/* release all references to the lock by this resource owner */
 
-	/* Identify owner for lock (must match LockRelease!) */
 	if (sessionLock)
-		owner = NULL;
+		Assert(locallockowner->owner == NULL);
 	else
-		owner = CurrentResourceOwner;
+		Assert(locallockowner->owner != NULL);
 
-	/* Scan to see if there are any locks belonging to the target owner */
-	lockOwners = locallock->lockOwners;
-	for (i = locallock->numLockOwners - 1; i >= 0; i--)
+	/* We will still hold this lock after forgetting this ResourceOwner. */
+	if (locallockowner->nLocks < locallock->nLocks)
 	{
-		if (lockOwners[i].owner == owner)
-		{
-			Assert(lockOwners[i].nLocks > 0);
-			if (lockOwners[i].nLocks < locallock->nLocks)
-			{
-				/*
-				 * We will still hold this lock after forgetting this
-				 * ResourceOwner.
-				 */
-				locallock->nLocks -= lockOwners[i].nLocks;
-				/* compact out unused slot */
-				locallock->numLockOwners--;
-				if (owner != NULL)
-					ResourceOwnerForgetLock(owner, locallock);
-				if (i < locallock->numLockOwners)
-					lockOwners[i] = lockOwners[locallock->numLockOwners];
-			}
-			else
-			{
-				Assert(lockOwners[i].nLocks == locallock->nLocks);
-				/* We want to call LockRelease just once */
-				lockOwners[i].nLocks = 1;
-				locallock->nLocks = 1;
-				if (!LockRelease(&locallock->tag.lock,
-								 locallock->tag.mode,
-								 sessionLock))
-					elog(WARNING, "ReleaseLockIfHeld: failed??");
-			}
-			break;
-		}
+		locallock->nLocks -= locallockowner->nLocks;
+		dlist_delete(&locallockowner->locallock_node);
+
+		if (sessionLock)
+			dlist_delete(&locallockowner->resowner_node);
+		else
+			ResourceOwnerForgetLock(locallockowner);
+	}
+	else
+	{
+		Assert(locallockowner->nLocks == locallock->nLocks);
+		/* We want to call LockRelease just once */
+		locallockowner->nLocks = 1;
+		locallock->nLocks = 1;
+
+		if (!LockRelease(&locallock->tag.lock,
+						 locallock->tag.mode,
+						 sessionLock))
+			elog(WARNING, "ReleaseLockIfHeld: failed??");
 	}
 }
 
@@ -2561,75 +2308,48 @@ ReleaseLockIfHeld(LOCALLOCK *locallock, bool sessionLock)
  * and we'll traverse through our hash table to find them.
  */
 void
-LockReassignCurrentOwner(LOCALLOCK **locallocks, int nlocks)
+LockReassignCurrentOwner(ResourceOwner owner, dlist_node *resowner_node)
 {
+	LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER,
+													 resowner_node,
+													 resowner_node);
 	ResourceOwner parent = ResourceOwnerGetParent(CurrentResourceOwner);
 
-	Assert(parent != NULL);
-
-	if (locallocks == NULL)
-	{
-		HASH_SEQ_STATUS status;
-		LOCALLOCK  *locallock;
-
-		hash_seq_init(&status, LockMethodLocalHash);
-
-		while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
-			LockReassignOwner(locallock, parent);
-	}
-	else
-	{
-		int			i;
-
-		for (i = nlocks - 1; i >= 0; i--)
-			LockReassignOwner(locallocks[i], parent);
-	}
+	LockReassignOwner(locallockowner, parent);
 }
 
 /*
- * Subroutine of LockReassignCurrentOwner. Reassigns a given lock belonging to
- * CurrentResourceOwner to its parent.
+ * Subroutine of LockReassignCurrentOwner. Reassigns the given
+ *'locallockowner' to 'parent'.
  */
 static void
-LockReassignOwner(LOCALLOCK *locallock, ResourceOwner parent)
+LockReassignOwner(LOCALLOCKOWNER *locallockowner, ResourceOwner parent)
 {
-	LOCALLOCKOWNER *lockOwners;
-	int			i;
-	int			ic = -1;
-	int			ip = -1;
+	dlist_iter iter;
+	LOCALLOCK *locallock = locallockowner->locallock;
 
-	/*
-	 * Scan to see if there are any locks belonging to current owner or its
-	 * parent
-	 */
-	lockOwners = locallock->lockOwners;
-	for (i = locallock->numLockOwners - 1; i >= 0; i--)
+	ResourceOwnerForgetLock(locallockowner);
+
+	dlist_foreach(iter, &locallock->locallockowners)
 	{
-		if (lockOwners[i].owner == CurrentResourceOwner)
-			ic = i;
-		else if (lockOwners[i].owner == parent)
-			ip = i;
-	}
+		LOCALLOCKOWNER *parentlocalowner = dlist_container(LOCALLOCKOWNER, locallock_node, iter.cur);
 
-	if (ic < 0)
-		return;					/* no current locks */
+		Assert(parentlocalowner->locallock == locallock);
 
-	if (ip < 0)
-	{
-		/* Parent has no slot, so just give it the child's slot */
-		lockOwners[ic].owner = parent;
-		ResourceOwnerRememberLock(parent, locallock);
-	}
-	else
-	{
-		/* Merge child's count with parent's */
-		lockOwners[ip].nLocks += lockOwners[ic].nLocks;
-		/* compact out unused slot */
-		locallock->numLockOwners--;
-		if (ic < locallock->numLockOwners)
-			lockOwners[ic] = lockOwners[locallock->numLockOwners];
+		if (parentlocalowner->owner != parent)
+			continue;
+
+		parentlocalowner->nLocks += locallockowner->nLocks;
+
+		locallockowner->nLocks = 0;
+		dlist_delete(&locallockowner->locallock_node);
+		pfree(locallockowner);
+		return;
 	}
-	ResourceOwnerForgetLock(CurrentResourceOwner, locallock);
+
+	/* reassign locallockowner to parent resowner */
+	locallockowner->owner = parent;
+	ResourceOwnerRememberLock(parent, locallockowner);
 }
 
 /*
@@ -3101,7 +2821,7 @@ GetLockConflicts(const LOCKTAG *locktag, LOCKMODE lockmode, int *countp)
  * We currently use this in two situations: first, to release locks held by
  * prepared transactions on commit (see lock_twophase_postcommit); and second,
  * to release locks taken via the fast-path, transferred to the main hash
- * table, and then released (see LockReleaseAll).
+ * table, and then released (see ResourceOwnerRelease).
  */
 static void
 LockRefindAndRelease(LockMethod lockMethodTable, PGPROC *proc,
@@ -3237,10 +2957,9 @@ CheckForSessionAndXactLocks(void)
 
 	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
 	{
-		LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
 		PerLockTagEntry *hentry;
 		bool		found;
-		int			i;
+		dlist_iter iter;
 
 		/*
 		 * Ignore VXID locks.  We don't want those to be held by prepared
@@ -3261,9 +2980,13 @@ CheckForSessionAndXactLocks(void)
 			hentry->sessLock = hentry->xactLock = false;
 
 		/* Scan to see if we hold lock at session or xact level or both */
-		for (i = locallock->numLockOwners - 1; i >= 0; i--)
+		dlist_foreach(iter, &locallock->locallockowners)
 		{
-			if (lockOwners[i].owner == NULL)
+			LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER,
+															 locallock_node,
+															 iter.cur);
+
+			if (locallockowner->owner == NULL)
 				hentry->sessLock = true;
 			else
 				hentry->xactLock = true;
@@ -3310,10 +3033,9 @@ AtPrepare_Locks(void)
 	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
 	{
 		TwoPhaseLockRecord record;
-		LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
 		bool		haveSessionLock;
 		bool		haveXactLock;
-		int			i;
+		dlist_iter iter;
 
 		/*
 		 * Ignore VXID locks.  We don't want those to be held by prepared
@@ -3328,9 +3050,13 @@ AtPrepare_Locks(void)
 
 		/* Scan to see whether we hold it at session or transaction level */
 		haveSessionLock = haveXactLock = false;
-		for (i = locallock->numLockOwners - 1; i >= 0; i--)
+		dlist_foreach(iter, &locallock->locallockowners)
 		{
-			if (lockOwners[i].owner == NULL)
+			LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER,
+															 locallock_node,
+															 iter.cur);
+
+			if (locallockowner->owner == NULL)
 				haveSessionLock = true;
 			else
 				haveXactLock = true;
@@ -3388,8 +3114,8 @@ AtPrepare_Locks(void)
  * pointers in the transaction's resource owner.  This is OK at the
  * moment since resowner.c doesn't try to free locks retail at a toplevel
  * transaction commit or abort.  We could alternatively zero out nLocks
- * and leave the LOCALLOCK entries to be garbage-collected by LockReleaseAll,
- * but that probably costs more cycles.
+ * and leave the LOCALLOCK entries to be garbage-collected by
+ * ResourceOwnerRelease, but that probably costs more cycles.
  */
 void
 PostPrepare_Locks(TransactionId xid)
@@ -3422,10 +3148,9 @@ PostPrepare_Locks(TransactionId xid)
 
 	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
 	{
-		LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
 		bool		haveSessionLock;
 		bool		haveXactLock;
-		int			i;
+		dlist_iter iter;
 
 		if (locallock->proclock == NULL || locallock->lock == NULL)
 		{
@@ -3444,9 +3169,13 @@ PostPrepare_Locks(TransactionId xid)
 
 		/* Scan to see whether we hold it at session or transaction level */
 		haveSessionLock = haveXactLock = false;
-		for (i = locallock->numLockOwners - 1; i >= 0; i--)
+		dlist_foreach(iter, &locallock->locallockowners)
 		{
-			if (lockOwners[i].owner == NULL)
+			LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER,
+															 locallock_node,
+															 iter.cur);
+
+			if (locallockowner->owner == NULL)
 				haveSessionLock = true;
 			else
 				haveXactLock = true;
@@ -3462,10 +3191,6 @@ PostPrepare_Locks(TransactionId xid)
 					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 					 errmsg("cannot PREPARE while holding both session-level and transaction-level locks on the same object")));
 
-		/* Mark the proclock to show we need to release this lockmode */
-		if (locallock->nLocks > 0)
-			locallock->proclock->releaseMask |= LOCKBIT_ON(locallock->tag.mode);
-
 		/* And remove the locallock hashtable entry */
 		RemoveLocalLock(locallock);
 	}
@@ -3483,11 +3208,7 @@ PostPrepare_Locks(TransactionId xid)
 
 		/*
 		 * If the proclock list for this partition is empty, we can skip
-		 * acquiring the partition lock.  This optimization is safer than the
-		 * situation in LockReleaseAll, because we got rid of any fast-path
-		 * locks during AtPrepare_Locks, so there cannot be any case where
-		 * another backend is adding something to our lists now.  For safety,
-		 * though, we code this the same way as in LockReleaseAll.
+		 * acquiring the partition lock.
 		 */
 		if (dlist_is_empty(procLocks))
 			continue;			/* needn't examine this partition */
@@ -3513,14 +3234,6 @@ PostPrepare_Locks(TransactionId xid)
 			Assert(lock->nGranted <= lock->nRequested);
 			Assert((proclock->holdMask & ~lock->grantMask) == 0);
 
-			/* Ignore it if nothing to release (must be a session lock) */
-			if (proclock->releaseMask == 0)
-				continue;
-
-			/* Else we should be releasing all locks */
-			if (proclock->releaseMask != proclock->holdMask)
-				elog(PANIC, "we seem to have dropped a bit somewhere");
-
 			/*
 			 * We cannot simply modify proclock->tag.myProc to reassign
 			 * ownership of the lock, because that's part of the hash key and
@@ -4288,7 +4001,6 @@ lock_twophase_recover(TransactionId xid, uint16 info,
 		Assert(proc->lockGroupLeader == NULL);
 		proclock->groupLeader = proc;
 		proclock->holdMask = 0;
-		proclock->releaseMask = 0;
 		/* Add proclock to appropriate lists */
 		dlist_push_tail(&lock->procLocks, &proclock->lockLink);
 		dlist_push_tail(&proc->myProcLocks[partition],
@@ -4425,7 +4137,7 @@ lock_twophase_postabort(TransactionId xid, uint16 info,
  *
  *		We don't bother recording this lock in the local lock table, since it's
  *		only ever released at the end of a transaction.  Instead,
- *		LockReleaseAll() calls VirtualXactLockTableCleanup().
+ *		ProcReleaseLocks() calls VirtualXactLockTableCleanup().
  */
 void
 VirtualXactLockTableInsert(VirtualTransactionId vxid)
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 22b4278610..1addef790a 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -777,10 +777,17 @@ ProcReleaseLocks(bool isCommit)
 		return;
 	/* If waiting, get off wait queue (should only be needed after error) */
 	LockErrorCleanup();
-	/* Release standard locks, including session-level if aborting */
-	LockReleaseAll(DEFAULT_LOCKMETHOD, !isCommit);
-	/* Release transaction-level advisory locks */
-	LockReleaseAll(USER_LOCKMETHOD, false);
+
+	VirtualXactLockTableCleanup();
+
+	/* Release session-level locks if aborting */
+	if (!isCommit)
+		LockReleaseSession(DEFAULT_LOCKMETHOD);
+
+#ifdef USE_ASSERT_CHECKING
+	/* Ensure all locks were released */
+	LockAssertNoneHeld(isCommit);
+#endif
 }
 
 
@@ -861,6 +868,8 @@ ProcKill(int code, Datum arg)
 		LWLockRelease(leader_lwlock);
 	}
 
+	Assert(MyProc->fpLockBits == 0);
+
 	/*
 	 * Reset MyLatch to the process local one.  This is so that signal
 	 * handlers et al can continue using the latch after the shared latch
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 2f07ca7a0e..0547e3d076 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -1325,10 +1325,10 @@ ShutdownPostgres(int code, Datum arg)
 	AbortOutOfAnyTransaction();
 
 	/*
-	 * User locks are not released by transaction end, so be sure to release
-	 * them explicitly.
+	 * Session locks are not released by transaction end, so be sure to
+	 * release them explicitly.
 	 */
-	LockReleaseAll(USER_LOCKMETHOD, true);
+	LockReleaseSession(USER_LOCKMETHOD);
 }
 
 
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 19b6241e45..ecd2312e27 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -33,6 +33,7 @@
 #include "utils/resowner_private.h"
 #include "utils/snapmgr.h"
 
+#include "lib/ilist.h"
 
 /*
  * All resource IDs managed by this code are required to fit into a Datum,
@@ -91,24 +92,6 @@ typedef struct ResourceArray
 #define RESARRAY_MAX_ITEMS(capacity) \
 	((capacity) <= RESARRAY_MAX_ARRAY ? (capacity) : (capacity)/4 * 3)
 
-/*
- * To speed up bulk releasing or reassigning locks from a resource owner to
- * its parent, each resource owner has a small cache of locks it owns. The
- * lock manager has the same information in its local lock hash table, and
- * we fall back on that if cache overflows, but traversing the hash table
- * is slower when there are a lot of locks belonging to other resource owners.
- *
- * MAX_RESOWNER_LOCKS is the size of the per-resource owner cache. It's
- * chosen based on some testing with pg_dump with a large schema. When the
- * tests were done (on 9.2), resource owners in a pg_dump run contained up
- * to 9 locks, regardless of the schema size, except for the top resource
- * owner which contained much more (overflowing the cache). 15 seems like a
- * nice round number that's somewhat higher than what pg_dump needs. Note that
- * making this number larger is not free - the bigger the cache, the slower
- * it is to release locks (in retail), when a resource owner holds many locks.
- */
-#define MAX_RESOWNER_LOCKS 15
-
 /*
  * ResourceOwner objects look like this
  */
@@ -133,9 +116,7 @@ typedef struct ResourceOwnerData
 	ResourceArray cryptohasharr;	/* cryptohash contexts */
 	ResourceArray hmacarr;		/* HMAC contexts */
 
-	/* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
-	int			nlocks;			/* number of owned locks */
-	LOCALLOCK  *locks[MAX_RESOWNER_LOCKS];	/* list of owned locks */
+	dlist_head  locks;			/* dlist of owned locks */
 }			ResourceOwnerData;
 
 
@@ -452,6 +433,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
 	ResourceArrayInit(&(owner->jitarr), PointerGetDatum(NULL));
 	ResourceArrayInit(&(owner->cryptohasharr), PointerGetDatum(NULL));
 	ResourceArrayInit(&(owner->hmacarr), PointerGetDatum(NULL));
+	dlist_init(&owner->locks);
 
 	return owner;
 }
@@ -586,8 +568,15 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 	}
 	else if (phase == RESOURCE_RELEASE_LOCKS)
 	{
+		dlist_mutable_iter iter;
+
 		if (isTopLevel)
 		{
+			dlist_foreach_modify(iter, &owner->locks)
+				LockReleaseCurrentOwner(owner, iter.cur);
+
+			Assert(dlist_is_empty(&owner->locks));
+
 			/*
 			 * For a top-level xact we are going to release all locks (or at
 			 * least all non-session locks), so just do a single lmgr call at
@@ -606,30 +595,20 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 			 * subtransaction, we do NOT release its locks yet, but transfer
 			 * them to the parent.
 			 */
-			LOCALLOCK **locks;
-			int			nlocks;
-
-			Assert(owner->parent != NULL);
-
-			/*
-			 * Pass the list of locks owned by this resource owner to the lock
-			 * manager, unless it has overflowed.
-			 */
-			if (owner->nlocks > MAX_RESOWNER_LOCKS)
+			if (isCommit)
 			{
-				locks = NULL;
-				nlocks = 0;
+				dlist_foreach_modify(iter, &owner->locks)
+					LockReassignCurrentOwner(owner, iter.cur);
+
+				Assert(dlist_is_empty(&owner->locks));
 			}
 			else
 			{
-				locks = owner->locks;
-				nlocks = owner->nlocks;
-			}
+				dlist_foreach_modify(iter, &owner->locks)
+					LockReleaseCurrentOwner(owner, iter.cur);
 
-			if (isCommit)
-				LockReassignCurrentOwner(locks, nlocks);
-			else
-				LockReleaseCurrentOwner(locks, nlocks);
+				Assert(dlist_is_empty(&owner->locks));
+			}
 		}
 	}
 	else if (phase == RESOURCE_RELEASE_AFTER_LOCKS)
@@ -757,7 +736,7 @@ ResourceOwnerDelete(ResourceOwner owner)
 	Assert(owner->jitarr.nitems == 0);
 	Assert(owner->cryptohasharr.nitems == 0);
 	Assert(owner->hmacarr.nitems == 0);
-	Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1);
+	Assert(dlist_is_empty(&owner->locks));
 
 	/*
 	 * Delete children.  The recursive call will delink the child from me, so
@@ -978,54 +957,61 @@ ResourceOwnerForgetBuffer(ResourceOwner owner, Buffer buffer)
 
 /*
  * Remember that a Local Lock is owned by a ResourceOwner
- *
- * This is different from the other Remember functions in that the list of
- * locks is only a lossy cache. It can hold up to MAX_RESOWNER_LOCKS entries,
- * and when it overflows, we stop tracking locks. The point of only remembering
- * only up to MAX_RESOWNER_LOCKS entries is that if a lot of locks are held,
- * ResourceOwnerForgetLock doesn't need to scan through a large array to find
- * the entry.
  */
 void
-ResourceOwnerRememberLock(ResourceOwner owner, LOCALLOCK *locallock)
+ResourceOwnerRememberLock(ResourceOwner owner, LOCALLOCKOWNER *locallockowner)
 {
-	Assert(locallock != NULL);
-
-	if (owner->nlocks > MAX_RESOWNER_LOCKS)
-		return;					/* we have already overflowed */
+	Assert(owner != NULL);
+	Assert(locallockowner != NULL);
 
-	if (owner->nlocks < MAX_RESOWNER_LOCKS)
-		owner->locks[owner->nlocks] = locallock;
-	else
+#ifdef USE_ASSERT_CHECKING
 	{
-		/* overflowed */
+		dlist_iter iter;
+
+		dlist_foreach(iter, &owner->locks)
+		{
+			LOCALLOCKOWNER *i = dlist_container(LOCALLOCKOWNER, resowner_node, iter.cur);
+
+			Assert(i->locallock != locallockowner->locallock);
+		}
 	}
-	owner->nlocks++;
+#endif
+
+	dlist_push_tail(&owner->locks, &locallockowner->resowner_node);
 }
 
 /*
- * Forget that a Local Lock is owned by a ResourceOwner
+ * Forget that a Local Lock is owned by the given LOCALLOCKOWNER.
  */
 void
-ResourceOwnerForgetLock(ResourceOwner owner, LOCALLOCK *locallock)
+ResourceOwnerForgetLock(LOCALLOCKOWNER *locallockowner)
 {
-	int			i;
+#ifdef USE_ASSERT_CHECKING
+	ResourceOwner owner;
 
-	if (owner->nlocks > MAX_RESOWNER_LOCKS)
-		return;					/* we have overflowed */
+	Assert(locallockowner != NULL);
+
+	owner = locallockowner->owner;
 
-	Assert(owner->nlocks > 0);
-	for (i = owner->nlocks - 1; i >= 0; i--)
 	{
-		if (locallock == owner->locks[i])
+		dlist_iter iter;
+		bool found = false;
+
+		dlist_foreach(iter, &owner->locks)
 		{
-			owner->locks[i] = owner->locks[owner->nlocks - 1];
-			owner->nlocks--;
-			return;
+			LOCALLOCKOWNER *owner = dlist_container(LOCALLOCKOWNER, resowner_node, iter.cur);
+
+			if (locallockowner == owner)
+			{
+				Assert(!found);
+				found = true;
+			}
 		}
+
+		Assert(found);
 	}
-	elog(ERROR, "lock reference %p is not owned by resource owner %s",
-		 locallock, owner->name);
+#endif
+	dlist_delete(&locallockowner->resowner_node);
 }
 
 /*
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index 6ae434596a..ee800ca693 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -24,6 +24,7 @@
 #include "storage/lwlock.h"
 #include "storage/shmem.h"
 #include "utils/timestamp.h"
+#include "lib/ilist.h"
 
 /* struct PGPROC is declared in proc.h, but must forward-reference it */
 typedef struct PGPROC PGPROC;
@@ -349,10 +350,6 @@ typedef struct LOCK
  * Otherwise, proclock objects whose holdMasks are zero are recycled
  * as soon as convenient.
  *
- * releaseMask is workspace for LockReleaseAll(): it shows the locks due
- * to be released during the current call.  This must only be examined or
- * set by the backend owning the PROCLOCK.
- *
  * Each PROCLOCK object is linked into lists for both the associated LOCK
  * object and the owning PGPROC object.  Note that the PROCLOCK is entered
  * into these lists as soon as it is created, even if no lock has yet been
@@ -374,7 +371,6 @@ typedef struct PROCLOCK
 	/* data */
 	PGPROC	   *groupLeader;	/* proc's lock group leader, or proc itself */
 	LOCKMASK	holdMask;		/* bitmask for lock types currently held */
-	LOCKMASK	releaseMask;	/* bitmask for lock types to be released */
 	dlist_node	lockLink;		/* list link in LOCK's list of proclocks */
 	dlist_node	procLink;		/* list link in PGPROC's list of proclocks */
 } PROCLOCK;
@@ -420,6 +416,13 @@ typedef struct LOCALLOCKOWNER
 	 * Must use a forward struct reference to avoid circularity.
 	 */
 	struct ResourceOwnerData *owner;
+
+	dlist_node  resowner_node;
+
+	dlist_node  locallock_node;
+
+	struct LOCALLOCK *locallock;
+
 	int64		nLocks;			/* # of times held by this owner */
 } LOCALLOCKOWNER;
 
@@ -433,9 +436,9 @@ typedef struct LOCALLOCK
 	LOCK	   *lock;			/* associated LOCK object, if any */
 	PROCLOCK   *proclock;		/* associated PROCLOCK object, if any */
 	int64		nLocks;			/* total number of times lock is held */
-	int			numLockOwners;	/* # of relevant ResourceOwners */
-	int			maxLockOwners;	/* allocated size of array */
-	LOCALLOCKOWNER *lockOwners; /* dynamically resizable array */
+
+	dlist_head locallockowners;	/* dlist of LOCALLOCKOWNER */
+
 	bool		holdsStrongLockCount;	/* bumped FastPathStrongRelationLocks */
 	bool		lockCleared;	/* we read all sinval msgs for lock */
 } LOCALLOCK;
@@ -564,10 +567,17 @@ extern void AbortStrongLockAcquire(void);
 extern void MarkLockClear(LOCALLOCK *locallock);
 extern bool LockRelease(const LOCKTAG *locktag,
 						LOCKMODE lockmode, bool sessionLock);
-extern void LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks);
+
+#ifdef USE_ASSERT_CHECKING
+extern void LockAssertNoneHeld(bool isCommit);
+#endif
+
 extern void LockReleaseSession(LOCKMETHODID lockmethodid);
-extern void LockReleaseCurrentOwner(LOCALLOCK **locallocks, int nlocks);
-extern void LockReassignCurrentOwner(LOCALLOCK **locallocks, int nlocks);
+struct ResourceOwnerData;
+extern void LockReleaseCurrentOwner(struct ResourceOwnerData *owner,
+									dlist_node *resowner_node);
+extern void LockReassignCurrentOwner(struct ResourceOwnerData *owner,
+									 dlist_node *resowner_node);
 extern bool LockHeldByMe(const LOCKTAG *locktag, LOCKMODE lockmode);
 #ifdef USE_ASSERT_CHECKING
 extern HTAB *GetLockMethodLocalHash(void);
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index 1b1f3181b5..ad13f13401 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -31,8 +31,9 @@ extern void ResourceOwnerRememberBuffer(ResourceOwner owner, Buffer buffer);
 extern void ResourceOwnerForgetBuffer(ResourceOwner owner, Buffer buffer);
 
 /* support for local lock management */
-extern void ResourceOwnerRememberLock(ResourceOwner owner, LOCALLOCK *locallock);
-extern void ResourceOwnerForgetLock(ResourceOwner owner, LOCALLOCK *locallock);
+extern void ResourceOwnerRememberLock(ResourceOwner owner,
+									  LOCALLOCKOWNER *locallock);
+extern void ResourceOwnerForgetLock(LOCALLOCKOWNER *locallock);
 
 /* support for catcache refcount management */
 extern void ResourceOwnerEnlargeCatCacheRefs(ResourceOwner owner);

#113

Amit Langote

amitlangote09@gmail.com

almost 3 years ago

In reply to: David Rowley (#112)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

Hi David,

On Tue, Jan 24, 2023 at 12:58 PM David Rowley <dgrowleyml@gmail.com> wrote:

On Fri, 20 Jan 2023 at 00:26, vignesh C <vignesh21@gmail.com> wrote:

CFBot shows some compilation errors as in [1], please post an updated
version for the same:

I've attached a rebased patch.

Thanks for the new patch.

Maybe you're planning to do it once this patch is post the PoC phase
(isn't it?), but it would be helpful to have commentary on all the new
dlist fields.

Especially, I think the following warrants a bit more explanation than other:

-   /* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
-   int         nlocks;         /* number of owned locks */
-   LOCALLOCK  *locks[MAX_RESOWNER_LOCKS];  /* list of owned locks */
+   dlist_head  locks;          /* dlist of owned locks */

This seems to be replacing what is a cache with an upper limit on the
number of cacheds locks with something that has no limit on how many
per-owner locks are remembered. I wonder whether we'd be doing
additional work in some cases with the new no-limit implementation
that wasn't being done before (where the owner's locks array is
overflowed) or maybe not much because the new implementation of
ResourceOwner{Remember|Forget}Lock() is simple push/delete of a dlist
node from the owner's dlist?

The following comment is now obsolete:

/*
* LockReassignCurrentOwner
* Reassign all locks belonging to CurrentResourceOwner to belong
* to its parent resource owner.
*
* If the caller knows what those locks are, it can pass them as an array.
* That speeds up the call significantly, when a lot of locks are held
* (e.g pg_dump with a large schema). Otherwise, pass NULL for locallocks,
* and we'll traverse through our hash table to find them.
*/

--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

#114

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: David Rowley (#112)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

Hi,

On 2023-01-24 16:57:37 +1300, David Rowley wrote:

I've attached a rebased patch.

Looks like there's some issue causing tests to fail probabilistically:

https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest%2F42%2F3501

Several failures are when testing a 32bit build.

While reading over this again, I wondered if instead of allocating the
memory for the LOCALLOCKOWNER in TopMemoryContext, maybe we should
create a Slab context as a child of TopMemoryContext and perform the
allocations there.

Yes, that does make sense.

I would like to get this LockReleaseAll problem finally fixed in PG16,
but I'd feel much better about this patch if it had some review from
someone who has more in-depth knowledge of the locking code.

I feel my review wouldn't be independent, but I'll give it a shot if nobody
else does.

Greetings,

Andres Freund

#115

David Rowley

dgrowleyml@gmail.com

almost 3 years ago

In reply to: Amit Langote (#113)

3 attachment(s)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

Thanks for having a look at this.

On Wed, 1 Feb 2023 at 03:07, Amit Langote <amitlangote09@gmail.com> wrote:

Maybe you're planning to do it once this patch is post the PoC phase
(isn't it?), but it would be helpful to have commentary on all the new
dlist fields.

I've added comments on the new fields. Maybe we can say the patch is "wip".

This seems to be replacing what is a cache with an upper limit on the
number of cacheds locks with something that has no limit on how many
per-owner locks are remembered. I wonder whether we'd be doing
additional work in some cases with the new no-limit implementation
that wasn't being done before (where the owner's locks array is
overflowed) or maybe not much because the new implementation of
ResourceOwner{Remember|Forget}Lock() is simple push/delete of a dlist
node from the owner's dlist?

It's a good question. The problem is I don't really have a good test
to find out. The problem is we'd need to benchmark taking fewer than
16 locks. On trying that, I find that there's just too much
variability in the performance between runs to determine if there's
any slowdown.

$ cat 10_locks.sql
select count(pg_advisory_lock(x)) from generate_series(1,10) x;

$ pgbench -f 10_locks.sql@1000 -M prepared -T 10 -n postgres | grep -E "(tps)"
tps = 47809.306088 (without initial connection time)
tps = 66859.789072 (without initial connection time)
tps = 37885.924616 (without initial connection time)

On trying with more locks, I see there are good wins from the patched version.

$ cat 100_locks.sql
select count(pg_advisory_lock(x)) from generate_series(1,100) x;

$ cat 1k_locks.sql
select count(pg_advisory_lock(x)) from generate_series(1,1000) x;

$ cat 10k_locks.sql
select count(pg_advisory_lock(x)) from generate_series(1,10000) x;

Test 1: Take 100 locks but periodically take 10k locks to bloat the
local lock table.

master:
$ pgbench -f 100_locks.sql@1000 -f 10k_locks.sql@1 -M prepared -T 10
-n postgres | grep -E "(tps|script)"
transaction type: multiple scripts
tps = 2726.197037 (without initial connection time)
SQL script 1: 100_locks.sql
- 27219 transactions (99.9% of total, tps = 2722.496227)
SQL script 2: 10k_locks.sql
- 37 transactions (0.1% of total, tps = 3.700810)

patched:
$ pgbench -f 100_locks.sql@1000 -f 10k_locks.sql@1 -M prepared -T 10
-n postgres | grep -E "(tps|script)"
transaction type: multiple scripts
tps = 34047.297822 (without initial connection time)
SQL script 1: 100_locks.sql
- 340039 transactions (99.9% of total, tps = 34012.688879)
SQL script 2: 10k_locks.sql
- 346 transactions (0.1% of total, tps = 34.608943)

patched without slab context:
$ pgbench -f 100_locks.sql@1000 -f 10k_locks.sql@1 -M prepared -T 10
-n postgres | grep -E "(tps|script)"
transaction type: multiple scripts
tps = 34851.770846 (without initial connection time)
SQL script 1: 100_locks.sql
- 348097 transactions (99.9% of total, tps = 34818.662324)
SQL script 2: 10k_locks.sql
- 331 transactions (0.1% of total, tps = 33.108522)

Test 2: Always take just 100 locks and don't bloat the local lock table.

master:
$ pgbench -f 100_locks.sql@1000 -M prepared -T 10 -n postgres | grep
-E "(tps|script)"
tps = 32682.491548 (without initial connection time)

patched:
$ pgbench -f 100_locks.sql@1000 -M prepared -T 10 -n postgres | grep
-E "(tps|script)"
tps = 35637.241815 (without initial connection time)

patched without slab context:
$ pgbench -f 100_locks.sql@1000 -M prepared -T 10 -n postgres | grep
-E "(tps|script)"
tps = 36192.185181 (without initial connection time)

The attached 0003 patch is an experiment to see if using a slab memory
context has any advantages for storing the LOCALLOCKOWNER structs.
There seems to be a small performance hit from doing this.

The following comment is now obsolete:

/*
* LockReassignCurrentOwner
* Reassign all locks belonging to CurrentResourceOwner to belong
* to its parent resource owner.
*
* If the caller knows what those locks are, it can pass them as an array.
* That speeds up the call significantly, when a lot of locks are held
* (e.g pg_dump with a large schema). Otherwise, pass NULL for locallocks,
* and we'll traverse through our hash table to find them.
*/

I've removed the obsolete part.

I've attached another set of patches. I do need to spend longer
looking at this. I'm mainly attaching these as CI seems to be
highlighting a problem that I'm unable to recreate locally and I
wanted to see if the attached fixes it.

David

Attachments:

v6-0001-wip-resowner-lock-release-all.patchtext/plain; charset=US-ASCII; name=v6-0001-wip-resowner-lock-release-all.patchDownload

From 4a546ad6d33e544dd872b23a94925a262088cd9a Mon Sep 17 00:00:00 2001
From: David Rowley <dgrowley@gmail.com>
Date: Tue, 24 Jan 2023 16:00:58 +1300
Subject: [PATCH v6 1/3] wip-resowner-lock-release-all.

---
 src/backend/commands/discard.c             |   2 +-
 src/backend/replication/logical/launcher.c |   2 +-
 src/backend/storage/lmgr/README            |   6 -
 src/backend/storage/lmgr/lock.c            | 650 ++++++---------------
 src/backend/storage/lmgr/proc.c            |  17 +-
 src/backend/utils/init/postinit.c          |   6 +-
 src/backend/utils/resowner/resowner.c      | 128 ++--
 src/include/storage/lock.h                 |  32 +-
 src/include/utils/resowner_private.h       |   5 +-
 9 files changed, 280 insertions(+), 568 deletions(-)

diff --git a/src/backend/commands/discard.c b/src/backend/commands/discard.c
index 296dc82d2e..edb8b6026e 100644
--- a/src/backend/commands/discard.c
+++ b/src/backend/commands/discard.c
@@ -71,7 +71,7 @@ DiscardAll(bool isTopLevel)
 	ResetAllOptions();
 	DropAllPreparedStatements();
 	Async_UnlistenAll();
-	LockReleaseAll(USER_LOCKMETHOD, true);
+	LockReleaseSession(USER_LOCKMETHOD);
 	ResetPlanCache();
 	ResetTempTableNamespace();
 	ResetSequenceCaches();
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index 970d170e73..8998b55f62 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -798,7 +798,7 @@ logicalrep_worker_onexit(int code, Datum arg)
 	 * parallel apply mode and will not be released when the worker
 	 * terminates, so manually release all locks before the worker exits.
 	 */
-	LockReleaseAll(DEFAULT_LOCKMETHOD, true);
+	LockReleaseSession(DEFAULT_LOCKMETHOD);
 
 	ApplyLauncherWakeup();
 }
diff --git a/src/backend/storage/lmgr/README b/src/backend/storage/lmgr/README
index d08ec6c402..9603cc8959 100644
--- a/src/backend/storage/lmgr/README
+++ b/src/backend/storage/lmgr/README
@@ -182,12 +182,6 @@ holdMask -
     subset of the PGPROC object's heldLocks mask (if the PGPROC is
     currently waiting for another lock mode on this lock).
 
-releaseMask -
-    A bitmask for the lock modes due to be released during LockReleaseAll.
-    This must be a subset of the holdMask.  Note that it is modified without
-    taking the partition LWLock, and therefore it is unsafe for any
-    backend except the one owning the PROCLOCK to examine/change it.
-
 lockLink -
     List link for shared memory queue of all the PROCLOCK objects for the
     same LOCK.
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index a87372f33f..7e2dd3a7af 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -22,8 +22,7 @@
  *	Interface:
  *
  *	InitLocks(), GetLocksMethodTable(), GetLockTagsMethodTable(),
- *	LockAcquire(), LockRelease(), LockReleaseAll(),
- *	LockCheckConflicts(), GrantLock()
+ *	LockAcquire(), LockRelease(), LockCheckConflicts(), GrantLock()
  *
  *-------------------------------------------------------------------------
  */
@@ -289,6 +288,9 @@ static LOCALLOCK *awaitedLock;
 static ResourceOwner awaitedOwner;
 
 
+static dlist_head session_locks[lengthof(LockMethods)];
+
+
 #ifdef LOCK_DEBUG
 
 /*------
@@ -375,8 +377,8 @@ static void GrantLockLocal(LOCALLOCK *locallock, ResourceOwner owner);
 static void BeginStrongLockAcquire(LOCALLOCK *locallock, uint32 fasthashcode);
 static void FinishStrongLockAcquire(void);
 static void WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner);
-static void ReleaseLockIfHeld(LOCALLOCK *locallock, bool sessionLock);
-static void LockReassignOwner(LOCALLOCK *locallock, ResourceOwner parent);
+static void ReleaseLockIfHeld(LOCALLOCKOWNER *locallockowner, bool sessionLock);
+static void LockReassignOwner(LOCALLOCKOWNER *locallockowner, ResourceOwner parent);
 static bool UnGrantLock(LOCK *lock, LOCKMODE lockmode,
 						PROCLOCK *proclock, LockMethod lockMethodTable);
 static void CleanUpLock(LOCK *lock, PROCLOCK *proclock,
@@ -477,6 +479,10 @@ InitLocks(void)
 									  16,
 									  &info,
 									  HASH_ELEM | HASH_BLOBS);
+
+	/* Initialize each element of the session_locks array */
+	for (int i = 0; i < lengthof(LockMethods); i++)
+		dlist_init(&session_locks[i]);
 }
 
 
@@ -839,26 +845,9 @@ LockAcquireExtended(const LOCKTAG *locktag,
 		locallock->nLocks = 0;
 		locallock->holdsStrongLockCount = false;
 		locallock->lockCleared = false;
-		locallock->numLockOwners = 0;
-		locallock->maxLockOwners = 8;
-		locallock->lockOwners = NULL;	/* in case next line fails */
-		locallock->lockOwners = (LOCALLOCKOWNER *)
-			MemoryContextAlloc(TopMemoryContext,
-							   locallock->maxLockOwners * sizeof(LOCALLOCKOWNER));
+		dlist_init(&locallock->locallockowners);
 	}
-	else
-	{
-		/* Make sure there will be room to remember the lock */
-		if (locallock->numLockOwners >= locallock->maxLockOwners)
-		{
-			int			newsize = locallock->maxLockOwners * 2;
 
-			locallock->lockOwners = (LOCALLOCKOWNER *)
-				repalloc(locallock->lockOwners,
-						 newsize * sizeof(LOCALLOCKOWNER));
-			locallock->maxLockOwners = newsize;
-		}
-	}
 	hashcode = locallock->hashcode;
 
 	if (locallockp)
@@ -1268,7 +1257,6 @@ SetupLockInTable(LockMethod lockMethodTable, PGPROC *proc,
 		proclock->groupLeader = proc->lockGroupLeader != NULL ?
 			proc->lockGroupLeader : proc;
 		proclock->holdMask = 0;
-		proclock->releaseMask = 0;
 		/* Add proclock to appropriate lists */
 		dlist_push_tail(&lock->procLocks, &proclock->lockLink);
 		dlist_push_tail(&proc->myProcLocks[partition], &proclock->procLink);
@@ -1365,17 +1353,18 @@ CheckAndSetLockHeld(LOCALLOCK *locallock, bool acquired)
 static void
 RemoveLocalLock(LOCALLOCK *locallock)
 {
-	int			i;
+	dlist_mutable_iter iter;
 
-	for (i = locallock->numLockOwners - 1; i >= 0; i--)
+	dlist_foreach_modify(iter, &locallock->locallockowners)
 	{
-		if (locallock->lockOwners[i].owner != NULL)
-			ResourceOwnerForgetLock(locallock->lockOwners[i].owner, locallock);
+		LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER, locallock_node, iter.cur);
+
+		Assert(locallockowner->owner != NULL);
+		dlist_delete(&locallockowner->locallock_node);
+		ResourceOwnerForgetLock(locallockowner);
 	}
-	locallock->numLockOwners = 0;
-	if (locallock->lockOwners != NULL)
-		pfree(locallock->lockOwners);
-	locallock->lockOwners = NULL;
+
+	Assert(dlist_is_empty(&locallock->locallockowners));
 
 	if (locallock->holdsStrongLockCount)
 	{
@@ -1683,26 +1672,38 @@ CleanUpLock(LOCK *lock, PROCLOCK *proclock,
 static void
 GrantLockLocal(LOCALLOCK *locallock, ResourceOwner owner)
 {
-	LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
-	int			i;
+	LOCALLOCKOWNER *locallockowner;
+	dlist_iter	iter;
 
-	Assert(locallock->numLockOwners < locallock->maxLockOwners);
 	/* Count the total */
 	locallock->nLocks++;
 	/* Count the per-owner lock */
-	for (i = 0; i < locallock->numLockOwners; i++)
+	dlist_foreach(iter, &locallock->locallockowners)
 	{
-		if (lockOwners[i].owner == owner)
+		locallockowner = dlist_container(LOCALLOCKOWNER, locallock_node, iter.cur);
+
+		if (locallockowner->owner == owner)
 		{
-			lockOwners[i].nLocks++;
+			locallockowner->nLocks++;
 			return;
 		}
 	}
-	lockOwners[i].owner = owner;
-	lockOwners[i].nLocks = 1;
-	locallock->numLockOwners++;
+	locallockowner = MemoryContextAlloc(TopMemoryContext, sizeof(LOCALLOCKOWNER));
+	locallockowner->owner = owner;
+	locallockowner->nLocks = 1;
+	locallockowner->locallock = locallock;
+
+	dlist_push_tail(&locallock->locallockowners, &locallockowner->locallock_node);
+
 	if (owner != NULL)
-		ResourceOwnerRememberLock(owner, locallock);
+		ResourceOwnerRememberLock(owner, locallockowner);
+	else
+	{
+		LOCKMETHODID lockmethodid = LOCALLOCK_LOCKMETHOD(*locallockowner->locallock);
+
+		Assert(lockmethodid > 0 && lockmethodid <= 2);
+		dlist_push_tail(&session_locks[lockmethodid - 1], &locallockowner->resowner_node);
+	}
 
 	/* Indicate that the lock is acquired for certain types of locks. */
 	CheckAndSetLockHeld(locallock, true);
@@ -2015,9 +2016,9 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 	 * Decrease the count for the resource owner.
 	 */
 	{
-		LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
 		ResourceOwner owner;
-		int			i;
+		dlist_iter	iter;
+		bool		found = false;
 
 		/* Identify owner for lock */
 		if (sessionLock)
@@ -2025,24 +2026,29 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 		else
 			owner = CurrentResourceOwner;
 
-		for (i = locallock->numLockOwners - 1; i >= 0; i--)
+		dlist_foreach(iter, &locallock->locallockowners)
 		{
-			if (lockOwners[i].owner == owner)
+			LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER, locallock_node, iter.cur);
+
+			if (locallockowner->owner != owner)
+				continue;
+
+			found = true;
+
+			if (--locallockowner->nLocks == 0)
 			{
-				Assert(lockOwners[i].nLocks > 0);
-				if (--lockOwners[i].nLocks == 0)
-				{
-					if (owner != NULL)
-						ResourceOwnerForgetLock(owner, locallock);
-					/* compact out unused slot */
-					locallock->numLockOwners--;
-					if (i < locallock->numLockOwners)
-						lockOwners[i] = lockOwners[locallock->numLockOwners];
-				}
-				break;
+				dlist_delete(&locallockowner->locallock_node);
+
+				if (owner != NULL)
+					ResourceOwnerForgetLock(locallockowner);
+				else
+					dlist_delete(&locallockowner->resowner_node);
 			}
+
+			Assert(locallockowner->nLocks >= 0);
 		}
-		if (i < 0)
+
+		if (!found)
 		{
 			/* don't release a lock belonging to another owner */
 			elog(WARNING, "you don't own a lock of type %s",
@@ -2060,6 +2066,8 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 	if (locallock->nLocks > 0)
 		return true;
 
+	Assert(locallock->nLocks >= 0);
+
 	/*
 	 * At this point we can no longer suppose we are clear of invalidation
 	 * messages related to this lock.  Although we'll delete the LOCALLOCK
@@ -2162,274 +2170,44 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 	return true;
 }
 
+#ifdef USE_ASSERT_CHECKING
 /*
- * LockReleaseAll -- Release all locks of the specified lock method that
- *		are held by the current process.
- *
- * Well, not necessarily *all* locks.  The available behaviors are:
- *		allLocks == true: release all locks including session locks.
- *		allLocks == false: release all non-session locks.
+ * LockAssertNoneHeld -- Assert that we no longer hold any DEFAULT_LOCKMETHOD
+ * locks during an abort.
  */
-void
-LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
+extern void
+LockAssertNoneHeld(bool isCommit)
 {
 	HASH_SEQ_STATUS status;
-	LockMethod	lockMethodTable;
-	int			i,
-				numLockModes;
 	LOCALLOCK  *locallock;
-	LOCK	   *lock;
-	int			partition;
-	bool		have_fast_path_lwlock = false;
-
-	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
-		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
-	lockMethodTable = LockMethods[lockmethodid];
-
-#ifdef LOCK_DEBUG
-	if (*(lockMethodTable->trace_flag))
-		elog(LOG, "LockReleaseAll: lockmethod=%d", lockmethodid);
-#endif
-
-	/*
-	 * Get rid of our fast-path VXID lock, if appropriate.  Note that this is
-	 * the only way that the lock we hold on our own VXID can ever get
-	 * released: it is always and only released when a toplevel transaction
-	 * ends.
-	 */
-	if (lockmethodid == DEFAULT_LOCKMETHOD)
-		VirtualXactLockTableCleanup();
 
-	numLockModes = lockMethodTable->numLockModes;
-
-	/*
-	 * First we run through the locallock table and get rid of unwanted
-	 * entries, then we scan the process's proclocks and get rid of those. We
-	 * do this separately because we may have multiple locallock entries
-	 * pointing to the same proclock, and we daren't end up with any dangling
-	 * pointers.  Fast-path locks are cleaned up during the locallock table
-	 * scan, though.
-	 */
-	hash_seq_init(&status, LockMethodLocalHash);
-
-	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+	if (!isCommit)
 	{
-		/*
-		 * If the LOCALLOCK entry is unused, we must've run out of shared
-		 * memory while trying to set up this lock.  Just forget the local
-		 * entry.
-		 */
-		if (locallock->nLocks == 0)
-		{
-			RemoveLocalLock(locallock);
-			continue;
-		}
-
-		/* Ignore items that are not of the lockmethod to be removed */
-		if (LOCALLOCK_LOCKMETHOD(*locallock) != lockmethodid)
-			continue;
+		hash_seq_init(&status, LockMethodLocalHash);
 
-		/*
-		 * If we are asked to release all locks, we can just zap the entry.
-		 * Otherwise, must scan to see if there are session locks. We assume
-		 * there is at most one lockOwners entry for session locks.
-		 */
-		if (!allLocks)
+		while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
 		{
-			LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
+			dlist_iter	local_iter;
 
-			/* If session lock is above array position 0, move it down to 0 */
-			for (i = 0; i < locallock->numLockOwners; i++)
-			{
-				if (lockOwners[i].owner == NULL)
-					lockOwners[0] = lockOwners[i];
-				else
-					ResourceOwnerForgetLock(lockOwners[i].owner, locallock);
-			}
+			Assert(locallock->nLocks >= 0);
 
-			if (locallock->numLockOwners > 0 &&
-				lockOwners[0].owner == NULL &&
-				lockOwners[0].nLocks > 0)
+			dlist_foreach(local_iter, &locallock->locallockowners)
 			{
-				/* Fix the locallock to show just the session locks */
-				locallock->nLocks = lockOwners[0].nLocks;
-				locallock->numLockOwners = 1;
-				/* We aren't deleting this locallock, so done */
-				continue;
-			}
-			else
-				locallock->numLockOwners = 0;
-		}
-
-		/*
-		 * If the lock or proclock pointers are NULL, this lock was taken via
-		 * the relation fast-path (and is not known to have been transferred).
-		 */
-		if (locallock->proclock == NULL || locallock->lock == NULL)
-		{
-			LOCKMODE	lockmode = locallock->tag.mode;
-			Oid			relid;
+				LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER,
+																 locallock_node,
+																 local_iter.cur);
 
-			/* Verify that a fast-path lock is what we've got. */
-			if (!EligibleForRelationFastPath(&locallock->tag.lock, lockmode))
-				elog(PANIC, "locallock table corrupted");
+				Assert(locallockowner->owner == NULL);
 
-			/*
-			 * If we don't currently hold the LWLock that protects our
-			 * fast-path data structures, we must acquire it before attempting
-			 * to release the lock via the fast-path.  We will continue to
-			 * hold the LWLock until we're done scanning the locallock table,
-			 * unless we hit a transferred fast-path lock.  (XXX is this
-			 * really such a good idea?  There could be a lot of entries ...)
-			 */
-			if (!have_fast_path_lwlock)
-			{
-				LWLockAcquire(&MyProc->fpInfoLock, LW_EXCLUSIVE);
-				have_fast_path_lwlock = true;
+				if (locallockowner->nLocks > 0 &&
+					LOCALLOCK_LOCKMETHOD(*locallock) == DEFAULT_LOCKMETHOD)
+					Assert(false);
 			}
-
-			/* Attempt fast-path release. */
-			relid = locallock->tag.lock.locktag_field2;
-			if (FastPathUnGrantRelationLock(relid, lockmode))
-			{
-				RemoveLocalLock(locallock);
-				continue;
-			}
-
-			/*
-			 * Our lock, originally taken via the fast path, has been
-			 * transferred to the main lock table.  That's going to require
-			 * some extra work, so release our fast-path lock before starting.
-			 */
-			LWLockRelease(&MyProc->fpInfoLock);
-			have_fast_path_lwlock = false;
-
-			/*
-			 * Now dump the lock.  We haven't got a pointer to the LOCK or
-			 * PROCLOCK in this case, so we have to handle this a bit
-			 * differently than a normal lock release.  Unfortunately, this
-			 * requires an extra LWLock acquire-and-release cycle on the
-			 * partitionLock, but hopefully it shouldn't happen often.
-			 */
-			LockRefindAndRelease(lockMethodTable, MyProc,
-								 &locallock->tag.lock, lockmode, false);
-			RemoveLocalLock(locallock);
-			continue;
 		}
-
-		/* Mark the proclock to show we need to release this lockmode */
-		if (locallock->nLocks > 0)
-			locallock->proclock->releaseMask |= LOCKBIT_ON(locallock->tag.mode);
-
-		/* And remove the locallock hashtable entry */
-		RemoveLocalLock(locallock);
 	}
-
-	/* Done with the fast-path data structures */
-	if (have_fast_path_lwlock)
-		LWLockRelease(&MyProc->fpInfoLock);
-
-	/*
-	 * Now, scan each lock partition separately.
-	 */
-	for (partition = 0; partition < NUM_LOCK_PARTITIONS; partition++)
-	{
-		LWLock	   *partitionLock;
-		dlist_head *procLocks = &MyProc->myProcLocks[partition];
-		dlist_mutable_iter proclock_iter;
-
-		partitionLock = LockHashPartitionLockByIndex(partition);
-
-		/*
-		 * If the proclock list for this partition is empty, we can skip
-		 * acquiring the partition lock.  This optimization is trickier than
-		 * it looks, because another backend could be in process of adding
-		 * something to our proclock list due to promoting one of our
-		 * fast-path locks.  However, any such lock must be one that we
-		 * decided not to delete above, so it's okay to skip it again now;
-		 * we'd just decide not to delete it again.  We must, however, be
-		 * careful to re-fetch the list header once we've acquired the
-		 * partition lock, to be sure we have a valid, up-to-date pointer.
-		 * (There is probably no significant risk if pointer fetch/store is
-		 * atomic, but we don't wish to assume that.)
-		 *
-		 * XXX This argument assumes that the locallock table correctly
-		 * represents all of our fast-path locks.  While allLocks mode
-		 * guarantees to clean up all of our normal locks regardless of the
-		 * locallock situation, we lose that guarantee for fast-path locks.
-		 * This is not ideal.
-		 */
-		if (dlist_is_empty(procLocks))
-			continue;			/* needn't examine this partition */
-
-		LWLockAcquire(partitionLock, LW_EXCLUSIVE);
-
-		dlist_foreach_modify(proclock_iter, procLocks)
-		{
-			PROCLOCK   *proclock = dlist_container(PROCLOCK, procLink, proclock_iter.cur);
-			bool		wakeupNeeded = false;
-
-			Assert(proclock->tag.myProc == MyProc);
-
-			lock = proclock->tag.myLock;
-
-			/* Ignore items that are not of the lockmethod to be removed */
-			if (LOCK_LOCKMETHOD(*lock) != lockmethodid)
-				continue;
-
-			/*
-			 * In allLocks mode, force release of all locks even if locallock
-			 * table had problems
-			 */
-			if (allLocks)
-				proclock->releaseMask = proclock->holdMask;
-			else
-				Assert((proclock->releaseMask & ~proclock->holdMask) == 0);
-
-			/*
-			 * Ignore items that have nothing to be released, unless they have
-			 * holdMask == 0 and are therefore recyclable
-			 */
-			if (proclock->releaseMask == 0 && proclock->holdMask != 0)
-				continue;
-
-			PROCLOCK_PRINT("LockReleaseAll", proclock);
-			LOCK_PRINT("LockReleaseAll", lock, 0);
-			Assert(lock->nRequested >= 0);
-			Assert(lock->nGranted >= 0);
-			Assert(lock->nGranted <= lock->nRequested);
-			Assert((proclock->holdMask & ~lock->grantMask) == 0);
-
-			/*
-			 * Release the previously-marked lock modes
-			 */
-			for (i = 1; i <= numLockModes; i++)
-			{
-				if (proclock->releaseMask & LOCKBIT_ON(i))
-					wakeupNeeded |= UnGrantLock(lock, i, proclock,
-												lockMethodTable);
-			}
-			Assert((lock->nRequested >= 0) && (lock->nGranted >= 0));
-			Assert(lock->nGranted <= lock->nRequested);
-			LOCK_PRINT("LockReleaseAll: updated", lock, 0);
-
-			proclock->releaseMask = 0;
-
-			/* CleanUpLock will wake up waiters if needed. */
-			CleanUpLock(lock, proclock,
-						lockMethodTable,
-						LockTagHashCode(&lock->tag),
-						wakeupNeeded);
-		}						/* loop over PROCLOCKs within this partition */
-
-		LWLockRelease(partitionLock);
-	}							/* loop over partitions */
-
-#ifdef LOCK_DEBUG
-	if (*(lockMethodTable->trace_flag))
-		elog(LOG, "LockReleaseAll done");
-#endif
+	Assert(MyProc->fpLockBits == 0);
 }
+#endif
 
 /*
  * LockReleaseSession -- Release all session locks of the specified lock method
@@ -2438,59 +2216,41 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
 void
 LockReleaseSession(LOCKMETHODID lockmethodid)
 {
-	HASH_SEQ_STATUS status;
-	LOCALLOCK  *locallock;
+	dlist_mutable_iter iter;
 
 	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
 		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
 
-	hash_seq_init(&status, LockMethodLocalHash);
-
-	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+	dlist_foreach_modify(iter, &session_locks[lockmethodid - 1])
 	{
-		/* Ignore items that are not of the specified lock method */
-		if (LOCALLOCK_LOCKMETHOD(*locallock) != lockmethodid)
-			continue;
+		LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER, resowner_node, iter.cur);
 
-		ReleaseLockIfHeld(locallock, true);
+		Assert(LOCALLOCK_LOCKMETHOD(*locallockowner->locallock) == lockmethodid);
+
+		ReleaseLockIfHeld(locallockowner, true);
 	}
+
+	Assert(dlist_is_empty(&session_locks[lockmethodid - 1]));
 }
 
 /*
  * LockReleaseCurrentOwner
- *		Release all locks belonging to CurrentResourceOwner
- *
- * If the caller knows what those locks are, it can pass them as an array.
- * That speeds up the call significantly, when a lot of locks are held.
- * Otherwise, pass NULL for locallocks, and we'll traverse through our hash
- * table to find them.
+ *		Release all locks belonging to 'owner'
  */
 void
-LockReleaseCurrentOwner(LOCALLOCK **locallocks, int nlocks)
+LockReleaseCurrentOwner(ResourceOwner owner, dlist_node *resowner_node)
 {
-	if (locallocks == NULL)
-	{
-		HASH_SEQ_STATUS status;
-		LOCALLOCK  *locallock;
+	LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER, resowner_node, resowner_node);
 
-		hash_seq_init(&status, LockMethodLocalHash);
+	Assert(locallockowner->owner == owner);
 
-		while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
-			ReleaseLockIfHeld(locallock, false);
-	}
-	else
-	{
-		int			i;
-
-		for (i = nlocks - 1; i >= 0; i--)
-			ReleaseLockIfHeld(locallocks[i], false);
-	}
+	ReleaseLockIfHeld(locallockowner, false);
 }
 
 /*
  * ReleaseLockIfHeld
- *		Release any session-level locks on this lockable object if sessionLock
- *		is true; else, release any locks held by CurrentResourceOwner.
+ *		Release any session-level locks on this 'locallockowner' if
+ *		sessionLock is true; else, release any locks held by 'locallockowner'.
  *
  * It is tempting to pass this a ResourceOwner pointer (or NULL for session
  * locks), but without refactoring LockRelease() we cannot support releasing
@@ -2501,52 +2261,39 @@ LockReleaseCurrentOwner(LOCALLOCK **locallocks, int nlocks)
  * convenience.
  */
 static void
-ReleaseLockIfHeld(LOCALLOCK *locallock, bool sessionLock)
+ReleaseLockIfHeld(LOCALLOCKOWNER *locallockowner, bool sessionLock)
 {
-	ResourceOwner owner;
-	LOCALLOCKOWNER *lockOwners;
-	int			i;
+	LOCALLOCK  *locallock = locallockowner->locallock;
+
+	/* release all references to the lock by this resource owner */
 
-	/* Identify owner for lock (must match LockRelease!) */
 	if (sessionLock)
-		owner = NULL;
+		Assert(locallockowner->owner == NULL);
 	else
-		owner = CurrentResourceOwner;
+		Assert(locallockowner->owner != NULL);
 
-	/* Scan to see if there are any locks belonging to the target owner */
-	lockOwners = locallock->lockOwners;
-	for (i = locallock->numLockOwners - 1; i >= 0; i--)
+	/* We will still hold this lock after forgetting this ResourceOwner. */
+	if (locallockowner->nLocks < locallock->nLocks)
 	{
-		if (lockOwners[i].owner == owner)
-		{
-			Assert(lockOwners[i].nLocks > 0);
-			if (lockOwners[i].nLocks < locallock->nLocks)
-			{
-				/*
-				 * We will still hold this lock after forgetting this
-				 * ResourceOwner.
-				 */
-				locallock->nLocks -= lockOwners[i].nLocks;
-				/* compact out unused slot */
-				locallock->numLockOwners--;
-				if (owner != NULL)
-					ResourceOwnerForgetLock(owner, locallock);
-				if (i < locallock->numLockOwners)
-					lockOwners[i] = lockOwners[locallock->numLockOwners];
-			}
-			else
-			{
-				Assert(lockOwners[i].nLocks == locallock->nLocks);
-				/* We want to call LockRelease just once */
-				lockOwners[i].nLocks = 1;
-				locallock->nLocks = 1;
-				if (!LockRelease(&locallock->tag.lock,
-								 locallock->tag.mode,
-								 sessionLock))
-					elog(WARNING, "ReleaseLockIfHeld: failed??");
-			}
-			break;
-		}
+		locallock->nLocks -= locallockowner->nLocks;
+		dlist_delete(&locallockowner->locallock_node);
+
+		if (sessionLock)
+			dlist_delete(&locallockowner->resowner_node);
+		else
+			ResourceOwnerForgetLock(locallockowner);
+	}
+	else
+	{
+		Assert(locallockowner->nLocks == locallock->nLocks);
+		/* We want to call LockRelease just once */
+		locallockowner->nLocks = 1;
+		locallock->nLocks = 1;
+
+		if (!LockRelease(&locallock->tag.lock,
+						 locallock->tag.mode,
+						 sessionLock))
+			elog(WARNING, "ReleaseLockIfHeld: failed??");
 	}
 }
 
@@ -2561,75 +2308,48 @@ ReleaseLockIfHeld(LOCALLOCK *locallock, bool sessionLock)
  * and we'll traverse through our hash table to find them.
  */
 void
-LockReassignCurrentOwner(LOCALLOCK **locallocks, int nlocks)
+LockReassignCurrentOwner(ResourceOwner owner, dlist_node *resowner_node)
 {
+	LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER,
+													 resowner_node,
+													 resowner_node);
 	ResourceOwner parent = ResourceOwnerGetParent(CurrentResourceOwner);
 
-	Assert(parent != NULL);
-
-	if (locallocks == NULL)
-	{
-		HASH_SEQ_STATUS status;
-		LOCALLOCK  *locallock;
-
-		hash_seq_init(&status, LockMethodLocalHash);
-
-		while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
-			LockReassignOwner(locallock, parent);
-	}
-	else
-	{
-		int			i;
-
-		for (i = nlocks - 1; i >= 0; i--)
-			LockReassignOwner(locallocks[i], parent);
-	}
+	LockReassignOwner(locallockowner, parent);
 }
 
 /*
- * Subroutine of LockReassignCurrentOwner. Reassigns a given lock belonging to
- * CurrentResourceOwner to its parent.
+ * Subroutine of LockReassignCurrentOwner. Reassigns the given
+ *'locallockowner' to 'parent'.
  */
 static void
-LockReassignOwner(LOCALLOCK *locallock, ResourceOwner parent)
+LockReassignOwner(LOCALLOCKOWNER *locallockowner, ResourceOwner parent)
 {
-	LOCALLOCKOWNER *lockOwners;
-	int			i;
-	int			ic = -1;
-	int			ip = -1;
+	dlist_iter	iter;
+	LOCALLOCK  *locallock = locallockowner->locallock;
 
-	/*
-	 * Scan to see if there are any locks belonging to current owner or its
-	 * parent
-	 */
-	lockOwners = locallock->lockOwners;
-	for (i = locallock->numLockOwners - 1; i >= 0; i--)
+	ResourceOwnerForgetLock(locallockowner);
+
+	dlist_foreach(iter, &locallock->locallockowners)
 	{
-		if (lockOwners[i].owner == CurrentResourceOwner)
-			ic = i;
-		else if (lockOwners[i].owner == parent)
-			ip = i;
-	}
+		LOCALLOCKOWNER *parentlocalowner = dlist_container(LOCALLOCKOWNER, locallock_node, iter.cur);
 
-	if (ic < 0)
-		return;					/* no current locks */
+		Assert(parentlocalowner->locallock == locallock);
 
-	if (ip < 0)
-	{
-		/* Parent has no slot, so just give it the child's slot */
-		lockOwners[ic].owner = parent;
-		ResourceOwnerRememberLock(parent, locallock);
-	}
-	else
-	{
-		/* Merge child's count with parent's */
-		lockOwners[ip].nLocks += lockOwners[ic].nLocks;
-		/* compact out unused slot */
-		locallock->numLockOwners--;
-		if (ic < locallock->numLockOwners)
-			lockOwners[ic] = lockOwners[locallock->numLockOwners];
+		if (parentlocalowner->owner != parent)
+			continue;
+
+		parentlocalowner->nLocks += locallockowner->nLocks;
+
+		locallockowner->nLocks = 0;
+		dlist_delete(&locallockowner->locallock_node);
+		pfree(locallockowner);
+		return;
 	}
-	ResourceOwnerForgetLock(CurrentResourceOwner, locallock);
+
+	/* reassign locallockowner to parent resowner */
+	locallockowner->owner = parent;
+	ResourceOwnerRememberLock(parent, locallockowner);
 }
 
 /*
@@ -3101,7 +2821,7 @@ GetLockConflicts(const LOCKTAG *locktag, LOCKMODE lockmode, int *countp)
  * We currently use this in two situations: first, to release locks held by
  * prepared transactions on commit (see lock_twophase_postcommit); and second,
  * to release locks taken via the fast-path, transferred to the main hash
- * table, and then released (see LockReleaseAll).
+ * table, and then released (see ResourceOwnerRelease).
  */
 static void
 LockRefindAndRelease(LockMethod lockMethodTable, PGPROC *proc,
@@ -3237,10 +2957,9 @@ CheckForSessionAndXactLocks(void)
 
 	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
 	{
-		LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
 		PerLockTagEntry *hentry;
 		bool		found;
-		int			i;
+		dlist_iter	iter;
 
 		/*
 		 * Ignore VXID locks.  We don't want those to be held by prepared
@@ -3261,9 +2980,13 @@ CheckForSessionAndXactLocks(void)
 			hentry->sessLock = hentry->xactLock = false;
 
 		/* Scan to see if we hold lock at session or xact level or both */
-		for (i = locallock->numLockOwners - 1; i >= 0; i--)
+		dlist_foreach(iter, &locallock->locallockowners)
 		{
-			if (lockOwners[i].owner == NULL)
+			LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER,
+															 locallock_node,
+															 iter.cur);
+
+			if (locallockowner->owner == NULL)
 				hentry->sessLock = true;
 			else
 				hentry->xactLock = true;
@@ -3310,10 +3033,9 @@ AtPrepare_Locks(void)
 	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
 	{
 		TwoPhaseLockRecord record;
-		LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
 		bool		haveSessionLock;
 		bool		haveXactLock;
-		int			i;
+		dlist_iter	iter;
 
 		/*
 		 * Ignore VXID locks.  We don't want those to be held by prepared
@@ -3328,9 +3050,13 @@ AtPrepare_Locks(void)
 
 		/* Scan to see whether we hold it at session or transaction level */
 		haveSessionLock = haveXactLock = false;
-		for (i = locallock->numLockOwners - 1; i >= 0; i--)
+		dlist_foreach(iter, &locallock->locallockowners)
 		{
-			if (lockOwners[i].owner == NULL)
+			LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER,
+															 locallock_node,
+															 iter.cur);
+
+			if (locallockowner->owner == NULL)
 				haveSessionLock = true;
 			else
 				haveXactLock = true;
@@ -3388,8 +3114,8 @@ AtPrepare_Locks(void)
  * pointers in the transaction's resource owner.  This is OK at the
  * moment since resowner.c doesn't try to free locks retail at a toplevel
  * transaction commit or abort.  We could alternatively zero out nLocks
- * and leave the LOCALLOCK entries to be garbage-collected by LockReleaseAll,
- * but that probably costs more cycles.
+ * and leave the LOCALLOCK entries to be garbage-collected by
+ * ResourceOwnerRelease, but that probably costs more cycles.
  */
 void
 PostPrepare_Locks(TransactionId xid)
@@ -3422,10 +3148,9 @@ PostPrepare_Locks(TransactionId xid)
 
 	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
 	{
-		LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
 		bool		haveSessionLock;
 		bool		haveXactLock;
-		int			i;
+		dlist_iter	iter;
 
 		if (locallock->proclock == NULL || locallock->lock == NULL)
 		{
@@ -3444,9 +3169,13 @@ PostPrepare_Locks(TransactionId xid)
 
 		/* Scan to see whether we hold it at session or transaction level */
 		haveSessionLock = haveXactLock = false;
-		for (i = locallock->numLockOwners - 1; i >= 0; i--)
+		dlist_foreach(iter, &locallock->locallockowners)
 		{
-			if (lockOwners[i].owner == NULL)
+			LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER,
+															 locallock_node,
+															 iter.cur);
+
+			if (locallockowner->owner == NULL)
 				haveSessionLock = true;
 			else
 				haveXactLock = true;
@@ -3462,10 +3191,6 @@ PostPrepare_Locks(TransactionId xid)
 					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 					 errmsg("cannot PREPARE while holding both session-level and transaction-level locks on the same object")));
 
-		/* Mark the proclock to show we need to release this lockmode */
-		if (locallock->nLocks > 0)
-			locallock->proclock->releaseMask |= LOCKBIT_ON(locallock->tag.mode);
-
 		/* And remove the locallock hashtable entry */
 		RemoveLocalLock(locallock);
 	}
@@ -3483,11 +3208,7 @@ PostPrepare_Locks(TransactionId xid)
 
 		/*
 		 * If the proclock list for this partition is empty, we can skip
-		 * acquiring the partition lock.  This optimization is safer than the
-		 * situation in LockReleaseAll, because we got rid of any fast-path
-		 * locks during AtPrepare_Locks, so there cannot be any case where
-		 * another backend is adding something to our lists now.  For safety,
-		 * though, we code this the same way as in LockReleaseAll.
+		 * acquiring the partition lock.
 		 */
 		if (dlist_is_empty(procLocks))
 			continue;			/* needn't examine this partition */
@@ -3513,14 +3234,6 @@ PostPrepare_Locks(TransactionId xid)
 			Assert(lock->nGranted <= lock->nRequested);
 			Assert((proclock->holdMask & ~lock->grantMask) == 0);
 
-			/* Ignore it if nothing to release (must be a session lock) */
-			if (proclock->releaseMask == 0)
-				continue;
-
-			/* Else we should be releasing all locks */
-			if (proclock->releaseMask != proclock->holdMask)
-				elog(PANIC, "we seem to have dropped a bit somewhere");
-
 			/*
 			 * We cannot simply modify proclock->tag.myProc to reassign
 			 * ownership of the lock, because that's part of the hash key and
@@ -4288,7 +4001,6 @@ lock_twophase_recover(TransactionId xid, uint16 info,
 		Assert(proc->lockGroupLeader == NULL);
 		proclock->groupLeader = proc;
 		proclock->holdMask = 0;
-		proclock->releaseMask = 0;
 		/* Add proclock to appropriate lists */
 		dlist_push_tail(&lock->procLocks, &proclock->lockLink);
 		dlist_push_tail(&proc->myProcLocks[partition],
@@ -4425,7 +4137,7 @@ lock_twophase_postabort(TransactionId xid, uint16 info,
  *
  *		We don't bother recording this lock in the local lock table, since it's
  *		only ever released at the end of a transaction.  Instead,
- *		LockReleaseAll() calls VirtualXactLockTableCleanup().
+ *		ProcReleaseLocks() calls VirtualXactLockTableCleanup().
  */
 void
 VirtualXactLockTableInsert(VirtualTransactionId vxid)
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 22b4278610..1addef790a 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -777,10 +777,17 @@ ProcReleaseLocks(bool isCommit)
 		return;
 	/* If waiting, get off wait queue (should only be needed after error) */
 	LockErrorCleanup();
-	/* Release standard locks, including session-level if aborting */
-	LockReleaseAll(DEFAULT_LOCKMETHOD, !isCommit);
-	/* Release transaction-level advisory locks */
-	LockReleaseAll(USER_LOCKMETHOD, false);
+
+	VirtualXactLockTableCleanup();
+
+	/* Release session-level locks if aborting */
+	if (!isCommit)
+		LockReleaseSession(DEFAULT_LOCKMETHOD);
+
+#ifdef USE_ASSERT_CHECKING
+	/* Ensure all locks were released */
+	LockAssertNoneHeld(isCommit);
+#endif
 }
 
 
@@ -861,6 +868,8 @@ ProcKill(int code, Datum arg)
 		LWLockRelease(leader_lwlock);
 	}
 
+	Assert(MyProc->fpLockBits == 0);
+
 	/*
 	 * Reset MyLatch to the process local one.  This is so that signal
 	 * handlers et al can continue using the latch after the shared latch
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 2f07ca7a0e..0547e3d076 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -1325,10 +1325,10 @@ ShutdownPostgres(int code, Datum arg)
 	AbortOutOfAnyTransaction();
 
 	/*
-	 * User locks are not released by transaction end, so be sure to release
-	 * them explicitly.
+	 * Session locks are not released by transaction end, so be sure to
+	 * release them explicitly.
 	 */
-	LockReleaseAll(USER_LOCKMETHOD, true);
+	LockReleaseSession(USER_LOCKMETHOD);
 }
 
 
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 19b6241e45..321ea15c78 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -33,6 +33,7 @@
 #include "utils/resowner_private.h"
 #include "utils/snapmgr.h"
 
+#include "lib/ilist.h"
 
 /*
  * All resource IDs managed by this code are required to fit into a Datum,
@@ -91,24 +92,6 @@ typedef struct ResourceArray
 #define RESARRAY_MAX_ITEMS(capacity) \
 	((capacity) <= RESARRAY_MAX_ARRAY ? (capacity) : (capacity)/4 * 3)
 
-/*
- * To speed up bulk releasing or reassigning locks from a resource owner to
- * its parent, each resource owner has a small cache of locks it owns. The
- * lock manager has the same information in its local lock hash table, and
- * we fall back on that if cache overflows, but traversing the hash table
- * is slower when there are a lot of locks belonging to other resource owners.
- *
- * MAX_RESOWNER_LOCKS is the size of the per-resource owner cache. It's
- * chosen based on some testing with pg_dump with a large schema. When the
- * tests were done (on 9.2), resource owners in a pg_dump run contained up
- * to 9 locks, regardless of the schema size, except for the top resource
- * owner which contained much more (overflowing the cache). 15 seems like a
- * nice round number that's somewhat higher than what pg_dump needs. Note that
- * making this number larger is not free - the bigger the cache, the slower
- * it is to release locks (in retail), when a resource owner holds many locks.
- */
-#define MAX_RESOWNER_LOCKS 15
-
 /*
  * ResourceOwner objects look like this
  */
@@ -133,9 +116,7 @@ typedef struct ResourceOwnerData
 	ResourceArray cryptohasharr;	/* cryptohash contexts */
 	ResourceArray hmacarr;		/* HMAC contexts */
 
-	/* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
-	int			nlocks;			/* number of owned locks */
-	LOCALLOCK  *locks[MAX_RESOWNER_LOCKS];	/* list of owned locks */
+	dlist_head	locks;			/* dlist of owned locks */
 }			ResourceOwnerData;
 
 
@@ -452,6 +433,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
 	ResourceArrayInit(&(owner->jitarr), PointerGetDatum(NULL));
 	ResourceArrayInit(&(owner->cryptohasharr), PointerGetDatum(NULL));
 	ResourceArrayInit(&(owner->hmacarr), PointerGetDatum(NULL));
+	dlist_init(&owner->locks);
 
 	return owner;
 }
@@ -586,8 +568,15 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 	}
 	else if (phase == RESOURCE_RELEASE_LOCKS)
 	{
+		dlist_mutable_iter iter;
+
 		if (isTopLevel)
 		{
+			dlist_foreach_modify(iter, &owner->locks)
+				LockReleaseCurrentOwner(owner, iter.cur);
+
+			Assert(dlist_is_empty(&owner->locks));
+
 			/*
 			 * For a top-level xact we are going to release all locks (or at
 			 * least all non-session locks), so just do a single lmgr call at
@@ -606,30 +595,20 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 			 * subtransaction, we do NOT release its locks yet, but transfer
 			 * them to the parent.
 			 */
-			LOCALLOCK **locks;
-			int			nlocks;
-
-			Assert(owner->parent != NULL);
-
-			/*
-			 * Pass the list of locks owned by this resource owner to the lock
-			 * manager, unless it has overflowed.
-			 */
-			if (owner->nlocks > MAX_RESOWNER_LOCKS)
+			if (isCommit)
 			{
-				locks = NULL;
-				nlocks = 0;
+				dlist_foreach_modify(iter, &owner->locks)
+					LockReassignCurrentOwner(owner, iter.cur);
+
+				Assert(dlist_is_empty(&owner->locks));
 			}
 			else
 			{
-				locks = owner->locks;
-				nlocks = owner->nlocks;
-			}
+				dlist_foreach_modify(iter, &owner->locks)
+					LockReleaseCurrentOwner(owner, iter.cur);
 
-			if (isCommit)
-				LockReassignCurrentOwner(locks, nlocks);
-			else
-				LockReleaseCurrentOwner(locks, nlocks);
+				Assert(dlist_is_empty(&owner->locks));
+			}
 		}
 	}
 	else if (phase == RESOURCE_RELEASE_AFTER_LOCKS)
@@ -757,7 +736,7 @@ ResourceOwnerDelete(ResourceOwner owner)
 	Assert(owner->jitarr.nitems == 0);
 	Assert(owner->cryptohasharr.nitems == 0);
 	Assert(owner->hmacarr.nitems == 0);
-	Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1);
+	Assert(dlist_is_empty(&owner->locks));
 
 	/*
 	 * Delete children.  The recursive call will delink the child from me, so
@@ -978,54 +957,61 @@ ResourceOwnerForgetBuffer(ResourceOwner owner, Buffer buffer)
 
 /*
  * Remember that a Local Lock is owned by a ResourceOwner
- *
- * This is different from the other Remember functions in that the list of
- * locks is only a lossy cache. It can hold up to MAX_RESOWNER_LOCKS entries,
- * and when it overflows, we stop tracking locks. The point of only remembering
- * only up to MAX_RESOWNER_LOCKS entries is that if a lot of locks are held,
- * ResourceOwnerForgetLock doesn't need to scan through a large array to find
- * the entry.
  */
 void
-ResourceOwnerRememberLock(ResourceOwner owner, LOCALLOCK *locallock)
+ResourceOwnerRememberLock(ResourceOwner owner, LOCALLOCKOWNER *locallockowner)
 {
-	Assert(locallock != NULL);
-
-	if (owner->nlocks > MAX_RESOWNER_LOCKS)
-		return;					/* we have already overflowed */
+	Assert(owner != NULL);
+	Assert(locallockowner != NULL);
 
-	if (owner->nlocks < MAX_RESOWNER_LOCKS)
-		owner->locks[owner->nlocks] = locallock;
-	else
+#ifdef USE_ASSERT_CHECKING
 	{
-		/* overflowed */
+		dlist_iter	iter;
+
+		dlist_foreach(iter, &owner->locks)
+		{
+			LOCALLOCKOWNER *i = dlist_container(LOCALLOCKOWNER, resowner_node, iter.cur);
+
+			Assert(i->locallock != locallockowner->locallock);
+		}
 	}
-	owner->nlocks++;
+#endif
+
+	dlist_push_tail(&owner->locks, &locallockowner->resowner_node);
 }
 
 /*
- * Forget that a Local Lock is owned by a ResourceOwner
+ * Forget that a Local Lock is owned by the given LOCALLOCKOWNER.
  */
 void
-ResourceOwnerForgetLock(ResourceOwner owner, LOCALLOCK *locallock)
+ResourceOwnerForgetLock(LOCALLOCKOWNER *locallockowner)
 {
-	int			i;
+#ifdef USE_ASSERT_CHECKING
+	ResourceOwner owner;
 
-	if (owner->nlocks > MAX_RESOWNER_LOCKS)
-		return;					/* we have overflowed */
+	Assert(locallockowner != NULL);
+
+	owner = locallockowner->owner;
 
-	Assert(owner->nlocks > 0);
-	for (i = owner->nlocks - 1; i >= 0; i--)
 	{
-		if (locallock == owner->locks[i])
+		dlist_iter	iter;
+		bool		found = false;
+
+		dlist_foreach(iter, &owner->locks)
 		{
-			owner->locks[i] = owner->locks[owner->nlocks - 1];
-			owner->nlocks--;
-			return;
+			LOCALLOCKOWNER *owner = dlist_container(LOCALLOCKOWNER, resowner_node, iter.cur);
+
+			if (locallockowner == owner)
+			{
+				Assert(!found);
+				found = true;
+			}
 		}
+
+		Assert(found);
 	}
-	elog(ERROR, "lock reference %p is not owned by resource owner %s",
-		 locallock, owner->name);
+#endif
+	dlist_delete(&locallockowner->resowner_node);
 }
 
 /*
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index 6ae434596a..f2617f805e 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -24,6 +24,7 @@
 #include "storage/lwlock.h"
 #include "storage/shmem.h"
 #include "utils/timestamp.h"
+#include "lib/ilist.h"
 
 /* struct PGPROC is declared in proc.h, but must forward-reference it */
 typedef struct PGPROC PGPROC;
@@ -349,10 +350,6 @@ typedef struct LOCK
  * Otherwise, proclock objects whose holdMasks are zero are recycled
  * as soon as convenient.
  *
- * releaseMask is workspace for LockReleaseAll(): it shows the locks due
- * to be released during the current call.  This must only be examined or
- * set by the backend owning the PROCLOCK.
- *
  * Each PROCLOCK object is linked into lists for both the associated LOCK
  * object and the owning PGPROC object.  Note that the PROCLOCK is entered
  * into these lists as soon as it is created, even if no lock has yet been
@@ -374,7 +371,6 @@ typedef struct PROCLOCK
 	/* data */
 	PGPROC	   *groupLeader;	/* proc's lock group leader, or proc itself */
 	LOCKMASK	holdMask;		/* bitmask for lock types currently held */
-	LOCKMASK	releaseMask;	/* bitmask for lock types to be released */
 	dlist_node	lockLink;		/* list link in LOCK's list of proclocks */
 	dlist_node	procLink;		/* list link in PGPROC's list of proclocks */
 } PROCLOCK;
@@ -420,6 +416,13 @@ typedef struct LOCALLOCKOWNER
 	 * Must use a forward struct reference to avoid circularity.
 	 */
 	struct ResourceOwnerData *owner;
+
+	dlist_node	resowner_node;
+
+	dlist_node	locallock_node;
+
+	struct LOCALLOCK *locallock;
+
 	int64		nLocks;			/* # of times held by this owner */
 } LOCALLOCKOWNER;
 
@@ -433,9 +436,9 @@ typedef struct LOCALLOCK
 	LOCK	   *lock;			/* associated LOCK object, if any */
 	PROCLOCK   *proclock;		/* associated PROCLOCK object, if any */
 	int64		nLocks;			/* total number of times lock is held */
-	int			numLockOwners;	/* # of relevant ResourceOwners */
-	int			maxLockOwners;	/* allocated size of array */
-	LOCALLOCKOWNER *lockOwners; /* dynamically resizable array */
+
+	dlist_head	locallockowners;	/* dlist of LOCALLOCKOWNER */
+
 	bool		holdsStrongLockCount;	/* bumped FastPathStrongRelationLocks */
 	bool		lockCleared;	/* we read all sinval msgs for lock */
 } LOCALLOCK;
@@ -564,10 +567,17 @@ extern void AbortStrongLockAcquire(void);
 extern void MarkLockClear(LOCALLOCK *locallock);
 extern bool LockRelease(const LOCKTAG *locktag,
 						LOCKMODE lockmode, bool sessionLock);
-extern void LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks);
+
+#ifdef USE_ASSERT_CHECKING
+extern void LockAssertNoneHeld(bool isCommit);
+#endif
+
 extern void LockReleaseSession(LOCKMETHODID lockmethodid);
-extern void LockReleaseCurrentOwner(LOCALLOCK **locallocks, int nlocks);
-extern void LockReassignCurrentOwner(LOCALLOCK **locallocks, int nlocks);
+struct ResourceOwnerData;
+extern void LockReleaseCurrentOwner(struct ResourceOwnerData *owner,
+									dlist_node *resowner_node);
+extern void LockReassignCurrentOwner(struct ResourceOwnerData *owner,
+									 dlist_node *resowner_node);
 extern bool LockHeldByMe(const LOCKTAG *locktag, LOCKMODE lockmode);
 #ifdef USE_ASSERT_CHECKING
 extern HTAB *GetLockMethodLocalHash(void);
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index 1b1f3181b5..ad13f13401 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -31,8 +31,9 @@ extern void ResourceOwnerRememberBuffer(ResourceOwner owner, Buffer buffer);
 extern void ResourceOwnerForgetBuffer(ResourceOwner owner, Buffer buffer);
 
 /* support for local lock management */
-extern void ResourceOwnerRememberLock(ResourceOwner owner, LOCALLOCK *locallock);
-extern void ResourceOwnerForgetLock(ResourceOwner owner, LOCALLOCK *locallock);
+extern void ResourceOwnerRememberLock(ResourceOwner owner,
+									  LOCALLOCKOWNER *locallock);
+extern void ResourceOwnerForgetLock(LOCALLOCKOWNER *locallock);
 
 /* support for catcache refcount management */
 extern void ResourceOwnerEnlargeCatCacheRefs(ResourceOwner owner);
-- 
2.37.2

v6-0002-fixup-wip-resowner-lock-release-all.patchtext/plain; charset=US-ASCII; name=v6-0002-fixup-wip-resowner-lock-release-all.patchDownload

From 4bb7647efef51e53f35bb3a08a361f057ef27707 Mon Sep 17 00:00:00 2001
From: David Rowley <dgrowley@gmail.com>
Date: Fri, 10 Feb 2023 14:29:23 +1300
Subject: [PATCH v6 2/3] fixup! wip-resowner-lock-release-all.

---
 src/backend/replication/logical/launcher.c |  2 +-
 src/backend/storage/lmgr/lock.c            | 19 +++++--------------
 src/backend/utils/resowner/resowner.c      | 20 +++++++++++++++++---
 src/include/storage/lock.h                 | 11 +++++------
 4 files changed, 28 insertions(+), 24 deletions(-)

diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index 8998b55f62..8ba6a51945 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -798,7 +798,7 @@ logicalrep_worker_onexit(int code, Datum arg)
 	 * parallel apply mode and will not be released when the worker
 	 * terminates, so manually release all locks before the worker exits.
 	 */
-	LockReleaseSession(DEFAULT_LOCKMETHOD);
+	ProcReleaseLocks(false);
 
 	ApplyLauncherWakeup();
 }
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 7e2dd3a7af..1033d57b88 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -2045,6 +2045,7 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 					dlist_delete(&locallockowner->resowner_node);
 			}
 
+			/* ensure nLocks didn't go negative */
 			Assert(locallockowner->nLocks >= 0);
 		}
 
@@ -2066,7 +2067,7 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 	if (locallock->nLocks > 0)
 		return true;
 
-	Assert(locallock->nLocks >= 0);
+	Assert(locallock->nLocks == 0);
 
 	/*
 	 * At this point we can no longer suppose we are clear of invalidation
@@ -2175,7 +2176,7 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
  * LockAssertNoneHeld -- Assert that we no longer hold any DEFAULT_LOCKMETHOD
  * locks during an abort.
  */
-extern void
+void
 LockAssertNoneHeld(bool isCommit)
 {
 	HASH_SEQ_STATUS status;
@@ -2238,10 +2239,8 @@ LockReleaseSession(LOCKMETHODID lockmethodid)
  *		Release all locks belonging to 'owner'
  */
 void
-LockReleaseCurrentOwner(ResourceOwner owner, dlist_node *resowner_node)
+LockReleaseCurrentOwner(ResourceOwner owner, LOCALLOCKOWNER *locallockowner)
 {
-	LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER, resowner_node, resowner_node);
-
 	Assert(locallockowner->owner == owner);
 
 	ReleaseLockIfHeld(locallockowner, false);
@@ -2301,18 +2300,10 @@ ReleaseLockIfHeld(LOCALLOCKOWNER *locallockowner, bool sessionLock)
  * LockReassignCurrentOwner
  *		Reassign all locks belonging to CurrentResourceOwner to belong
  *		to its parent resource owner.
- *
- * If the caller knows what those locks are, it can pass them as an array.
- * That speeds up the call significantly, when a lot of locks are held
- * (e.g pg_dump with a large schema).  Otherwise, pass NULL for locallocks,
- * and we'll traverse through our hash table to find them.
  */
 void
-LockReassignCurrentOwner(ResourceOwner owner, dlist_node *resowner_node)
+LockReassignCurrentOwner(LOCALLOCKOWNER *locallockowner)
 {
-	LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER,
-													 resowner_node,
-													 resowner_node);
 	ResourceOwner parent = ResourceOwnerGetParent(CurrentResourceOwner);
 
 	LockReassignOwner(locallockowner, parent);
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 321ea15c78..46a9a3ca42 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -573,7 +573,11 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 		if (isTopLevel)
 		{
 			dlist_foreach_modify(iter, &owner->locks)
-				LockReleaseCurrentOwner(owner, iter.cur);
+			{
+				LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER, resowner_node, iter.cur);
+
+				LockReleaseCurrentOwner(owner, locallockowner);
+			}
 
 			Assert(dlist_is_empty(&owner->locks));
 
@@ -598,14 +602,24 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 			if (isCommit)
 			{
 				dlist_foreach_modify(iter, &owner->locks)
-					LockReassignCurrentOwner(owner, iter.cur);
+				{
+					LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER,
+																	 resowner_node,
+																	 iter.cur);
+
+					LockReassignCurrentOwner(locallockowner);
+				}
 
 				Assert(dlist_is_empty(&owner->locks));
 			}
 			else
 			{
 				dlist_foreach_modify(iter, &owner->locks)
-					LockReleaseCurrentOwner(owner, iter.cur);
+				{
+					LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER, resowner_node, iter.cur);
+
+					LockReleaseCurrentOwner(owner, locallockowner);
+				}
 
 				Assert(dlist_is_empty(&owner->locks));
 			}
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index f2617f805e..e3861a8ea5 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -417,11 +417,11 @@ typedef struct LOCALLOCKOWNER
 	 */
 	struct ResourceOwnerData *owner;
 
-	dlist_node	resowner_node;
+	dlist_node	resowner_node;	/* dlist link for ResourceOwner.locks */
 
-	dlist_node	locallock_node;
+	dlist_node	locallock_node;	/* dlist link for LOCALLOCK.locallockowners */
 
-	struct LOCALLOCK *locallock;
+	struct LOCALLOCK *locallock;	/* pointer to the corresponding LOCALLOCK */
 
 	int64		nLocks;			/* # of times held by this owner */
 } LOCALLOCKOWNER;
@@ -575,9 +575,8 @@ extern void LockAssertNoneHeld(bool isCommit);
 extern void LockReleaseSession(LOCKMETHODID lockmethodid);
 struct ResourceOwnerData;
 extern void LockReleaseCurrentOwner(struct ResourceOwnerData *owner,
-									dlist_node *resowner_node);
-extern void LockReassignCurrentOwner(struct ResourceOwnerData *owner,
-									 dlist_node *resowner_node);
+									LOCALLOCKOWNER *locallockowner);
+extern void LockReassignCurrentOwner(LOCALLOCKOWNER *locallockowner);
 extern bool LockHeldByMe(const LOCKTAG *locktag, LOCKMODE lockmode);
 #ifdef USE_ASSERT_CHECKING
 extern HTAB *GetLockMethodLocalHash(void);
-- 
2.37.2

v6-0003-Use-a-slab-context-type-for-storage-of-LOCALLOCKO.patchtext/plain; charset=US-ASCII; name=v6-0003-Use-a-slab-context-type-for-storage-of-LOCALLOCKO.patchDownload

From c00e0f64278bc35745cb0bb11e95eb5349ba50c6 Mon Sep 17 00:00:00 2001
From: David Rowley <dgrowley@gmail.com>
Date: Fri, 10 Feb 2023 15:10:06 +1300
Subject: [PATCH v6 3/3] Use a slab context type for storage of LOCALLOCKOWNERs

---
 src/backend/storage/lmgr/lock.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 1033d57b88..803a5bb482 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -281,6 +281,8 @@ static HTAB *LockMethodLockHash;
 static HTAB *LockMethodProcLockHash;
 static HTAB *LockMethodLocalHash;
 
+/* A memory context for storing LOCALLOCKOWNER structs */
+MemoryContext LocalLockOwnerContext;
 
 /* private state for error cleanup */
 static LOCALLOCK *StrongLockInProgress;
@@ -483,6 +485,17 @@ InitLocks(void)
 	/* Initialize each element of the session_locks array */
 	for (int i = 0; i < lengthof(LockMethods); i++)
 		dlist_init(&session_locks[i]);
+
+	/*
+	 * Create a slab context for storing LOCALLOCKOWNERs.  Slab seems like a
+	 * good context type for this as it will manage fragmentation better than
+	 * aset.c contexts and it will free() excess memory rather than maintain
+	 * excessively long freelists after a large surge in locking requirements.
+	 */
+	LocalLockOwnerContext = SlabContextCreate(TopMemoryContext,
+											  "LOCALLOCKOWNER context",
+											  SLAB_DEFAULT_BLOCK_SIZE,
+											  sizeof(LOCALLOCKOWNER));
 }
 
 
@@ -1688,7 +1701,7 @@ GrantLockLocal(LOCALLOCK *locallock, ResourceOwner owner)
 			return;
 		}
 	}
-	locallockowner = MemoryContextAlloc(TopMemoryContext, sizeof(LOCALLOCKOWNER));
+	locallockowner = MemoryContextAlloc(LocalLockOwnerContext, sizeof(LOCALLOCKOWNER));
 	locallockowner->owner = owner;
 	locallockowner->nLocks = 1;
 	locallockowner->locallock = locallock;
-- 
2.37.2

#116

Heikki Linnakangas

hlinnaka@iki.fi

over 2 years ago

In reply to: David Rowley (#115)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

On 10/02/2023 04:51, David Rowley wrote:

I've attached another set of patches. I do need to spend longer
looking at this. I'm mainly attaching these as CI seems to be
highlighting a problem that I'm unable to recreate locally and I
wanted to see if the attached fixes it.

I like this patch's approach.

index 296dc82d2ee..edb8b6026e5 100644
--- a/src/backend/commands/discard.c
+++ b/src/backend/commands/discard.c
@@ -71,7 +71,7 @@ DiscardAll(bool isTopLevel)
Async_UnlistenAll();
-       LockReleaseAll(USER_LOCKMETHOD, true);
+       LockReleaseSession(USER_LOCKMETHOD);
ResetPlanCache();

This assumes that there are no transaction-level advisory locks. I think
that's OK. It took me a while to convince myself of that, though. I
think we need a high level comment somewhere that explains what
assumptions we make on which locks can be held in session mode and which
in transaction mode.

@@ -3224,14 +3206,6 @@ PostPrepare_Locks(TransactionId xid)
Assert(lock->nGranted <= lock->nRequested);
Assert((proclock->holdMask & ~lock->grantMask) == 0);

- /* Ignore it if nothing to release (must be a session lock) */
- if (proclock->releaseMask == 0)
- continue;
-
- /* Else we should be releasing all locks */
- if (proclock->releaseMask != proclock->holdMask)
- elog(PANIC, "we seem to have dropped a bit somewhere");
-
/*
* We cannot simply modify proclock->tag.myProc to reassign
* ownership of the lock, because that's part of the hash key and

This looks wrong. If you prepare a transaction that is holding any
session locks, we will now transfer them to the prepared transaction.
And its locallock entry will be out of sync. To fix, I think we could
keep around the hash table that CheckForSessionAndXactLocks() builds,
and use that here.

--
Heikki Linnakangas
Neon (https://neon.tech)

#117

David Rowley

dgrowleyml@gmail.com

over 2 years ago

In reply to: Heikki Linnakangas (#116)

1 attachment(s)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

Thank you for having a look at this. Apologies for not getting back to
you sooner.

On Wed, 5 Jul 2023 at 21:44, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 10/02/2023 04:51, David Rowley wrote:

I've attached another set of patches. I do need to spend longer
looking at this. I'm mainly attaching these as CI seems to be
highlighting a problem that I'm unable to recreate locally and I
wanted to see if the attached fixes it.

I like this patch's approach.
index 296dc82d2ee..edb8b6026e5 100644
--- a/src/backend/commands/discard.c
+++ b/src/backend/commands/discard.c
@@ -71,7 +71,7 @@ DiscardAll(bool isTopLevel)
Async_UnlistenAll();
-       LockReleaseAll(USER_LOCKMETHOD, true);
+       LockReleaseSession(USER_LOCKMETHOD);
ResetPlanCache();
This assumes that there are no transaction-level advisory locks. I think
that's OK. It took me a while to convince myself of that, though. I
think we need a high level comment somewhere that explains what
assumptions we make on which locks can be held in session mode and which
in transaction mode.

Isn't it ok because DISCARD ALL cannot run inside a transaction block,
so there should be no locks taken apart from possibly session-level
locks?

I've added a call to LockAssertNoneHeld(false) in there.

@@ -3224,14 +3206,6 @@ PostPrepare_Locks(TransactionId xid)
Assert(lock->nGranted <= lock->nRequested);
Assert((proclock->holdMask & ~lock->grantMask) == 0);

- /* Ignore it if nothing to release (must be a session lock) */
- if (proclock->releaseMask == 0)
- continue;
-
- /* Else we should be releasing all locks */
- if (proclock->releaseMask != proclock->holdMask)
- elog(PANIC, "we seem to have dropped a bit somewhere");
-
/*
* We cannot simply modify proclock->tag.myProc to reassign
* ownership of the lock, because that's part of the hash key and

This looks wrong. If you prepare a transaction that is holding any
session locks, we will now transfer them to the prepared transaction.
And its locallock entry will be out of sync. To fix, I think we could
keep around the hash table that CheckForSessionAndXactLocks() builds,
and use that here.

Good catch. I've modified the patch to keep the hash table built in
CheckForSessionAndXactLocks around for longer so that we can check for
session locks.

I've attached an updated patch mainly to get CI checking this. I
suspect something is wrong as subscription/015_stream is timing out.
I've not gotten to the bottom of that yet.

David

Attachments:

v7-0001-wip-resowner-lock-release-all.patchapplication/octet-stream; name=v7-0001-wip-resowner-lock-release-all.patchDownload

From ed8d8c9d46bc4b530ab462e006bd3d911320cc52 Mon Sep 17 00:00:00 2001
From: David Rowley <dgrowley@gmail.com>
Date: Tue, 24 Jan 2023 16:00:58 +1300
Subject: [PATCH v7] wip-resowner-lock-release-all.

---
 src/backend/access/transam/xact.c          |  18 +-
 src/backend/commands/discard.c             |   2 +-
 src/backend/replication/logical/launcher.c |   2 +-
 src/backend/storage/lmgr/README            |   6 -
 src/backend/storage/lmgr/lock.c            | 711 +++++++--------------
 src/backend/storage/lmgr/proc.c            |  17 +-
 src/backend/utils/init/postinit.c          |   6 +-
 src/backend/utils/resowner/resowner.c      | 140 ++--
 src/include/storage/lock.h                 |  35 +-
 src/include/utils/resowner_private.h       |   5 +-
 10 files changed, 355 insertions(+), 587 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 8daaa535ed..fdfb6cd02f 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2415,6 +2415,7 @@ PrepareTransaction(void)
 	TransactionId xid = GetCurrentTransactionId();
 	GlobalTransaction gxact;
 	TimestampTz prepared_at;
+	HTAB		   *sessionandxactlocks;
 
 	Assert(!IsInParallelMode());
 
@@ -2560,7 +2561,17 @@ PrepareTransaction(void)
 	StartPrepare(gxact);
 
 	AtPrepare_Notify();
-	AtPrepare_Locks();
+
+	/*
+	 * Prepare the locks and save the returned hash table that describes if
+	 * the lock is held at the session and/or transaction level.  We need to
+	 * know if we're dealing with session locks inside PostPrepare_Locks(),
+	 * but we're unable to build the hash table there due to that function
+	 * only discovering if we're dealing with a session lock while we're in a
+	 * critical section, in which case we can't allocate memory for the hash
+	 * table.
+	 */
+	sessionandxactlocks = AtPrepare_Locks();
 	AtPrepare_PredicateLocks();
 	AtPrepare_PgStat();
 	AtPrepare_MultiXact();
@@ -2587,7 +2598,10 @@ PrepareTransaction(void)
 	 * ProcArrayClearTransaction().  Otherwise, a GetLockConflicts() would
 	 * conclude "xact already committed or aborted" for our locks.
 	 */
-	PostPrepare_Locks(xid);
+	PostPrepare_Locks(xid, sessionandxactlocks);
+
+	/* We no longer need this hash table */
+	hash_destroy(sessionandxactlocks);
 
 	/*
 	 * Let others know about no transaction in progress by me.  This has to be
diff --git a/src/backend/commands/discard.c b/src/backend/commands/discard.c
index 296dc82d2e..edb8b6026e 100644
--- a/src/backend/commands/discard.c
+++ b/src/backend/commands/discard.c
@@ -71,7 +71,7 @@ DiscardAll(bool isTopLevel)
 	ResetAllOptions();
 	DropAllPreparedStatements();
 	Async_UnlistenAll();
-	LockReleaseAll(USER_LOCKMETHOD, true);
+	LockReleaseSession(USER_LOCKMETHOD);
 	ResetPlanCache();
 	ResetTempTableNamespace();
 	ResetSequenceCaches();
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index 7882fc91ce..d8a452de32 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -841,7 +841,7 @@ logicalrep_worker_onexit(int code, Datum arg)
 	 * The locks will be acquired once the worker is initialized.
 	 */
 	if (!InitializingApplyWorker)
-		LockReleaseAll(DEFAULT_LOCKMETHOD, true);
+		LockReleaseSession(DEFAULT_LOCKMETHOD);
 
 	ApplyLauncherWakeup();
 }
diff --git a/src/backend/storage/lmgr/README b/src/backend/storage/lmgr/README
index 45de0fd2bd..e7e0f29347 100644
--- a/src/backend/storage/lmgr/README
+++ b/src/backend/storage/lmgr/README
@@ -182,12 +182,6 @@ holdMask -
     subset of the PGPROC object's heldLocks mask (if the PGPROC is
     currently waiting for another lock mode on this lock).
 
-releaseMask -
-    A bitmask for the lock modes due to be released during LockReleaseAll.
-    This must be a subset of the holdMask.  Note that it is modified without
-    taking the partition LWLock, and therefore it is unsafe for any
-    backend except the one owning the PROCLOCK to examine/change it.
-
 lockLink -
     List link for shared memory queue of all the PROCLOCK objects for the
     same LOCK.
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index ec6240fbae..5049952875 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -22,8 +22,7 @@
  *	Interface:
  *
  *	InitLocks(), GetLocksMethodTable(), GetLockTagsMethodTable(),
- *	LockAcquire(), LockRelease(), LockReleaseAll(),
- *	LockCheckConflicts(), GrantLock()
+ *	LockAcquire(), LockRelease(), LockCheckConflicts(), GrantLock()
  *
  *-------------------------------------------------------------------------
  */
@@ -162,6 +161,16 @@ typedef struct TwoPhaseLockRecord
 	LOCKMODE	lockmode;
 } TwoPhaseLockRecord;
 
+/*
+ * Used by for a hash table entry type in AtPrepare_Locks() to communicate the
+ * session/xact lock status of each held lock for use in PostPrepare_Locks().
+ */
+typedef struct PerLockTagEntry
+{
+	LOCKTAG lock;  /* identifies the lockable object */
+	bool sessLock; /* is any lockmode held at session level? */
+	bool xactLock; /* is any lockmode held at xact level? */
+} PerLockTagEntry;
 
 /*
  * Count of the number of fast path lock slots we believe to be used.  This
@@ -270,6 +279,8 @@ static HTAB *LockMethodLockHash;
 static HTAB *LockMethodProcLockHash;
 static HTAB *LockMethodLocalHash;
 
+/* A memory context for storing LOCALLOCKOWNER structs */
+MemoryContext LocalLockOwnerContext;
 
 /* private state for error cleanup */
 static LOCALLOCK *StrongLockInProgress;
@@ -277,6 +288,9 @@ static LOCALLOCK *awaitedLock;
 static ResourceOwner awaitedOwner;
 
 
+static dlist_head session_locks[lengthof(LockMethods)];
+
+
 #ifdef LOCK_DEBUG
 
 /*------
@@ -363,8 +377,8 @@ static void GrantLockLocal(LOCALLOCK *locallock, ResourceOwner owner);
 static void BeginStrongLockAcquire(LOCALLOCK *locallock, uint32 fasthashcode);
 static void FinishStrongLockAcquire(void);
 static void WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner);
-static void ReleaseLockIfHeld(LOCALLOCK *locallock, bool sessionLock);
-static void LockReassignOwner(LOCALLOCK *locallock, ResourceOwner parent);
+static void ReleaseLockIfHeld(LOCALLOCKOWNER *locallockowner, bool sessionLock);
+static void LockReassignOwner(LOCALLOCKOWNER *locallockowner, ResourceOwner parent);
 static bool UnGrantLock(LOCK *lock, LOCKMODE lockmode,
 						PROCLOCK *proclock, LockMethod lockMethodTable);
 static void CleanUpLock(LOCK *lock, PROCLOCK *proclock,
@@ -465,6 +479,21 @@ InitLocks(void)
 									  16,
 									  &info,
 									  HASH_ELEM | HASH_BLOBS);
+
+	/* Initialize each element of the session_locks array */
+	for (int i = 0; i < lengthof(LockMethods); i++)
+		dlist_init(&session_locks[i]);
+
+	/*
+	 * Create a slab context for storing LOCALLOCKOWNERs.  Slab seems like a
+	 * good context type for this as it will manage fragmentation better than
+	 * aset.c contexts and it will free() excess memory rather than maintain
+	 * excessively long freelists after a large surge in locking requirements.
+	 */
+	LocalLockOwnerContext = SlabContextCreate(TopMemoryContext,
+											  "LOCALLOCKOWNER context",
+											  SLAB_DEFAULT_BLOCK_SIZE,
+											  sizeof(LOCALLOCKOWNER));
 }
 
 
@@ -827,26 +856,9 @@ LockAcquireExtended(const LOCKTAG *locktag,
 		locallock->nLocks = 0;
 		locallock->holdsStrongLockCount = false;
 		locallock->lockCleared = false;
-		locallock->numLockOwners = 0;
-		locallock->maxLockOwners = 8;
-		locallock->lockOwners = NULL;	/* in case next line fails */
-		locallock->lockOwners = (LOCALLOCKOWNER *)
-			MemoryContextAlloc(TopMemoryContext,
-							   locallock->maxLockOwners * sizeof(LOCALLOCKOWNER));
+		dlist_init(&locallock->locallockowners);
 	}
-	else
-	{
-		/* Make sure there will be room to remember the lock */
-		if (locallock->numLockOwners >= locallock->maxLockOwners)
-		{
-			int			newsize = locallock->maxLockOwners * 2;
 
-			locallock->lockOwners = (LOCALLOCKOWNER *)
-				repalloc(locallock->lockOwners,
-						 newsize * sizeof(LOCALLOCKOWNER));
-			locallock->maxLockOwners = newsize;
-		}
-	}
 	hashcode = locallock->hashcode;
 
 	if (locallockp)
@@ -1249,7 +1261,6 @@ SetupLockInTable(LockMethod lockMethodTable, PGPROC *proc,
 		proclock->groupLeader = proc->lockGroupLeader != NULL ?
 			proc->lockGroupLeader : proc;
 		proclock->holdMask = 0;
-		proclock->releaseMask = 0;
 		/* Add proclock to appropriate lists */
 		dlist_push_tail(&lock->procLocks, &proclock->lockLink);
 		dlist_push_tail(&proc->myProcLocks[partition], &proclock->procLink);
@@ -1343,17 +1354,19 @@ CheckAndSetLockHeld(LOCALLOCK *locallock, bool acquired)
 static void
 RemoveLocalLock(LOCALLOCK *locallock)
 {
-	int			i;
+	dlist_mutable_iter iter;
 
-	for (i = locallock->numLockOwners - 1; i >= 0; i--)
+	dlist_foreach_modify(iter, &locallock->locallockowners)
 	{
-		if (locallock->lockOwners[i].owner != NULL)
-			ResourceOwnerForgetLock(locallock->lockOwners[i].owner, locallock);
+		LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER, locallock_node, iter.cur);
+
+		Assert(locallockowner->owner != NULL);
+		dlist_delete(&locallockowner->locallock_node);
+		ResourceOwnerForgetLock(locallockowner);
+		pfree(locallockowner);
 	}
-	locallock->numLockOwners = 0;
-	if (locallock->lockOwners != NULL)
-		pfree(locallock->lockOwners);
-	locallock->lockOwners = NULL;
+
+	Assert(dlist_is_empty(&locallock->locallockowners));
 
 	if (locallock->holdsStrongLockCount)
 	{
@@ -1659,26 +1672,38 @@ CleanUpLock(LOCK *lock, PROCLOCK *proclock,
 static void
 GrantLockLocal(LOCALLOCK *locallock, ResourceOwner owner)
 {
-	LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
-	int			i;
+	LOCALLOCKOWNER *locallockowner;
+	dlist_iter	iter;
 
-	Assert(locallock->numLockOwners < locallock->maxLockOwners);
 	/* Count the total */
 	locallock->nLocks++;
 	/* Count the per-owner lock */
-	for (i = 0; i < locallock->numLockOwners; i++)
+	dlist_foreach(iter, &locallock->locallockowners)
 	{
-		if (lockOwners[i].owner == owner)
+		locallockowner = dlist_container(LOCALLOCKOWNER, locallock_node, iter.cur);
+
+		if (locallockowner->owner == owner)
 		{
-			lockOwners[i].nLocks++;
+			locallockowner->nLocks++;
 			return;
 		}
 	}
-	lockOwners[i].owner = owner;
-	lockOwners[i].nLocks = 1;
-	locallock->numLockOwners++;
+	locallockowner = MemoryContextAlloc(LocalLockOwnerContext, sizeof(LOCALLOCKOWNER));
+	locallockowner->owner = owner;
+	locallockowner->nLocks = 1;
+	locallockowner->locallock = locallock;
+
+	dlist_push_tail(&locallock->locallockowners, &locallockowner->locallock_node);
+
 	if (owner != NULL)
-		ResourceOwnerRememberLock(owner, locallock);
+		ResourceOwnerRememberLock(owner, locallockowner);
+	else
+	{
+		LOCKMETHODID lockmethodid = LOCALLOCK_LOCKMETHOD(*locallockowner->locallock);
+
+		Assert(lockmethodid > 0 && lockmethodid <= 2);
+		dlist_push_tail(&session_locks[lockmethodid - 1], &locallockowner->resowner_node);
+	}
 
 	/* Indicate that the lock is acquired for certain types of locks. */
 	CheckAndSetLockHeld(locallock, true);
@@ -1971,9 +1996,9 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 	 * Decrease the count for the resource owner.
 	 */
 	{
-		LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
 		ResourceOwner owner;
-		int			i;
+		dlist_mutable_iter	iter;
+		bool		found = false;
 
 		/* Identify owner for lock */
 		if (sessionLock)
@@ -1981,24 +2006,33 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 		else
 			owner = CurrentResourceOwner;
 
-		for (i = locallock->numLockOwners - 1; i >= 0; i--)
+		dlist_foreach_modify(iter, &locallock->locallockowners)
 		{
-			if (lockOwners[i].owner == owner)
+			LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER, locallock_node, iter.cur);
+
+			if (locallockowner->owner != owner)
+				continue;
+
+			found = true;
+
+			if (--locallockowner->nLocks == 0)
 			{
-				Assert(lockOwners[i].nLocks > 0);
-				if (--lockOwners[i].nLocks == 0)
-				{
-					if (owner != NULL)
-						ResourceOwnerForgetLock(owner, locallock);
-					/* compact out unused slot */
-					locallock->numLockOwners--;
-					if (i < locallock->numLockOwners)
-						lockOwners[i] = lockOwners[locallock->numLockOwners];
-				}
-				break;
+				dlist_delete(&locallockowner->locallock_node);
+
+				if (owner != NULL)
+					ResourceOwnerForgetLock(locallockowner);
+				else
+					dlist_delete(&locallockowner->resowner_node);
+				pfree(locallockowner);
+			}
+			else
+			{
+				/* ensure nLocks didn't go negative */
+				Assert(locallockowner->nLocks >= 0);
 			}
 		}
-		if (i < 0)
+
+		if (!found)
 		{
 			/* don't release a lock belonging to another owner */
 			elog(WARNING, "you don't own a lock of type %s",
@@ -2016,6 +2050,8 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 	if (locallock->nLocks > 0)
 		return true;
 
+	Assert(locallock->nLocks == 0);
+
 	/*
 	 * At this point we can no longer suppose we are clear of invalidation
 	 * messages related to this lock.  Although we'll delete the LOCALLOCK
@@ -2118,274 +2154,44 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 	return true;
 }
 
+#ifdef USE_ASSERT_CHECKING
 /*
- * LockReleaseAll -- Release all locks of the specified lock method that
- *		are held by the current process.
- *
- * Well, not necessarily *all* locks.  The available behaviors are:
- *		allLocks == true: release all locks including session locks.
- *		allLocks == false: release all non-session locks.
+ * LockAssertNoneHeld -- Assert that we no longer hold any DEFAULT_LOCKMETHOD
+ * locks during an abort.
  */
 void
-LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
+LockAssertNoneHeld(bool isCommit)
 {
 	HASH_SEQ_STATUS status;
-	LockMethod	lockMethodTable;
-	int			i,
-				numLockModes;
 	LOCALLOCK  *locallock;
-	LOCK	   *lock;
-	int			partition;
-	bool		have_fast_path_lwlock = false;
-
-	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
-		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
-	lockMethodTable = LockMethods[lockmethodid];
-
-#ifdef LOCK_DEBUG
-	if (*(lockMethodTable->trace_flag))
-		elog(LOG, "LockReleaseAll: lockmethod=%d", lockmethodid);
-#endif
-
-	/*
-	 * Get rid of our fast-path VXID lock, if appropriate.  Note that this is
-	 * the only way that the lock we hold on our own VXID can ever get
-	 * released: it is always and only released when a toplevel transaction
-	 * ends.
-	 */
-	if (lockmethodid == DEFAULT_LOCKMETHOD)
-		VirtualXactLockTableCleanup();
-
-	numLockModes = lockMethodTable->numLockModes;
 
-	/*
-	 * First we run through the locallock table and get rid of unwanted
-	 * entries, then we scan the process's proclocks and get rid of those. We
-	 * do this separately because we may have multiple locallock entries
-	 * pointing to the same proclock, and we daren't end up with any dangling
-	 * pointers.  Fast-path locks are cleaned up during the locallock table
-	 * scan, though.
-	 */
-	hash_seq_init(&status, LockMethodLocalHash);
-
-	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+	if (!isCommit)
 	{
-		/*
-		 * If the LOCALLOCK entry is unused, we must've run out of shared
-		 * memory while trying to set up this lock.  Just forget the local
-		 * entry.
-		 */
-		if (locallock->nLocks == 0)
-		{
-			RemoveLocalLock(locallock);
-			continue;
-		}
-
-		/* Ignore items that are not of the lockmethod to be removed */
-		if (LOCALLOCK_LOCKMETHOD(*locallock) != lockmethodid)
-			continue;
+		hash_seq_init(&status, LockMethodLocalHash);
 
-		/*
-		 * If we are asked to release all locks, we can just zap the entry.
-		 * Otherwise, must scan to see if there are session locks. We assume
-		 * there is at most one lockOwners entry for session locks.
-		 */
-		if (!allLocks)
+		while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
 		{
-			LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
+			dlist_iter	local_iter;
 
-			/* If session lock is above array position 0, move it down to 0 */
-			for (i = 0; i < locallock->numLockOwners; i++)
-			{
-				if (lockOwners[i].owner == NULL)
-					lockOwners[0] = lockOwners[i];
-				else
-					ResourceOwnerForgetLock(lockOwners[i].owner, locallock);
-			}
+			Assert(locallock->nLocks >= 0);
 
-			if (locallock->numLockOwners > 0 &&
-				lockOwners[0].owner == NULL &&
-				lockOwners[0].nLocks > 0)
+			dlist_foreach(local_iter, &locallock->locallockowners)
 			{
-				/* Fix the locallock to show just the session locks */
-				locallock->nLocks = lockOwners[0].nLocks;
-				locallock->numLockOwners = 1;
-				/* We aren't deleting this locallock, so done */
-				continue;
-			}
-			else
-				locallock->numLockOwners = 0;
-		}
-
-		/*
-		 * If the lock or proclock pointers are NULL, this lock was taken via
-		 * the relation fast-path (and is not known to have been transferred).
-		 */
-		if (locallock->proclock == NULL || locallock->lock == NULL)
-		{
-			LOCKMODE	lockmode = locallock->tag.mode;
-			Oid			relid;
+				LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER,
+																 locallock_node,
+																 local_iter.cur);
 
-			/* Verify that a fast-path lock is what we've got. */
-			if (!EligibleForRelationFastPath(&locallock->tag.lock, lockmode))
-				elog(PANIC, "locallock table corrupted");
+				Assert(locallockowner->owner == NULL);
 
-			/*
-			 * If we don't currently hold the LWLock that protects our
-			 * fast-path data structures, we must acquire it before attempting
-			 * to release the lock via the fast-path.  We will continue to
-			 * hold the LWLock until we're done scanning the locallock table,
-			 * unless we hit a transferred fast-path lock.  (XXX is this
-			 * really such a good idea?  There could be a lot of entries ...)
-			 */
-			if (!have_fast_path_lwlock)
-			{
-				LWLockAcquire(&MyProc->fpInfoLock, LW_EXCLUSIVE);
-				have_fast_path_lwlock = true;
-			}
-
-			/* Attempt fast-path release. */
-			relid = locallock->tag.lock.locktag_field2;
-			if (FastPathUnGrantRelationLock(relid, lockmode))
-			{
-				RemoveLocalLock(locallock);
-				continue;
+				if (locallockowner->nLocks > 0 &&
+					LOCALLOCK_LOCKMETHOD(*locallock) == DEFAULT_LOCKMETHOD)
+					Assert(false);
 			}
-
-			/*
-			 * Our lock, originally taken via the fast path, has been
-			 * transferred to the main lock table.  That's going to require
-			 * some extra work, so release our fast-path lock before starting.
-			 */
-			LWLockRelease(&MyProc->fpInfoLock);
-			have_fast_path_lwlock = false;
-
-			/*
-			 * Now dump the lock.  We haven't got a pointer to the LOCK or
-			 * PROCLOCK in this case, so we have to handle this a bit
-			 * differently than a normal lock release.  Unfortunately, this
-			 * requires an extra LWLock acquire-and-release cycle on the
-			 * partitionLock, but hopefully it shouldn't happen often.
-			 */
-			LockRefindAndRelease(lockMethodTable, MyProc,
-								 &locallock->tag.lock, lockmode, false);
-			RemoveLocalLock(locallock);
-			continue;
 		}
-
-		/* Mark the proclock to show we need to release this lockmode */
-		if (locallock->nLocks > 0)
-			locallock->proclock->releaseMask |= LOCKBIT_ON(locallock->tag.mode);
-
-		/* And remove the locallock hashtable entry */
-		RemoveLocalLock(locallock);
 	}
-
-	/* Done with the fast-path data structures */
-	if (have_fast_path_lwlock)
-		LWLockRelease(&MyProc->fpInfoLock);
-
-	/*
-	 * Now, scan each lock partition separately.
-	 */
-	for (partition = 0; partition < NUM_LOCK_PARTITIONS; partition++)
-	{
-		LWLock	   *partitionLock;
-		dlist_head *procLocks = &MyProc->myProcLocks[partition];
-		dlist_mutable_iter proclock_iter;
-
-		partitionLock = LockHashPartitionLockByIndex(partition);
-
-		/*
-		 * If the proclock list for this partition is empty, we can skip
-		 * acquiring the partition lock.  This optimization is trickier than
-		 * it looks, because another backend could be in process of adding
-		 * something to our proclock list due to promoting one of our
-		 * fast-path locks.  However, any such lock must be one that we
-		 * decided not to delete above, so it's okay to skip it again now;
-		 * we'd just decide not to delete it again.  We must, however, be
-		 * careful to re-fetch the list header once we've acquired the
-		 * partition lock, to be sure we have a valid, up-to-date pointer.
-		 * (There is probably no significant risk if pointer fetch/store is
-		 * atomic, but we don't wish to assume that.)
-		 *
-		 * XXX This argument assumes that the locallock table correctly
-		 * represents all of our fast-path locks.  While allLocks mode
-		 * guarantees to clean up all of our normal locks regardless of the
-		 * locallock situation, we lose that guarantee for fast-path locks.
-		 * This is not ideal.
-		 */
-		if (dlist_is_empty(procLocks))
-			continue;			/* needn't examine this partition */
-
-		LWLockAcquire(partitionLock, LW_EXCLUSIVE);
-
-		dlist_foreach_modify(proclock_iter, procLocks)
-		{
-			PROCLOCK   *proclock = dlist_container(PROCLOCK, procLink, proclock_iter.cur);
-			bool		wakeupNeeded = false;
-
-			Assert(proclock->tag.myProc == MyProc);
-
-			lock = proclock->tag.myLock;
-
-			/* Ignore items that are not of the lockmethod to be removed */
-			if (LOCK_LOCKMETHOD(*lock) != lockmethodid)
-				continue;
-
-			/*
-			 * In allLocks mode, force release of all locks even if locallock
-			 * table had problems
-			 */
-			if (allLocks)
-				proclock->releaseMask = proclock->holdMask;
-			else
-				Assert((proclock->releaseMask & ~proclock->holdMask) == 0);
-
-			/*
-			 * Ignore items that have nothing to be released, unless they have
-			 * holdMask == 0 and are therefore recyclable
-			 */
-			if (proclock->releaseMask == 0 && proclock->holdMask != 0)
-				continue;
-
-			PROCLOCK_PRINT("LockReleaseAll", proclock);
-			LOCK_PRINT("LockReleaseAll", lock, 0);
-			Assert(lock->nRequested >= 0);
-			Assert(lock->nGranted >= 0);
-			Assert(lock->nGranted <= lock->nRequested);
-			Assert((proclock->holdMask & ~lock->grantMask) == 0);
-
-			/*
-			 * Release the previously-marked lock modes
-			 */
-			for (i = 1; i <= numLockModes; i++)
-			{
-				if (proclock->releaseMask & LOCKBIT_ON(i))
-					wakeupNeeded |= UnGrantLock(lock, i, proclock,
-												lockMethodTable);
-			}
-			Assert((lock->nRequested >= 0) && (lock->nGranted >= 0));
-			Assert(lock->nGranted <= lock->nRequested);
-			LOCK_PRINT("LockReleaseAll: updated", lock, 0);
-
-			proclock->releaseMask = 0;
-
-			/* CleanUpLock will wake up waiters if needed. */
-			CleanUpLock(lock, proclock,
-						lockMethodTable,
-						LockTagHashCode(&lock->tag),
-						wakeupNeeded);
-		}						/* loop over PROCLOCKs within this partition */
-
-		LWLockRelease(partitionLock);
-	}							/* loop over partitions */
-
-#ifdef LOCK_DEBUG
-	if (*(lockMethodTable->trace_flag))
-		elog(LOG, "LockReleaseAll done");
-#endif
+	Assert(MyProc->fpLockBits == 0);
 }
+#endif
 
 /*
  * LockReleaseSession -- Release all session locks of the specified lock method
@@ -2394,59 +2200,39 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
 void
 LockReleaseSession(LOCKMETHODID lockmethodid)
 {
-	HASH_SEQ_STATUS status;
-	LOCALLOCK  *locallock;
+	dlist_mutable_iter iter;
 
 	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
 		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
 
-	hash_seq_init(&status, LockMethodLocalHash);
-
-	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+	dlist_foreach_modify(iter, &session_locks[lockmethodid - 1])
 	{
-		/* Ignore items that are not of the specified lock method */
-		if (LOCALLOCK_LOCKMETHOD(*locallock) != lockmethodid)
-			continue;
+		LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER, resowner_node, iter.cur);
 
-		ReleaseLockIfHeld(locallock, true);
+		Assert(LOCALLOCK_LOCKMETHOD(*locallockowner->locallock) == lockmethodid);
+
+		ReleaseLockIfHeld(locallockowner, true);
 	}
+
+	Assert(dlist_is_empty(&session_locks[lockmethodid - 1]));
 }
 
 /*
  * LockReleaseCurrentOwner
- *		Release all locks belonging to CurrentResourceOwner
- *
- * If the caller knows what those locks are, it can pass them as an array.
- * That speeds up the call significantly, when a lot of locks are held.
- * Otherwise, pass NULL for locallocks, and we'll traverse through our hash
- * table to find them.
+ *		Release all locks belonging to 'owner'
  */
 void
-LockReleaseCurrentOwner(LOCALLOCK **locallocks, int nlocks)
+LockReleaseCurrentOwner(ResourceOwner owner, LOCALLOCKOWNER *locallockowner)
 {
-	if (locallocks == NULL)
-	{
-		HASH_SEQ_STATUS status;
-		LOCALLOCK  *locallock;
-
-		hash_seq_init(&status, LockMethodLocalHash);
-
-		while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
-			ReleaseLockIfHeld(locallock, false);
-	}
-	else
-	{
-		int			i;
+	Assert(locallockowner->owner == owner);
 
-		for (i = nlocks - 1; i >= 0; i--)
-			ReleaseLockIfHeld(locallocks[i], false);
-	}
+	ReleaseLockIfHeld(locallockowner, false);
 }
 
 /*
  * ReleaseLockIfHeld
- *		Release any session-level locks on this lockable object if sessionLock
- *		is true; else, release any locks held by CurrentResourceOwner.
+ *		Release any session-level locks on this 'locallockowner' if
+ *		sessionLock is true; else, release any locks held by 'locallockowner'.
  *
  * It is tempting to pass this a ResourceOwner pointer (or NULL for session
  * locks), but without refactoring LockRelease() we cannot support releasing
@@ -2457,52 +2243,39 @@ LockReleaseCurrentOwner(LOCALLOCK **locallocks, int nlocks)
  * convenience.
  */
 static void
-ReleaseLockIfHeld(LOCALLOCK *locallock, bool sessionLock)
+ReleaseLockIfHeld(LOCALLOCKOWNER *locallockowner, bool sessionLock)
 {
-	ResourceOwner owner;
-	LOCALLOCKOWNER *lockOwners;
-	int			i;
+	LOCALLOCK  *locallock = locallockowner->locallock;
+
+	/* release all references to the lock by this resource owner */
 
-	/* Identify owner for lock (must match LockRelease!) */
 	if (sessionLock)
-		owner = NULL;
+		Assert(locallockowner->owner == NULL);
 	else
-		owner = CurrentResourceOwner;
+		Assert(locallockowner->owner != NULL);
 
-	/* Scan to see if there are any locks belonging to the target owner */
-	lockOwners = locallock->lockOwners;
-	for (i = locallock->numLockOwners - 1; i >= 0; i--)
+	/* We will still hold this lock after forgetting this ResourceOwner. */
+	if (locallockowner->nLocks < locallock->nLocks)
 	{
-		if (lockOwners[i].owner == owner)
-		{
-			Assert(lockOwners[i].nLocks > 0);
-			if (lockOwners[i].nLocks < locallock->nLocks)
-			{
-				/*
-				 * We will still hold this lock after forgetting this
-				 * ResourceOwner.
-				 */
-				locallock->nLocks -= lockOwners[i].nLocks;
-				/* compact out unused slot */
-				locallock->numLockOwners--;
-				if (owner != NULL)
-					ResourceOwnerForgetLock(owner, locallock);
-				if (i < locallock->numLockOwners)
-					lockOwners[i] = lockOwners[locallock->numLockOwners];
-			}
-			else
-			{
-				Assert(lockOwners[i].nLocks == locallock->nLocks);
-				/* We want to call LockRelease just once */
-				lockOwners[i].nLocks = 1;
-				locallock->nLocks = 1;
-				if (!LockRelease(&locallock->tag.lock,
-								 locallock->tag.mode,
-								 sessionLock))
-					elog(WARNING, "ReleaseLockIfHeld: failed??");
-			}
-			break;
-		}
+		locallock->nLocks -= locallockowner->nLocks;
+		dlist_delete(&locallockowner->locallock_node);
+
+		if (sessionLock)
+			dlist_delete(&locallockowner->resowner_node);
+		else
+			ResourceOwnerForgetLock(locallockowner);
+	}
+	else
+	{
+		Assert(locallockowner->nLocks == locallock->nLocks);
+		/* We want to call LockRelease just once */
+		locallockowner->nLocks = 1;
+		locallock->nLocks = 1;
+
+		if (!LockRelease(&locallock->tag.lock,
+						 locallock->tag.mode,
+						 sessionLock))
+			elog(WARNING, "ReleaseLockIfHeld: failed??");
 	}
 }
 
@@ -2510,82 +2283,47 @@ ReleaseLockIfHeld(LOCALLOCK *locallock, bool sessionLock)
  * LockReassignCurrentOwner
  *		Reassign all locks belonging to CurrentResourceOwner to belong
  *		to its parent resource owner.
- *
- * If the caller knows what those locks are, it can pass them as an array.
- * That speeds up the call significantly, when a lot of locks are held
- * (e.g pg_dump with a large schema).  Otherwise, pass NULL for locallocks,
- * and we'll traverse through our hash table to find them.
  */
 void
-LockReassignCurrentOwner(LOCALLOCK **locallocks, int nlocks)
+LockReassignCurrentOwner(LOCALLOCKOWNER *locallockowner)
 {
 	ResourceOwner parent = ResourceOwnerGetParent(CurrentResourceOwner);
 
-	Assert(parent != NULL);
-
-	if (locallocks == NULL)
-	{
-		HASH_SEQ_STATUS status;
-		LOCALLOCK  *locallock;
-
-		hash_seq_init(&status, LockMethodLocalHash);
-
-		while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
-			LockReassignOwner(locallock, parent);
-	}
-	else
-	{
-		int			i;
-
-		for (i = nlocks - 1; i >= 0; i--)
-			LockReassignOwner(locallocks[i], parent);
-	}
+	LockReassignOwner(locallockowner, parent);
 }
 
 /*
- * Subroutine of LockReassignCurrentOwner. Reassigns a given lock belonging to
- * CurrentResourceOwner to its parent.
+ * Subroutine of LockReassignCurrentOwner. Reassigns the given
+ *'locallockowner' to 'parent'.
  */
 static void
-LockReassignOwner(LOCALLOCK *locallock, ResourceOwner parent)
+LockReassignOwner(LOCALLOCKOWNER *locallockowner, ResourceOwner parent)
 {
-	LOCALLOCKOWNER *lockOwners;
-	int			i;
-	int			ic = -1;
-	int			ip = -1;
+	dlist_iter	iter;
+	LOCALLOCK  *locallock = locallockowner->locallock;
 
-	/*
-	 * Scan to see if there are any locks belonging to current owner or its
-	 * parent
-	 */
-	lockOwners = locallock->lockOwners;
-	for (i = locallock->numLockOwners - 1; i >= 0; i--)
+	ResourceOwnerForgetLock(locallockowner);
+
+	dlist_foreach(iter, &locallock->locallockowners)
 	{
-		if (lockOwners[i].owner == CurrentResourceOwner)
-			ic = i;
-		else if (lockOwners[i].owner == parent)
-			ip = i;
-	}
+		LOCALLOCKOWNER *parentlocalowner = dlist_container(LOCALLOCKOWNER, locallock_node, iter.cur);
 
-	if (ic < 0)
-		return;					/* no current locks */
+		Assert(parentlocalowner->locallock == locallock);
 
-	if (ip < 0)
-	{
-		/* Parent has no slot, so just give it the child's slot */
-		lockOwners[ic].owner = parent;
-		ResourceOwnerRememberLock(parent, locallock);
-	}
-	else
-	{
-		/* Merge child's count with parent's */
-		lockOwners[ip].nLocks += lockOwners[ic].nLocks;
-		/* compact out unused slot */
-		locallock->numLockOwners--;
-		if (ic < locallock->numLockOwners)
-			lockOwners[ic] = lockOwners[locallock->numLockOwners];
+		if (parentlocalowner->owner != parent)
+			continue;
+
+		parentlocalowner->nLocks += locallockowner->nLocks;
+
+		locallockowner->nLocks = 0;
+		dlist_delete(&locallockowner->locallock_node);
+		pfree(locallockowner);
+		return;
 	}
-	ResourceOwnerForgetLock(CurrentResourceOwner, locallock);
+
+	/* reassign locallockowner to parent resowner */
+	locallockowner->owner = parent;
+	ResourceOwnerRememberLock(parent, locallockowner);
 }
 
 /*
@@ -3057,7 +2795,7 @@ GetLockConflicts(const LOCKTAG *locktag, LOCKMODE lockmode, int *countp)
  * We currently use this in two situations: first, to release locks held by
  * prepared transactions on commit (see lock_twophase_postcommit); and second,
  * to release locks taken via the fast-path, transferred to the main hash
- * table, and then released (see LockReleaseAll).
+ * table, and then released (see ResourceOwnerRelease).
  */
 static void
 LockRefindAndRelease(LockMethod lockMethodTable, PGPROC *proc,
@@ -3163,16 +2901,9 @@ LockRefindAndRelease(LockMethod lockMethodTable, PGPROC *proc,
  * we can't implement this check by examining LOCALLOCK entries in isolation.
  * We must build a transient hashtable that is indexed by locktag only.
  */
-static void
+static HTAB *
 CheckForSessionAndXactLocks(void)
 {
-	typedef struct
-	{
-		LOCKTAG		lock;		/* identifies the lockable object */
-		bool		sessLock;	/* is any lockmode held at session level? */
-		bool		xactLock;	/* is any lockmode held at xact level? */
-	} PerLockTagEntry;
-
 	HASHCTL		hash_ctl;
 	HTAB	   *lockhtab;
 	HASH_SEQ_STATUS status;
@@ -3193,10 +2924,9 @@ CheckForSessionAndXactLocks(void)
 
 	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
 	{
-		LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
 		PerLockTagEntry *hentry;
 		bool		found;
-		int			i;
+		dlist_iter	iter;
 
 		/*
 		 * Ignore VXID locks.  We don't want those to be held by prepared
@@ -3217,9 +2947,13 @@ CheckForSessionAndXactLocks(void)
 			hentry->sessLock = hentry->xactLock = false;
 
 		/* Scan to see if we hold lock at session or xact level or both */
-		for (i = locallock->numLockOwners - 1; i >= 0; i--)
+		dlist_foreach(iter, &locallock->locallockowners)
 		{
-			if (lockOwners[i].owner == NULL)
+			LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER,
+															 locallock_node,
+															 iter.cur);
+
+			if (locallockowner->owner == NULL)
 				hentry->sessLock = true;
 			else
 				hentry->xactLock = true;
@@ -3235,8 +2969,7 @@ CheckForSessionAndXactLocks(void)
 					 errmsg("cannot PREPARE while holding both session-level and transaction-level locks on the same object")));
 	}
 
-	/* Success, so clean up */
-	hash_destroy(lockhtab);
+	return lockhtab;
 }
 
 /*
@@ -3244,6 +2977,11 @@ CheckForSessionAndXactLocks(void)
  *		Do the preparatory work for a PREPARE: make 2PC state file records
  *		for all locks currently held.
  *
+ * Returns a hash table of PerLockTagEntry structs with an entry for each
+ * lock held by this backend marking if the lock is held at the session or
+ * xact level, or both.  It is up to the calling function to call
+ * hash_destroy() on this table to free the memory used by it.
+ *
  * Session-level locks are ignored, as are VXID locks.
  *
  * For the most part, we don't need to touch shared memory for this ---
@@ -3251,14 +2989,15 @@ CheckForSessionAndXactLocks(void)
  * Fast-path locks are an exception, however: we move any such locks to
  * the main table before allowing PREPARE TRANSACTION to succeed.
  */
-void
+HTAB *
 AtPrepare_Locks(void)
 {
 	HASH_SEQ_STATUS status;
 	LOCALLOCK  *locallock;
+	HTAB	   *sessionandxactlocks;
 
 	/* First, verify there aren't locks of both xact and session level */
-	CheckForSessionAndXactLocks();
+	sessionandxactlocks = CheckForSessionAndXactLocks();
 
 	/* Now do the per-locallock cleanup work */
 	hash_seq_init(&status, LockMethodLocalHash);
@@ -3266,10 +3005,9 @@ AtPrepare_Locks(void)
 	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
 	{
 		TwoPhaseLockRecord record;
-		LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
 		bool		haveSessionLock;
 		bool		haveXactLock;
-		int			i;
+		dlist_iter	iter;
 
 		/*
 		 * Ignore VXID locks.  We don't want those to be held by prepared
@@ -3284,9 +3022,13 @@ AtPrepare_Locks(void)
 
 		/* Scan to see whether we hold it at session or transaction level */
 		haveSessionLock = haveXactLock = false;
-		for (i = locallock->numLockOwners - 1; i >= 0; i--)
+		dlist_foreach(iter, &locallock->locallockowners)
 		{
-			if (lockOwners[i].owner == NULL)
+			LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER,
+															 locallock_node,
+															 iter.cur);
+
+			if (locallockowner->owner == NULL)
 				haveSessionLock = true;
 			else
 				haveXactLock = true;
@@ -3330,6 +3072,8 @@ AtPrepare_Locks(void)
 		RegisterTwoPhaseRecord(TWOPHASE_RM_LOCK_ID, 0,
 							   &record, sizeof(TwoPhaseLockRecord));
 	}
+
+	return sessionandxactlocks;
 }
 
 /*
@@ -3344,11 +3088,11 @@ AtPrepare_Locks(void)
  * pointers in the transaction's resource owner.  This is OK at the
  * moment since resowner.c doesn't try to free locks retail at a toplevel
  * transaction commit or abort.  We could alternatively zero out nLocks
- * and leave the LOCALLOCK entries to be garbage-collected by LockReleaseAll,
- * but that probably costs more cycles.
+ * and leave the LOCALLOCK entries to be garbage-collected by
+ * ResourceOwnerRelease, but that probably costs more cycles.
  */
 void
-PostPrepare_Locks(TransactionId xid)
+PostPrepare_Locks(TransactionId xid, HTAB *sessionandxactlocks)
 {
 	PGPROC	   *newproc = TwoPhaseGetDummyProc(xid, false);
 	HASH_SEQ_STATUS status;
@@ -3378,10 +3122,9 @@ PostPrepare_Locks(TransactionId xid)
 
 	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
 	{
-		LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
 		bool		haveSessionLock;
 		bool		haveXactLock;
-		int			i;
+		dlist_iter	iter;
 
 		if (locallock->proclock == NULL || locallock->lock == NULL)
 		{
@@ -3400,9 +3143,13 @@ PostPrepare_Locks(TransactionId xid)
 
 		/* Scan to see whether we hold it at session or transaction level */
 		haveSessionLock = haveXactLock = false;
-		for (i = locallock->numLockOwners - 1; i >= 0; i--)
+		dlist_foreach(iter, &locallock->locallockowners)
 		{
-			if (lockOwners[i].owner == NULL)
+			LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER,
+															 locallock_node,
+															 iter.cur);
+
+			if (locallockowner->owner == NULL)
 				haveSessionLock = true;
 			else
 				haveXactLock = true;
@@ -3418,10 +3165,6 @@ PostPrepare_Locks(TransactionId xid)
 					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 					 errmsg("cannot PREPARE while holding both session-level and transaction-level locks on the same object")));
 
-		/* Mark the proclock to show we need to release this lockmode */
-		if (locallock->nLocks > 0)
-			locallock->proclock->releaseMask |= LOCKBIT_ON(locallock->tag.mode);
-
 		/* And remove the locallock hashtable entry */
 		RemoveLocalLock(locallock);
 	}
@@ -3439,11 +3182,7 @@ PostPrepare_Locks(TransactionId xid)
 
 		/*
 		 * If the proclock list for this partition is empty, we can skip
-		 * acquiring the partition lock.  This optimization is safer than the
-		 * situation in LockReleaseAll, because we got rid of any fast-path
-		 * locks during AtPrepare_Locks, so there cannot be any case where
-		 * another backend is adding something to our lists now.  For safety,
-		 * though, we code this the same way as in LockReleaseAll.
+		 * acquiring the partition lock.
 		 */
 		if (dlist_is_empty(procLocks))
 			continue;			/* needn't examine this partition */
@@ -3452,6 +3191,8 @@ PostPrepare_Locks(TransactionId xid)
 
 		dlist_foreach_modify(proclock_iter, procLocks)
 		{
+			PerLockTagEntry *locktagentry;
+
 			proclock = dlist_container(PROCLOCK, procLink, proclock_iter.cur);
 
 			Assert(proclock->tag.myProc == MyProc);
@@ -3469,13 +3210,14 @@ PostPrepare_Locks(TransactionId xid)
 			Assert(lock->nGranted <= lock->nRequested);
 			Assert((proclock->holdMask & ~lock->grantMask) == 0);
 
-			/* Ignore it if nothing to release (must be a session lock) */
-			if (proclock->releaseMask == 0)
-				continue;
+			locktagentry = hash_search(sessionandxactlocks,
+									   &lock->tag,
+									   HASH_FIND,
+									   NULL);
 
-			/* Else we should be releasing all locks */
-			if (proclock->releaseMask != proclock->holdMask)
-				elog(PANIC, "we seem to have dropped a bit somewhere");
+			/* skip session locks */
+			if (locktagentry != NULL && locktagentry->sessLock)
+				continue;
 
 			/*
 			 * We cannot simply modify proclock->tag.myProc to reassign
@@ -4245,7 +3987,6 @@ lock_twophase_recover(TransactionId xid, uint16 info,
 		Assert(proc->lockGroupLeader == NULL);
 		proclock->groupLeader = proc;
 		proclock->holdMask = 0;
-		proclock->releaseMask = 0;
 		/* Add proclock to appropriate lists */
 		dlist_push_tail(&lock->procLocks, &proclock->lockLink);
 		dlist_push_tail(&proc->myProcLocks[partition],
@@ -4382,7 +4123,7 @@ lock_twophase_postabort(TransactionId xid, uint16 info,
  *
  *		We don't bother recording this lock in the local lock table, since it's
  *		only ever released at the end of a transaction.  Instead,
- *		LockReleaseAll() calls VirtualXactLockTableCleanup().
+ *		ProcReleaseLocks() calls VirtualXactLockTableCleanup().
  */
 void
 VirtualXactLockTableInsert(VirtualTransactionId vxid)
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 5b663a2997..def857479d 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -777,10 +777,17 @@ ProcReleaseLocks(bool isCommit)
 		return;
 	/* If waiting, get off wait queue (should only be needed after error) */
 	LockErrorCleanup();
-	/* Release standard locks, including session-level if aborting */
-	LockReleaseAll(DEFAULT_LOCKMETHOD, !isCommit);
-	/* Release transaction-level advisory locks */
-	LockReleaseAll(USER_LOCKMETHOD, false);
+
+	VirtualXactLockTableCleanup();
+
+	/* Release session-level locks if aborting */
+	if (!isCommit)
+		LockReleaseSession(DEFAULT_LOCKMETHOD);
+
+#ifdef USE_ASSERT_CHECKING
+	/* Ensure all locks were released */
+	LockAssertNoneHeld(isCommit);
+#endif
 }
 
 
@@ -861,6 +868,8 @@ ProcKill(int code, Datum arg)
 		LWLockRelease(leader_lwlock);
 	}
 
+	Assert(MyProc->fpLockBits == 0);
+
 	/*
 	 * Reset MyLatch to the process local one.  This is so that signal
 	 * handlers et al can continue using the latch after the shared latch
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index df4d15a50f..e404cc8ff0 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -1349,10 +1349,10 @@ ShutdownPostgres(int code, Datum arg)
 	AbortOutOfAnyTransaction();
 
 	/*
-	 * User locks are not released by transaction end, so be sure to release
-	 * them explicitly.
+	 * Session locks are not released by transaction end, so be sure to
+	 * release them explicitly.
 	 */
-	LockReleaseAll(USER_LOCKMETHOD, true);
+	LockReleaseSession(USER_LOCKMETHOD);
 }
 
 
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index f926f1faad..d57f8657d7 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -33,6 +33,7 @@
 #include "utils/resowner_private.h"
 #include "utils/snapmgr.h"
 
+#include "lib/ilist.h"
 
 /*
  * All resource IDs managed by this code are required to fit into a Datum,
@@ -91,24 +92,6 @@ typedef struct ResourceArray
 #define RESARRAY_MAX_ITEMS(capacity) \
 	((capacity) <= RESARRAY_MAX_ARRAY ? (capacity) : (capacity)/4 * 3)
 
-/*
- * To speed up bulk releasing or reassigning locks from a resource owner to
- * its parent, each resource owner has a small cache of locks it owns. The
- * lock manager has the same information in its local lock hash table, and
- * we fall back on that if cache overflows, but traversing the hash table
- * is slower when there are a lot of locks belonging to other resource owners.
- *
- * MAX_RESOWNER_LOCKS is the size of the per-resource owner cache. It's
- * chosen based on some testing with pg_dump with a large schema. When the
- * tests were done (on 9.2), resource owners in a pg_dump run contained up
- * to 9 locks, regardless of the schema size, except for the top resource
- * owner which contained much more (overflowing the cache). 15 seems like a
- * nice round number that's somewhat higher than what pg_dump needs. Note that
- * making this number larger is not free - the bigger the cache, the slower
- * it is to release locks (in retail), when a resource owner holds many locks.
- */
-#define MAX_RESOWNER_LOCKS 15
-
 /*
  * ResourceOwner objects look like this
  */
@@ -134,9 +117,7 @@ typedef struct ResourceOwnerData
 	ResourceArray cryptohasharr;	/* cryptohash contexts */
 	ResourceArray hmacarr;		/* HMAC contexts */
 
-	/* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
-	int			nlocks;			/* number of owned locks */
-	LOCALLOCK  *locks[MAX_RESOWNER_LOCKS];	/* list of owned locks */
+	dlist_head	locks;			/* dlist of owned locks */
 }			ResourceOwnerData;
 
 
@@ -454,6 +435,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
 	ResourceArrayInit(&(owner->jitarr), PointerGetDatum(NULL));
 	ResourceArrayInit(&(owner->cryptohasharr), PointerGetDatum(NULL));
 	ResourceArrayInit(&(owner->hmacarr), PointerGetDatum(NULL));
+	dlist_init(&owner->locks);
 
 	return owner;
 }
@@ -606,8 +588,19 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 	}
 	else if (phase == RESOURCE_RELEASE_LOCKS)
 	{
+		dlist_mutable_iter iter;
+
 		if (isTopLevel)
 		{
+			dlist_foreach_modify(iter, &owner->locks)
+			{
+				LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER, resowner_node, iter.cur);
+
+				LockReleaseCurrentOwner(owner, locallockowner);
+			}
+
+			Assert(dlist_is_empty(&owner->locks));
+
 			/*
 			 * For a top-level xact we are going to release all locks (or at
 			 * least all non-session locks), so just do a single lmgr call at
@@ -626,30 +619,30 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 			 * subtransaction, we do NOT release its locks yet, but transfer
 			 * them to the parent.
 			 */
-			LOCALLOCK **locks;
-			int			nlocks;
+			if (isCommit)
+			{
+				dlist_foreach_modify(iter, &owner->locks)
+				{
+					LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER,
+																	 resowner_node,
+																	 iter.cur);
 
-			Assert(owner->parent != NULL);
+					LockReassignCurrentOwner(locallockowner);
+				}
 
-			/*
-			 * Pass the list of locks owned by this resource owner to the lock
-			 * manager, unless it has overflowed.
-			 */
-			if (owner->nlocks > MAX_RESOWNER_LOCKS)
-			{
-				locks = NULL;
-				nlocks = 0;
+				Assert(dlist_is_empty(&owner->locks));
 			}
 			else
 			{
-				locks = owner->locks;
-				nlocks = owner->nlocks;
-			}
+				dlist_foreach_modify(iter, &owner->locks)
+				{
+					LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER, resowner_node, iter.cur);
 
-			if (isCommit)
-				LockReassignCurrentOwner(locks, nlocks);
-			else
-				LockReleaseCurrentOwner(locks, nlocks);
+					LockReleaseCurrentOwner(owner, locallockowner);
+				}
+
+				Assert(dlist_is_empty(&owner->locks));
+			}
 		}
 	}
 	else if (phase == RESOURCE_RELEASE_AFTER_LOCKS)
@@ -778,7 +771,7 @@ ResourceOwnerDelete(ResourceOwner owner)
 	Assert(owner->jitarr.nitems == 0);
 	Assert(owner->cryptohasharr.nitems == 0);
 	Assert(owner->hmacarr.nitems == 0);
-	Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1);
+	Assert(dlist_is_empty(&owner->locks));
 
 	/*
 	 * Delete children.  The recursive call will delink the child from me, so
@@ -1038,54 +1031,61 @@ ResourceOwnerForgetBufferIO(ResourceOwner owner, Buffer buffer)
 
 /*
  * Remember that a Local Lock is owned by a ResourceOwner
- *
- * This is different from the other Remember functions in that the list of
- * locks is only a lossy cache. It can hold up to MAX_RESOWNER_LOCKS entries,
- * and when it overflows, we stop tracking locks. The point of only remembering
- * only up to MAX_RESOWNER_LOCKS entries is that if a lot of locks are held,
- * ResourceOwnerForgetLock doesn't need to scan through a large array to find
- * the entry.
  */
 void
-ResourceOwnerRememberLock(ResourceOwner owner, LOCALLOCK *locallock)
+ResourceOwnerRememberLock(ResourceOwner owner, LOCALLOCKOWNER *locallockowner)
 {
-	Assert(locallock != NULL);
-
-	if (owner->nlocks > MAX_RESOWNER_LOCKS)
-		return;					/* we have already overflowed */
+	Assert(owner != NULL);
+	Assert(locallockowner != NULL);
 
-	if (owner->nlocks < MAX_RESOWNER_LOCKS)
-		owner->locks[owner->nlocks] = locallock;
-	else
+#ifdef USE_ASSERT_CHECKING
 	{
-		/* overflowed */
+		dlist_iter	iter;
+
+		dlist_foreach(iter, &owner->locks)
+		{
+			LOCALLOCKOWNER *i = dlist_container(LOCALLOCKOWNER, resowner_node, iter.cur);
+
+			Assert(i->locallock != locallockowner->locallock);
+		}
 	}
-	owner->nlocks++;
+#endif
+
+	dlist_push_tail(&owner->locks, &locallockowner->resowner_node);
 }
 
 /*
- * Forget that a Local Lock is owned by a ResourceOwner
+ * Forget that a Local Lock is owned by the given LOCALLOCKOWNER.
  */
 void
-ResourceOwnerForgetLock(ResourceOwner owner, LOCALLOCK *locallock)
+ResourceOwnerForgetLock(LOCALLOCKOWNER *locallockowner)
 {
-	int			i;
+#ifdef USE_ASSERT_CHECKING
+	ResourceOwner owner;
+
+	Assert(locallockowner != NULL);
 
-	if (owner->nlocks > MAX_RESOWNER_LOCKS)
-		return;					/* we have overflowed */
+	owner = locallockowner->owner;
 
-	Assert(owner->nlocks > 0);
-	for (i = owner->nlocks - 1; i >= 0; i--)
 	{
-		if (locallock == owner->locks[i])
+		dlist_iter	iter;
+		bool		found = false;
+
+		dlist_foreach(iter, &owner->locks)
 		{
-			owner->locks[i] = owner->locks[owner->nlocks - 1];
-			owner->nlocks--;
-			return;
+			LOCALLOCKOWNER *owner = dlist_container(LOCALLOCKOWNER, resowner_node, iter.cur);
+
+			if (locallockowner == owner)
+			{
+				Assert(!found);
+				found = true;
+			}
 		}
+
+		Assert(found);
 	}
-	elog(ERROR, "lock reference %p is not owned by resource owner %s",
-		 locallock, owner->name);
+#endif
+	dlist_delete(&locallockowner->resowner_node);
 }
 
 /*
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index 8575bea25c..c7ff792201 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -24,6 +24,7 @@
 #include "storage/lwlock.h"
 #include "storage/shmem.h"
 #include "utils/timestamp.h"
+#include "lib/ilist.h"
 
 /* struct PGPROC is declared in proc.h, but must forward-reference it */
 typedef struct PGPROC PGPROC;
@@ -349,10 +350,6 @@ typedef struct LOCK
  * Otherwise, proclock objects whose holdMasks are zero are recycled
  * as soon as convenient.
  *
- * releaseMask is workspace for LockReleaseAll(): it shows the locks due
- * to be released during the current call.  This must only be examined or
- * set by the backend owning the PROCLOCK.
- *
  * Each PROCLOCK object is linked into lists for both the associated LOCK
  * object and the owning PGPROC object.  Note that the PROCLOCK is entered
  * into these lists as soon as it is created, even if no lock has yet been
@@ -374,7 +371,6 @@ typedef struct PROCLOCK
 	/* data */
 	PGPROC	   *groupLeader;	/* proc's lock group leader, or proc itself */
 	LOCKMASK	holdMask;		/* bitmask for lock types currently held */
-	LOCKMASK	releaseMask;	/* bitmask for lock types to be released */
 	dlist_node	lockLink;		/* list link in LOCK's list of proclocks */
 	dlist_node	procLink;		/* list link in PGPROC's list of proclocks */
 } PROCLOCK;
@@ -420,6 +416,13 @@ typedef struct LOCALLOCKOWNER
 	 * Must use a forward struct reference to avoid circularity.
 	 */
 	struct ResourceOwnerData *owner;
+
+	dlist_node	resowner_node;	/* dlist link for ResourceOwner.locks */
+
+	dlist_node	locallock_node;	/* dlist link for LOCALLOCK.locallockowners */
+
+	struct LOCALLOCK *locallock;	/* pointer to the corresponding LOCALLOCK */
+
 	int64		nLocks;			/* # of times held by this owner */
 } LOCALLOCKOWNER;
 
@@ -433,9 +436,9 @@ typedef struct LOCALLOCK
 	LOCK	   *lock;			/* associated LOCK object, if any */
 	PROCLOCK   *proclock;		/* associated PROCLOCK object, if any */
 	int64		nLocks;			/* total number of times lock is held */
-	int			numLockOwners;	/* # of relevant ResourceOwners */
-	int			maxLockOwners;	/* allocated size of array */
-	LOCALLOCKOWNER *lockOwners; /* dynamically resizable array */
+
+	dlist_head	locallockowners;	/* dlist of LOCALLOCKOWNER */
+
 	bool		holdsStrongLockCount;	/* bumped FastPathStrongRelationLocks */
 	bool		lockCleared;	/* we read all sinval msgs for lock */
 } LOCALLOCK;
@@ -564,10 +567,16 @@ extern void AbortStrongLockAcquire(void);
 extern void MarkLockClear(LOCALLOCK *locallock);
 extern bool LockRelease(const LOCKTAG *locktag,
 						LOCKMODE lockmode, bool sessionLock);
-extern void LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks);
+
+#ifdef USE_ASSERT_CHECKING
+extern void LockAssertNoneHeld(bool isCommit);
+#endif
+
 extern void LockReleaseSession(LOCKMETHODID lockmethodid);
-extern void LockReleaseCurrentOwner(LOCALLOCK **locallocks, int nlocks);
-extern void LockReassignCurrentOwner(LOCALLOCK **locallocks, int nlocks);
+struct ResourceOwnerData;
+extern void LockReleaseCurrentOwner(struct ResourceOwnerData *owner,
+									LOCALLOCKOWNER *locallockowner);
+extern void LockReassignCurrentOwner(LOCALLOCKOWNER *locallockowner);
 extern bool LockHeldByMe(const LOCKTAG *locktag, LOCKMODE lockmode);
 #ifdef USE_ASSERT_CHECKING
 extern HTAB *GetLockMethodLocalHash(void);
@@ -576,8 +585,8 @@ extern bool LockHasWaiters(const LOCKTAG *locktag,
 						   LOCKMODE lockmode, bool sessionLock);
 extern VirtualTransactionId *GetLockConflicts(const LOCKTAG *locktag,
 											  LOCKMODE lockmode, int *countp);
-extern void AtPrepare_Locks(void);
-extern void PostPrepare_Locks(TransactionId xid);
+extern HTAB *AtPrepare_Locks(void);
+extern void PostPrepare_Locks(TransactionId xid, HTAB *sessionandxactlocks);
 extern bool LockCheckConflicts(LockMethod lockMethodTable,
 							   LOCKMODE lockmode,
 							   LOCK *lock, PROCLOCK *proclock);
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index ae58438ec7..29b1cce54b 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -36,8 +36,9 @@ extern void ResourceOwnerRememberBufferIO(ResourceOwner owner, Buffer buffer);
 extern void ResourceOwnerForgetBufferIO(ResourceOwner owner, Buffer buffer);
 
 /* support for local lock management */
-extern void ResourceOwnerRememberLock(ResourceOwner owner, LOCALLOCK *locallock);
-extern void ResourceOwnerForgetLock(ResourceOwner owner, LOCALLOCK *locallock);
+extern void ResourceOwnerRememberLock(ResourceOwner owner,
+									  LOCALLOCKOWNER *locallock);
+extern void ResourceOwnerForgetLock(LOCALLOCKOWNER *locallock);
 
 /* support for catcache refcount management */
 extern void ResourceOwnerEnlargeCatCacheRefs(ResourceOwner owner);
-- 
2.40.1.windows.1

#118

Heikki Linnakangas

hlinnaka@iki.fi

over 2 years ago

In reply to: David Rowley (#117)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

On 11/09/2023 15:00, David Rowley wrote:

On Wed, 5 Jul 2023 at 21:44, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
index 296dc82d2ee..edb8b6026e5 100644
--- a/src/backend/commands/discard.c
+++ b/src/backend/commands/discard.c
@@ -71,7 +71,7 @@ DiscardAll(bool isTopLevel)
Async_UnlistenAll();
-       LockReleaseAll(USER_LOCKMETHOD, true);
+       LockReleaseSession(USER_LOCKMETHOD);
ResetPlanCache();
This assumes that there are no transaction-level advisory locks. I think
that's OK. It took me a while to convince myself of that, though. I
think we need a high level comment somewhere that explains what
assumptions we make on which locks can be held in session mode and which
in transaction mode.
Isn't it ok because DISCARD ALL cannot run inside a transaction block,
so there should be no locks taken apart from possibly session-level
locks?

Hmm, sounds valid. I think I convinced myself that it's OK through some
other reasoning, but I don't remember it now.

I've added a call to LockAssertNoneHeld(false) in there.

I don't see it in the patch?

--
Heikki Linnakangas
Neon (https://neon.tech)

#119

David Rowley

dgrowleyml@gmail.com

over 2 years ago

In reply to: Heikki Linnakangas (#118)

1 attachment(s)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

On Fri, 15 Sept 2023 at 22:37, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I've added a call to LockAssertNoneHeld(false) in there.

I don't see it in the patch?

hmm. I must've git format-patch before committing that part.

I'll try that again... see attached.

David

Attachments:

v8-0001-wip-resowner-lock-release-all.patchapplication/octet-stream; name=v8-0001-wip-resowner-lock-release-all.patchDownload

From 5f89dfa19789a2ca40ccee7cc81f69b3e53f1325 Mon Sep 17 00:00:00 2001
From: David Rowley <dgrowley@gmail.com>
Date: Tue, 24 Jan 2023 16:00:58 +1300
Subject: [PATCH v8] wip-resowner-lock-release-all.

---
 src/backend/access/transam/xact.c          |  18 +-
 src/backend/commands/discard.c             |   7 +-
 src/backend/replication/logical/launcher.c |   6 +-
 src/backend/storage/lmgr/README            |   6 -
 src/backend/storage/lmgr/lock.c            | 711 +++++++--------------
 src/backend/storage/lmgr/proc.c            |  17 +-
 src/backend/utils/init/postinit.c          |   6 +-
 src/backend/utils/resowner/resowner.c      | 140 ++--
 src/include/storage/lock.h                 |  35 +-
 src/include/utils/resowner_private.h       |   5 +-
 10 files changed, 364 insertions(+), 587 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 8daaa535ed..fdfb6cd02f 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2415,6 +2415,7 @@ PrepareTransaction(void)
 	TransactionId xid = GetCurrentTransactionId();
 	GlobalTransaction gxact;
 	TimestampTz prepared_at;
+	HTAB		   *sessionandxactlocks;
 
 	Assert(!IsInParallelMode());
 
@@ -2560,7 +2561,17 @@ PrepareTransaction(void)
 	StartPrepare(gxact);
 
 	AtPrepare_Notify();
-	AtPrepare_Locks();
+
+	/*
+	 * Prepare the locks and save the returned hash table that describes if
+	 * the lock is held at the session and/or transaction level.  We need to
+	 * know if we're dealing with session locks inside PostPrepare_Locks(),
+	 * but we're unable to build the hash table there due to that function
+	 * only discovering if we're dealing with a session lock while we're in a
+	 * critical section, in which case we can't allocate memory for the hash
+	 * table.
+	 */
+	sessionandxactlocks = AtPrepare_Locks();
 	AtPrepare_PredicateLocks();
 	AtPrepare_PgStat();
 	AtPrepare_MultiXact();
@@ -2587,7 +2598,10 @@ PrepareTransaction(void)
 	 * ProcArrayClearTransaction().  Otherwise, a GetLockConflicts() would
 	 * conclude "xact already committed or aborted" for our locks.
 	 */
-	PostPrepare_Locks(xid);
+	PostPrepare_Locks(xid, sessionandxactlocks);
+
+	/* We no longer need this hash table */
+	hash_destroy(sessionandxactlocks);
 
 	/*
 	 * Let others know about no transaction in progress by me.  This has to be
diff --git a/src/backend/commands/discard.c b/src/backend/commands/discard.c
index 296dc82d2e..5baf83ac6c 100644
--- a/src/backend/commands/discard.c
+++ b/src/backend/commands/discard.c
@@ -71,7 +71,12 @@ DiscardAll(bool isTopLevel)
 	ResetAllOptions();
 	DropAllPreparedStatements();
 	Async_UnlistenAll();
-	LockReleaseAll(USER_LOCKMETHOD, true);
+	LockReleaseSession(USER_LOCKMETHOD);
+
+#ifdef USE_ASSERT_CHECKING
+	LockAssertNoneHeld(false);
+#endif
+
 	ResetPlanCache();
 	ResetTempTableNamespace();
 	ResetSequenceCaches();
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index 7882fc91ce..cc8bce09bd 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -841,7 +841,11 @@ logicalrep_worker_onexit(int code, Datum arg)
 	 * The locks will be acquired once the worker is initialized.
 	 */
 	if (!InitializingApplyWorker)
-		LockReleaseAll(DEFAULT_LOCKMETHOD, true);
+		LockReleaseSession(DEFAULT_LOCKMETHOD);
+
+#ifdef USE_ASSERT_CHECKING
+	LockAssertNoneHeld(false);
+#endif
 
 	ApplyLauncherWakeup();
 }
diff --git a/src/backend/storage/lmgr/README b/src/backend/storage/lmgr/README
index 45de0fd2bd..e7e0f29347 100644
--- a/src/backend/storage/lmgr/README
+++ b/src/backend/storage/lmgr/README
@@ -182,12 +182,6 @@ holdMask -
     subset of the PGPROC object's heldLocks mask (if the PGPROC is
     currently waiting for another lock mode on this lock).
 
-releaseMask -
-    A bitmask for the lock modes due to be released during LockReleaseAll.
-    This must be a subset of the holdMask.  Note that it is modified without
-    taking the partition LWLock, and therefore it is unsafe for any
-    backend except the one owning the PROCLOCK to examine/change it.
-
 lockLink -
     List link for shared memory queue of all the PROCLOCK objects for the
     same LOCK.
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index ec6240fbae..5049952875 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -22,8 +22,7 @@
  *	Interface:
  *
  *	InitLocks(), GetLocksMethodTable(), GetLockTagsMethodTable(),
- *	LockAcquire(), LockRelease(), LockReleaseAll(),
- *	LockCheckConflicts(), GrantLock()
+ *	LockAcquire(), LockRelease(), LockCheckConflicts(), GrantLock()
  *
  *-------------------------------------------------------------------------
  */
@@ -162,6 +161,16 @@ typedef struct TwoPhaseLockRecord
 	LOCKMODE	lockmode;
 } TwoPhaseLockRecord;
 
+/*
+ * Used by for a hash table entry type in AtPrepare_Locks() to communicate the
+ * session/xact lock status of each held lock for use in PostPrepare_Locks().
+ */
+typedef struct PerLockTagEntry
+{
+	LOCKTAG lock;  /* identifies the lockable object */
+	bool sessLock; /* is any lockmode held at session level? */
+	bool xactLock; /* is any lockmode held at xact level? */
+} PerLockTagEntry;
 
 /*
  * Count of the number of fast path lock slots we believe to be used.  This
@@ -270,6 +279,8 @@ static HTAB *LockMethodLockHash;
 static HTAB *LockMethodProcLockHash;
 static HTAB *LockMethodLocalHash;
 
+/* A memory context for storing LOCALLOCKOWNER structs */
+MemoryContext LocalLockOwnerContext;
 
 /* private state for error cleanup */
 static LOCALLOCK *StrongLockInProgress;
@@ -277,6 +288,9 @@ static LOCALLOCK *awaitedLock;
 static ResourceOwner awaitedOwner;
 
 
+static dlist_head session_locks[lengthof(LockMethods)];
+
+
 #ifdef LOCK_DEBUG
 
 /*------
@@ -363,8 +377,8 @@ static void GrantLockLocal(LOCALLOCK *locallock, ResourceOwner owner);
 static void BeginStrongLockAcquire(LOCALLOCK *locallock, uint32 fasthashcode);
 static void FinishStrongLockAcquire(void);
 static void WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner);
-static void ReleaseLockIfHeld(LOCALLOCK *locallock, bool sessionLock);
-static void LockReassignOwner(LOCALLOCK *locallock, ResourceOwner parent);
+static void ReleaseLockIfHeld(LOCALLOCKOWNER *locallockowner, bool sessionLock);
+static void LockReassignOwner(LOCALLOCKOWNER *locallockowner, ResourceOwner parent);
 static bool UnGrantLock(LOCK *lock, LOCKMODE lockmode,
 						PROCLOCK *proclock, LockMethod lockMethodTable);
 static void CleanUpLock(LOCK *lock, PROCLOCK *proclock,
@@ -465,6 +479,21 @@ InitLocks(void)
 									  16,
 									  &info,
 									  HASH_ELEM | HASH_BLOBS);
+
+	/* Initialize each element of the session_locks array */
+	for (int i = 0; i < lengthof(LockMethods); i++)
+		dlist_init(&session_locks[i]);
+
+	/*
+	 * Create a slab context for storing LOCALLOCKOWNERs.  Slab seems like a
+	 * good context type for this as it will manage fragmentation better than
+	 * aset.c contexts and it will free() excess memory rather than maintain
+	 * excessively long freelists after a large surge in locking requirements.
+	 */
+	LocalLockOwnerContext = SlabContextCreate(TopMemoryContext,
+											  "LOCALLOCKOWNER context",
+											  SLAB_DEFAULT_BLOCK_SIZE,
+											  sizeof(LOCALLOCKOWNER));
 }
 
 
@@ -827,26 +856,9 @@ LockAcquireExtended(const LOCKTAG *locktag,
 		locallock->nLocks = 0;
 		locallock->holdsStrongLockCount = false;
 		locallock->lockCleared = false;
-		locallock->numLockOwners = 0;
-		locallock->maxLockOwners = 8;
-		locallock->lockOwners = NULL;	/* in case next line fails */
-		locallock->lockOwners = (LOCALLOCKOWNER *)
-			MemoryContextAlloc(TopMemoryContext,
-							   locallock->maxLockOwners * sizeof(LOCALLOCKOWNER));
+		dlist_init(&locallock->locallockowners);
 	}
-	else
-	{
-		/* Make sure there will be room to remember the lock */
-		if (locallock->numLockOwners >= locallock->maxLockOwners)
-		{
-			int			newsize = locallock->maxLockOwners * 2;
 
-			locallock->lockOwners = (LOCALLOCKOWNER *)
-				repalloc(locallock->lockOwners,
-						 newsize * sizeof(LOCALLOCKOWNER));
-			locallock->maxLockOwners = newsize;
-		}
-	}
 	hashcode = locallock->hashcode;
 
 	if (locallockp)
@@ -1249,7 +1261,6 @@ SetupLockInTable(LockMethod lockMethodTable, PGPROC *proc,
 		proclock->groupLeader = proc->lockGroupLeader != NULL ?
 			proc->lockGroupLeader : proc;
 		proclock->holdMask = 0;
-		proclock->releaseMask = 0;
 		/* Add proclock to appropriate lists */
 		dlist_push_tail(&lock->procLocks, &proclock->lockLink);
 		dlist_push_tail(&proc->myProcLocks[partition], &proclock->procLink);
@@ -1343,17 +1354,19 @@ CheckAndSetLockHeld(LOCALLOCK *locallock, bool acquired)
 static void
 RemoveLocalLock(LOCALLOCK *locallock)
 {
-	int			i;
+	dlist_mutable_iter iter;
 
-	for (i = locallock->numLockOwners - 1; i >= 0; i--)
+	dlist_foreach_modify(iter, &locallock->locallockowners)
 	{
-		if (locallock->lockOwners[i].owner != NULL)
-			ResourceOwnerForgetLock(locallock->lockOwners[i].owner, locallock);
+		LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER, locallock_node, iter.cur);
+
+		Assert(locallockowner->owner != NULL);
+		dlist_delete(&locallockowner->locallock_node);
+		ResourceOwnerForgetLock(locallockowner);
+		pfree(locallockowner);
 	}
-	locallock->numLockOwners = 0;
-	if (locallock->lockOwners != NULL)
-		pfree(locallock->lockOwners);
-	locallock->lockOwners = NULL;
+
+	Assert(dlist_is_empty(&locallock->locallockowners));
 
 	if (locallock->holdsStrongLockCount)
 	{
@@ -1659,26 +1672,38 @@ CleanUpLock(LOCK *lock, PROCLOCK *proclock,
 static void
 GrantLockLocal(LOCALLOCK *locallock, ResourceOwner owner)
 {
-	LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
-	int			i;
+	LOCALLOCKOWNER *locallockowner;
+	dlist_iter	iter;
 
-	Assert(locallock->numLockOwners < locallock->maxLockOwners);
 	/* Count the total */
 	locallock->nLocks++;
 	/* Count the per-owner lock */
-	for (i = 0; i < locallock->numLockOwners; i++)
+	dlist_foreach(iter, &locallock->locallockowners)
 	{
-		if (lockOwners[i].owner == owner)
+		locallockowner = dlist_container(LOCALLOCKOWNER, locallock_node, iter.cur);
+
+		if (locallockowner->owner == owner)
 		{
-			lockOwners[i].nLocks++;
+			locallockowner->nLocks++;
 			return;
 		}
 	}
-	lockOwners[i].owner = owner;
-	lockOwners[i].nLocks = 1;
-	locallock->numLockOwners++;
+	locallockowner = MemoryContextAlloc(LocalLockOwnerContext, sizeof(LOCALLOCKOWNER));
+	locallockowner->owner = owner;
+	locallockowner->nLocks = 1;
+	locallockowner->locallock = locallock;
+
+	dlist_push_tail(&locallock->locallockowners, &locallockowner->locallock_node);
+
 	if (owner != NULL)
-		ResourceOwnerRememberLock(owner, locallock);
+		ResourceOwnerRememberLock(owner, locallockowner);
+	else
+	{
+		LOCKMETHODID lockmethodid = LOCALLOCK_LOCKMETHOD(*locallockowner->locallock);
+
+		Assert(lockmethodid > 0 && lockmethodid <= 2);
+		dlist_push_tail(&session_locks[lockmethodid - 1], &locallockowner->resowner_node);
+	}
 
 	/* Indicate that the lock is acquired for certain types of locks. */
 	CheckAndSetLockHeld(locallock, true);
@@ -1971,9 +1996,9 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 	 * Decrease the count for the resource owner.
 	 */
 	{
-		LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
 		ResourceOwner owner;
-		int			i;
+		dlist_mutable_iter	iter;
+		bool		found = false;
 
 		/* Identify owner for lock */
 		if (sessionLock)
@@ -1981,24 +2006,33 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 		else
 			owner = CurrentResourceOwner;
 
-		for (i = locallock->numLockOwners - 1; i >= 0; i--)
+		dlist_foreach_modify(iter, &locallock->locallockowners)
 		{
-			if (lockOwners[i].owner == owner)
+			LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER, locallock_node, iter.cur);
+
+			if (locallockowner->owner != owner)
+				continue;
+
+			found = true;
+
+			if (--locallockowner->nLocks == 0)
 			{
-				Assert(lockOwners[i].nLocks > 0);
-				if (--lockOwners[i].nLocks == 0)
-				{
-					if (owner != NULL)
-						ResourceOwnerForgetLock(owner, locallock);
-					/* compact out unused slot */
-					locallock->numLockOwners--;
-					if (i < locallock->numLockOwners)
-						lockOwners[i] = lockOwners[locallock->numLockOwners];
-				}
-				break;
+				dlist_delete(&locallockowner->locallock_node);
+
+				if (owner != NULL)
+					ResourceOwnerForgetLock(locallockowner);
+				else
+					dlist_delete(&locallockowner->resowner_node);
+				pfree(locallockowner);
+			}
+			else
+			{
+				/* ensure nLocks didn't go negative */
+				Assert(locallockowner->nLocks >= 0);
 			}
 		}
-		if (i < 0)
+
+		if (!found)
 		{
 			/* don't release a lock belonging to another owner */
 			elog(WARNING, "you don't own a lock of type %s",
@@ -2016,6 +2050,8 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 	if (locallock->nLocks > 0)
 		return true;
 
+	Assert(locallock->nLocks == 0);
+
 	/*
 	 * At this point we can no longer suppose we are clear of invalidation
 	 * messages related to this lock.  Although we'll delete the LOCALLOCK
@@ -2118,274 +2154,44 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 	return true;
 }
 
+#ifdef USE_ASSERT_CHECKING
 /*
- * LockReleaseAll -- Release all locks of the specified lock method that
- *		are held by the current process.
- *
- * Well, not necessarily *all* locks.  The available behaviors are:
- *		allLocks == true: release all locks including session locks.
- *		allLocks == false: release all non-session locks.
+ * LockAssertNoneHeld -- Assert that we no longer hold any DEFAULT_LOCKMETHOD
+ * locks during an abort.
  */
 void
-LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
+LockAssertNoneHeld(bool isCommit)
 {
 	HASH_SEQ_STATUS status;
-	LockMethod	lockMethodTable;
-	int			i,
-				numLockModes;
 	LOCALLOCK  *locallock;
-	LOCK	   *lock;
-	int			partition;
-	bool		have_fast_path_lwlock = false;
-
-	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
-		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
-	lockMethodTable = LockMethods[lockmethodid];
-
-#ifdef LOCK_DEBUG
-	if (*(lockMethodTable->trace_flag))
-		elog(LOG, "LockReleaseAll: lockmethod=%d", lockmethodid);
-#endif
-
-	/*
-	 * Get rid of our fast-path VXID lock, if appropriate.  Note that this is
-	 * the only way that the lock we hold on our own VXID can ever get
-	 * released: it is always and only released when a toplevel transaction
-	 * ends.
-	 */
-	if (lockmethodid == DEFAULT_LOCKMETHOD)
-		VirtualXactLockTableCleanup();
-
-	numLockModes = lockMethodTable->numLockModes;
 
-	/*
-	 * First we run through the locallock table and get rid of unwanted
-	 * entries, then we scan the process's proclocks and get rid of those. We
-	 * do this separately because we may have multiple locallock entries
-	 * pointing to the same proclock, and we daren't end up with any dangling
-	 * pointers.  Fast-path locks are cleaned up during the locallock table
-	 * scan, though.
-	 */
-	hash_seq_init(&status, LockMethodLocalHash);
-
-	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+	if (!isCommit)
 	{
-		/*
-		 * If the LOCALLOCK entry is unused, we must've run out of shared
-		 * memory while trying to set up this lock.  Just forget the local
-		 * entry.
-		 */
-		if (locallock->nLocks == 0)
-		{
-			RemoveLocalLock(locallock);
-			continue;
-		}
-
-		/* Ignore items that are not of the lockmethod to be removed */
-		if (LOCALLOCK_LOCKMETHOD(*locallock) != lockmethodid)
-			continue;
+		hash_seq_init(&status, LockMethodLocalHash);
 
-		/*
-		 * If we are asked to release all locks, we can just zap the entry.
-		 * Otherwise, must scan to see if there are session locks. We assume
-		 * there is at most one lockOwners entry for session locks.
-		 */
-		if (!allLocks)
+		while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
 		{
-			LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
+			dlist_iter	local_iter;
 
-			/* If session lock is above array position 0, move it down to 0 */
-			for (i = 0; i < locallock->numLockOwners; i++)
-			{
-				if (lockOwners[i].owner == NULL)
-					lockOwners[0] = lockOwners[i];
-				else
-					ResourceOwnerForgetLock(lockOwners[i].owner, locallock);
-			}
+			Assert(locallock->nLocks >= 0);
 
-			if (locallock->numLockOwners > 0 &&
-				lockOwners[0].owner == NULL &&
-				lockOwners[0].nLocks > 0)
+			dlist_foreach(local_iter, &locallock->locallockowners)
 			{
-				/* Fix the locallock to show just the session locks */
-				locallock->nLocks = lockOwners[0].nLocks;
-				locallock->numLockOwners = 1;
-				/* We aren't deleting this locallock, so done */
-				continue;
-			}
-			else
-				locallock->numLockOwners = 0;
-		}
-
-		/*
-		 * If the lock or proclock pointers are NULL, this lock was taken via
-		 * the relation fast-path (and is not known to have been transferred).
-		 */
-		if (locallock->proclock == NULL || locallock->lock == NULL)
-		{
-			LOCKMODE	lockmode = locallock->tag.mode;
-			Oid			relid;
+				LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER,
+																 locallock_node,
+																 local_iter.cur);
 
-			/* Verify that a fast-path lock is what we've got. */
-			if (!EligibleForRelationFastPath(&locallock->tag.lock, lockmode))
-				elog(PANIC, "locallock table corrupted");
+				Assert(locallockowner->owner == NULL);
 
-			/*
-			 * If we don't currently hold the LWLock that protects our
-			 * fast-path data structures, we must acquire it before attempting
-			 * to release the lock via the fast-path.  We will continue to
-			 * hold the LWLock until we're done scanning the locallock table,
-			 * unless we hit a transferred fast-path lock.  (XXX is this
-			 * really such a good idea?  There could be a lot of entries ...)
-			 */
-			if (!have_fast_path_lwlock)
-			{
-				LWLockAcquire(&MyProc->fpInfoLock, LW_EXCLUSIVE);
-				have_fast_path_lwlock = true;
-			}
-
-			/* Attempt fast-path release. */
-			relid = locallock->tag.lock.locktag_field2;
-			if (FastPathUnGrantRelationLock(relid, lockmode))
-			{
-				RemoveLocalLock(locallock);
-				continue;
+				if (locallockowner->nLocks > 0 &&
+					LOCALLOCK_LOCKMETHOD(*locallock) == DEFAULT_LOCKMETHOD)
+					Assert(false);
 			}
-
-			/*
-			 * Our lock, originally taken via the fast path, has been
-			 * transferred to the main lock table.  That's going to require
-			 * some extra work, so release our fast-path lock before starting.
-			 */
-			LWLockRelease(&MyProc->fpInfoLock);
-			have_fast_path_lwlock = false;
-
-			/*
-			 * Now dump the lock.  We haven't got a pointer to the LOCK or
-			 * PROCLOCK in this case, so we have to handle this a bit
-			 * differently than a normal lock release.  Unfortunately, this
-			 * requires an extra LWLock acquire-and-release cycle on the
-			 * partitionLock, but hopefully it shouldn't happen often.
-			 */
-			LockRefindAndRelease(lockMethodTable, MyProc,
-								 &locallock->tag.lock, lockmode, false);
-			RemoveLocalLock(locallock);
-			continue;
 		}
-
-		/* Mark the proclock to show we need to release this lockmode */
-		if (locallock->nLocks > 0)
-			locallock->proclock->releaseMask |= LOCKBIT_ON(locallock->tag.mode);
-
-		/* And remove the locallock hashtable entry */
-		RemoveLocalLock(locallock);
 	}
-
-	/* Done with the fast-path data structures */
-	if (have_fast_path_lwlock)
-		LWLockRelease(&MyProc->fpInfoLock);
-
-	/*
-	 * Now, scan each lock partition separately.
-	 */
-	for (partition = 0; partition < NUM_LOCK_PARTITIONS; partition++)
-	{
-		LWLock	   *partitionLock;
-		dlist_head *procLocks = &MyProc->myProcLocks[partition];
-		dlist_mutable_iter proclock_iter;
-
-		partitionLock = LockHashPartitionLockByIndex(partition);
-
-		/*
-		 * If the proclock list for this partition is empty, we can skip
-		 * acquiring the partition lock.  This optimization is trickier than
-		 * it looks, because another backend could be in process of adding
-		 * something to our proclock list due to promoting one of our
-		 * fast-path locks.  However, any such lock must be one that we
-		 * decided not to delete above, so it's okay to skip it again now;
-		 * we'd just decide not to delete it again.  We must, however, be
-		 * careful to re-fetch the list header once we've acquired the
-		 * partition lock, to be sure we have a valid, up-to-date pointer.
-		 * (There is probably no significant risk if pointer fetch/store is
-		 * atomic, but we don't wish to assume that.)
-		 *
-		 * XXX This argument assumes that the locallock table correctly
-		 * represents all of our fast-path locks.  While allLocks mode
-		 * guarantees to clean up all of our normal locks regardless of the
-		 * locallock situation, we lose that guarantee for fast-path locks.
-		 * This is not ideal.
-		 */
-		if (dlist_is_empty(procLocks))
-			continue;			/* needn't examine this partition */
-
-		LWLockAcquire(partitionLock, LW_EXCLUSIVE);
-
-		dlist_foreach_modify(proclock_iter, procLocks)
-		{
-			PROCLOCK   *proclock = dlist_container(PROCLOCK, procLink, proclock_iter.cur);
-			bool		wakeupNeeded = false;
-
-			Assert(proclock->tag.myProc == MyProc);
-
-			lock = proclock->tag.myLock;
-
-			/* Ignore items that are not of the lockmethod to be removed */
-			if (LOCK_LOCKMETHOD(*lock) != lockmethodid)
-				continue;
-
-			/*
-			 * In allLocks mode, force release of all locks even if locallock
-			 * table had problems
-			 */
-			if (allLocks)
-				proclock->releaseMask = proclock->holdMask;
-			else
-				Assert((proclock->releaseMask & ~proclock->holdMask) == 0);
-
-			/*
-			 * Ignore items that have nothing to be released, unless they have
-			 * holdMask == 0 and are therefore recyclable
-			 */
-			if (proclock->releaseMask == 0 && proclock->holdMask != 0)
-				continue;
-
-			PROCLOCK_PRINT("LockReleaseAll", proclock);
-			LOCK_PRINT("LockReleaseAll", lock, 0);
-			Assert(lock->nRequested >= 0);
-			Assert(lock->nGranted >= 0);
-			Assert(lock->nGranted <= lock->nRequested);
-			Assert((proclock->holdMask & ~lock->grantMask) == 0);
-
-			/*
-			 * Release the previously-marked lock modes
-			 */
-			for (i = 1; i <= numLockModes; i++)
-			{
-				if (proclock->releaseMask & LOCKBIT_ON(i))
-					wakeupNeeded |= UnGrantLock(lock, i, proclock,
-												lockMethodTable);
-			}
-			Assert((lock->nRequested >= 0) && (lock->nGranted >= 0));
-			Assert(lock->nGranted <= lock->nRequested);
-			LOCK_PRINT("LockReleaseAll: updated", lock, 0);
-
-			proclock->releaseMask = 0;
-
-			/* CleanUpLock will wake up waiters if needed. */
-			CleanUpLock(lock, proclock,
-						lockMethodTable,
-						LockTagHashCode(&lock->tag),
-						wakeupNeeded);
-		}						/* loop over PROCLOCKs within this partition */
-
-		LWLockRelease(partitionLock);
-	}							/* loop over partitions */
-
-#ifdef LOCK_DEBUG
-	if (*(lockMethodTable->trace_flag))
-		elog(LOG, "LockReleaseAll done");
-#endif
+	Assert(MyProc->fpLockBits == 0);
 }
+#endif
 
 /*
  * LockReleaseSession -- Release all session locks of the specified lock method
@@ -2394,59 +2200,39 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
 void
 LockReleaseSession(LOCKMETHODID lockmethodid)
 {
-	HASH_SEQ_STATUS status;
-	LOCALLOCK  *locallock;
+	dlist_mutable_iter iter;
 
 	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
 		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
 
-	hash_seq_init(&status, LockMethodLocalHash);
-
-	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+	dlist_foreach_modify(iter, &session_locks[lockmethodid - 1])
 	{
-		/* Ignore items that are not of the specified lock method */
-		if (LOCALLOCK_LOCKMETHOD(*locallock) != lockmethodid)
-			continue;
+		LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER, resowner_node, iter.cur);
 
-		ReleaseLockIfHeld(locallock, true);
+		Assert(LOCALLOCK_LOCKMETHOD(*locallockowner->locallock) == lockmethodid);
+
+		ReleaseLockIfHeld(locallockowner, true);
 	}
+
+	Assert(dlist_is_empty(&session_locks[lockmethodid - 1]));
 }
 
 /*
  * LockReleaseCurrentOwner
- *		Release all locks belonging to CurrentResourceOwner
- *
- * If the caller knows what those locks are, it can pass them as an array.
- * That speeds up the call significantly, when a lot of locks are held.
- * Otherwise, pass NULL for locallocks, and we'll traverse through our hash
- * table to find them.
+ *		Release all locks belonging to 'owner'
  */
 void
-LockReleaseCurrentOwner(LOCALLOCK **locallocks, int nlocks)
+LockReleaseCurrentOwner(ResourceOwner owner, LOCALLOCKOWNER *locallockowner)
 {
-	if (locallocks == NULL)
-	{
-		HASH_SEQ_STATUS status;
-		LOCALLOCK  *locallock;
-
-		hash_seq_init(&status, LockMethodLocalHash);
-
-		while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
-			ReleaseLockIfHeld(locallock, false);
-	}
-	else
-	{
-		int			i;
+	Assert(locallockowner->owner == owner);
 
-		for (i = nlocks - 1; i >= 0; i--)
-			ReleaseLockIfHeld(locallocks[i], false);
-	}
+	ReleaseLockIfHeld(locallockowner, false);
 }
 
 /*
  * ReleaseLockIfHeld
- *		Release any session-level locks on this lockable object if sessionLock
- *		is true; else, release any locks held by CurrentResourceOwner.
+ *		Release any session-level locks on this 'locallockowner' if
+ *		sessionLock is true; else, release any locks held by 'locallockowner'.
  *
  * It is tempting to pass this a ResourceOwner pointer (or NULL for session
  * locks), but without refactoring LockRelease() we cannot support releasing
@@ -2457,52 +2243,39 @@ LockReleaseCurrentOwner(LOCALLOCK **locallocks, int nlocks)
  * convenience.
  */
 static void
-ReleaseLockIfHeld(LOCALLOCK *locallock, bool sessionLock)
+ReleaseLockIfHeld(LOCALLOCKOWNER *locallockowner, bool sessionLock)
 {
-	ResourceOwner owner;
-	LOCALLOCKOWNER *lockOwners;
-	int			i;
+	LOCALLOCK  *locallock = locallockowner->locallock;
+
+	/* release all references to the lock by this resource owner */
 
-	/* Identify owner for lock (must match LockRelease!) */
 	if (sessionLock)
-		owner = NULL;
+		Assert(locallockowner->owner == NULL);
 	else
-		owner = CurrentResourceOwner;
+		Assert(locallockowner->owner != NULL);
 
-	/* Scan to see if there are any locks belonging to the target owner */
-	lockOwners = locallock->lockOwners;
-	for (i = locallock->numLockOwners - 1; i >= 0; i--)
+	/* We will still hold this lock after forgetting this ResourceOwner. */
+	if (locallockowner->nLocks < locallock->nLocks)
 	{
-		if (lockOwners[i].owner == owner)
-		{
-			Assert(lockOwners[i].nLocks > 0);
-			if (lockOwners[i].nLocks < locallock->nLocks)
-			{
-				/*
-				 * We will still hold this lock after forgetting this
-				 * ResourceOwner.
-				 */
-				locallock->nLocks -= lockOwners[i].nLocks;
-				/* compact out unused slot */
-				locallock->numLockOwners--;
-				if (owner != NULL)
-					ResourceOwnerForgetLock(owner, locallock);
-				if (i < locallock->numLockOwners)
-					lockOwners[i] = lockOwners[locallock->numLockOwners];
-			}
-			else
-			{
-				Assert(lockOwners[i].nLocks == locallock->nLocks);
-				/* We want to call LockRelease just once */
-				lockOwners[i].nLocks = 1;
-				locallock->nLocks = 1;
-				if (!LockRelease(&locallock->tag.lock,
-								 locallock->tag.mode,
-								 sessionLock))
-					elog(WARNING, "ReleaseLockIfHeld: failed??");
-			}
-			break;
-		}
+		locallock->nLocks -= locallockowner->nLocks;
+		dlist_delete(&locallockowner->locallock_node);
+
+		if (sessionLock)
+			dlist_delete(&locallockowner->resowner_node);
+		else
+			ResourceOwnerForgetLock(locallockowner);
+	}
+	else
+	{
+		Assert(locallockowner->nLocks == locallock->nLocks);
+		/* We want to call LockRelease just once */
+		locallockowner->nLocks = 1;
+		locallock->nLocks = 1;
+
+		if (!LockRelease(&locallock->tag.lock,
+						 locallock->tag.mode,
+						 sessionLock))
+			elog(WARNING, "ReleaseLockIfHeld: failed??");
 	}
 }
 
@@ -2510,82 +2283,47 @@ ReleaseLockIfHeld(LOCALLOCK *locallock, bool sessionLock)
  * LockReassignCurrentOwner
  *		Reassign all locks belonging to CurrentResourceOwner to belong
  *		to its parent resource owner.
- *
- * If the caller knows what those locks are, it can pass them as an array.
- * That speeds up the call significantly, when a lot of locks are held
- * (e.g pg_dump with a large schema).  Otherwise, pass NULL for locallocks,
- * and we'll traverse through our hash table to find them.
  */
 void
-LockReassignCurrentOwner(LOCALLOCK **locallocks, int nlocks)
+LockReassignCurrentOwner(LOCALLOCKOWNER *locallockowner)
 {
 	ResourceOwner parent = ResourceOwnerGetParent(CurrentResourceOwner);
 
-	Assert(parent != NULL);
-
-	if (locallocks == NULL)
-	{
-		HASH_SEQ_STATUS status;
-		LOCALLOCK  *locallock;
-
-		hash_seq_init(&status, LockMethodLocalHash);
-
-		while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
-			LockReassignOwner(locallock, parent);
-	}
-	else
-	{
-		int			i;
-
-		for (i = nlocks - 1; i >= 0; i--)
-			LockReassignOwner(locallocks[i], parent);
-	}
+	LockReassignOwner(locallockowner, parent);
 }
 
 /*
- * Subroutine of LockReassignCurrentOwner. Reassigns a given lock belonging to
- * CurrentResourceOwner to its parent.
+ * Subroutine of LockReassignCurrentOwner. Reassigns the given
+ *'locallockowner' to 'parent'.
  */
 static void
-LockReassignOwner(LOCALLOCK *locallock, ResourceOwner parent)
+LockReassignOwner(LOCALLOCKOWNER *locallockowner, ResourceOwner parent)
 {
-	LOCALLOCKOWNER *lockOwners;
-	int			i;
-	int			ic = -1;
-	int			ip = -1;
+	dlist_iter	iter;
+	LOCALLOCK  *locallock = locallockowner->locallock;
 
-	/*
-	 * Scan to see if there are any locks belonging to current owner or its
-	 * parent
-	 */
-	lockOwners = locallock->lockOwners;
-	for (i = locallock->numLockOwners - 1; i >= 0; i--)
+	ResourceOwnerForgetLock(locallockowner);
+
+	dlist_foreach(iter, &locallock->locallockowners)
 	{
-		if (lockOwners[i].owner == CurrentResourceOwner)
-			ic = i;
-		else if (lockOwners[i].owner == parent)
-			ip = i;
-	}
+		LOCALLOCKOWNER *parentlocalowner = dlist_container(LOCALLOCKOWNER, locallock_node, iter.cur);
 
-	if (ic < 0)
-		return;					/* no current locks */
+		Assert(parentlocalowner->locallock == locallock);
 
-	if (ip < 0)
-	{
-		/* Parent has no slot, so just give it the child's slot */
-		lockOwners[ic].owner = parent;
-		ResourceOwnerRememberLock(parent, locallock);
-	}
-	else
-	{
-		/* Merge child's count with parent's */
-		lockOwners[ip].nLocks += lockOwners[ic].nLocks;
-		/* compact out unused slot */
-		locallock->numLockOwners--;
-		if (ic < locallock->numLockOwners)
-			lockOwners[ic] = lockOwners[locallock->numLockOwners];
+		if (parentlocalowner->owner != parent)
+			continue;
+
+		parentlocalowner->nLocks += locallockowner->nLocks;
+
+		locallockowner->nLocks = 0;
+		dlist_delete(&locallockowner->locallock_node);
+		pfree(locallockowner);
+		return;
 	}
-	ResourceOwnerForgetLock(CurrentResourceOwner, locallock);
+
+	/* reassign locallockowner to parent resowner */
+	locallockowner->owner = parent;
+	ResourceOwnerRememberLock(parent, locallockowner);
 }
 
 /*
@@ -3057,7 +2795,7 @@ GetLockConflicts(const LOCKTAG *locktag, LOCKMODE lockmode, int *countp)
  * We currently use this in two situations: first, to release locks held by
  * prepared transactions on commit (see lock_twophase_postcommit); and second,
  * to release locks taken via the fast-path, transferred to the main hash
- * table, and then released (see LockReleaseAll).
+ * table, and then released (see ResourceOwnerRelease).
  */
 static void
 LockRefindAndRelease(LockMethod lockMethodTable, PGPROC *proc,
@@ -3163,16 +2901,9 @@ LockRefindAndRelease(LockMethod lockMethodTable, PGPROC *proc,
  * we can't implement this check by examining LOCALLOCK entries in isolation.
  * We must build a transient hashtable that is indexed by locktag only.
  */
-static void
+static HTAB *
 CheckForSessionAndXactLocks(void)
 {
-	typedef struct
-	{
-		LOCKTAG		lock;		/* identifies the lockable object */
-		bool		sessLock;	/* is any lockmode held at session level? */
-		bool		xactLock;	/* is any lockmode held at xact level? */
-	} PerLockTagEntry;
-
 	HASHCTL		hash_ctl;
 	HTAB	   *lockhtab;
 	HASH_SEQ_STATUS status;
@@ -3193,10 +2924,9 @@ CheckForSessionAndXactLocks(void)
 
 	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
 	{
-		LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
 		PerLockTagEntry *hentry;
 		bool		found;
-		int			i;
+		dlist_iter	iter;
 
 		/*
 		 * Ignore VXID locks.  We don't want those to be held by prepared
@@ -3217,9 +2947,13 @@ CheckForSessionAndXactLocks(void)
 			hentry->sessLock = hentry->xactLock = false;
 
 		/* Scan to see if we hold lock at session or xact level or both */
-		for (i = locallock->numLockOwners - 1; i >= 0; i--)
+		dlist_foreach(iter, &locallock->locallockowners)
 		{
-			if (lockOwners[i].owner == NULL)
+			LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER,
+															 locallock_node,
+															 iter.cur);
+
+			if (locallockowner->owner == NULL)
 				hentry->sessLock = true;
 			else
 				hentry->xactLock = true;
@@ -3235,8 +2969,7 @@ CheckForSessionAndXactLocks(void)
 					 errmsg("cannot PREPARE while holding both session-level and transaction-level locks on the same object")));
 	}
 
-	/* Success, so clean up */
-	hash_destroy(lockhtab);
+	return lockhtab;
 }
 
 /*
@@ -3244,6 +2977,11 @@ CheckForSessionAndXactLocks(void)
  *		Do the preparatory work for a PREPARE: make 2PC state file records
  *		for all locks currently held.
  *
+ * Returns a hash table of PerLockTagEntry structs with an entry for each
+ * lock held by this backend marking if the lock is held at the session or
+ * xact level, or both.  It is up to the calling function to call
+ * hash_destroy() on this table to free the memory used by it.
+ *
  * Session-level locks are ignored, as are VXID locks.
  *
  * For the most part, we don't need to touch shared memory for this ---
@@ -3251,14 +2989,15 @@ CheckForSessionAndXactLocks(void)
  * Fast-path locks are an exception, however: we move any such locks to
  * the main table before allowing PREPARE TRANSACTION to succeed.
  */
-void
+HTAB *
 AtPrepare_Locks(void)
 {
 	HASH_SEQ_STATUS status;
 	LOCALLOCK  *locallock;
+	HTAB	   *sessionandxactlocks;
 
 	/* First, verify there aren't locks of both xact and session level */
-	CheckForSessionAndXactLocks();
+	sessionandxactlocks = CheckForSessionAndXactLocks();
 
 	/* Now do the per-locallock cleanup work */
 	hash_seq_init(&status, LockMethodLocalHash);
@@ -3266,10 +3005,9 @@ AtPrepare_Locks(void)
 	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
 	{
 		TwoPhaseLockRecord record;
-		LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
 		bool		haveSessionLock;
 		bool		haveXactLock;
-		int			i;
+		dlist_iter	iter;
 
 		/*
 		 * Ignore VXID locks.  We don't want those to be held by prepared
@@ -3284,9 +3022,13 @@ AtPrepare_Locks(void)
 
 		/* Scan to see whether we hold it at session or transaction level */
 		haveSessionLock = haveXactLock = false;
-		for (i = locallock->numLockOwners - 1; i >= 0; i--)
+		dlist_foreach(iter, &locallock->locallockowners)
 		{
-			if (lockOwners[i].owner == NULL)
+			LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER,
+															 locallock_node,
+															 iter.cur);
+
+			if (locallockowner->owner == NULL)
 				haveSessionLock = true;
 			else
 				haveXactLock = true;
@@ -3330,6 +3072,8 @@ AtPrepare_Locks(void)
 		RegisterTwoPhaseRecord(TWOPHASE_RM_LOCK_ID, 0,
 							   &record, sizeof(TwoPhaseLockRecord));
 	}
+
+	return sessionandxactlocks;
 }
 
 /*
@@ -3344,11 +3088,11 @@ AtPrepare_Locks(void)
  * pointers in the transaction's resource owner.  This is OK at the
  * moment since resowner.c doesn't try to free locks retail at a toplevel
  * transaction commit or abort.  We could alternatively zero out nLocks
- * and leave the LOCALLOCK entries to be garbage-collected by LockReleaseAll,
- * but that probably costs more cycles.
+ * and leave the LOCALLOCK entries to be garbage-collected by
+ * ResourceOwnerRelease, but that probably costs more cycles.
  */
 void
-PostPrepare_Locks(TransactionId xid)
+PostPrepare_Locks(TransactionId xid, HTAB *sessionandxactlocks)
 {
 	PGPROC	   *newproc = TwoPhaseGetDummyProc(xid, false);
 	HASH_SEQ_STATUS status;
@@ -3378,10 +3122,9 @@ PostPrepare_Locks(TransactionId xid)
 
 	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
 	{
-		LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
 		bool		haveSessionLock;
 		bool		haveXactLock;
-		int			i;
+		dlist_iter	iter;
 
 		if (locallock->proclock == NULL || locallock->lock == NULL)
 		{
@@ -3400,9 +3143,13 @@ PostPrepare_Locks(TransactionId xid)
 
 		/* Scan to see whether we hold it at session or transaction level */
 		haveSessionLock = haveXactLock = false;
-		for (i = locallock->numLockOwners - 1; i >= 0; i--)
+		dlist_foreach(iter, &locallock->locallockowners)
 		{
-			if (lockOwners[i].owner == NULL)
+			LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER,
+															 locallock_node,
+															 iter.cur);
+
+			if (locallockowner->owner == NULL)
 				haveSessionLock = true;
 			else
 				haveXactLock = true;
@@ -3418,10 +3165,6 @@ PostPrepare_Locks(TransactionId xid)
 					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 					 errmsg("cannot PREPARE while holding both session-level and transaction-level locks on the same object")));
 
-		/* Mark the proclock to show we need to release this lockmode */
-		if (locallock->nLocks > 0)
-			locallock->proclock->releaseMask |= LOCKBIT_ON(locallock->tag.mode);
-
 		/* And remove the locallock hashtable entry */
 		RemoveLocalLock(locallock);
 	}
@@ -3439,11 +3182,7 @@ PostPrepare_Locks(TransactionId xid)
 
 		/*
 		 * If the proclock list for this partition is empty, we can skip
-		 * acquiring the partition lock.  This optimization is safer than the
-		 * situation in LockReleaseAll, because we got rid of any fast-path
-		 * locks during AtPrepare_Locks, so there cannot be any case where
-		 * another backend is adding something to our lists now.  For safety,
-		 * though, we code this the same way as in LockReleaseAll.
+		 * acquiring the partition lock.
 		 */
 		if (dlist_is_empty(procLocks))
 			continue;			/* needn't examine this partition */
@@ -3452,6 +3191,8 @@ PostPrepare_Locks(TransactionId xid)
 
 		dlist_foreach_modify(proclock_iter, procLocks)
 		{
+			PerLockTagEntry *locktagentry;
+
 			proclock = dlist_container(PROCLOCK, procLink, proclock_iter.cur);
 
 			Assert(proclock->tag.myProc == MyProc);
@@ -3469,13 +3210,14 @@ PostPrepare_Locks(TransactionId xid)
 			Assert(lock->nGranted <= lock->nRequested);
 			Assert((proclock->holdMask & ~lock->grantMask) == 0);
 
-			/* Ignore it if nothing to release (must be a session lock) */
-			if (proclock->releaseMask == 0)
-				continue;
+			locktagentry = hash_search(sessionandxactlocks,
+									   &lock->tag,
+									   HASH_FIND,
+									   NULL);
 
-			/* Else we should be releasing all locks */
-			if (proclock->releaseMask != proclock->holdMask)
-				elog(PANIC, "we seem to have dropped a bit somewhere");
+			/* skip session locks */
+			if (locktagentry != NULL && locktagentry->sessLock)
+				continue;
 
 			/*
 			 * We cannot simply modify proclock->tag.myProc to reassign
@@ -4245,7 +3987,6 @@ lock_twophase_recover(TransactionId xid, uint16 info,
 		Assert(proc->lockGroupLeader == NULL);
 		proclock->groupLeader = proc;
 		proclock->holdMask = 0;
-		proclock->releaseMask = 0;
 		/* Add proclock to appropriate lists */
 		dlist_push_tail(&lock->procLocks, &proclock->lockLink);
 		dlist_push_tail(&proc->myProcLocks[partition],
@@ -4382,7 +4123,7 @@ lock_twophase_postabort(TransactionId xid, uint16 info,
  *
  *		We don't bother recording this lock in the local lock table, since it's
  *		only ever released at the end of a transaction.  Instead,
- *		LockReleaseAll() calls VirtualXactLockTableCleanup().
+ *		ProcReleaseLocks() calls VirtualXactLockTableCleanup().
  */
 void
 VirtualXactLockTableInsert(VirtualTransactionId vxid)
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 5b663a2997..def857479d 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -777,10 +777,17 @@ ProcReleaseLocks(bool isCommit)
 		return;
 	/* If waiting, get off wait queue (should only be needed after error) */
 	LockErrorCleanup();
-	/* Release standard locks, including session-level if aborting */
-	LockReleaseAll(DEFAULT_LOCKMETHOD, !isCommit);
-	/* Release transaction-level advisory locks */
-	LockReleaseAll(USER_LOCKMETHOD, false);
+
+	VirtualXactLockTableCleanup();
+
+	/* Release session-level locks if aborting */
+	if (!isCommit)
+		LockReleaseSession(DEFAULT_LOCKMETHOD);
+
+#ifdef USE_ASSERT_CHECKING
+	/* Ensure all locks were released */
+	LockAssertNoneHeld(isCommit);
+#endif
 }
 
 
@@ -861,6 +868,8 @@ ProcKill(int code, Datum arg)
 		LWLockRelease(leader_lwlock);
 	}
 
+	Assert(MyProc->fpLockBits == 0);
+
 	/*
 	 * Reset MyLatch to the process local one.  This is so that signal
 	 * handlers et al can continue using the latch after the shared latch
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index df4d15a50f..e404cc8ff0 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -1349,10 +1349,10 @@ ShutdownPostgres(int code, Datum arg)
 	AbortOutOfAnyTransaction();
 
 	/*
-	 * User locks are not released by transaction end, so be sure to release
-	 * them explicitly.
+	 * Session locks are not released by transaction end, so be sure to
+	 * release them explicitly.
 	 */
-	LockReleaseAll(USER_LOCKMETHOD, true);
+	LockReleaseSession(USER_LOCKMETHOD);
 }
 
 
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index f926f1faad..d57f8657d7 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -33,6 +33,7 @@
 #include "utils/resowner_private.h"
 #include "utils/snapmgr.h"
 
+#include "lib/ilist.h"
 
 /*
  * All resource IDs managed by this code are required to fit into a Datum,
@@ -91,24 +92,6 @@ typedef struct ResourceArray
 #define RESARRAY_MAX_ITEMS(capacity) \
 	((capacity) <= RESARRAY_MAX_ARRAY ? (capacity) : (capacity)/4 * 3)
 
-/*
- * To speed up bulk releasing or reassigning locks from a resource owner to
- * its parent, each resource owner has a small cache of locks it owns. The
- * lock manager has the same information in its local lock hash table, and
- * we fall back on that if cache overflows, but traversing the hash table
- * is slower when there are a lot of locks belonging to other resource owners.
- *
- * MAX_RESOWNER_LOCKS is the size of the per-resource owner cache. It's
- * chosen based on some testing with pg_dump with a large schema. When the
- * tests were done (on 9.2), resource owners in a pg_dump run contained up
- * to 9 locks, regardless of the schema size, except for the top resource
- * owner which contained much more (overflowing the cache). 15 seems like a
- * nice round number that's somewhat higher than what pg_dump needs. Note that
- * making this number larger is not free - the bigger the cache, the slower
- * it is to release locks (in retail), when a resource owner holds many locks.
- */
-#define MAX_RESOWNER_LOCKS 15
-
 /*
  * ResourceOwner objects look like this
  */
@@ -134,9 +117,7 @@ typedef struct ResourceOwnerData
 	ResourceArray cryptohasharr;	/* cryptohash contexts */
 	ResourceArray hmacarr;		/* HMAC contexts */
 
-	/* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
-	int			nlocks;			/* number of owned locks */
-	LOCALLOCK  *locks[MAX_RESOWNER_LOCKS];	/* list of owned locks */
+	dlist_head	locks;			/* dlist of owned locks */
 }			ResourceOwnerData;
 
 
@@ -454,6 +435,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
 	ResourceArrayInit(&(owner->jitarr), PointerGetDatum(NULL));
 	ResourceArrayInit(&(owner->cryptohasharr), PointerGetDatum(NULL));
 	ResourceArrayInit(&(owner->hmacarr), PointerGetDatum(NULL));
+	dlist_init(&owner->locks);
 
 	return owner;
 }
@@ -606,8 +588,19 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 	}
 	else if (phase == RESOURCE_RELEASE_LOCKS)
 	{
+		dlist_mutable_iter iter;
+
 		if (isTopLevel)
 		{
+			dlist_foreach_modify(iter, &owner->locks)
+			{
+				LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER, resowner_node, iter.cur);
+
+				LockReleaseCurrentOwner(owner, locallockowner);
+			}
+
+			Assert(dlist_is_empty(&owner->locks));
+
 			/*
 			 * For a top-level xact we are going to release all locks (or at
 			 * least all non-session locks), so just do a single lmgr call at
@@ -626,30 +619,30 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 			 * subtransaction, we do NOT release its locks yet, but transfer
 			 * them to the parent.
 			 */
-			LOCALLOCK **locks;
-			int			nlocks;
+			if (isCommit)
+			{
+				dlist_foreach_modify(iter, &owner->locks)
+				{
+					LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER,
+																	 resowner_node,
+																	 iter.cur);
 
-			Assert(owner->parent != NULL);
+					LockReassignCurrentOwner(locallockowner);
+				}
 
-			/*
-			 * Pass the list of locks owned by this resource owner to the lock
-			 * manager, unless it has overflowed.
-			 */
-			if (owner->nlocks > MAX_RESOWNER_LOCKS)
-			{
-				locks = NULL;
-				nlocks = 0;
+				Assert(dlist_is_empty(&owner->locks));
 			}
 			else
 			{
-				locks = owner->locks;
-				nlocks = owner->nlocks;
-			}
+				dlist_foreach_modify(iter, &owner->locks)
+				{
+					LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER, resowner_node, iter.cur);
 
-			if (isCommit)
-				LockReassignCurrentOwner(locks, nlocks);
-			else
-				LockReleaseCurrentOwner(locks, nlocks);
+					LockReleaseCurrentOwner(owner, locallockowner);
+				}
+
+				Assert(dlist_is_empty(&owner->locks));
+			}
 		}
 	}
 	else if (phase == RESOURCE_RELEASE_AFTER_LOCKS)
@@ -778,7 +771,7 @@ ResourceOwnerDelete(ResourceOwner owner)
 	Assert(owner->jitarr.nitems == 0);
 	Assert(owner->cryptohasharr.nitems == 0);
 	Assert(owner->hmacarr.nitems == 0);
-	Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1);
+	Assert(dlist_is_empty(&owner->locks));
 
 	/*
 	 * Delete children.  The recursive call will delink the child from me, so
@@ -1038,54 +1031,61 @@ ResourceOwnerForgetBufferIO(ResourceOwner owner, Buffer buffer)
 
 /*
  * Remember that a Local Lock is owned by a ResourceOwner
- *
- * This is different from the other Remember functions in that the list of
- * locks is only a lossy cache. It can hold up to MAX_RESOWNER_LOCKS entries,
- * and when it overflows, we stop tracking locks. The point of only remembering
- * only up to MAX_RESOWNER_LOCKS entries is that if a lot of locks are held,
- * ResourceOwnerForgetLock doesn't need to scan through a large array to find
- * the entry.
  */
 void
-ResourceOwnerRememberLock(ResourceOwner owner, LOCALLOCK *locallock)
+ResourceOwnerRememberLock(ResourceOwner owner, LOCALLOCKOWNER *locallockowner)
 {
-	Assert(locallock != NULL);
-
-	if (owner->nlocks > MAX_RESOWNER_LOCKS)
-		return;					/* we have already overflowed */
+	Assert(owner != NULL);
+	Assert(locallockowner != NULL);
 
-	if (owner->nlocks < MAX_RESOWNER_LOCKS)
-		owner->locks[owner->nlocks] = locallock;
-	else
+#ifdef USE_ASSERT_CHECKING
 	{
-		/* overflowed */
+		dlist_iter	iter;
+
+		dlist_foreach(iter, &owner->locks)
+		{
+			LOCALLOCKOWNER *i = dlist_container(LOCALLOCKOWNER, resowner_node, iter.cur);
+
+			Assert(i->locallock != locallockowner->locallock);
+		}
 	}
-	owner->nlocks++;
+#endif
+
+	dlist_push_tail(&owner->locks, &locallockowner->resowner_node);
 }
 
 /*
- * Forget that a Local Lock is owned by a ResourceOwner
+ * Forget that a Local Lock is owned by the given LOCALLOCKOWNER.
  */
 void
-ResourceOwnerForgetLock(ResourceOwner owner, LOCALLOCK *locallock)
+ResourceOwnerForgetLock(LOCALLOCKOWNER *locallockowner)
 {
-	int			i;
+#ifdef USE_ASSERT_CHECKING
+	ResourceOwner owner;
+
+	Assert(locallockowner != NULL);
 
-	if (owner->nlocks > MAX_RESOWNER_LOCKS)
-		return;					/* we have overflowed */
+	owner = locallockowner->owner;
 
-	Assert(owner->nlocks > 0);
-	for (i = owner->nlocks - 1; i >= 0; i--)
 	{
-		if (locallock == owner->locks[i])
+		dlist_iter	iter;
+		bool		found = false;
+
+		dlist_foreach(iter, &owner->locks)
 		{
-			owner->locks[i] = owner->locks[owner->nlocks - 1];
-			owner->nlocks--;
-			return;
+			LOCALLOCKOWNER *owner = dlist_container(LOCALLOCKOWNER, resowner_node, iter.cur);
+
+			if (locallockowner == owner)
+			{
+				Assert(!found);
+				found = true;
+			}
 		}
+
+		Assert(found);
 	}
-	elog(ERROR, "lock reference %p is not owned by resource owner %s",
-		 locallock, owner->name);
+#endif
+	dlist_delete(&locallockowner->resowner_node);
 }
 
 /*
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index 8575bea25c..c7ff792201 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -24,6 +24,7 @@
 #include "storage/lwlock.h"
 #include "storage/shmem.h"
 #include "utils/timestamp.h"
+#include "lib/ilist.h"
 
 /* struct PGPROC is declared in proc.h, but must forward-reference it */
 typedef struct PGPROC PGPROC;
@@ -349,10 +350,6 @@ typedef struct LOCK
  * Otherwise, proclock objects whose holdMasks are zero are recycled
  * as soon as convenient.
  *
- * releaseMask is workspace for LockReleaseAll(): it shows the locks due
- * to be released during the current call.  This must only be examined or
- * set by the backend owning the PROCLOCK.
- *
  * Each PROCLOCK object is linked into lists for both the associated LOCK
  * object and the owning PGPROC object.  Note that the PROCLOCK is entered
  * into these lists as soon as it is created, even if no lock has yet been
@@ -374,7 +371,6 @@ typedef struct PROCLOCK
 	/* data */
 	PGPROC	   *groupLeader;	/* proc's lock group leader, or proc itself */
 	LOCKMASK	holdMask;		/* bitmask for lock types currently held */
-	LOCKMASK	releaseMask;	/* bitmask for lock types to be released */
 	dlist_node	lockLink;		/* list link in LOCK's list of proclocks */
 	dlist_node	procLink;		/* list link in PGPROC's list of proclocks */
 } PROCLOCK;
@@ -420,6 +416,13 @@ typedef struct LOCALLOCKOWNER
 	 * Must use a forward struct reference to avoid circularity.
 	 */
 	struct ResourceOwnerData *owner;
+
+	dlist_node	resowner_node;	/* dlist link for ResourceOwner.locks */
+
+	dlist_node	locallock_node;	/* dlist link for LOCALLOCK.locallockowners */
+
+	struct LOCALLOCK *locallock;	/* pointer to the corresponding LOCALLOCK */
+
 	int64		nLocks;			/* # of times held by this owner */
 } LOCALLOCKOWNER;
 
@@ -433,9 +436,9 @@ typedef struct LOCALLOCK
 	LOCK	   *lock;			/* associated LOCK object, if any */
 	PROCLOCK   *proclock;		/* associated PROCLOCK object, if any */
 	int64		nLocks;			/* total number of times lock is held */
-	int			numLockOwners;	/* # of relevant ResourceOwners */
-	int			maxLockOwners;	/* allocated size of array */
-	LOCALLOCKOWNER *lockOwners; /* dynamically resizable array */
+
+	dlist_head	locallockowners;	/* dlist of LOCALLOCKOWNER */
+
 	bool		holdsStrongLockCount;	/* bumped FastPathStrongRelationLocks */
 	bool		lockCleared;	/* we read all sinval msgs for lock */
 } LOCALLOCK;
@@ -564,10 +567,16 @@ extern void AbortStrongLockAcquire(void);
 extern void MarkLockClear(LOCALLOCK *locallock);
 extern bool LockRelease(const LOCKTAG *locktag,
 						LOCKMODE lockmode, bool sessionLock);
-extern void LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks);
+
+#ifdef USE_ASSERT_CHECKING
+extern void LockAssertNoneHeld(bool isCommit);
+#endif
+
 extern void LockReleaseSession(LOCKMETHODID lockmethodid);
-extern void LockReleaseCurrentOwner(LOCALLOCK **locallocks, int nlocks);
-extern void LockReassignCurrentOwner(LOCALLOCK **locallocks, int nlocks);
+struct ResourceOwnerData;
+extern void LockReleaseCurrentOwner(struct ResourceOwnerData *owner,
+									LOCALLOCKOWNER *locallockowner);
+extern void LockReassignCurrentOwner(LOCALLOCKOWNER *locallockowner);
 extern bool LockHeldByMe(const LOCKTAG *locktag, LOCKMODE lockmode);
 #ifdef USE_ASSERT_CHECKING
 extern HTAB *GetLockMethodLocalHash(void);
@@ -576,8 +585,8 @@ extern bool LockHasWaiters(const LOCKTAG *locktag,
 						   LOCKMODE lockmode, bool sessionLock);
 extern VirtualTransactionId *GetLockConflicts(const LOCKTAG *locktag,
 											  LOCKMODE lockmode, int *countp);
-extern void AtPrepare_Locks(void);
-extern void PostPrepare_Locks(TransactionId xid);
+extern HTAB *AtPrepare_Locks(void);
+extern void PostPrepare_Locks(TransactionId xid, HTAB *sessionandxactlocks);
 extern bool LockCheckConflicts(LockMethod lockMethodTable,
 							   LOCKMODE lockmode,
 							   LOCK *lock, PROCLOCK *proclock);
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index ae58438ec7..29b1cce54b 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -36,8 +36,9 @@ extern void ResourceOwnerRememberBufferIO(ResourceOwner owner, Buffer buffer);
 extern void ResourceOwnerForgetBufferIO(ResourceOwner owner, Buffer buffer);
 
 /* support for local lock management */
-extern void ResourceOwnerRememberLock(ResourceOwner owner, LOCALLOCK *locallock);
-extern void ResourceOwnerForgetLock(ResourceOwner owner, LOCALLOCK *locallock);
+extern void ResourceOwnerRememberLock(ResourceOwner owner,
+									  LOCALLOCKOWNER *locallock);
+extern void ResourceOwnerForgetLock(LOCALLOCKOWNER *locallock);
 
 /* support for catcache refcount management */
 extern void ResourceOwnerEnlargeCatCacheRefs(ResourceOwner owner);
-- 
2.40.1.windows.1

#120

Heikki Linnakangas

hlinnaka@iki.fi

about 2 years ago

In reply to: David Rowley (#119)

1 attachment(s)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

On 18/09/2023 07:08, David Rowley wrote:

On Fri, 15 Sept 2023 at 22:37, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I've added a call to LockAssertNoneHeld(false) in there.

I don't see it in the patch?

hmm. I must've git format-patch before committing that part.

I'll try that again... see attached.

This needed a rebase after my ResourceOwner refactoring. Attached.

A few quick comments:

- It would be nice to add a test for the issue that you fixed in patch
v7, i.e. if you prepare a transaction while holding session-level locks.

- GrantLockLocal() now calls MemoryContextAlloc(), which can fail if you
are out of memory. Is that handled gracefully or is the lock leaked?

--
Heikki Linnakangas
Neon (https://neon.tech)

Attachments:

v9-wip-resowner-release-all.patchtext/x-patch; charset=UTF-8; name=v9-wip-resowner-release-all.patchDownload

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 74ce5f9491c..8c8585be7ab 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2415,6 +2415,7 @@ PrepareTransaction(void)
 	TransactionId xid = GetCurrentTransactionId();
 	GlobalTransaction gxact;
 	TimestampTz prepared_at;
+	HTAB		   *sessionandxactlocks;
 
 	Assert(!IsInParallelMode());
 
@@ -2560,7 +2561,17 @@ PrepareTransaction(void)
 	StartPrepare(gxact);
 
 	AtPrepare_Notify();
-	AtPrepare_Locks();
+
+	/*
+	 * Prepare the locks and save the returned hash table that describes if
+	 * the lock is held at the session and/or transaction level.  We need to
+	 * know if we're dealing with session locks inside PostPrepare_Locks(),
+	 * but we're unable to build the hash table there due to that function
+	 * only discovering if we're dealing with a session lock while we're in a
+	 * critical section, in which case we can't allocate memory for the hash
+	 * table.
+	 */
+	sessionandxactlocks = AtPrepare_Locks();
 	AtPrepare_PredicateLocks();
 	AtPrepare_PgStat();
 	AtPrepare_MultiXact();
@@ -2587,7 +2598,10 @@ PrepareTransaction(void)
 	 * ProcArrayClearTransaction().  Otherwise, a GetLockConflicts() would
 	 * conclude "xact already committed or aborted" for our locks.
 	 */
-	PostPrepare_Locks(xid);
+	PostPrepare_Locks(xid, sessionandxactlocks);
+
+	/* We no longer need this hash table */
+	hash_destroy(sessionandxactlocks);
 
 	/*
 	 * Let others know about no transaction in progress by me.  This has to be
diff --git a/src/backend/commands/discard.c b/src/backend/commands/discard.c
index 296dc82d2ee..5baf83ac6ce 100644
--- a/src/backend/commands/discard.c
+++ b/src/backend/commands/discard.c
@@ -71,7 +71,12 @@ DiscardAll(bool isTopLevel)
 	ResetAllOptions();
 	DropAllPreparedStatements();
 	Async_UnlistenAll();
-	LockReleaseAll(USER_LOCKMETHOD, true);
+	LockReleaseSession(USER_LOCKMETHOD);
+
+#ifdef USE_ASSERT_CHECKING
+	LockAssertNoneHeld(false);
+#endif
+
 	ResetPlanCache();
 	ResetTempTableNamespace();
 	ResetSequenceCaches();
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index 501910b4454..835f112f751 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -841,7 +841,11 @@ logicalrep_worker_onexit(int code, Datum arg)
 	 * The locks will be acquired once the worker is initialized.
 	 */
 	if (!InitializingApplyWorker)
-		LockReleaseAll(DEFAULT_LOCKMETHOD, true);
+		LockReleaseSession(DEFAULT_LOCKMETHOD);
+
+#ifdef USE_ASSERT_CHECKING
+	LockAssertNoneHeld(false);
+#endif
 
 	ApplyLauncherWakeup();
 }
diff --git a/src/backend/storage/lmgr/README b/src/backend/storage/lmgr/README
index 45de0fd2bd6..e7e0f29347a 100644
--- a/src/backend/storage/lmgr/README
+++ b/src/backend/storage/lmgr/README
@@ -182,12 +182,6 @@ holdMask -
     subset of the PGPROC object's heldLocks mask (if the PGPROC is
     currently waiting for another lock mode on this lock).
 
-releaseMask -
-    A bitmask for the lock modes due to be released during LockReleaseAll.
-    This must be a subset of the holdMask.  Note that it is modified without
-    taking the partition LWLock, and therefore it is unsafe for any
-    backend except the one owning the PROCLOCK to examine/change it.
-
 lockLink -
     List link for shared memory queue of all the PROCLOCK objects for the
     same LOCK.
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index b8c57b3e16e..ce54b589b0d 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -22,8 +22,7 @@
  *	Interface:
  *
  *	InitLocks(), GetLocksMethodTable(), GetLockTagsMethodTable(),
- *	LockAcquire(), LockRelease(), LockReleaseAll(),
- *	LockCheckConflicts(), GrantLock()
+ *	LockAcquire(), LockRelease(), LockCheckConflicts(), GrantLock()
  *
  *-------------------------------------------------------------------------
  */
@@ -162,6 +161,16 @@ typedef struct TwoPhaseLockRecord
 	LOCKMODE	lockmode;
 } TwoPhaseLockRecord;
 
+/*
+ * Used by for a hash table entry type in AtPrepare_Locks() to communicate the
+ * session/xact lock status of each held lock for use in PostPrepare_Locks().
+ */
+typedef struct PerLockTagEntry
+{
+	LOCKTAG lock;  /* identifies the lockable object */
+	bool sessLock; /* is any lockmode held at session level? */
+	bool xactLock; /* is any lockmode held at xact level? */
+} PerLockTagEntry;
 
 /*
  * Count of the number of fast path lock slots we believe to be used.  This
@@ -270,6 +279,8 @@ static HTAB *LockMethodLockHash;
 static HTAB *LockMethodProcLockHash;
 static HTAB *LockMethodLocalHash;
 
+/* A memory context for storing LOCALLOCKOWNER structs */
+MemoryContext LocalLockOwnerContext;
 
 /* private state for error cleanup */
 static LOCALLOCK *StrongLockInProgress;
@@ -277,6 +288,9 @@ static LOCALLOCK *awaitedLock;
 static ResourceOwner awaitedOwner;
 
 
+static dlist_head session_locks[lengthof(LockMethods)];
+
+
 #ifdef LOCK_DEBUG
 
 /*------
@@ -363,8 +377,8 @@ static void GrantLockLocal(LOCALLOCK *locallock, ResourceOwner owner);
 static void BeginStrongLockAcquire(LOCALLOCK *locallock, uint32 fasthashcode);
 static void FinishStrongLockAcquire(void);
 static void WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner);
-static void ReleaseLockIfHeld(LOCALLOCK *locallock, bool sessionLock);
-static void LockReassignOwner(LOCALLOCK *locallock, ResourceOwner parent);
+static void ReleaseLockIfHeld(LOCALLOCKOWNER *locallockowner, bool sessionLock);
+static void LockReassignOwner(LOCALLOCKOWNER *locallockowner, ResourceOwner parent);
 static bool UnGrantLock(LOCK *lock, LOCKMODE lockmode,
 						PROCLOCK *proclock, LockMethod lockMethodTable);
 static void CleanUpLock(LOCK *lock, PROCLOCK *proclock,
@@ -465,6 +479,21 @@ InitLocks(void)
 									  16,
 									  &info,
 									  HASH_ELEM | HASH_BLOBS);
+
+	/* Initialize each element of the session_locks array */
+	for (int i = 0; i < lengthof(LockMethods); i++)
+		dlist_init(&session_locks[i]);
+
+	/*
+	 * Create a slab context for storing LOCALLOCKOWNERs.  Slab seems like a
+	 * good context type for this as it will manage fragmentation better than
+	 * aset.c contexts and it will free() excess memory rather than maintain
+	 * excessively long freelists after a large surge in locking requirements.
+	 */
+	LocalLockOwnerContext = SlabContextCreate(TopMemoryContext,
+											  "LOCALLOCKOWNER context",
+											  SLAB_DEFAULT_BLOCK_SIZE,
+											  sizeof(LOCALLOCKOWNER));
 }
 
 
@@ -827,26 +856,9 @@ LockAcquireExtended(const LOCKTAG *locktag,
 		locallock->nLocks = 0;
 		locallock->holdsStrongLockCount = false;
 		locallock->lockCleared = false;
-		locallock->numLockOwners = 0;
-		locallock->maxLockOwners = 8;
-		locallock->lockOwners = NULL;	/* in case next line fails */
-		locallock->lockOwners = (LOCALLOCKOWNER *)
-			MemoryContextAlloc(TopMemoryContext,
-							   locallock->maxLockOwners * sizeof(LOCALLOCKOWNER));
+		dlist_init(&locallock->locallockowners);
 	}
-	else
-	{
-		/* Make sure there will be room to remember the lock */
-		if (locallock->numLockOwners >= locallock->maxLockOwners)
-		{
-			int			newsize = locallock->maxLockOwners * 2;
 
-			locallock->lockOwners = (LOCALLOCKOWNER *)
-				repalloc(locallock->lockOwners,
-						 newsize * sizeof(LOCALLOCKOWNER));
-			locallock->maxLockOwners = newsize;
-		}
-	}
 	hashcode = locallock->hashcode;
 
 	if (locallockp)
@@ -1249,7 +1261,6 @@ SetupLockInTable(LockMethod lockMethodTable, PGPROC *proc,
 		proclock->groupLeader = proc->lockGroupLeader != NULL ?
 			proc->lockGroupLeader : proc;
 		proclock->holdMask = 0;
-		proclock->releaseMask = 0;
 		/* Add proclock to appropriate lists */
 		dlist_push_tail(&lock->procLocks, &proclock->lockLink);
 		dlist_push_tail(&proc->myProcLocks[partition], &proclock->procLink);
@@ -1343,17 +1354,19 @@ CheckAndSetLockHeld(LOCALLOCK *locallock, bool acquired)
 static void
 RemoveLocalLock(LOCALLOCK *locallock)
 {
-	int			i;
+	dlist_mutable_iter iter;
 
-	for (i = locallock->numLockOwners - 1; i >= 0; i--)
+	dlist_foreach_modify(iter, &locallock->locallockowners)
 	{
-		if (locallock->lockOwners[i].owner != NULL)
-			ResourceOwnerForgetLock(locallock->lockOwners[i].owner, locallock);
+		LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER, locallock_node, iter.cur);
+
+		Assert(locallockowner->owner != NULL);
+		dlist_delete(&locallockowner->locallock_node);
+		ResourceOwnerForgetLock(locallockowner);
+		pfree(locallockowner);
 	}
-	locallock->numLockOwners = 0;
-	if (locallock->lockOwners != NULL)
-		pfree(locallock->lockOwners);
-	locallock->lockOwners = NULL;
+
+	Assert(dlist_is_empty(&locallock->locallockowners));
 
 	if (locallock->holdsStrongLockCount)
 	{
@@ -1659,26 +1672,38 @@ CleanUpLock(LOCK *lock, PROCLOCK *proclock,
 static void
 GrantLockLocal(LOCALLOCK *locallock, ResourceOwner owner)
 {
-	LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
-	int			i;
+	LOCALLOCKOWNER *locallockowner;
+	dlist_iter	iter;
 
-	Assert(locallock->numLockOwners < locallock->maxLockOwners);
 	/* Count the total */
 	locallock->nLocks++;
 	/* Count the per-owner lock */
-	for (i = 0; i < locallock->numLockOwners; i++)
+	dlist_foreach(iter, &locallock->locallockowners)
 	{
-		if (lockOwners[i].owner == owner)
+		locallockowner = dlist_container(LOCALLOCKOWNER, locallock_node, iter.cur);
+
+		if (locallockowner->owner == owner)
 		{
-			lockOwners[i].nLocks++;
+			locallockowner->nLocks++;
 			return;
 		}
 	}
-	lockOwners[i].owner = owner;
-	lockOwners[i].nLocks = 1;
-	locallock->numLockOwners++;
+	locallockowner = MemoryContextAlloc(LocalLockOwnerContext, sizeof(LOCALLOCKOWNER));
+	locallockowner->owner = owner;
+	locallockowner->nLocks = 1;
+	locallockowner->locallock = locallock;
+
+	dlist_push_tail(&locallock->locallockowners, &locallockowner->locallock_node);
+
 	if (owner != NULL)
-		ResourceOwnerRememberLock(owner, locallock);
+		ResourceOwnerRememberLock(owner, locallockowner);
+	else
+	{
+		LOCKMETHODID lockmethodid = LOCALLOCK_LOCKMETHOD(*locallockowner->locallock);
+
+		Assert(lockmethodid > 0 && lockmethodid <= 2);
+		dlist_push_tail(&session_locks[lockmethodid - 1], &locallockowner->resowner_node);
+	}
 
 	/* Indicate that the lock is acquired for certain types of locks. */
 	CheckAndSetLockHeld(locallock, true);
@@ -1971,9 +1996,9 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 	 * Decrease the count for the resource owner.
 	 */
 	{
-		LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
 		ResourceOwner owner;
-		int			i;
+		dlist_mutable_iter	iter;
+		bool		found = false;
 
 		/* Identify owner for lock */
 		if (sessionLock)
@@ -1981,24 +2006,33 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 		else
 			owner = CurrentResourceOwner;
 
-		for (i = locallock->numLockOwners - 1; i >= 0; i--)
+		dlist_foreach_modify(iter, &locallock->locallockowners)
 		{
-			if (lockOwners[i].owner == owner)
+			LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER, locallock_node, iter.cur);
+
+			if (locallockowner->owner != owner)
+				continue;
+
+			found = true;
+
+			if (--locallockowner->nLocks == 0)
 			{
-				Assert(lockOwners[i].nLocks > 0);
-				if (--lockOwners[i].nLocks == 0)
-				{
-					if (owner != NULL)
-						ResourceOwnerForgetLock(owner, locallock);
-					/* compact out unused slot */
-					locallock->numLockOwners--;
-					if (i < locallock->numLockOwners)
-						lockOwners[i] = lockOwners[locallock->numLockOwners];
-				}
-				break;
+				dlist_delete(&locallockowner->locallock_node);
+
+				if (owner != NULL)
+					ResourceOwnerForgetLock(locallockowner);
+				else
+					dlist_delete(&locallockowner->resowner_node);
+				pfree(locallockowner);
+			}
+			else
+			{
+				/* ensure nLocks didn't go negative */
+				Assert(locallockowner->nLocks >= 0);
 			}
 		}
-		if (i < 0)
+
+		if (!found)
 		{
 			/* don't release a lock belonging to another owner */
 			elog(WARNING, "you don't own a lock of type %s",
@@ -2016,6 +2050,8 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 	if (locallock->nLocks > 0)
 		return true;
 
+	Assert(locallock->nLocks == 0);
+
 	/*
 	 * At this point we can no longer suppose we are clear of invalidation
 	 * messages related to this lock.  Although we'll delete the LOCALLOCK
@@ -2118,274 +2154,44 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 	return true;
 }
 
+#ifdef USE_ASSERT_CHECKING
 /*
- * LockReleaseAll -- Release all locks of the specified lock method that
- *		are held by the current process.
- *
- * Well, not necessarily *all* locks.  The available behaviors are:
- *		allLocks == true: release all locks including session locks.
- *		allLocks == false: release all non-session locks.
+ * LockAssertNoneHeld -- Assert that we no longer hold any DEFAULT_LOCKMETHOD
+ * locks during an abort.
  */
 void
-LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
+LockAssertNoneHeld(bool isCommit)
 {
 	HASH_SEQ_STATUS status;
-	LockMethod	lockMethodTable;
-	int			i,
-				numLockModes;
 	LOCALLOCK  *locallock;
-	LOCK	   *lock;
-	int			partition;
-	bool		have_fast_path_lwlock = false;
-
-	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
-		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
-	lockMethodTable = LockMethods[lockmethodid];
-
-#ifdef LOCK_DEBUG
-	if (*(lockMethodTable->trace_flag))
-		elog(LOG, "LockReleaseAll: lockmethod=%d", lockmethodid);
-#endif
-
-	/*
-	 * Get rid of our fast-path VXID lock, if appropriate.  Note that this is
-	 * the only way that the lock we hold on our own VXID can ever get
-	 * released: it is always and only released when a toplevel transaction
-	 * ends.
-	 */
-	if (lockmethodid == DEFAULT_LOCKMETHOD)
-		VirtualXactLockTableCleanup();
-
-	numLockModes = lockMethodTable->numLockModes;
 
-	/*
-	 * First we run through the locallock table and get rid of unwanted
-	 * entries, then we scan the process's proclocks and get rid of those. We
-	 * do this separately because we may have multiple locallock entries
-	 * pointing to the same proclock, and we daren't end up with any dangling
-	 * pointers.  Fast-path locks are cleaned up during the locallock table
-	 * scan, though.
-	 */
-	hash_seq_init(&status, LockMethodLocalHash);
-
-	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+	if (!isCommit)
 	{
-		/*
-		 * If the LOCALLOCK entry is unused, we must've run out of shared
-		 * memory while trying to set up this lock.  Just forget the local
-		 * entry.
-		 */
-		if (locallock->nLocks == 0)
-		{
-			RemoveLocalLock(locallock);
-			continue;
-		}
-
-		/* Ignore items that are not of the lockmethod to be removed */
-		if (LOCALLOCK_LOCKMETHOD(*locallock) != lockmethodid)
-			continue;
+		hash_seq_init(&status, LockMethodLocalHash);
 
-		/*
-		 * If we are asked to release all locks, we can just zap the entry.
-		 * Otherwise, must scan to see if there are session locks. We assume
-		 * there is at most one lockOwners entry for session locks.
-		 */
-		if (!allLocks)
+		while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
 		{
-			LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
+			dlist_iter	local_iter;
 
-			/* If session lock is above array position 0, move it down to 0 */
-			for (i = 0; i < locallock->numLockOwners; i++)
-			{
-				if (lockOwners[i].owner == NULL)
-					lockOwners[0] = lockOwners[i];
-				else
-					ResourceOwnerForgetLock(lockOwners[i].owner, locallock);
-			}
+			Assert(locallock->nLocks >= 0);
 
-			if (locallock->numLockOwners > 0 &&
-				lockOwners[0].owner == NULL &&
-				lockOwners[0].nLocks > 0)
+			dlist_foreach(local_iter, &locallock->locallockowners)
 			{
-				/* Fix the locallock to show just the session locks */
-				locallock->nLocks = lockOwners[0].nLocks;
-				locallock->numLockOwners = 1;
-				/* We aren't deleting this locallock, so done */
-				continue;
-			}
-			else
-				locallock->numLockOwners = 0;
-		}
-
-		/*
-		 * If the lock or proclock pointers are NULL, this lock was taken via
-		 * the relation fast-path (and is not known to have been transferred).
-		 */
-		if (locallock->proclock == NULL || locallock->lock == NULL)
-		{
-			LOCKMODE	lockmode = locallock->tag.mode;
-			Oid			relid;
+				LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER,
+																 locallock_node,
+																 local_iter.cur);
 
-			/* Verify that a fast-path lock is what we've got. */
-			if (!EligibleForRelationFastPath(&locallock->tag.lock, lockmode))
-				elog(PANIC, "locallock table corrupted");
+				Assert(locallockowner->owner == NULL);
 
-			/*
-			 * If we don't currently hold the LWLock that protects our
-			 * fast-path data structures, we must acquire it before attempting
-			 * to release the lock via the fast-path.  We will continue to
-			 * hold the LWLock until we're done scanning the locallock table,
-			 * unless we hit a transferred fast-path lock.  (XXX is this
-			 * really such a good idea?  There could be a lot of entries ...)
-			 */
-			if (!have_fast_path_lwlock)
-			{
-				LWLockAcquire(&MyProc->fpInfoLock, LW_EXCLUSIVE);
-				have_fast_path_lwlock = true;
-			}
-
-			/* Attempt fast-path release. */
-			relid = locallock->tag.lock.locktag_field2;
-			if (FastPathUnGrantRelationLock(relid, lockmode))
-			{
-				RemoveLocalLock(locallock);
-				continue;
+				if (locallockowner->nLocks > 0 &&
+					LOCALLOCK_LOCKMETHOD(*locallock) == DEFAULT_LOCKMETHOD)
+					Assert(false);
 			}
-
-			/*
-			 * Our lock, originally taken via the fast path, has been
-			 * transferred to the main lock table.  That's going to require
-			 * some extra work, so release our fast-path lock before starting.
-			 */
-			LWLockRelease(&MyProc->fpInfoLock);
-			have_fast_path_lwlock = false;
-
-			/*
-			 * Now dump the lock.  We haven't got a pointer to the LOCK or
-			 * PROCLOCK in this case, so we have to handle this a bit
-			 * differently than a normal lock release.  Unfortunately, this
-			 * requires an extra LWLock acquire-and-release cycle on the
-			 * partitionLock, but hopefully it shouldn't happen often.
-			 */
-			LockRefindAndRelease(lockMethodTable, MyProc,
-								 &locallock->tag.lock, lockmode, false);
-			RemoveLocalLock(locallock);
-			continue;
 		}
-
-		/* Mark the proclock to show we need to release this lockmode */
-		if (locallock->nLocks > 0)
-			locallock->proclock->releaseMask |= LOCKBIT_ON(locallock->tag.mode);
-
-		/* And remove the locallock hashtable entry */
-		RemoveLocalLock(locallock);
 	}
-
-	/* Done with the fast-path data structures */
-	if (have_fast_path_lwlock)
-		LWLockRelease(&MyProc->fpInfoLock);
-
-	/*
-	 * Now, scan each lock partition separately.
-	 */
-	for (partition = 0; partition < NUM_LOCK_PARTITIONS; partition++)
-	{
-		LWLock	   *partitionLock;
-		dlist_head *procLocks = &MyProc->myProcLocks[partition];
-		dlist_mutable_iter proclock_iter;
-
-		partitionLock = LockHashPartitionLockByIndex(partition);
-
-		/*
-		 * If the proclock list for this partition is empty, we can skip
-		 * acquiring the partition lock.  This optimization is trickier than
-		 * it looks, because another backend could be in process of adding
-		 * something to our proclock list due to promoting one of our
-		 * fast-path locks.  However, any such lock must be one that we
-		 * decided not to delete above, so it's okay to skip it again now;
-		 * we'd just decide not to delete it again.  We must, however, be
-		 * careful to re-fetch the list header once we've acquired the
-		 * partition lock, to be sure we have a valid, up-to-date pointer.
-		 * (There is probably no significant risk if pointer fetch/store is
-		 * atomic, but we don't wish to assume that.)
-		 *
-		 * XXX This argument assumes that the locallock table correctly
-		 * represents all of our fast-path locks.  While allLocks mode
-		 * guarantees to clean up all of our normal locks regardless of the
-		 * locallock situation, we lose that guarantee for fast-path locks.
-		 * This is not ideal.
-		 */
-		if (dlist_is_empty(procLocks))
-			continue;			/* needn't examine this partition */
-
-		LWLockAcquire(partitionLock, LW_EXCLUSIVE);
-
-		dlist_foreach_modify(proclock_iter, procLocks)
-		{
-			PROCLOCK   *proclock = dlist_container(PROCLOCK, procLink, proclock_iter.cur);
-			bool		wakeupNeeded = false;
-
-			Assert(proclock->tag.myProc == MyProc);
-
-			lock = proclock->tag.myLock;
-
-			/* Ignore items that are not of the lockmethod to be removed */
-			if (LOCK_LOCKMETHOD(*lock) != lockmethodid)
-				continue;
-
-			/*
-			 * In allLocks mode, force release of all locks even if locallock
-			 * table had problems
-			 */
-			if (allLocks)
-				proclock->releaseMask = proclock->holdMask;
-			else
-				Assert((proclock->releaseMask & ~proclock->holdMask) == 0);
-
-			/*
-			 * Ignore items that have nothing to be released, unless they have
-			 * holdMask == 0 and are therefore recyclable
-			 */
-			if (proclock->releaseMask == 0 && proclock->holdMask != 0)
-				continue;
-
-			PROCLOCK_PRINT("LockReleaseAll", proclock);
-			LOCK_PRINT("LockReleaseAll", lock, 0);
-			Assert(lock->nRequested >= 0);
-			Assert(lock->nGranted >= 0);
-			Assert(lock->nGranted <= lock->nRequested);
-			Assert((proclock->holdMask & ~lock->grantMask) == 0);
-
-			/*
-			 * Release the previously-marked lock modes
-			 */
-			for (i = 1; i <= numLockModes; i++)
-			{
-				if (proclock->releaseMask & LOCKBIT_ON(i))
-					wakeupNeeded |= UnGrantLock(lock, i, proclock,
-												lockMethodTable);
-			}
-			Assert((lock->nRequested >= 0) && (lock->nGranted >= 0));
-			Assert(lock->nGranted <= lock->nRequested);
-			LOCK_PRINT("LockReleaseAll: updated", lock, 0);
-
-			proclock->releaseMask = 0;
-
-			/* CleanUpLock will wake up waiters if needed. */
-			CleanUpLock(lock, proclock,
-						lockMethodTable,
-						LockTagHashCode(&lock->tag),
-						wakeupNeeded);
-		}						/* loop over PROCLOCKs within this partition */
-
-		LWLockRelease(partitionLock);
-	}							/* loop over partitions */
-
-#ifdef LOCK_DEBUG
-	if (*(lockMethodTable->trace_flag))
-		elog(LOG, "LockReleaseAll done");
-#endif
+	Assert(MyProc->fpLockBits == 0);
 }
+#endif
 
 /*
  * LockReleaseSession -- Release all session locks of the specified lock method
@@ -2394,59 +2200,39 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
 void
 LockReleaseSession(LOCKMETHODID lockmethodid)
 {
-	HASH_SEQ_STATUS status;
-	LOCALLOCK  *locallock;
+	dlist_mutable_iter iter;
 
 	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
 		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
 
-	hash_seq_init(&status, LockMethodLocalHash);
-
-	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+	dlist_foreach_modify(iter, &session_locks[lockmethodid - 1])
 	{
-		/* Ignore items that are not of the specified lock method */
-		if (LOCALLOCK_LOCKMETHOD(*locallock) != lockmethodid)
-			continue;
+		LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER, resowner_node, iter.cur);
 
-		ReleaseLockIfHeld(locallock, true);
+		Assert(LOCALLOCK_LOCKMETHOD(*locallockowner->locallock) == lockmethodid);
+
+		ReleaseLockIfHeld(locallockowner, true);
 	}
+
+	Assert(dlist_is_empty(&session_locks[lockmethodid - 1]));
 }
 
 /*
  * LockReleaseCurrentOwner
- *		Release all locks belonging to CurrentResourceOwner
- *
- * If the caller knows what those locks are, it can pass them as an array.
- * That speeds up the call significantly, when a lot of locks are held.
- * Otherwise, pass NULL for locallocks, and we'll traverse through our hash
- * table to find them.
+ *		Release all locks belonging to 'owner'
  */
 void
-LockReleaseCurrentOwner(LOCALLOCK **locallocks, int nlocks)
+LockReleaseCurrentOwner(ResourceOwner owner, LOCALLOCKOWNER *locallockowner)
 {
-	if (locallocks == NULL)
-	{
-		HASH_SEQ_STATUS status;
-		LOCALLOCK  *locallock;
-
-		hash_seq_init(&status, LockMethodLocalHash);
-
-		while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
-			ReleaseLockIfHeld(locallock, false);
-	}
-	else
-	{
-		int			i;
+	Assert(locallockowner->owner == owner);
 
-		for (i = nlocks - 1; i >= 0; i--)
-			ReleaseLockIfHeld(locallocks[i], false);
-	}
+	ReleaseLockIfHeld(locallockowner, false);
 }
 
 /*
  * ReleaseLockIfHeld
- *		Release any session-level locks on this lockable object if sessionLock
- *		is true; else, release any locks held by CurrentResourceOwner.
+ *		Release any session-level locks on this 'locallockowner' if
+ *		sessionLock is true; else, release any locks held by 'locallockowner'.
  *
  * It is tempting to pass this a ResourceOwner pointer (or NULL for session
  * locks), but without refactoring LockRelease() we cannot support releasing
@@ -2457,52 +2243,39 @@ LockReleaseCurrentOwner(LOCALLOCK **locallocks, int nlocks)
  * convenience.
  */
 static void
-ReleaseLockIfHeld(LOCALLOCK *locallock, bool sessionLock)
+ReleaseLockIfHeld(LOCALLOCKOWNER *locallockowner, bool sessionLock)
 {
-	ResourceOwner owner;
-	LOCALLOCKOWNER *lockOwners;
-	int			i;
+	LOCALLOCK  *locallock = locallockowner->locallock;
+
+	/* release all references to the lock by this resource owner */
 
-	/* Identify owner for lock (must match LockRelease!) */
 	if (sessionLock)
-		owner = NULL;
+		Assert(locallockowner->owner == NULL);
 	else
-		owner = CurrentResourceOwner;
+		Assert(locallockowner->owner != NULL);
 
-	/* Scan to see if there are any locks belonging to the target owner */
-	lockOwners = locallock->lockOwners;
-	for (i = locallock->numLockOwners - 1; i >= 0; i--)
+	/* We will still hold this lock after forgetting this ResourceOwner. */
+	if (locallockowner->nLocks < locallock->nLocks)
 	{
-		if (lockOwners[i].owner == owner)
-		{
-			Assert(lockOwners[i].nLocks > 0);
-			if (lockOwners[i].nLocks < locallock->nLocks)
-			{
-				/*
-				 * We will still hold this lock after forgetting this
-				 * ResourceOwner.
-				 */
-				locallock->nLocks -= lockOwners[i].nLocks;
-				/* compact out unused slot */
-				locallock->numLockOwners--;
-				if (owner != NULL)
-					ResourceOwnerForgetLock(owner, locallock);
-				if (i < locallock->numLockOwners)
-					lockOwners[i] = lockOwners[locallock->numLockOwners];
-			}
-			else
-			{
-				Assert(lockOwners[i].nLocks == locallock->nLocks);
-				/* We want to call LockRelease just once */
-				lockOwners[i].nLocks = 1;
-				locallock->nLocks = 1;
-				if (!LockRelease(&locallock->tag.lock,
-								 locallock->tag.mode,
-								 sessionLock))
-					elog(WARNING, "ReleaseLockIfHeld: failed??");
-			}
-			break;
-		}
+		locallock->nLocks -= locallockowner->nLocks;
+		dlist_delete(&locallockowner->locallock_node);
+
+		if (sessionLock)
+			dlist_delete(&locallockowner->resowner_node);
+		else
+			ResourceOwnerForgetLock(locallockowner);
+	}
+	else
+	{
+		Assert(locallockowner->nLocks == locallock->nLocks);
+		/* We want to call LockRelease just once */
+		locallockowner->nLocks = 1;
+		locallock->nLocks = 1;
+
+		if (!LockRelease(&locallock->tag.lock,
+						 locallock->tag.mode,
+						 sessionLock))
+			elog(WARNING, "ReleaseLockIfHeld: failed??");
 	}
 }
 
@@ -2510,82 +2283,47 @@ ReleaseLockIfHeld(LOCALLOCK *locallock, bool sessionLock)
  * LockReassignCurrentOwner
  *		Reassign all locks belonging to CurrentResourceOwner to belong
  *		to its parent resource owner.
- *
- * If the caller knows what those locks are, it can pass them as an array.
- * That speeds up the call significantly, when a lot of locks are held
- * (e.g pg_dump with a large schema).  Otherwise, pass NULL for locallocks,
- * and we'll traverse through our hash table to find them.
  */
 void
-LockReassignCurrentOwner(LOCALLOCK **locallocks, int nlocks)
+LockReassignCurrentOwner(LOCALLOCKOWNER *locallockowner)
 {
 	ResourceOwner parent = ResourceOwnerGetParent(CurrentResourceOwner);
 
-	Assert(parent != NULL);
-
-	if (locallocks == NULL)
-	{
-		HASH_SEQ_STATUS status;
-		LOCALLOCK  *locallock;
-
-		hash_seq_init(&status, LockMethodLocalHash);
-
-		while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
-			LockReassignOwner(locallock, parent);
-	}
-	else
-	{
-		int			i;
-
-		for (i = nlocks - 1; i >= 0; i--)
-			LockReassignOwner(locallocks[i], parent);
-	}
+	LockReassignOwner(locallockowner, parent);
 }
 
 /*
- * Subroutine of LockReassignCurrentOwner. Reassigns a given lock belonging to
- * CurrentResourceOwner to its parent.
+ * Subroutine of LockReassignCurrentOwner. Reassigns the given
+ *'locallockowner' to 'parent'.
  */
 static void
-LockReassignOwner(LOCALLOCK *locallock, ResourceOwner parent)
+LockReassignOwner(LOCALLOCKOWNER *locallockowner, ResourceOwner parent)
 {
-	LOCALLOCKOWNER *lockOwners;
-	int			i;
-	int			ic = -1;
-	int			ip = -1;
+	dlist_iter	iter;
+	LOCALLOCK  *locallock = locallockowner->locallock;
 
-	/*
-	 * Scan to see if there are any locks belonging to current owner or its
-	 * parent
-	 */
-	lockOwners = locallock->lockOwners;
-	for (i = locallock->numLockOwners - 1; i >= 0; i--)
+	ResourceOwnerForgetLock(locallockowner);
+
+	dlist_foreach(iter, &locallock->locallockowners)
 	{
-		if (lockOwners[i].owner == CurrentResourceOwner)
-			ic = i;
-		else if (lockOwners[i].owner == parent)
-			ip = i;
-	}
+		LOCALLOCKOWNER *parentlocalowner = dlist_container(LOCALLOCKOWNER, locallock_node, iter.cur);
 
-	if (ic < 0)
-		return;					/* no current locks */
+		Assert(parentlocalowner->locallock == locallock);
 
-	if (ip < 0)
-	{
-		/* Parent has no slot, so just give it the child's slot */
-		lockOwners[ic].owner = parent;
-		ResourceOwnerRememberLock(parent, locallock);
-	}
-	else
-	{
-		/* Merge child's count with parent's */
-		lockOwners[ip].nLocks += lockOwners[ic].nLocks;
-		/* compact out unused slot */
-		locallock->numLockOwners--;
-		if (ic < locallock->numLockOwners)
-			lockOwners[ic] = lockOwners[locallock->numLockOwners];
+		if (parentlocalowner->owner != parent)
+			continue;
+
+		parentlocalowner->nLocks += locallockowner->nLocks;
+
+		locallockowner->nLocks = 0;
+		dlist_delete(&locallockowner->locallock_node);
+		pfree(locallockowner);
+		return;
 	}
-	ResourceOwnerForgetLock(CurrentResourceOwner, locallock);
+
+	/* reassign locallockowner to parent resowner */
+	locallockowner->owner = parent;
+	ResourceOwnerRememberLock(parent, locallockowner);
 }
 
 /*
@@ -3057,7 +2795,7 @@ GetLockConflicts(const LOCKTAG *locktag, LOCKMODE lockmode, int *countp)
  * We currently use this in two situations: first, to release locks held by
  * prepared transactions on commit (see lock_twophase_postcommit); and second,
  * to release locks taken via the fast-path, transferred to the main hash
- * table, and then released (see LockReleaseAll).
+ * table, and then released (see ResourceOwnerRelease).
  */
 static void
 LockRefindAndRelease(LockMethod lockMethodTable, PGPROC *proc,
@@ -3163,16 +2901,9 @@ LockRefindAndRelease(LockMethod lockMethodTable, PGPROC *proc,
  * we can't implement this check by examining LOCALLOCK entries in isolation.
  * We must build a transient hashtable that is indexed by locktag only.
  */
-static void
+static HTAB *
 CheckForSessionAndXactLocks(void)
 {
-	typedef struct
-	{
-		LOCKTAG		lock;		/* identifies the lockable object */
-		bool		sessLock;	/* is any lockmode held at session level? */
-		bool		xactLock;	/* is any lockmode held at xact level? */
-	} PerLockTagEntry;
-
 	HASHCTL		hash_ctl;
 	HTAB	   *lockhtab;
 	HASH_SEQ_STATUS status;
@@ -3193,10 +2924,9 @@ CheckForSessionAndXactLocks(void)
 
 	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
 	{
-		LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
 		PerLockTagEntry *hentry;
 		bool		found;
-		int			i;
+		dlist_iter	iter;
 
 		/*
 		 * Ignore VXID locks.  We don't want those to be held by prepared
@@ -3217,9 +2947,13 @@ CheckForSessionAndXactLocks(void)
 			hentry->sessLock = hentry->xactLock = false;
 
 		/* Scan to see if we hold lock at session or xact level or both */
-		for (i = locallock->numLockOwners - 1; i >= 0; i--)
+		dlist_foreach(iter, &locallock->locallockowners)
 		{
-			if (lockOwners[i].owner == NULL)
+			LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER,
+															 locallock_node,
+															 iter.cur);
+
+			if (locallockowner->owner == NULL)
 				hentry->sessLock = true;
 			else
 				hentry->xactLock = true;
@@ -3235,8 +2969,7 @@ CheckForSessionAndXactLocks(void)
 					 errmsg("cannot PREPARE while holding both session-level and transaction-level locks on the same object")));
 	}
 
-	/* Success, so clean up */
-	hash_destroy(lockhtab);
+	return lockhtab;
 }
 
 /*
@@ -3244,6 +2977,11 @@ CheckForSessionAndXactLocks(void)
  *		Do the preparatory work for a PREPARE: make 2PC state file records
  *		for all locks currently held.
  *
+ * Returns a hash table of PerLockTagEntry structs with an entry for each
+ * lock held by this backend marking if the lock is held at the session or
+ * xact level, or both.  It is up to the calling function to call
+ * hash_destroy() on this table to free the memory used by it.
+ *
  * Session-level locks are ignored, as are VXID locks.
  *
  * For the most part, we don't need to touch shared memory for this ---
@@ -3251,14 +2989,15 @@ CheckForSessionAndXactLocks(void)
  * Fast-path locks are an exception, however: we move any such locks to
  * the main table before allowing PREPARE TRANSACTION to succeed.
  */
-void
+HTAB *
 AtPrepare_Locks(void)
 {
 	HASH_SEQ_STATUS status;
 	LOCALLOCK  *locallock;
+	HTAB	   *sessionandxactlocks;
 
 	/* First, verify there aren't locks of both xact and session level */
-	CheckForSessionAndXactLocks();
+	sessionandxactlocks = CheckForSessionAndXactLocks();
 
 	/* Now do the per-locallock cleanup work */
 	hash_seq_init(&status, LockMethodLocalHash);
@@ -3266,10 +3005,9 @@ AtPrepare_Locks(void)
 	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
 	{
 		TwoPhaseLockRecord record;
-		LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
 		bool		haveSessionLock;
 		bool		haveXactLock;
-		int			i;
+		dlist_iter	iter;
 
 		/*
 		 * Ignore VXID locks.  We don't want those to be held by prepared
@@ -3284,9 +3022,13 @@ AtPrepare_Locks(void)
 
 		/* Scan to see whether we hold it at session or transaction level */
 		haveSessionLock = haveXactLock = false;
-		for (i = locallock->numLockOwners - 1; i >= 0; i--)
+		dlist_foreach(iter, &locallock->locallockowners)
 		{
-			if (lockOwners[i].owner == NULL)
+			LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER,
+															 locallock_node,
+															 iter.cur);
+
+			if (locallockowner->owner == NULL)
 				haveSessionLock = true;
 			else
 				haveXactLock = true;
@@ -3330,6 +3072,8 @@ AtPrepare_Locks(void)
 		RegisterTwoPhaseRecord(TWOPHASE_RM_LOCK_ID, 0,
 							   &record, sizeof(TwoPhaseLockRecord));
 	}
+
+	return sessionandxactlocks;
 }
 
 /*
@@ -3344,11 +3088,11 @@ AtPrepare_Locks(void)
  * pointers in the transaction's resource owner.  This is OK at the
  * moment since resowner.c doesn't try to free locks retail at a toplevel
  * transaction commit or abort.  We could alternatively zero out nLocks
- * and leave the LOCALLOCK entries to be garbage-collected by LockReleaseAll,
- * but that probably costs more cycles.
+ * and leave the LOCALLOCK entries to be garbage-collected by
+ * ResourceOwnerRelease, but that probably costs more cycles.
  */
 void
-PostPrepare_Locks(TransactionId xid)
+PostPrepare_Locks(TransactionId xid, HTAB *sessionandxactlocks)
 {
 	PGPROC	   *newproc = TwoPhaseGetDummyProc(xid, false);
 	HASH_SEQ_STATUS status;
@@ -3378,10 +3122,9 @@ PostPrepare_Locks(TransactionId xid)
 
 	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
 	{
-		LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
 		bool		haveSessionLock;
 		bool		haveXactLock;
-		int			i;
+		dlist_iter	iter;
 
 		if (locallock->proclock == NULL || locallock->lock == NULL)
 		{
@@ -3400,9 +3143,13 @@ PostPrepare_Locks(TransactionId xid)
 
 		/* Scan to see whether we hold it at session or transaction level */
 		haveSessionLock = haveXactLock = false;
-		for (i = locallock->numLockOwners - 1; i >= 0; i--)
+		dlist_foreach(iter, &locallock->locallockowners)
 		{
-			if (lockOwners[i].owner == NULL)
+			LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER,
+															 locallock_node,
+															 iter.cur);
+
+			if (locallockowner->owner == NULL)
 				haveSessionLock = true;
 			else
 				haveXactLock = true;
@@ -3418,10 +3165,6 @@ PostPrepare_Locks(TransactionId xid)
 					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 					 errmsg("cannot PREPARE while holding both session-level and transaction-level locks on the same object")));
 
-		/* Mark the proclock to show we need to release this lockmode */
-		if (locallock->nLocks > 0)
-			locallock->proclock->releaseMask |= LOCKBIT_ON(locallock->tag.mode);
-
 		/* And remove the locallock hashtable entry */
 		RemoveLocalLock(locallock);
 	}
@@ -3439,11 +3182,7 @@ PostPrepare_Locks(TransactionId xid)
 
 		/*
 		 * If the proclock list for this partition is empty, we can skip
-		 * acquiring the partition lock.  This optimization is safer than the
-		 * situation in LockReleaseAll, because we got rid of any fast-path
-		 * locks during AtPrepare_Locks, so there cannot be any case where
-		 * another backend is adding something to our lists now.  For safety,
-		 * though, we code this the same way as in LockReleaseAll.
+		 * acquiring the partition lock.
 		 */
 		if (dlist_is_empty(procLocks))
 			continue;			/* needn't examine this partition */
@@ -3452,6 +3191,8 @@ PostPrepare_Locks(TransactionId xid)
 
 		dlist_foreach_modify(proclock_iter, procLocks)
 		{
+			PerLockTagEntry *locktagentry;
+
 			proclock = dlist_container(PROCLOCK, procLink, proclock_iter.cur);
 
 			Assert(proclock->tag.myProc == MyProc);
@@ -3469,13 +3210,14 @@ PostPrepare_Locks(TransactionId xid)
 			Assert(lock->nGranted <= lock->nRequested);
 			Assert((proclock->holdMask & ~lock->grantMask) == 0);
 
-			/* Ignore it if nothing to release (must be a session lock) */
-			if (proclock->releaseMask == 0)
-				continue;
+			locktagentry = hash_search(sessionandxactlocks,
+									   &lock->tag,
+									   HASH_FIND,
+									   NULL);
 
-			/* Else we should be releasing all locks */
-			if (proclock->releaseMask != proclock->holdMask)
-				elog(PANIC, "we seem to have dropped a bit somewhere");
+			/* skip session locks */
+			if (locktagentry != NULL && locktagentry->sessLock)
+				continue;
 
 			/*
 			 * We cannot simply modify proclock->tag.myProc to reassign
@@ -4245,7 +3987,6 @@ lock_twophase_recover(TransactionId xid, uint16 info,
 		Assert(proc->lockGroupLeader == NULL);
 		proclock->groupLeader = proc;
 		proclock->holdMask = 0;
-		proclock->releaseMask = 0;
 		/* Add proclock to appropriate lists */
 		dlist_push_tail(&lock->procLocks, &proclock->lockLink);
 		dlist_push_tail(&proc->myProcLocks[partition],
@@ -4382,7 +4123,7 @@ lock_twophase_postabort(TransactionId xid, uint16 info,
  *
  *		We don't bother recording this lock in the local lock table, since it's
  *		only ever released at the end of a transaction.  Instead,
- *		LockReleaseAll() calls VirtualXactLockTableCleanup().
+ *		ProcReleaseLocks() calls VirtualXactLockTableCleanup().
  */
 void
 VirtualXactLockTableInsert(VirtualTransactionId vxid)
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index e9e445bb216..9a7ea73cdcd 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -777,10 +777,17 @@ ProcReleaseLocks(bool isCommit)
 		return;
 	/* If waiting, get off wait queue (should only be needed after error) */
 	LockErrorCleanup();
-	/* Release standard locks, including session-level if aborting */
-	LockReleaseAll(DEFAULT_LOCKMETHOD, !isCommit);
-	/* Release transaction-level advisory locks */
-	LockReleaseAll(USER_LOCKMETHOD, false);
+
+	VirtualXactLockTableCleanup();
+
+	/* Release session-level locks if aborting */
+	if (!isCommit)
+		LockReleaseSession(DEFAULT_LOCKMETHOD);
+
+#ifdef USE_ASSERT_CHECKING
+	/* Ensure all locks were released */
+	LockAssertNoneHeld(isCommit);
+#endif
 }
 
 
@@ -865,6 +872,8 @@ ProcKill(int code, Datum arg)
 		LWLockRelease(leader_lwlock);
 	}
 
+	Assert(MyProc->fpLockBits == 0);
+
 	/*
 	 * Reset MyLatch to the process local one.  This is so that signal
 	 * handlers et al can continue using the latch after the shared latch
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 552cf9d950a..d64998e4ee8 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -1353,10 +1353,10 @@ ShutdownPostgres(int code, Datum arg)
 	AbortOutOfAnyTransaction();
 
 	/*
-	 * User locks are not released by transaction end, so be sure to release
-	 * them explicitly.
+	 * Session locks are not released by transaction end, so be sure to
+	 * release them explicitly.
 	 */
-	LockReleaseAll(USER_LOCKMETHOD, true);
+	LockReleaseSession(USER_LOCKMETHOD);
 }
 
 
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index f096f3df20a..96417e3cb94 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -91,18 +91,6 @@ typedef struct ResourceElem
 StaticAssertDecl(RESOWNER_HASH_MAX_ITEMS(RESOWNER_HASH_INIT_SIZE) >= RESOWNER_ARRAY_SIZE,
 				 "initial hash size too small compared to array size");
 
-/*
- * MAX_RESOWNER_LOCKS is the size of the per-resource owner locks cache. It's
- * chosen based on some testing with pg_dump with a large schema. When the
- * tests were done (on 9.2), resource owners in a pg_dump run contained up
- * to 9 locks, regardless of the schema size, except for the top resource
- * owner which contained much more (overflowing the cache). 15 seems like a
- * nice round number that's somewhat higher than what pg_dump needs. Note that
- * making this number larger is not free - the bigger the cache, the slower
- * it is to release locks (in retail), when a resource owner holds many locks.
- */
-#define MAX_RESOWNER_LOCKS 15
-
 /*
  * ResourceOwner objects look like this
  */
@@ -152,10 +140,10 @@ typedef struct ResourceOwnerData
 	uint32		capacity;		/* allocated length of hash[] */
 	uint32		grow_at;		/* grow hash when reach this */
 
-	/* The local locks cache. */
-	LOCALLOCK  *locks[MAX_RESOWNER_LOCKS];	/* list of owned locks */
+	dlist_head	locks;			/* dlist of owned locks */
 } ResourceOwnerData;
 
+#include "lib/ilist.h"
 
 /*****************************************************************************
  *	  GLOBAL MEMORY															 *
@@ -423,6 +411,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
 	owner = (ResourceOwner) MemoryContextAllocZero(TopMemoryContext,
 												   sizeof(ResourceOwnerData));
 	owner->name = name;
+	dlist_init(&owner->locks);
 
 	if (parent)
 	{
@@ -729,8 +718,19 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 	}
 	else if (phase == RESOURCE_RELEASE_LOCKS)
 	{
+		dlist_mutable_iter iter;
+
 		if (isTopLevel)
 		{
+			dlist_foreach_modify(iter, &owner->locks)
+			{
+				LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER, resowner_node, iter.cur);
+
+				LockReleaseCurrentOwner(owner, locallockowner);
+			}
+
+			Assert(dlist_is_empty(&owner->locks));
+
 			/*
 			 * For a top-level xact we are going to release all locks (or at
 			 * least all non-session locks), so just do a single lmgr call at
@@ -749,30 +749,30 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 			 * subtransaction, we do NOT release its locks yet, but transfer
 			 * them to the parent.
 			 */
-			LOCALLOCK **locks;
-			int			nlocks;
+			if (isCommit)
+			{
+				dlist_foreach_modify(iter, &owner->locks)
+				{
+					LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER,
+																	 resowner_node,
+																	 iter.cur);
 
-			Assert(owner->parent != NULL);
+					LockReassignCurrentOwner(locallockowner);
+				}
 
-			/*
-			 * Pass the list of locks owned by this resource owner to the lock
-			 * manager, unless it has overflowed.
-			 */
-			if (owner->nlocks > MAX_RESOWNER_LOCKS)
-			{
-				locks = NULL;
-				nlocks = 0;
+				Assert(dlist_is_empty(&owner->locks));
 			}
 			else
 			{
-				locks = owner->locks;
-				nlocks = owner->nlocks;
-			}
+				dlist_foreach_modify(iter, &owner->locks)
+				{
+					LOCALLOCKOWNER *locallockowner = dlist_container(LOCALLOCKOWNER, resowner_node, iter.cur);
 
-			if (isCommit)
-				LockReassignCurrentOwner(locks, nlocks);
-			else
-				LockReleaseCurrentOwner(locks, nlocks);
+					LockReleaseCurrentOwner(owner, locallockowner);
+				}
+
+				Assert(dlist_is_empty(&owner->locks));
+			}
 		}
 	}
 	else if (phase == RESOURCE_RELEASE_AFTER_LOCKS)
@@ -860,7 +860,7 @@ ResourceOwnerDelete(ResourceOwner owner)
 	/* And it better not own any resources, either */
 	Assert(owner->narr == 0);
 	Assert(owner->nhash == 0);
-	Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1);
+	Assert(dlist_is_empty(&owner->locks));
 
 	/*
 	 * Delete children.  The recursive call will delink the child from me, so
@@ -1034,52 +1034,59 @@ ReleaseAuxProcessResourcesCallback(int code, Datum arg)
 
 /*
  * Remember that a Local Lock is owned by a ResourceOwner
- *
- * This is different from the generic ResourceOwnerRemember in that the list of
- * locks is only a lossy cache.  It can hold up to MAX_RESOWNER_LOCKS entries,
- * and when it overflows, we stop tracking locks.  The point of only remembering
- * only up to MAX_RESOWNER_LOCKS entries is that if a lot of locks are held,
- * ResourceOwnerForgetLock doesn't need to scan through a large array to find
- * the entry.
  */
 void
-ResourceOwnerRememberLock(ResourceOwner owner, LOCALLOCK *locallock)
+ResourceOwnerRememberLock(ResourceOwner owner, LOCALLOCKOWNER *locallockowner)
 {
-	Assert(locallock != NULL);
+	Assert(owner != NULL);
+	Assert(locallockowner != NULL);
 
-	if (owner->nlocks > MAX_RESOWNER_LOCKS)
-		return;					/* we have already overflowed */
-
-	if (owner->nlocks < MAX_RESOWNER_LOCKS)
-		owner->locks[owner->nlocks] = locallock;
-	else
+#ifdef USE_ASSERT_CHECKING
 	{
-		/* overflowed */
+		dlist_iter	iter;
+
+		dlist_foreach(iter, &owner->locks)
+		{
+			LOCALLOCKOWNER *i = dlist_container(LOCALLOCKOWNER, resowner_node, iter.cur);
+
+			Assert(i->locallock != locallockowner->locallock);
+		}
 	}
-	owner->nlocks++;
+#endif
+
+	dlist_push_tail(&owner->locks, &locallockowner->resowner_node);
 }
 
 /*
- * Forget that a Local Lock is owned by a ResourceOwner
+ * Forget that a Local Lock is owned by the given LOCALLOCKOWNER.
  */
 void
-ResourceOwnerForgetLock(ResourceOwner owner, LOCALLOCK *locallock)
+ResourceOwnerForgetLock(LOCALLOCKOWNER *locallockowner)
 {
-	int			i;
+#ifdef USE_ASSERT_CHECKING
+	ResourceOwner owner;
+
+	Assert(locallockowner != NULL);
 
-	if (owner->nlocks > MAX_RESOWNER_LOCKS)
-		return;					/* we have overflowed */
+	owner = locallockowner->owner;
 
-	Assert(owner->nlocks > 0);
-	for (i = owner->nlocks - 1; i >= 0; i--)
 	{
-		if (locallock == owner->locks[i])
+		dlist_iter	iter;
+		bool		found = false;
+
+		dlist_foreach(iter, &owner->locks)
 		{
-			owner->locks[i] = owner->locks[owner->nlocks - 1];
-			owner->nlocks--;
-			return;
+			LOCALLOCKOWNER *owner = dlist_container(LOCALLOCKOWNER, resowner_node, iter.cur);
+
+			if (locallockowner == owner)
+			{
+				Assert(!found);
+				found = true;
+			}
 		}
+
+		Assert(found);
 	}
-	elog(ERROR, "lock reference %p is not owned by resource owner %s",
-		 locallock, owner->name);
+#endif
+	dlist_delete(&locallockowner->resowner_node);
 }
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index 590c026b5bf..e7f5b25f338 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -24,6 +24,7 @@
 #include "storage/lwlock.h"
 #include "storage/shmem.h"
 #include "utils/timestamp.h"
+#include "lib/ilist.h"
 
 /* struct PGPROC is declared in proc.h, but must forward-reference it */
 typedef struct PGPROC PGPROC;
@@ -349,10 +350,6 @@ typedef struct LOCK
  * Otherwise, proclock objects whose holdMasks are zero are recycled
  * as soon as convenient.
  *
- * releaseMask is workspace for LockReleaseAll(): it shows the locks due
- * to be released during the current call.  This must only be examined or
- * set by the backend owning the PROCLOCK.
- *
  * Each PROCLOCK object is linked into lists for both the associated LOCK
  * object and the owning PGPROC object.  Note that the PROCLOCK is entered
  * into these lists as soon as it is created, even if no lock has yet been
@@ -374,7 +371,6 @@ typedef struct PROCLOCK
 	/* data */
 	PGPROC	   *groupLeader;	/* proc's lock group leader, or proc itself */
 	LOCKMASK	holdMask;		/* bitmask for lock types currently held */
-	LOCKMASK	releaseMask;	/* bitmask for lock types to be released */
 	dlist_node	lockLink;		/* list link in LOCK's list of proclocks */
 	dlist_node	procLink;		/* list link in PGPROC's list of proclocks */
 } PROCLOCK;
@@ -420,6 +416,13 @@ typedef struct LOCALLOCKOWNER
 	 * Must use a forward struct reference to avoid circularity.
 	 */
 	struct ResourceOwnerData *owner;
+
+	dlist_node	resowner_node;	/* dlist link for ResourceOwner.locks */
+
+	dlist_node	locallock_node;	/* dlist link for LOCALLOCK.locallockowners */
+
+	struct LOCALLOCK *locallock;	/* pointer to the corresponding LOCALLOCK */
+
 	int64		nLocks;			/* # of times held by this owner */
 } LOCALLOCKOWNER;
 
@@ -433,9 +436,9 @@ typedef struct LOCALLOCK
 	LOCK	   *lock;			/* associated LOCK object, if any */
 	PROCLOCK   *proclock;		/* associated PROCLOCK object, if any */
 	int64		nLocks;			/* total number of times lock is held */
-	int			numLockOwners;	/* # of relevant ResourceOwners */
-	int			maxLockOwners;	/* allocated size of array */
-	LOCALLOCKOWNER *lockOwners; /* dynamically resizable array */
+
+	dlist_head	locallockowners;	/* dlist of LOCALLOCKOWNER */
+
 	bool		holdsStrongLockCount;	/* bumped FastPathStrongRelationLocks */
 	bool		lockCleared;	/* we read all sinval msgs for lock */
 } LOCALLOCK;
@@ -564,10 +567,16 @@ extern void AbortStrongLockAcquire(void);
 extern void MarkLockClear(LOCALLOCK *locallock);
 extern bool LockRelease(const LOCKTAG *locktag,
 						LOCKMODE lockmode, bool sessionLock);
-extern void LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks);
+
+#ifdef USE_ASSERT_CHECKING
+extern void LockAssertNoneHeld(bool isCommit);
+#endif
+
 extern void LockReleaseSession(LOCKMETHODID lockmethodid);
-extern void LockReleaseCurrentOwner(LOCALLOCK **locallocks, int nlocks);
-extern void LockReassignCurrentOwner(LOCALLOCK **locallocks, int nlocks);
+struct ResourceOwnerData;
+extern void LockReleaseCurrentOwner(struct ResourceOwnerData *owner,
+									LOCALLOCKOWNER *locallockowner);
+extern void LockReassignCurrentOwner(LOCALLOCKOWNER *locallockowner);
 extern bool LockHeldByMe(const LOCKTAG *locktag, LOCKMODE lockmode);
 #ifdef USE_ASSERT_CHECKING
 extern HTAB *GetLockMethodLocalHash(void);
@@ -576,8 +585,8 @@ extern bool LockHasWaiters(const LOCKTAG *locktag,
 						   LOCKMODE lockmode, bool sessionLock);
 extern VirtualTransactionId *GetLockConflicts(const LOCKTAG *locktag,
 											  LOCKMODE lockmode, int *countp);
-extern void AtPrepare_Locks(void);
-extern void PostPrepare_Locks(TransactionId xid);
+extern HTAB *AtPrepare_Locks(void);
+extern void PostPrepare_Locks(TransactionId xid, HTAB *sessionandxactlocks);
 extern bool LockCheckConflicts(LockMethod lockMethodTable,
 							   LOCKMODE lockmode,
 							   LOCK *lock, PROCLOCK *proclock);
diff --git a/src/include/utils/resowner.h b/src/include/utils/resowner.h
index 0735480214e..cf368cedca4 100644
--- a/src/include/utils/resowner.h
+++ b/src/include/utils/resowner.h
@@ -159,8 +159,8 @@ extern void CreateAuxProcessResourceOwner(void);
 extern void ReleaseAuxProcessResources(bool isCommit);
 
 /* special support for local lock management */
-struct LOCALLOCK;
-extern void ResourceOwnerRememberLock(ResourceOwner owner, struct LOCALLOCK *locallock);
-extern void ResourceOwnerForgetLock(ResourceOwner owner, struct LOCALLOCK *locallock);
+struct LOCALLOCKOWNER;
+extern void ResourceOwnerRememberLock(ResourceOwner owner, struct LOCALLOCKOWNER *locallockowner);
+extern void ResourceOwnerForgetLock(struct LOCALLOCKOWNER *locallockowner);
 
 #endif							/* RESOWNER_H */
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index ae58438ec76..29b1cce54b7 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -36,8 +36,9 @@ extern void ResourceOwnerRememberBufferIO(ResourceOwner owner, Buffer buffer);
 extern void ResourceOwnerForgetBufferIO(ResourceOwner owner, Buffer buffer);
 
 /* support for local lock management */
-extern void ResourceOwnerRememberLock(ResourceOwner owner, LOCALLOCK *locallock);
-extern void ResourceOwnerForgetLock(ResourceOwner owner, LOCALLOCK *locallock);
+extern void ResourceOwnerRememberLock(ResourceOwner owner,
+									  LOCALLOCKOWNER *locallock);
+extern void ResourceOwnerForgetLock(LOCALLOCKOWNER *locallock);
 
 /* support for catcache refcount management */
 extern void ResourceOwnerEnlargeCatCacheRefs(ResourceOwner owner);

#121

vignesh C

vignesh21@gmail.com

about 2 years ago

In reply to: Heikki Linnakangas (#120)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

On Thu, 9 Nov 2023 at 21:48, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 18/09/2023 07:08, David Rowley wrote:

On Fri, 15 Sept 2023 at 22:37, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I've added a call to LockAssertNoneHeld(false) in there.

I don't see it in the patch?

hmm. I must've git format-patch before committing that part.

I'll try that again... see attached.

This needed a rebase after my ResourceOwner refactoring. Attached.

A few quick comments:

- It would be nice to add a test for the issue that you fixed in patch
v7, i.e. if you prepare a transaction while holding session-level locks.

- GrantLockLocal() now calls MemoryContextAlloc(), which can fail if you
are out of memory. Is that handled gracefully or is the lock leaked?

CFBot shows one of the test has aborted at [1]https://cirrus-ci.com/task/5118173163290624?logs=cores#L51 with:
[20:54:28.535] Core was generated by `postgres: subscriber: logical
replication apply worker for subscription 16397 '.
[20:54:28.535] Program terminated with signal SIGABRT, Aborted.
[20:54:28.535] #0 __GI_raise (sig=sig@entry=6) at
../sysdeps/unix/sysv/linux/raise.c:50
[20:54:28.535] Download failed: Invalid argument. Continuing without
source file ./signal/../sysdeps/unix/sysv/linux/raise.c.
[20:54:28.627]
[20:54:28.627] Thread 1 (Thread 0x7f0ea02d1a40 (LWP 50984)):
[20:54:28.627] #0 __GI_raise (sig=sig@entry=6) at
../sysdeps/unix/sysv/linux/raise.c:50
...
...
[20:54:28.627] #2 0x00005618e989d62f in ExceptionalCondition
(conditionName=conditionName@entry=0x5618e9b40f70
"dlist_is_empty(&(MyProc->myProcLocks[i]))",
fileName=fileName@entry=0x5618e9b40ec0
"../src/backend/storage/lmgr/proc.c", lineNumber=lineNumber@entry=856)
at ../src/backend/utils/error/assert.c:66
[20:54:28.627] No locals.
[20:54:28.627] #3 0x00005618e95e6847 in ProcKill (code=<optimized
out>, arg=<optimized out>) at ../src/backend/storage/lmgr/proc.c:856
[20:54:28.627] i = <optimized out>
[20:54:28.627] proc = <optimized out>
[20:54:28.627] procgloballist = <optimized out>
[20:54:28.627] __func__ = "ProcKill"
[20:54:28.627] #4 0x00005618e959ebcc in shmem_exit
(code=code@entry=1) at ../src/backend/storage/ipc/ipc.c:276
[20:54:28.627] __func__ = "shmem_exit"
[20:54:28.627] #5 0x00005618e959ecd0 in proc_exit_prepare
(code=code@entry=1) at ../src/backend/storage/ipc/ipc.c:198
[20:54:28.627] __func__ = "proc_exit_prepare"
[20:54:28.627] #6 0x00005618e959ee8e in proc_exit (code=code@entry=1)
at ../src/backend/storage/ipc/ipc.c:111
[20:54:28.627] __func__ = "proc_exit"
[20:54:28.627] #7 0x00005618e94aa54d in BackgroundWorkerMain () at
../src/backend/postmaster/bgworker.c:805
[20:54:28.627] local_sigjmp_buf = {{__jmpbuf =
{94665009627112, -3865857745677845768, 0, 0, 140732736634980, 1,
3865354362587970296, 7379258256398875384}, __mask_was_saved = 1,
__saved_mask = {__val = {18446744066192964099, 94665025527920,
94665025527920, 94665025527920, 0, 94665025528120, 8192, 1,
94664997686410, 94665009627040, 94664997622076, 94665025527920, 1, 0,
0, 140732736634980}}}}
[20:54:28.627] worker = 0x5618eb37c570
[20:54:28.627] entrypt = <optimized out>
[20:54:28.627] __func__ = "BackgroundWorkerMain"
[20:54:28.627] #8 0x00005618e94b495c in do_start_bgworker
(rw=rw@entry=0x5618eb3b73c8) at
../src/backend/postmaster/postmaster.c:5697
[20:54:28.627] worker_pid = <optimized out>
[20:54:28.627] __func__ = "do_start_bgworker"
[20:54:28.627] #9 0x00005618e94b4c32 in maybe_start_bgworkers () at
../src/backend/postmaster/postmaster.c:5921
[20:54:28.627] rw = 0x5618eb3b73c8
[20:54:28.627] num_launched = 0
[20:54:28.627] now = 0
[20:54:28.627] iter = {cur = 0x5618eb3b79a8, next =
0x5618eb382a20, prev = 0x5618ea44a980 <BackgroundWorkerList>}
[20:54:28.627] #10 0x00005618e94b574a in process_pm_pmsignal () at
../src/backend/postmaster/postmaster.c:5073
[20:54:28.627] __func__ = "process_pm_pmsignal"
[20:54:28.627] #11 0x00005618e94b5f4a in ServerLoop () at
../src/backend/postmaster/postmaster.c:1760

[1]: https://cirrus-ci.com/task/5118173163290624?logs=cores#L51

Regards,
Vignesh