Moving relation extension locks out of heavyweight lock manager

Started by Masahiko Sawadaover 8 years ago190 messages

sawada.mshk@gmail.com

over 8 years ago

1 attachment(s)

Hi all,

Currently, the relation extension lock is implemented using
heavyweight lock manager and almost functions (except for
brin_page_cleanup) using LockRelationForExntesion use it with
ExclusiveLock mode. But actually it doesn't need multiple lock modes
or deadlock detection or any of the other functionality that the
heavyweight lock manager provides. I think It's enough to use
something like LWLock. So I'd like to propose to change relation
extension lock management so that it works using LWLock instead.

Attached draft patch makes relation extension locks uses LWLock rather
than heavyweight lock manager, using by shared hash table storing
information of the relation extension lock. The basic idea is that we
add hash table in shared memory for relation extension locks and each
hash entry is LWLock struct. Whenever the process wants to acquire
relation extension locks, it searches appropriate LWLock entry in hash
table and acquire it. The process can remove a hash entry when
unlocking it if nobody is holding and waiting it.

This work would be helpful not only for existing workload but also
future works like some parallel utility commands, which is discussed
on other threads[1]* Block level parallel vacuum WIP </messages/by-id/CAD21AoD1xAqp4zK-Vi1cuY3feq2oO8HcpJiz32UDUfe0BE31Xw@mail.gmail.com> * CREATE TABLE with parallel workers, 10.0? </messages/by-id/CAFBoRzeoDdjbPV4riCE+2ApV+Y8nV4HDepYUGftm5SuKWna3rQ@mail.gmail.com> * utility commands benefiting from parallel plan </messages/by-id/CAJrrPGcY3SZa40vU+R8d8dunXp9JRcFyjmPn2RF9_4cxjHd7uA@mail.gmail.com>. At least for parallel vacuum, this feature helps
to solve issue that the implementation of parallel vacuum has.

I ran pgbench for 10 min three times(scale factor is 5000), here is a
performance measurement result.

clients TPS(HEAD) TPS(Patched)
4 2092.612 2031.277
8 3153.732 3046.789
16 4562.072 4625.419
32 6439.391 6479.526
64 7767.364 7779.636
100 7917.173 7906.567

* 16 core Xeon E5620 2.4GHz
* 32 GB RAM
* ioDrive

In current implementation, it seems there is no performance degradation so far.
Please give me feedback.

[1]: * Block level parallel vacuum WIP </messages/by-id/CAD21AoD1xAqp4zK-Vi1cuY3feq2oO8HcpJiz32UDUfe0BE31Xw@mail.gmail.com> * CREATE TABLE with parallel workers, 10.0? </messages/by-id/CAFBoRzeoDdjbPV4riCE+2ApV+Y8nV4HDepYUGftm5SuKWna3rQ@mail.gmail.com> * utility commands benefiting from parallel plan </messages/by-id/CAJrrPGcY3SZa40vU+R8d8dunXp9JRcFyjmPn2RF9_4cxjHd7uA@mail.gmail.com>
* Block level parallel vacuum WIP
</messages/by-id/CAD21AoD1xAqp4zK-Vi1cuY3feq2oO8HcpJiz32UDUfe0BE31Xw@mail.gmail.com>
* CREATE TABLE with parallel workers, 10.0?
</messages/by-id/CAFBoRzeoDdjbPV4riCE+2ApV+Y8nV4HDepYUGftm5SuKWna3rQ@mail.gmail.com>
* utility commands benefiting from parallel plan
</messages/by-id/CAJrrPGcY3SZa40vU+R8d8dunXp9JRcFyjmPn2RF9_4cxjHd7uA@mail.gmail.com>

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

Moving_extension_lock_out_of_heavyweight_lock_v1.patchapplication/octet-stream; name=Moving_extension_lock_out_of_heavyweight_lock_v1.patchDownload

diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index 1725591..69c5c9f 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -613,8 +613,8 @@ brin_page_cleanup(Relation idxrel, Buffer buf)
 	 */
 	if (PageIsNew(page))
 	{
-		LockRelationForExtension(idxrel, ShareLock);
-		UnlockRelationForExtension(idxrel, ShareLock);
+		LockRelationForExtension(idxrel, LW_SHARED);
+		UnlockRelationForExtension(idxrel, LW_SHARED);
 
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 		if (PageIsNew(page))
@@ -706,7 +706,7 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 			 */
 			if (!RELATION_IS_LOCAL(irel))
 			{
-				LockRelationForExtension(irel, ExclusiveLock);
+				LockRelationForExtension(irel, LW_EXCLUSIVE);
 				extensionLockHeld = true;
 			}
 			buf = ReadBuffer(irel, P_NEW);
@@ -758,7 +758,7 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 				}
 
 				if (extensionLockHeld)
-					UnlockRelationForExtension(irel, ExclusiveLock);
+					UnlockRelationForExtension(irel, LW_EXCLUSIVE);
 
 				ReleaseBuffer(buf);
 				return InvalidBuffer;
@@ -768,7 +768,7 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
 		if (extensionLockHeld)
-			UnlockRelationForExtension(irel, ExclusiveLock);
+			UnlockRelationForExtension(irel, LW_EXCLUSIVE);
 
 		page = BufferGetPage(buf);
 
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index 9ed279b..1d07e10 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -567,7 +567,7 @@ revmap_physical_extend(BrinRevmap *revmap)
 	else
 	{
 		if (needLock)
-			LockRelationForExtension(irel, ExclusiveLock);
+			LockRelationForExtension(irel, LW_EXCLUSIVE);
 
 		buf = ReadBuffer(irel, P_NEW);
 		if (BufferGetBlockNumber(buf) != mapBlk)
@@ -579,7 +579,7 @@ revmap_physical_extend(BrinRevmap *revmap)
 			 * page from under whoever is using it.
 			 */
 			if (needLock)
-				UnlockRelationForExtension(irel, ExclusiveLock);
+				UnlockRelationForExtension(irel, LW_EXCLUSIVE);
 			LockBuffer(revmap->rm_metaBuf, BUFFER_LOCK_UNLOCK);
 			ReleaseBuffer(buf);
 			return;
@@ -588,7 +588,7 @@ revmap_physical_extend(BrinRevmap *revmap)
 		page = BufferGetPage(buf);
 
 		if (needLock)
-			UnlockRelationForExtension(irel, ExclusiveLock);
+			UnlockRelationForExtension(irel, LW_EXCLUSIVE);
 	}
 
 	/* Check that it's a regular block (or an empty page) */
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index d03d59d..c98c194 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -323,13 +323,13 @@ GinNewBuffer(Relation index)
 	/* Must extend the file */
 	needLock = !RELATION_IS_LOCAL(index);
 	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
+		LockRelationForExtension(index, LW_EXCLUSIVE);
 
 	buffer = ReadBuffer(index, P_NEW);
 	LockBuffer(buffer, GIN_EXCLUSIVE);
 
 	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+		UnlockRelationForExtension(index, LW_EXCLUSIVE);
 
 	return buffer;
 }
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index 26c077a..d139b76 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -716,10 +716,10 @@ ginvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	needLock = !RELATION_IS_LOCAL(index);
 
 	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
+		LockRelationForExtension(index, LW_EXCLUSIVE);
 	npages = RelationGetNumberOfBlocks(index);
 	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+		UnlockRelationForExtension(index, LW_EXCLUSIVE);
 
 	totFreePages = 0;
 
@@ -766,10 +766,10 @@ ginvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	stats->pages_free = totFreePages;
 
 	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
+		LockRelationForExtension(index, LW_EXCLUSIVE);
 	stats->num_pages = RelationGetNumberOfBlocks(index);
 	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+		UnlockRelationForExtension(index, LW_EXCLUSIVE);
 
 	return stats;
 }
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index cbdaec9..a3e8186 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -801,13 +801,13 @@ gistNewBuffer(Relation r)
 	needLock = !RELATION_IS_LOCAL(r);
 
 	if (needLock)
-		LockRelationForExtension(r, ExclusiveLock);
+		LockRelationForExtension(r, LW_EXCLUSIVE);
 
 	buffer = ReadBuffer(r, P_NEW);
 	LockBuffer(buffer, GIST_EXCLUSIVE);
 
 	if (needLock)
-		UnlockRelationForExtension(r, ExclusiveLock);
+		UnlockRelationForExtension(r, LW_EXCLUSIVE);
 
 	return buffer;
 }
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 77d9d12..e85eb7d 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -59,10 +59,10 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 
 	/* try to find deleted pages */
 	if (needLock)
-		LockRelationForExtension(rel, ExclusiveLock);
+		LockRelationForExtension(rel, LW_EXCLUSIVE);
 	npages = RelationGetNumberOfBlocks(rel);
 	if (needLock)
-		UnlockRelationForExtension(rel, ExclusiveLock);
+		UnlockRelationForExtension(rel, LW_EXCLUSIVE);
 
 	totFreePages = 0;
 	for (blkno = GIST_ROOT_BLKNO + 1; blkno < npages; blkno++)
@@ -91,10 +91,10 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	/* return statistics */
 	stats->pages_free = totFreePages;
 	if (needLock)
-		LockRelationForExtension(rel, ExclusiveLock);
+		LockRelationForExtension(rel, LW_EXCLUSIVE);
 	stats->num_pages = RelationGetNumberOfBlocks(rel);
 	if (needLock)
-		UnlockRelationForExtension(rel, ExclusiveLock);
+		UnlockRelationForExtension(rel, LW_EXCLUSIVE);
 
 	return stats;
 }
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 6529fe3..812f7e0 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -519,11 +519,11 @@ loop:
 	if (needLock)
 	{
 		if (!use_fsm)
-			LockRelationForExtension(relation, ExclusiveLock);
-		else if (!ConditionalLockRelationForExtension(relation, ExclusiveLock))
+			LockRelationForExtension(relation, LW_EXCLUSIVE);
+		else if (!ConditionalLockRelationForExtension(relation, LW_EXCLUSIVE))
 		{
 			/* Couldn't get the lock immediately; wait for it. */
-			LockRelationForExtension(relation, ExclusiveLock);
+			LockRelationForExtension(relation, LW_EXCLUSIVE);
 
 			/*
 			 * Check if some other backend has extended a block for us while
@@ -537,7 +537,7 @@ loop:
 			 */
 			if (targetBlock != InvalidBlockNumber)
 			{
-				UnlockRelationForExtension(relation, ExclusiveLock);
+				UnlockRelationForExtension(relation, LW_EXCLUSIVE);
 				goto loop;
 			}
 
@@ -576,7 +576,7 @@ loop:
 	 * against vacuumlazy.c --- see comments therein.
 	 */
 	if (needLock)
-		UnlockRelationForExtension(relation, ExclusiveLock);
+		UnlockRelationForExtension(relation, LW_EXCLUSIVE);
 
 	/*
 	 * We need to initialize the empty new page.  Double-check that it really
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index e5616ce..aaba35b 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -641,7 +641,7 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 	 * Note that another backend might have extended or created the relation
 	 * by the time we get the lock.
 	 */
-	LockRelationForExtension(rel, ExclusiveLock);
+	LockRelationForExtension(rel, LW_EXCLUSIVE);
 
 	/* Might have to re-open if a cache flush happened */
 	RelationOpenSmgr(rel);
@@ -679,7 +679,7 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 	/* Update local cache with the up-to-date size */
 	rel->rd_smgr->smgr_vm_nblocks = vm_nblocks_now;
 
-	UnlockRelationForExtension(rel, ExclusiveLock);
+	UnlockRelationForExtension(rel, LW_EXCLUSIVE);
 
 	pfree(pg);
 }
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index f815fd4..7ac9a2e 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -658,7 +658,7 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 		needLock = !RELATION_IS_LOCAL(rel);
 
 		if (needLock)
-			LockRelationForExtension(rel, ExclusiveLock);
+			LockRelationForExtension(rel, LW_EXCLUSIVE);
 
 		buf = ReadBuffer(rel, P_NEW);
 
@@ -672,7 +672,7 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 		 * condition against btvacuumscan --- see comments therein.
 		 */
 		if (needLock)
-			UnlockRelationForExtension(rel, ExclusiveLock);
+			UnlockRelationForExtension(rel, LW_EXCLUSIVE);
 
 		/* Initialize the new page before returning it */
 		page = BufferGetPage(buf);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 775f2ff..ae076ae 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -1059,10 +1059,10 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	{
 		/* Get the current relation length */
 		if (needLock)
-			LockRelationForExtension(rel, ExclusiveLock);
+			LockRelationForExtension(rel, LW_EXCLUSIVE);
 		num_pages = RelationGetNumberOfBlocks(rel);
 		if (needLock)
-			UnlockRelationForExtension(rel, ExclusiveLock);
+			UnlockRelationForExtension(rel, LW_EXCLUSIVE);
 
 		/* Quit if we've scanned the whole relation */
 		if (blkno >= num_pages)
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index e57ac49..ab88a07 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -230,13 +230,13 @@ SpGistNewBuffer(Relation index)
 	/* Must extend the file */
 	needLock = !RELATION_IS_LOCAL(index);
 	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
+		LockRelationForExtension(index, LW_EXCLUSIVE);
 
 	buffer = ReadBuffer(index, P_NEW);
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
 	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+		UnlockRelationForExtension(index, LW_EXCLUSIVE);
 
 	return buffer;
 }
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index cce9b3f..84c3502 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -824,10 +824,10 @@ spgvacuumscan(spgBulkDeleteState *bds)
 	{
 		/* Get the current relation length */
 		if (needLock)
-			LockRelationForExtension(index, ExclusiveLock);
+			LockRelationForExtension(index, LW_EXCLUSIVE);
 		num_pages = RelationGetNumberOfBlocks(index);
 		if (needLock)
-			UnlockRelationForExtension(index, ExclusiveLock);
+			UnlockRelationForExtension(index, LW_EXCLUSIVE);
 
 		/* Quit if we've scanned the whole relation */
 		if (blkno >= num_pages)
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 5b43a66..658b98c 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -851,8 +851,8 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
 			 * it's got exclusive lock on the whole relation.
 			 */
 			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
-			LockRelationForExtension(onerel, ExclusiveLock);
-			UnlockRelationForExtension(onerel, ExclusiveLock);
+			LockRelationForExtension(onerel, LW_EXCLUSIVE);
+			UnlockRelationForExtension(onerel, LW_EXCLUSIVE);
 			LockBufferForCleanup(buf);
 			if (PageIsNew(page))
 			{
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 4648473..a099fd8 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -624,7 +624,7 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 	 * Note that another backend might have extended or created the relation
 	 * by the time we get the lock.
 	 */
-	LockRelationForExtension(rel, ExclusiveLock);
+	LockRelationForExtension(rel, LW_EXCLUSIVE);
 
 	/* Might have to re-open if a cache flush happened */
 	RelationOpenSmgr(rel);
@@ -652,7 +652,7 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 	/* Update local cache with the up-to-date size */
 	rel->rd_smgr->smgr_fsm_nblocks = fsm_nblocks_now;
 
-	UnlockRelationForExtension(rel, ExclusiveLock);
+	UnlockRelationForExtension(rel, LW_EXCLUSIVE);
 
 	pfree(pg);
 }
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index fe98898..0901bec 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -24,6 +24,31 @@
 #include "storage/procarray.h"
 #include "utils/inval.h"
 
+/*
+ * Compute the hash code associated with a RELEXTLOCK.
+ *
+ * To avoid unnecessary recomputations of the hash code, we try to do this
+ * just once per function, and then pass it around as needed.  Aside from
+ * passing the hashcode to hash_search_with_hash_value(), we can extract
+ * the lock partition number from the hashcode.
+ */
+#define RelExtLockTargetTagHashCode(relextlocktargettag) \
+	get_hash_value(RelExtLockHash, (const void *) relextlocktargettag)
+
+/*
+ * The lockmgr's shared hash tables are partitioned to reduce contention.
+ * To determine which partition a given relid belongs to, compute the tag's
+ * hash code with ExtLockTagHashCode(), then apply one of these macros.
+ * NB: NUM_RELEXTENSIONLOCK_PARTITIONS must be a power of 2!
+ */
+#define RelExtLockHashPartition(hashcode) \
+	((hashcode) % NUM_RELEXTLOCK_PARTITIONS)
+#define RelExtLockHashPartitionLock(hashcode) \
+	(&MainLWLockArray[RELEXTLOCK_MANAGER_LWLOCK_OFFSET + \
+					  LockHashPartition(hashcode)].lock)
+#define RelExtLockHashPartitionLockByIndex(i) \
+	(&MainLWLockArray[RELEXTLOCK_MANAGER_LWLOCK_OFFSET + (i)].lock
+
 
 /*
  * Per-backend counter for generating speculative insertion tokens.
@@ -57,6 +82,67 @@ typedef struct XactLockTableWaitInfo
 } XactLockTableWaitInfo;
 
 static void XactLockTableWaitErrorCb(void *arg);
+static bool CreateRelExtLock(const RELEXTLOCKTAG *targettag, uint32 hashcode,
+							 LWLockMode lockmode, bool conditional);
+static void DeleteRelExtLock(const RELEXTLOCKTAG *targettag, uint32 hashcode);
+static bool RelExtLockExists(const RELEXTLOCKTAG *targettag);
+
+/*
+ * Pointers to hash tables containing lock state
+ *
+ * The RelExtLockHash hash table is in shared memory; LocalRelExtLockHash
+ * hashtable is local to each backend.
+ */
+static HTAB *RelExtLockHash;
+static HTAB *LocalRelExtLockHash;
+
+
+/*
+ * InitRelExtLock
+ *      Initialize the relation extension lock manager's data structures.
+ */
+void
+InitRelExtLock(long max_table_size)
+{
+	HASHCTL	info;
+	long		init_table_size;
+
+	/*
+	 * Compute init/max size to request for lock hashtables.  Note these
+	 * calculations must agree with LockShmemSize!
+	 */
+	init_table_size = max_table_size / 2;
+
+	/*
+	 * Allocate hash table for RELEXTLOCK structs. This stores per-relation
+	 * lock.
+	 */
+	MemSet(&info, 0, sizeof(info));
+	info.keysize = sizeof(Oid);
+	info.entrysize = sizeof(RELEXTLOCK);
+	info.num_partitions = NUM_RELEXTLOCK_PARTITIONS;
+
+	RelExtLockHash = ShmemInitHash("EXTRELLOCK Hash",
+								   init_table_size,
+								   max_table_size,
+								   &info,
+								   HASH_ELEM | HASH_BLOBS | HASH_PARTITION);
+
+	if (LocalRelExtLockHash)
+		hash_destroy(LocalRelExtLockHash);
+
+	/*
+	 * Allocate non-shared hash table for RELEXTLOCK structs.  This stores
+	 * per-relation extension lock and holding information.
+	 */
+	info.keysize = sizeof(Oid);
+	info.entrysize = sizeof(LOCALRELEXTLOCK);
+
+	LocalRelExtLockHash = hash_create("LOCALRELEXTLOCK hash",
+									  16,
+									  &info,
+									  HASH_ELEM | HASH_BLOBS);
+}
 
 /*
  * RelationInitLockInfo
@@ -321,7 +407,7 @@ UnlockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode)
 /*
  *		LockRelationForExtension
  *
- * This lock tag is used to interlock addition of pages to relations.
+ * This lock is used to interlock addition of pages to relations.
  * We need such locking because bufmgr/smgr definition of P_NEW is not
  * race-condition-proof.
  *
@@ -329,15 +415,31 @@ UnlockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode)
  * the relation, so no AcceptInvalidationMessages call is needed here.
  */
 void
-LockRelationForExtension(Relation relation, LOCKMODE lockmode)
+LockRelationForExtension(Relation relation, LWLockMode lockmode)
 {
-	LOCKTAG		tag;
+	RELEXTLOCKTAG	locktag;
+	LOCALRELEXTLOCK	*local_lock;
+	bool	found;
+	uint32	hashcode;
 
-	SET_LOCKTAG_RELATION_EXTEND(tag,
-								relation->rd_lockInfo.lockRelId.dbId,
-								relation->rd_lockInfo.lockRelId.relId);
+	locktag.relid = relation->rd_id;
+	locktag.mode = lockmode;
 
-	(void) LockAcquire(&tag, lockmode, false, false);
+	/* Do we have the lock already? */
+	if (RelExtLockExists(&locktag))
+		return;
+
+	hashcode = RelExtLockTargetTagHashCode(&locktag);
+
+	/* Acquire lock in local hash table */
+	local_lock = (LOCALRELEXTLOCK *) hash_search_with_hash_value(LocalRelExtLockHash,
+																 (void *) &locktag,
+																 hashcode,
+																 HASH_ENTER, &found);
+	local_lock->held = true;
+
+	/* Actually create the lock in shared hash table */
+	CreateRelExtLock(&locktag, hashcode, lockmode, false);
 }
 
 /*
@@ -347,47 +449,95 @@ LockRelationForExtension(Relation relation, LOCKMODE lockmode)
  * Returns TRUE iff the lock was acquired.
  */
 bool
-ConditionalLockRelationForExtension(Relation relation, LOCKMODE lockmode)
+ConditionalLockRelationForExtension(Relation relation, LWLockMode lockmode)
 {
-	LOCKTAG		tag;
-
-	SET_LOCKTAG_RELATION_EXTEND(tag,
-								relation->rd_lockInfo.lockRelId.dbId,
-								relation->rd_lockInfo.lockRelId.relId);
-
-	return (LockAcquire(&tag, lockmode, false, true) != LOCKACQUIRE_NOT_AVAIL);
+	RELEXTLOCKTAG	locktag;
+	LOCALRELEXTLOCK	*local_lock;
+	bool	found;
+	uint32	hashcode;
+	bool	ret;
+
+	locktag.relid = relation->rd_id;
+	locktag.mode = lockmode;
+
+	/* Do we have the lock already? */
+	if (RelExtLockExists(&locktag))
+		return true;
+
+	hashcode = RelExtLockTargetTagHashCode(&locktag);
+
+	/* Acquire lock in local hash table, but we're not sure the result of acquire yet */
+	local_lock = (LOCALRELEXTLOCK *) hash_search_with_hash_value(LocalRelExtLockHash,
+																 (void *) &locktag,
+																 hashcode,
+																 HASH_ENTER, &found);
+	ret = CreateRelExtLock(&locktag, hashcode, lockmode, true);
+	local_lock->held = ret;
+
+	return ret;
 }
 
 /*
  *		RelationExtensionLockWaiterCount
  *
  * Count the number of processes waiting for the given relation extension lock.
+ * NOte that this routine doesn't acquire the partition lock. Please make sure
+ * that the caller must acquire partitionlock in exclusive mode or we must call
+ * this routine after acquired the relation extension lock of this relation.
  */
 int
 RelationExtensionLockWaiterCount(Relation relation)
 {
-	LOCKTAG		tag;
-
-	SET_LOCKTAG_RELATION_EXTEND(tag,
-								relation->rd_lockInfo.lockRelId.dbId,
-								relation->rd_lockInfo.lockRelId.relId);
-
-	return LockWaiterCount(&tag);
+	RELEXTLOCKTAG	locktag;
+	RELEXTLOCK	*ext_lock;
+	bool	found;
+	int		nwaiters;
+	uint32	hashcode;
+
+	locktag.relid = relation->rd_id;
+	locktag.mode = LW_EXCLUSIVE;
+	hashcode = RelExtLockTargetTagHashCode(&locktag);
+
+	ext_lock = (RELEXTLOCK *) hash_search_with_hash_value(RelExtLockHash,
+														  (void *) &locktag,
+														  hashcode,
+														  HASH_FIND, &found);
+	/* We assume that we already acquire this lock */
+	Assert(found);
+
+	nwaiters = LWLockWaiterCount(&(ext_lock->lock));
+
+	return nwaiters;
 }
 
 /*
  *		UnlockRelationForExtension
  */
 void
-UnlockRelationForExtension(Relation relation, LOCKMODE lockmode)
+UnlockRelationForExtension(Relation relation, LWLockMode lockmode)
 {
-	LOCKTAG		tag;
+	RELEXTLOCKTAG	locktag;
+	uint32	hashcode;
+	bool	found;
 
-	SET_LOCKTAG_RELATION_EXTEND(tag,
-								relation->rd_lockInfo.lockRelId.dbId,
-								relation->rd_lockInfo.lockRelId.relId);
+	locktag.relid = relation->rd_id;
+	locktag.mode = lockmode;
 
-	LockRelease(&tag, lockmode, false);
+	/* Quick exit, if we don't acquire lock */
+	if (!RelExtLockExists(&locktag))
+		return;
+
+	/* Remove hash entry from local hash table */
+	hashcode = RelExtLockTargetTagHashCode(&locktag);
+	hash_search_with_hash_value(LocalRelExtLockHash,
+								(void *) &locktag,
+								hashcode, HASH_REMOVE,
+								&found);
+
+	Assert(found);
+
+	/* Actually remove the lock in shared hash table */
+	DeleteRelExtLock(&locktag, hashcode);
 }
 
 /*
@@ -961,12 +1111,6 @@ DescribeLockTag(StringInfo buf, const LOCKTAG *tag)
 							 tag->locktag_field2,
 							 tag->locktag_field1);
 			break;
-		case LOCKTAG_RELATION_EXTEND:
-			appendStringInfo(buf,
-							 _("extension of relation %u of database %u"),
-							 tag->locktag_field2,
-							 tag->locktag_field1);
-			break;
 		case LOCKTAG_PAGE:
 			appendStringInfo(buf,
 							 _("page %u of relation %u of database %u"),
@@ -1042,3 +1186,112 @@ GetLockNameFromTagType(uint16 locktag_type)
 		return "???";
 	return LockTagTypeNames[locktag_type];
 }
+
+
+/*
+ * Check whether a particular relation extension lock is held by this transaction.
+ *
+ * Note that this function may return false even when it fhe lock exists in local
+ * hash table, because the conditional relation extension lock doesn't remove the
+ * local hash entry even when failed to acquire lock.
+ */
+static bool
+RelExtLockExists(const RELEXTLOCKTAG *targettag)
+{
+	LOCALRELEXTLOCK	*lock;
+	uint32	hashcode;
+
+	hashcode = RelExtLockTargetTagHashCode(targettag);
+
+	lock = (LOCALRELEXTLOCK *) hash_search_with_hash_value(LocalRelExtLockHash,
+														   (void *) targettag,
+														   hashcode, HASH_FIND, NULL);
+
+	if (!lock)
+		return false;
+
+	/*
+	 * Found entry in the table, but still need to check whether it's actually
+	 * held -- it could be just created when acquiring conditional lock.
+	 */
+	return lock->held;
+}
+
+/*
+ * Create RELEXTLOCK hash entry on shared hash table. To avoid dead-lock with
+ * partition lock and LWLock, we acquire them but don't release it here. The
+ * caller must call DeleteRelExtLock later to release these locks.
+ */
+static bool
+CreateRelExtLock(const RELEXTLOCKTAG *targettag, uint32 hashcode, LWLockMode lockmode,
+				 bool conditional)
+{
+	RELEXTLOCK	*ext_lock;
+	LWLock	*partitionLock;
+	bool	found;
+	bool	ret = false;
+
+	partitionLock = RelExtLockHashPartitionLock(hashcode);
+	LWLockAcquire(partitionLock, LW_EXCLUSIVE);
+
+	ext_lock = (RELEXTLOCK *) hash_search_with_hash_value(RelExtLockHash,
+														  (void * ) targettag,
+														  hashcode, HASH_ENTER, &found);
+
+	if (!ext_lock)
+		ereport(ERROR,
+				(errcode(ERRCODE_OUT_OF_MEMORY),
+				 errmsg("out of shared memory"),
+				 errhint("You might need to increase max_pred_locks_per_transaction.")));
+
+	/* This is a new hash entry, initialize it */
+	if (!found)
+		LWLockInitialize(&(ext_lock->lock), LWTRANCHE_RELEXT_LOCK_MANAGER);
+
+	if (conditional)
+		ret = LWLockConditionalAcquire(&(ext_lock->lock), lockmode);
+	else
+		ret = LWLockAcquire(&(ext_lock->lock), lockmode);
+
+	/* Always return true if not conditional lock */
+	return ret;
+}
+
+/*
+ * Remove RELEXTLOCK from shared RelExtLockHash hash table. Since other backends
+ * might be acquiring it or waiting for this lock, we can delete it only if there
+ * is no longer backends who are interested in it.
+ *
+ * Note that we assume partition lock for hash table is already acquired when
+ * acquiring the lock. This routine should release partition lock as well after
+ * released LWLock.
+ */
+static void
+DeleteRelExtLock(const RELEXTLOCKTAG *targettag, uint32 hashcode)
+{
+	RELEXTLOCK	*ext_lock;
+	LOCALRELEXTLOCK	*lock;
+	LWLock	*partitionLock;
+	bool	found;
+
+	partitionLock = RelExtLockHashPartitionLock(hashcode);
+
+	ext_lock = (RELEXTLOCK *) hash_search_with_hash_value(RelExtLockHash,
+														  (void * ) targettag,
+														  hashcode,
+														  HASH_FIND, &found);
+
+	if (!found)
+		return;
+
+	/*
+	 * Remove this hash entry if there is no longer someone who is interested
+	 * in extension lock of this relation.
+	 */
+	if (LWLockCheckForCleanup(&(ext_lock->lock)))
+		hash_search_with_hash_value(RelExtLockHash, (void *) targettag,
+									hashcode, HASH_REMOVE, &found);
+
+	LWLockRelease(&(ext_lock->lock));
+	LWLockRelease(partitionLock);
+}
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 4315be4..90311da 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -45,6 +45,7 @@
 #include "storage/sinvaladt.h"
 #include "storage/spin.h"
 #include "storage/standby.h"
+#include "storage/lmgr.h"
 #include "utils/memutils.h"
 #include "utils/ps_status.h"
 #include "utils/resowner_private.h"
@@ -388,6 +389,10 @@ InitLocks(void)
 	max_table_size = NLOCKENTS();
 	init_table_size = max_table_size / 2;
 
+
+	/* Initialize lock structure for relation extension lock */
+	InitRelExtLock(max_table_size);
+
 	/*
 	 * Allocate hash table for LOCK structs.  This stores per-locked-object
 	 * information.
@@ -3366,6 +3371,7 @@ LockShmemSize(void)
 	/* lock hash table */
 	max_table_size = NLOCKENTS();
 	size = add_size(size, hash_estimate_size(max_table_size, sizeof(LOCK)));
+	size = add_size(size, hash_estimate_size(max_table_size, sizeof(LWLock)));
 
 	/* proclock hash table */
 	max_table_size *= 2;
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 3e13394..c004213 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -451,6 +451,13 @@ InitializeLWLocks(void)
 	for (id = 0; id < NUM_PREDICATELOCK_PARTITIONS; id++, lock++)
 		LWLockInitialize(&lock->lock, LWTRANCHE_PREDICATE_LOCK_MANAGER);
 
+	/* Initialize relation extension lmgr's LWLocks in main array */
+	lock = MainLWLockArray + NUM_INDIVIDUAL_LWLOCKS +
+		NUM_BUFFER_PARTITIONS + NUM_LOCK_PARTITIONS +
+		NUM_PREDICATELOCK_PARTITIONS;
+	for (id = 0; id < NUM_RELEXTLOCK_PARTITIONS; id++, lock++)
+		LWLockInitialize(&lock->lock, LWTRANCHE_RELEXT_LOCK_MANAGER);
+
 	/* Initialize named tranches. */
 	if (NamedLWLockTrancheRequests > 0)
 	{
@@ -494,7 +501,7 @@ RegisterLWLockTranches(void)
 
 	if (LWLockTrancheArray == NULL)
 	{
-		LWLockTranchesAllocated = 64;
+		LWLockTranchesAllocated = 128;
 		LWLockTrancheArray = (char **)
 			MemoryContextAllocZero(TopMemoryContext,
 						  LWLockTranchesAllocated * sizeof(char *));
@@ -508,6 +515,7 @@ RegisterLWLockTranches(void)
 	LWLockRegisterTranche(LWTRANCHE_LOCK_MANAGER, "lock_manager");
 	LWLockRegisterTranche(LWTRANCHE_PREDICATE_LOCK_MANAGER,
 						  "predicate_lock_manager");
+	LWLockRegisterTranche(LWTRANCHE_RELEXT_LOCK_MANAGER, "relext_lock_manager");
 	LWLockRegisterTranche(LWTRANCHE_PARALLEL_QUERY_DSA,
 						  "parallel_query_dsa");
 	LWLockRegisterTranche(LWTRANCHE_TBM, "tbm");
@@ -1857,3 +1865,46 @@ LWLockHeldByMeInMode(LWLock *l, LWLockMode mode)
 	}
 	return false;
 }
+
+/*
+ * LWLockCheckForCleanup
+ *
+ * Return true only if there is no backend who waiting for this lock and is
+ * acquiring.
+ */
+bool
+LWLockCheckForCleanup(LWLock *lock)
+{
+	uint32	state;
+	bool ret;
+
+	state = pg_atomic_read_u32(&(lock->state));
+
+	ret = (state & LW_LOCK_MASK) == 0;
+	ret &= (state & LW_SHARED_MASK) == 0;
+
+	return ret;
+}
+
+int
+LWLockWaiterCount(LWLock *lock)
+{
+	int     nwaiters = 0;
+	proclist_mutable_iter iter;
+	uint32 state;
+
+	state = pg_atomic_read_u32(&(lock->state));
+
+	/* Quick check using state of lock */
+	if ((state & LW_FLAG_HAS_WAITERS) == 0)
+		return 0;
+
+	LWLockWaitListLock(lock);
+
+	proclist_foreach_modify(iter, &lock->waiters, lwWaitLink)
+		nwaiters++;
+
+	LWLockWaitListUnlock(lock);
+
+	return nwaiters;
+}
diff --git a/src/backend/utils/adt/lockfuncs.c b/src/backend/utils/adt/lockfuncs.c
index ef4824f..5205542 100644
--- a/src/backend/utils/adt/lockfuncs.c
+++ b/src/backend/utils/adt/lockfuncs.c
@@ -234,7 +234,6 @@ pg_lock_status(PG_FUNCTION_ARGS)
 		switch ((LockTagType) instance->locktag.locktag_type)
 		{
 			case LOCKTAG_RELATION:
-			case LOCKTAG_RELATION_EXTEND:
 				values[1] = ObjectIdGetDatum(instance->locktag.locktag_field1);
 				values[2] = ObjectIdGetDatum(instance->locktag.locktag_field2);
 				nulls[3] = true;
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index 2a1244c..e7f4828 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -34,6 +34,36 @@ typedef enum XLTW_Oper
 	XLTW_RecheckExclusionConstr
 } XLTW_Oper;
 
+typedef	struct RELEXTLOCKTAG
+{
+	Oid		relid;		/* identifies the lockable object */
+	LWLockMode mode;	/* lock mode for this table entry */
+} RELEXTLOCKTAG;
+
+/*
+ * This structure holds information per-object relation extension
+ * lock.
+ */
+typedef struct RELEXTLOCK
+{
+	RELEXTLOCKTAG	tag;	/* hash key -- must be first */
+	LWLock			lock;	/* LWLock for relation extension */
+} RELEXTLOCK;
+
+/*
+ * The LOCALRELEXTLOCK struct represents a local copy of data which is
+ * also present in the RELEXTLOCK table, organized for fast access without
+ * needing to acquire a LWLock.  It is strictly for optimization.
+ */
+typedef struct LOCALRELEXTLOCK
+{
+	/* hash key */
+	RELEXTLOCKTAG	relid;	/* unique identifier of locktable object */
+
+	/* data */
+	bool			held;	/* is lock held? */
+} LOCALRELEXTLOCK;
+
 extern void RelationInitLockInfo(Relation relation);
 
 /* Lock a relation */
@@ -51,10 +81,10 @@ extern void LockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
 extern void UnlockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
 
 /* Lock a relation for extension */
-extern void LockRelationForExtension(Relation relation, LOCKMODE lockmode);
-extern void UnlockRelationForExtension(Relation relation, LOCKMODE lockmode);
-extern bool ConditionalLockRelationForExtension(Relation relation,
-									LOCKMODE lockmode);
+extern void InitRelExtLock(long max_table_size);
+extern void LockRelationForExtension(Relation relation, LWLockMode lockmode);
+extern void UnlockRelationForExtension(Relation relation, LWLockMode lockmode);
+extern bool ConditionalLockRelationForExtension(Relation relation, LWLockMode lockmode);
 extern int	RelationExtensionLockWaiterCount(Relation relation);
 
 /* Lock a page (currently only used within indexes) */
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index 7a9c105..9d6e90f 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -139,8 +139,6 @@ typedef uint16 LOCKMETHODID;
 typedef enum LockTagType
 {
 	LOCKTAG_RELATION,			/* whole relation */
-	/* ID info for a relation is DB OID + REL OID; DB OID = 0 if shared */
-	LOCKTAG_RELATION_EXTEND,	/* the right to extend a relation */
 	/* same ID info as RELATION */
 	LOCKTAG_PAGE,				/* one page of a relation */
 	/* ID info for a page is RELATION info + BlockNumber */
@@ -199,14 +197,6 @@ typedef struct LOCKTAG
 	 (locktag).locktag_type = LOCKTAG_RELATION, \
 	 (locktag).locktag_lockmethodid = DEFAULT_LOCKMETHOD)
 
-#define SET_LOCKTAG_RELATION_EXTEND(locktag,dboid,reloid) \
-	((locktag).locktag_field1 = (dboid), \
-	 (locktag).locktag_field2 = (reloid), \
-	 (locktag).locktag_field3 = 0, \
-	 (locktag).locktag_field4 = 0, \
-	 (locktag).locktag_type = LOCKTAG_RELATION_EXTEND, \
-	 (locktag).locktag_lockmethodid = DEFAULT_LOCKMETHOD)
-
 #define SET_LOCKTAG_PAGE(locktag,dboid,reloid,blocknum) \
 	((locktag).locktag_field1 = (dboid), \
 	 (locktag).locktag_field2 = (reloid), \
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 0cd45bb..acab6fb 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -120,14 +120,21 @@ extern PGDLLIMPORT int NamedLWLockTrancheRequests;
 #define LOG2_NUM_PREDICATELOCK_PARTITIONS  4
 #define NUM_PREDICATELOCK_PARTITIONS  (1 << LOG2_NUM_PREDICATELOCK_PARTITIONS)
 
+/* Number of partitions the shared relation extension lock tables are divided into */
+#define LOG2_NUM_RELEXTLOCK_PARTITIONS	4
+#define NUM_RELEXTLOCK_PARTITIONS	(1 << LOG2_NUM_RELEXTLOCK_PARTITIONS)
+
 /* Offsets for various chunks of preallocated lwlocks. */
 #define BUFFER_MAPPING_LWLOCK_OFFSET	NUM_INDIVIDUAL_LWLOCKS
 #define LOCK_MANAGER_LWLOCK_OFFSET		\
 	(BUFFER_MAPPING_LWLOCK_OFFSET + NUM_BUFFER_PARTITIONS)
 #define PREDICATELOCK_MANAGER_LWLOCK_OFFSET \
 	(LOCK_MANAGER_LWLOCK_OFFSET + NUM_LOCK_PARTITIONS)
-#define NUM_FIXED_LWLOCKS \
+#define RELEXTLOCK_MANAGER_LWLOCK_OFFSET \
 	(PREDICATELOCK_MANAGER_LWLOCK_OFFSET + NUM_PREDICATELOCK_PARTITIONS)
+#define NUM_FIXED_LWLOCKS \
+	(PREDICATELOCK_MANAGER_LWLOCK_OFFSET + NUM_PREDICATELOCK_PARTITIONS + \
+	NUM_RELEXTLOCK_PARTITIONS)
 
 typedef enum LWLockMode
 {
@@ -151,6 +158,8 @@ extern void LWLockReleaseClearVar(LWLock *lock, uint64 *valptr, uint64 val);
 extern void LWLockReleaseAll(void);
 extern bool LWLockHeldByMe(LWLock *lock);
 extern bool LWLockHeldByMeInMode(LWLock *lock, LWLockMode mode);
+extern bool LWLockCheckForCleanup(LWLock *lock);
+extern int LWLockWaiterCount(LWLock *lock);
 
 extern bool LWLockWaitForVar(LWLock *lock, uint64 *valptr, uint64 oldval, uint64 *newval);
 extern void LWLockUpdateVar(LWLock *lock, uint64 *valptr, uint64 value);
@@ -211,6 +220,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_BUFFER_MAPPING,
 	LWTRANCHE_LOCK_MANAGER,
 	LWTRANCHE_PREDICATE_LOCK_MANAGER,
+	LWTRANCHE_RELEXT_LOCK_MANAGER,
 	LWTRANCHE_PARALLEL_QUERY_DSA,
 	LWTRANCHE_TBM,
 	LWTRANCHE_FIRST_USER_DEFINED

Robert Haas

robertmhaas@gmail.com

over 8 years ago

In reply to: Masahiko Sawada (#1)

Re: Moving relation extension locks out of heavyweight lock manager

On Wed, May 10, 2017 at 8:39 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Currently, the relation extension lock is implemented using
heavyweight lock manager and almost functions (except for
brin_page_cleanup) using LockRelationForExntesion use it with
ExclusiveLock mode. But actually it doesn't need multiple lock modes
or deadlock detection or any of the other functionality that the
heavyweight lock manager provides. I think It's enough to use
something like LWLock. So I'd like to propose to change relation
extension lock management so that it works using LWLock instead.

That's not a good idea because it'll make the code that executes while
holding that lock noninterruptible.

Possibly something based on condition variables would work better.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

over 8 years ago

In reply to: Robert Haas (#2)

Re: Moving relation extension locks out of heavyweight lock manager

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, May 10, 2017 at 8:39 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

... I'd like to propose to change relation
extension lock management so that it works using LWLock instead.

That's not a good idea because it'll make the code that executes while
holding that lock noninterruptible.

Is that really a problem? We typically only hold it over one kernel call,
which ought to be noninterruptible anyway. Also, the CheckpointLock is
held for far longer, and we've not heard complaints about that one.

I'm slightly suspicious of the claim that we don't need deadlock
detection. There are places that e.g. touch FSM while holding this
lock. It might be all right but it needs close review, not just an
assertion that it's not a problem.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Amit Kapila

amit.kapila16@gmail.com

over 8 years ago

In reply to: Masahiko Sawada (#1)

Re: Moving relation extension locks out of heavyweight lock manager

On Thu, May 11, 2017 at 6:09 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

This work would be helpful not only for existing workload but also
future works like some parallel utility commands, which is discussed
on other threads[1]. At least for parallel vacuum, this feature helps
to solve issue that the implementation of parallel vacuum has.

I ran pgbench for 10 min three times(scale factor is 5000), here is a
performance measurement result.

clients TPS(HEAD) TPS(Patched)
4 2092.612 2031.277
8 3153.732 3046.789
16 4562.072 4625.419
32 6439.391 6479.526
64 7767.364 7779.636
100 7917.173 7906.567

* 16 core Xeon E5620 2.4GHz
* 32 GB RAM
* ioDrive

In current implementation, it seems there is no performance degradation so far.

I think it is good to check pgbench, but we should do tests of the
bulk load as this lock is stressed during such a workload. Some of
the tests we have done when we have improved the performance of bulk
load can be found in an e-mail [1]/messages/by-id/CAFiTN-tkX6gs-jL8VrPxg6OG9VUAKnObUq7r7pWQqASzdF5OwA@mail.gmail.com.

[1]: /messages/by-id/CAFiTN-tkX6gs-jL8VrPxg6OG9VUAKnObUq7r7pWQqASzdF5OwA@mail.gmail.com
/messages/by-id/CAFiTN-tkX6gs-jL8VrPxg6OG9VUAKnObUq7r7pWQqASzdF5OwA@mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Amit Kapila

amit.kapila16@gmail.com

over 8 years ago

In reply to: Tom Lane (#3)

Re: Moving relation extension locks out of heavyweight lock manager

On Fri, May 12, 2017 at 9:14 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, May 10, 2017 at 8:39 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

... I'd like to propose to change relation
extension lock management so that it works using LWLock instead.

That's not a good idea because it'll make the code that executes while
holding that lock noninterruptible.

Is that really a problem? We typically only hold it over one kernel call,
which ought to be noninterruptible anyway.

During parallel bulk load operations, I think we hold it over multiple
kernel calls.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Masahiko Sawada

sawada.mshk@gmail.com

over 8 years ago

In reply to: Amit Kapila (#4)

Re: Moving relation extension locks out of heavyweight lock manager

On Sat, May 13, 2017 at 8:19 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, May 11, 2017 at 6:09 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

This work would be helpful not only for existing workload but also
future works like some parallel utility commands, which is discussed
on other threads[1]. At least for parallel vacuum, this feature helps
to solve issue that the implementation of parallel vacuum has.

I ran pgbench for 10 min three times(scale factor is 5000), here is a
performance measurement result.

clients TPS(HEAD) TPS(Patched)
4 2092.612 2031.277
8 3153.732 3046.789
16 4562.072 4625.419
32 6439.391 6479.526
64 7767.364 7779.636
100 7917.173 7906.567

* 16 core Xeon E5620 2.4GHz
* 32 GB RAM
* ioDrive

In current implementation, it seems there is no performance degradation so far.

I think it is good to check pgbench, but we should do tests of the
bulk load as this lock is stressed during such a workload. Some of
the tests we have done when we have improved the performance of bulk
load can be found in an e-mail [1].

Thank you for sharing.

I've measured using two test scripts attached on that thread. Here is result.

* Copy test script
Client HEAD Patched
4 452.60 455.53
8 561.74 561.09
16 592.50 592.21
32 602.53 599.53
64 605.01 606.42

* Insert test script
Client HEAD Patched
4 159.04 158.44
8 169.41 169.69
16 177.11 178.14
32 182.14 181.99
64 182.11 182.73

It seems there is no performance degradation so far.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Robert Haas

robertmhaas@gmail.com

over 8 years ago

In reply to: Amit Kapila (#5)

Re: Moving relation extension locks out of heavyweight lock manager

On Sat, May 13, 2017 at 7:27 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, May 12, 2017 at 9:14 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, May 10, 2017 at 8:39 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

... I'd like to propose to change relation
extension lock management so that it works using LWLock instead.

That's not a good idea because it'll make the code that executes while
holding that lock noninterruptible.

Is that really a problem? We typically only hold it over one kernel call,
which ought to be noninterruptible anyway.

During parallel bulk load operations, I think we hold it over multiple
kernel calls.

We do. Also, RelationGetNumberOfBlocks() is not necessarily only one
kernel call, no? Nor is vm_extend.

Also, it's not just the backend doing the filesystem operation that's
non-interruptible, but also any waiters, right?

Maybe this isn't a big problem, but it does seem to be that it would
be better to avoid it if we can.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Masahiko Sawada

sawada.mshk@gmail.com

over 8 years ago

In reply to: Robert Haas (#7)

Re: Moving relation extension locks out of heavyweight lock manager

On Wed, May 17, 2017 at 1:30 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sat, May 13, 2017 at 7:27 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, May 12, 2017 at 9:14 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, May 10, 2017 at 8:39 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

... I'd like to propose to change relation
extension lock management so that it works using LWLock instead.

That's not a good idea because it'll make the code that executes while
holding that lock noninterruptible.

Is that really a problem? We typically only hold it over one kernel call,
which ought to be noninterruptible anyway.

During parallel bulk load operations, I think we hold it over multiple
kernel calls.

We do. Also, RelationGetNumberOfBlocks() is not necessarily only one
kernel call, no? Nor is vm_extend.

Yeah, these functions could call more than one kernel calls while
holding extension lock.

Also, it's not just the backend doing the filesystem operation that's
non-interruptible, but also any waiters, right?

Maybe this isn't a big problem, but it does seem to be that it would
be better to avoid it if we can.

I agree to change it to be interruptible for more safety.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Masahiko Sawada

sawada.mshk@gmail.com

over 8 years ago

In reply to: Masahiko Sawada (#8)

1 attachment(s)

Re: Moving relation extension locks out of heavyweight lock manager

On Fri, May 19, 2017 at 11:12 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, May 17, 2017 at 1:30 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sat, May 13, 2017 at 7:27 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, May 12, 2017 at 9:14 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, May 10, 2017 at 8:39 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

... I'd like to propose to change relation
extension lock management so that it works using LWLock instead.

That's not a good idea because it'll make the code that executes while
holding that lock noninterruptible.

Is that really a problem? We typically only hold it over one kernel call,
which ought to be noninterruptible anyway.

During parallel bulk load operations, I think we hold it over multiple
kernel calls.

We do. Also, RelationGetNumberOfBlocks() is not necessarily only one
kernel call, no? Nor is vm_extend.

Yeah, these functions could call more than one kernel calls while
holding extension lock.

Also, it's not just the backend doing the filesystem operation that's
non-interruptible, but also any waiters, right?

Maybe this isn't a big problem, but it does seem to be that it would
be better to avoid it if we can.

I agree to change it to be interruptible for more safety.

Attached updated version patch. To use the lock mechanism similar to
LWLock but interrupt-able, I introduced new lock manager for extension
lock. A lot of code especially locking and unlocking, is inspired by
LWLock but it uses the condition variables to wait for acquiring lock.
Other part is not changed from previous patch. This is still a PoC
patch, lacks documentation. The following is the measurement result
with test script same as I used before.

* Copy test script
HEAD Patched
4 436.6 436.1
8 561.8 561.8
16 580.7 579.4
32 588.5 597.0
64 596.1 599.0

* Insert test script
HEAD Patched
4 156.5 156.0
8 167.0 167.9
16 176.2 175.6
32 181.1 181.0
64 181.5 183.0

Since I replaced heavyweight lock with lightweight lock I expected the
performance slightly improves from HEAD but it was almost same result.
I'll continue to look at more detail.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

Moving_extension_lock_out_of_heavyweight_lock_v2.patchapplication/octet-stream; name=Moving_extension_lock_out_of_heavyweight_lock_v2.patchDownload

diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index 3609c8a..ac75c73 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -609,8 +609,8 @@ brin_page_cleanup(Relation idxrel, Buffer buf)
 	 */
 	if (PageIsNew(page))
 	{
-		LockRelationForExtension(idxrel, ShareLock);
-		UnlockRelationForExtension(idxrel, ShareLock);
+		LockRelationForExtension(idxrel, RELEXT_SHARED);
+		UnlockRelationForExtension(idxrel, RELEXT_SHARED);
 
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 		if (PageIsNew(page))
@@ -702,7 +702,7 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 			 */
 			if (!RELATION_IS_LOCAL(irel))
 			{
-				LockRelationForExtension(irel, ExclusiveLock);
+				LockRelationForExtension(irel, RELEXT_EXCLUSIVE);
 				extensionLockHeld = true;
 			}
 			buf = ReadBuffer(irel, P_NEW);
@@ -754,7 +754,7 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 				}
 
 				if (extensionLockHeld)
-					UnlockRelationForExtension(irel, ExclusiveLock);
+					UnlockRelationForExtension(irel, RELEXT_EXCLUSIVE);
 
 				ReleaseBuffer(buf);
 				return InvalidBuffer;
@@ -764,7 +764,7 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
 		if (extensionLockHeld)
-			UnlockRelationForExtension(irel, ExclusiveLock);
+			UnlockRelationForExtension(irel, RELEXT_EXCLUSIVE);
 
 		page = BufferGetPage(buf);
 
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index e778cbc..3b7ca6b 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -570,7 +570,7 @@ revmap_physical_extend(BrinRevmap *revmap)
 	else
 	{
 		if (needLock)
-			LockRelationForExtension(irel, ExclusiveLock);
+			LockRelationForExtension(irel, RELEXT_EXCLUSIVE);
 
 		buf = ReadBuffer(irel, P_NEW);
 		if (BufferGetBlockNumber(buf) != mapBlk)
@@ -582,7 +582,7 @@ revmap_physical_extend(BrinRevmap *revmap)
 			 * page from under whoever is using it.
 			 */
 			if (needLock)
-				UnlockRelationForExtension(irel, ExclusiveLock);
+				UnlockRelationForExtension(irel, RELEXT_EXCLUSIVE);
 			LockBuffer(revmap->rm_metaBuf, BUFFER_LOCK_UNLOCK);
 			ReleaseBuffer(buf);
 			return;
@@ -591,7 +591,7 @@ revmap_physical_extend(BrinRevmap *revmap)
 		page = BufferGetPage(buf);
 
 		if (needLock)
-			UnlockRelationForExtension(irel, ExclusiveLock);
+			UnlockRelationForExtension(irel, RELEXT_EXCLUSIVE);
 	}
 
 	/* Check that it's a regular block (or an empty page) */
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index d03d59d..cbdf51f 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -323,13 +323,13 @@ GinNewBuffer(Relation index)
 	/* Must extend the file */
 	needLock = !RELATION_IS_LOCAL(index);
 	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
+		LockRelationForExtension(index, RELEXT_EXCLUSIVE);
 
 	buffer = ReadBuffer(index, P_NEW);
 	LockBuffer(buffer, GIN_EXCLUSIVE);
 
 	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+		UnlockRelationForExtension(index, RELEXT_EXCLUSIVE);
 
 	return buffer;
 }
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index 27e502a..d2e9567 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -716,10 +716,10 @@ ginvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	needLock = !RELATION_IS_LOCAL(index);
 
 	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
+		LockRelationForExtension(index, RELEXT_EXCLUSIVE);
 	npages = RelationGetNumberOfBlocks(index);
 	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+		UnlockRelationForExtension(index, RELEXT_EXCLUSIVE);
 
 	totFreePages = 0;
 
@@ -766,10 +766,10 @@ ginvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	stats->pages_free = totFreePages;
 
 	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
+		LockRelationForExtension(index, RELEXT_EXCLUSIVE);
 	stats->num_pages = RelationGetNumberOfBlocks(index);
 	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+		UnlockRelationForExtension(index, RELEXT_EXCLUSIVE);
 
 	return stats;
 }
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index cbdaec9..4788e2a 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -801,13 +801,13 @@ gistNewBuffer(Relation r)
 	needLock = !RELATION_IS_LOCAL(r);
 
 	if (needLock)
-		LockRelationForExtension(r, ExclusiveLock);
+		LockRelationForExtension(r, RELEXT_EXCLUSIVE);
 
 	buffer = ReadBuffer(r, P_NEW);
 	LockBuffer(buffer, GIST_EXCLUSIVE);
 
 	if (needLock)
-		UnlockRelationForExtension(r, ExclusiveLock);
+		UnlockRelationForExtension(r, RELEXT_EXCLUSIVE);
 
 	return buffer;
 }
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 77d9d12..ca45b06 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -59,10 +59,10 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 
 	/* try to find deleted pages */
 	if (needLock)
-		LockRelationForExtension(rel, ExclusiveLock);
+		LockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 	npages = RelationGetNumberOfBlocks(rel);
 	if (needLock)
-		UnlockRelationForExtension(rel, ExclusiveLock);
+		UnlockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 
 	totFreePages = 0;
 	for (blkno = GIST_ROOT_BLKNO + 1; blkno < npages; blkno++)
@@ -91,10 +91,10 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	/* return statistics */
 	stats->pages_free = totFreePages;
 	if (needLock)
-		LockRelationForExtension(rel, ExclusiveLock);
+		LockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 	stats->num_pages = RelationGetNumberOfBlocks(rel);
 	if (needLock)
-		UnlockRelationForExtension(rel, ExclusiveLock);
+		UnlockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 
 	return stats;
 }
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 6529fe3..8c50c32 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -519,11 +519,11 @@ loop:
 	if (needLock)
 	{
 		if (!use_fsm)
-			LockRelationForExtension(relation, ExclusiveLock);
-		else if (!ConditionalLockRelationForExtension(relation, ExclusiveLock))
+			LockRelationForExtension(relation, RELEXT_EXCLUSIVE);
+		else if (!ConditionalLockRelationForExtension(relation, RELEXT_EXCLUSIVE))
 		{
 			/* Couldn't get the lock immediately; wait for it. */
-			LockRelationForExtension(relation, ExclusiveLock);
+			LockRelationForExtension(relation, RELEXT_EXCLUSIVE);
 
 			/*
 			 * Check if some other backend has extended a block for us while
@@ -537,7 +537,7 @@ loop:
 			 */
 			if (targetBlock != InvalidBlockNumber)
 			{
-				UnlockRelationForExtension(relation, ExclusiveLock);
+				UnlockRelationForExtension(relation, RELEXT_EXCLUSIVE);
 				goto loop;
 			}
 
@@ -576,7 +576,7 @@ loop:
 	 * against vacuumlazy.c --- see comments therein.
 	 */
 	if (needLock)
-		UnlockRelationForExtension(relation, ExclusiveLock);
+		UnlockRelationForExtension(relation, RELEXT_EXCLUSIVE);
 
 	/*
 	 * We need to initialize the empty new page.  Double-check that it really
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index e5616ce..a722d89 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -641,7 +641,7 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 	 * Note that another backend might have extended or created the relation
 	 * by the time we get the lock.
 	 */
-	LockRelationForExtension(rel, ExclusiveLock);
+	LockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 
 	/* Might have to re-open if a cache flush happened */
 	RelationOpenSmgr(rel);
@@ -679,7 +679,7 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 	/* Update local cache with the up-to-date size */
 	rel->rd_smgr->smgr_vm_nblocks = vm_nblocks_now;
 
-	UnlockRelationForExtension(rel, ExclusiveLock);
+	UnlockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 
 	pfree(pg);
 }
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index f815fd4..85e91e2 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -658,7 +658,7 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 		needLock = !RELATION_IS_LOCAL(rel);
 
 		if (needLock)
-			LockRelationForExtension(rel, ExclusiveLock);
+			LockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 
 		buf = ReadBuffer(rel, P_NEW);
 
@@ -672,7 +672,7 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 		 * condition against btvacuumscan --- see comments therein.
 		 */
 		if (needLock)
-			UnlockRelationForExtension(rel, ExclusiveLock);
+			UnlockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 
 		/* Initialize the new page before returning it */
 		page = BufferGetPage(buf);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 116f5f3..cbc1a46 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -1059,10 +1059,10 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	{
 		/* Get the current relation length */
 		if (needLock)
-			LockRelationForExtension(rel, ExclusiveLock);
+			LockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 		num_pages = RelationGetNumberOfBlocks(rel);
 		if (needLock)
-			UnlockRelationForExtension(rel, ExclusiveLock);
+			UnlockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 
 		/* Quit if we've scanned the whole relation */
 		if (blkno >= num_pages)
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index e57ac49..4a05dd3 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -230,13 +230,13 @@ SpGistNewBuffer(Relation index)
 	/* Must extend the file */
 	needLock = !RELATION_IS_LOCAL(index);
 	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
+		LockRelationForExtension(index, RELEXT_EXCLUSIVE);
 
 	buffer = ReadBuffer(index, P_NEW);
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
 	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+		UnlockRelationForExtension(index, RELEXT_EXCLUSIVE);
 
 	return buffer;
 }
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index cce9b3f..3ae0344 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -824,10 +824,10 @@ spgvacuumscan(spgBulkDeleteState *bds)
 	{
 		/* Get the current relation length */
 		if (needLock)
-			LockRelationForExtension(index, ExclusiveLock);
+			LockRelationForExtension(index, RELEXT_EXCLUSIVE);
 		num_pages = RelationGetNumberOfBlocks(index);
 		if (needLock)
-			UnlockRelationForExtension(index, ExclusiveLock);
+			UnlockRelationForExtension(index, RELEXT_EXCLUSIVE);
 
 		/* Quit if we've scanned the whole relation */
 		if (blkno >= num_pages)
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index fc9c4f0..4d4a2e6 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -851,8 +851,8 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
 			 * it's got exclusive lock on the whole relation.
 			 */
 			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
-			LockRelationForExtension(onerel, ExclusiveLock);
-			UnlockRelationForExtension(onerel, ExclusiveLock);
+			LockRelationForExtension(onerel, RELEXT_EXCLUSIVE);
+			UnlockRelationForExtension(onerel, RELEXT_EXCLUSIVE);
 			LockBufferForCleanup(buf);
 			if (PageIsNew(page))
 			{
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index f453dad..f5fac1b 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3612,6 +3612,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE:
 			event_name = "LogicalSyncStateChange";
 			break;
+		case WAIT_EVENT_RELATION_EXTENSION:
+			event_name = "RelationExtension";
+			break;
 			/* no default case, so that compiler will warn */
 	}
 
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 4648473..498223a 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -624,7 +624,7 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 	 * Note that another backend might have extended or created the relation
 	 * by the time we get the lock.
 	 */
-	LockRelationForExtension(rel, ExclusiveLock);
+	LockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 
 	/* Might have to re-open if a cache flush happened */
 	RelationOpenSmgr(rel);
@@ -652,7 +652,7 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 	/* Update local cache with the up-to-date size */
 	rel->rd_smgr->smgr_fsm_nblocks = fsm_nblocks_now;
 
-	UnlockRelationForExtension(rel, ExclusiveLock);
+	UnlockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 
 	pfree(pg);
 }
diff --git a/src/backend/storage/lmgr/Makefile b/src/backend/storage/lmgr/Makefile
index e1b787e..2334a40 100644
--- a/src/backend/storage/lmgr/Makefile
+++ b/src/backend/storage/lmgr/Makefile
@@ -13,7 +13,7 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = lmgr.o lock.o proc.o deadlock.o lwlock.o lwlocknames.o spin.o \
-	s_lock.o predicate.o condition_variable.o
+	s_lock.o predicate.o condition_variable.o extension_lock.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/storage/lmgr/extension_lock.c b/src/backend/storage/lmgr/extension_lock.c
new file mode 100644
index 0000000..e8bbd5a
--- /dev/null
+++ b/src/backend/storage/lmgr/extension_lock.c
@@ -0,0 +1,380 @@
+/*-------------------------------------------------------------------------
+ *
+ * extension_lock.c
+ *	  Relation extension lock manager
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/storage/lmgr/extension_lock.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "pg_trace.h"
+#include "postmaster/postmaster.h"
+#include "replication/slot.h"
+#include "storage/ipc.h"
+#include "storage/proc.h"
+#include "storage/proclist.h"
+#include "storage/spin.h"
+#include "storage/extension_lock.h"
+#include "utils/memutils.h"
+#include "utils/rel.h"
+
+#ifdef LWLOCK_STATS
+#include "utils/hsearch.h"
+#endif
+
+/*
+ * Compute the hash code associated with a RELEXTLOCK.
+ *
+ * To avoid unnecessary recomputations of the hash code, we try to do this
+ * just once per function, and then pass it around as needed.  Aside from
+ * passing the hashcode to hash_search_with_hash_value(), we can extract
+ * the lock partition number from the hashcode.
+ */
+#define RelExtLockTargetTagHashCode(relextlocktargettag) \
+	get_hash_value(RelExtLockHash, (const void *) relextlocktargettag)
+
+/*
+ * The lockmgr's shared hash tables are partitioned to reduce contention.
+ * To determine which partition a given relid belongs to, compute the tag's
+ * hash code with ExtLockTagHashCode(), then apply one of these macros.
+ * NB: NUM_RELEXTENSIONLOCK_PARTITIONS must be a power of 2!
+ */
+#define RelExtLockHashPartition(hashcode) \
+	((hashcode) % NUM_RELEXTLOCK_PARTITIONS)
+#define RelExtLockHashPartitionLock(hashcode) \
+	(&MainLWLockArray[RELEXTLOCK_MANAGER_LWLOCK_OFFSET + \
+					  LockHashPartition(hashcode)].lock)
+#define RelExtLockHashPartitionLockByIndex(i) \
+	(&MainLWLockArray[RELEXTLOCK_MANAGER_LWLOCK_OFFSET + (i)].lock
+
+#define	RELEXT_VAL_EXCLUSIVE	((uint32) 1 << 24)
+#define RELEXT_VAL_SHARED		1
+
+#define RELEXT_LOCKMASK			((uint32) ((1 << 25) - 1))
+
+/* */
+#define MAX_SIMUL_EXTLOCKS 8
+
+/*
+ * This structure holds information per-object relation extension
+ * lock. held_extlocks represents the ExtLocks we're holding.
+ */
+typedef	struct relextlock_handle
+{
+	Oid			relid;
+	RelExtLock	*lock;
+	RelExtLockMode mode;	/* lock mode for this table entry */
+} relextlock_handle;
+static relextlock_handle held_relextlocks[MAX_SIMUL_EXTLOCKS];
+static int num_held_relextlocks = 0;
+
+static bool RelExtLockAcquire(Oid relid, RelExtLockMode lockmode, bool conditional);
+static void RelExtLockRelease(Oid rleid, RelExtLockMode lockmode);
+static bool RelExtLockAttemptLock(RelExtLock *ext_lock, RelExtLockMode lockmode);
+
+/*
+ * Pointers to hash tables containing lock state
+ *
+ * The RelExtLockHash hash table is in shared memory
+ */
+static HTAB *RelExtLockHash;
+
+/*
+ * InitRelExtLock
+ *      Initialize the relation extension lock manager's data structures.
+ */
+void
+InitRelExtLock(long max_table_size)
+{
+	HASHCTL	info;
+	long		init_table_size;
+
+	/*
+	 * Compute init/max size to request for lock hashtables.  Note these
+	 * calculations must agree with LockShmemSize!
+	 */
+	init_table_size = max_table_size / 2;
+
+	/*
+	 * Allocate hash table for RELEXTLOCK structs. This stores per-relation
+	 * lock.
+	 */
+	MemSet(&info, 0, sizeof(info));
+	info.keysize = sizeof(Oid);
+	info.entrysize = sizeof(RelExtLock);
+	info.num_partitions = NUM_RELEXTLOCK_PARTITIONS;
+
+	RelExtLockHash = ShmemInitHash("RelExtLock Hash",
+								   init_table_size,
+								   max_table_size,
+								   &info,
+								   HASH_ELEM | HASH_BLOBS | HASH_PARTITION);
+}
+
+/*
+ *		LockRelationForExtension
+ *
+ * This lock is used to interlock addition of pages to relations.
+ * We need such locking because bufmgr/smgr definition of P_NEW is not
+ * race-condition-proof.
+ *
+ * We assume the caller is already holding some type of regular lock on
+ * the relation, so no AcceptInvalidationMessages call is needed here.
+ */
+void
+LockRelationForExtension(Relation relation, RelExtLockMode lockmode)
+{
+	RelExtLockAcquire(relation->rd_id, lockmode, false);
+}
+
+/*
+ *		ConditionalLockRelationForExtension
+ *
+ * As above, but only lock if we can get the lock without blocking.
+ * Returns TRUE iff the lock was acquired.
+ */
+bool
+ConditionalLockRelationForExtension(Relation relation, RelExtLockMode lockmode)
+{
+	return RelExtLockAcquire(relation->rd_id, lockmode, true);
+}
+
+/*
+ *		RelationExtensionLockWaiterCount
+ *
+ * Count the number of processes waiting for the given relation extension lock.
+ * NOte that this routine doesn't acquire the partition lock. Please make sure
+ * that the caller must acquire partitionlock in exclusive mode or we must call
+ * this routine after acquired the relation extension lock of this relation.
+ */
+int
+RelationExtensionLockWaiterCount(Relation relation)
+{
+	RelExtLock	*ext_lock;
+	Oid		relid;
+	uint32	nwaiters;
+	uint32	hashcode;
+	bool	found;
+
+	relid = relation->rd_id;
+	hashcode = RelExtLockTargetTagHashCode(&relid);
+
+	ext_lock = (RelExtLock *) hash_search_with_hash_value(RelExtLockHash,
+														  (void *) &relid,
+														  hashcode,
+														  HASH_FIND, &found);
+	/* We assume that we already acquire this lock */
+	Assert(found);
+
+	nwaiters = pg_atomic_read_u32(&(ext_lock->nwaiters));
+
+	return nwaiters;
+}
+
+/*
+ *		UnlockRelationForExtension
+ */
+void
+UnlockRelationForExtension(Relation relation, RelExtLockMode lockmode)
+{
+	RelExtLockRelease(relation->rd_id, lockmode);
+}
+
+/*
+ * Acquire relation extension lock and create RELEXTLOCK hash entry on shared
+ * hash table. To avoid dead-lock with partition lock and LWLock, we acquire
+ * them but don't release it here. The caller must call DeleteRelExtLock later
+ * to release these locks.
+ */
+static bool
+RelExtLockAcquire(Oid relid, RelExtLockMode lockmode, bool conditional)
+{
+	RelExtLock	*ext_lock;
+	LWLock	*partitionLock;
+	uint32	hashcode;
+	bool	found;
+	bool	got_lock = false;
+	bool	waited = false;
+
+	hashcode = RelExtLockTargetTagHashCode(&relid);
+	partitionLock = RelExtLockHashPartitionLock(hashcode);
+	LWLockAcquire(partitionLock, LW_EXCLUSIVE);
+
+	ext_lock = (RelExtLock *) hash_search_with_hash_value(RelExtLockHash,
+														  (void * ) &relid,
+														  hashcode, HASH_ENTER, &found);
+
+	if (!ext_lock)
+		ereport(ERROR,
+				(errcode(ERRCODE_OUT_OF_MEMORY),
+				 errmsg("out of shared memory"),
+				 errhint("You might need to increase max_pred_locks_per_transaction.")));
+
+	for (;;)
+	{
+		bool ret;
+
+		ret = RelExtLockAttemptLock(ext_lock, lockmode);
+
+		if (ret)
+		{
+			got_lock = true;
+
+			if (waited)
+				pg_atomic_sub_fetch_u32(&(ext_lock->nwaiters), 1);
+
+			break;	/* got the lock */
+		}
+
+		/* Could not get lock, return if in conditional lock */
+		if (!ret && conditional)
+			break;
+
+		/* Add to wait list */
+		pg_atomic_add_fetch_u32(&(ext_lock->nwaiters), 1);
+		ConditionVariableSleep(&(ext_lock->cv), WAIT_EVENT_RELATION_EXTENSION);
+	}
+
+	ConditionVariableCancelSleep();
+
+	if (got_lock)
+	{
+		/* Add lock to list relation extension locks held by this backend */
+		held_relextlocks[num_held_relextlocks].relid = relid;
+		held_relextlocks[num_held_relextlocks].lock = ext_lock;
+		held_relextlocks[num_held_relextlocks].mode = lockmode;
+		num_held_relextlocks++;
+	}
+	else
+		LWLockRelease(partitionLock);
+
+	/* Always end up with true if not conditional lock */
+	return got_lock;
+}
+
+/*
+ * RelationExtensionLockReleaseAll - release all currently-held relation extension locks
+ */
+void
+RelationExtensionLockReleaseAll(void)
+{
+	while (num_held_relextlocks > 0)
+	{
+		HOLD_INTERRUPTS();
+
+		RelExtLockRelease(held_relextlocks[num_held_relextlocks - 1].relid,
+						  held_relextlocks[num_held_relextlocks - 1].mode);
+	}
+}
+
+/*
+ * ExstLockRelease
+ *
+ * Remove RELEXTLOCK from shared RelExtLockHash hash table. Since other backends
+ * might be acquiring it or waiting for this lock, we can delete it only if there
+ * is no longer backends who are interested in it.
+ *
+ * Note that we assume partition lock for hash table is already acquired when
+ * acquiring the lock. This routine should release partition lock as well after
+ * released LWLock.
+ */
+static void
+RelExtLockRelease(Oid relid, RelExtLockMode lockmode)
+{
+	RelExtLock	*ext_lock;
+	RelExtLockMode mode;
+	uint32	hashcode;
+	LWLock	*partitionLock;
+	uint32	oldstate;
+	uint32	nwaiters;
+	int i;
+
+	hashcode = RelExtLockTargetTagHashCode(&relid);
+	partitionLock = RelExtLockHashPartitionLock(hashcode);
+
+	for (i = num_held_relextlocks; --i >= 0;)
+		if (relid == held_relextlocks[i].relid &&
+			lockmode == held_relextlocks[i].mode)
+			break;
+
+	if (i < 0)
+		elog(ERROR, "relation extension lock for %u with lock mode %d is not held",
+			 relid, lockmode);
+
+	ext_lock = held_relextlocks[i].lock;
+	mode = held_relextlocks[i].mode;
+
+	num_held_relextlocks--;
+
+	/* Shrink */
+	for (; i < num_held_relextlocks; i++)
+		held_relextlocks[i] = held_relextlocks[i + 1];
+
+	if (mode == RELEXT_EXCLUSIVE)
+		oldstate = pg_atomic_sub_fetch_u32(&(ext_lock->state), RELEXT_VAL_EXCLUSIVE);
+	else
+		oldstate = pg_atomic_sub_fetch_u32(&(ext_lock->state), RELEXT_VAL_SHARED);
+
+	nwaiters = pg_atomic_read_u32(&(ext_lock->nwaiters));
+
+	/* Wake up waiters if there are */
+	if (nwaiters > 0)
+		ConditionVariableBroadcast(&(ext_lock->cv));
+	else
+		hash_search_with_hash_value(RelExtLockHash, (void *) &relid,
+									hashcode, HASH_REMOVE, NULL);
+
+	LWLockRelease(partitionLock);
+}
+
+/*
+ * Internal function that tries to atomically acquire the relation extension
+ * lock in the passed in mode. Return true if we got the lock.
+ */
+static bool
+RelExtLockAttemptLock(RelExtLock *ext_lock, RelExtLockMode lockmode)
+{
+	uint32	oldstate;
+
+	oldstate = pg_atomic_read_u32(&ext_lock->state);
+
+	while (true)
+	{
+		uint32	desired_state;
+		bool	lock_free;
+
+		desired_state = oldstate;
+
+		if (lockmode == RELEXT_EXCLUSIVE)
+		{
+			lock_free = (oldstate & RELEXT_LOCKMASK) == 0;
+			if (lock_free)
+				desired_state += RELEXT_VAL_EXCLUSIVE;
+		}
+		else
+		{
+			lock_free = (oldstate & RELEXT_VAL_EXCLUSIVE) == 0;
+			if (lock_free)
+				desired_state += RELEXT_VAL_SHARED;
+		}
+
+		if (pg_atomic_compare_exchange_u32(&ext_lock->state,
+										   &oldstate, desired_state))
+		{
+			if (lock_free)
+				return true;
+			else
+				return false;
+		}
+	}
+	pg_unreachable();
+}
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index fe98898..34095cb 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -319,78 +319,6 @@ UnlockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode)
 }
 
 /*
- *		LockRelationForExtension
- *
- * This lock tag is used to interlock addition of pages to relations.
- * We need such locking because bufmgr/smgr definition of P_NEW is not
- * race-condition-proof.
- *
- * We assume the caller is already holding some type of regular lock on
- * the relation, so no AcceptInvalidationMessages call is needed here.
- */
-void
-LockRelationForExtension(Relation relation, LOCKMODE lockmode)
-{
-	LOCKTAG		tag;
-
-	SET_LOCKTAG_RELATION_EXTEND(tag,
-								relation->rd_lockInfo.lockRelId.dbId,
-								relation->rd_lockInfo.lockRelId.relId);
-
-	(void) LockAcquire(&tag, lockmode, false, false);
-}
-
-/*
- *		ConditionalLockRelationForExtension
- *
- * As above, but only lock if we can get the lock without blocking.
- * Returns TRUE iff the lock was acquired.
- */
-bool
-ConditionalLockRelationForExtension(Relation relation, LOCKMODE lockmode)
-{
-	LOCKTAG		tag;
-
-	SET_LOCKTAG_RELATION_EXTEND(tag,
-								relation->rd_lockInfo.lockRelId.dbId,
-								relation->rd_lockInfo.lockRelId.relId);
-
-	return (LockAcquire(&tag, lockmode, false, true) != LOCKACQUIRE_NOT_AVAIL);
-}
-
-/*
- *		RelationExtensionLockWaiterCount
- *
- * Count the number of processes waiting for the given relation extension lock.
- */
-int
-RelationExtensionLockWaiterCount(Relation relation)
-{
-	LOCKTAG		tag;
-
-	SET_LOCKTAG_RELATION_EXTEND(tag,
-								relation->rd_lockInfo.lockRelId.dbId,
-								relation->rd_lockInfo.lockRelId.relId);
-
-	return LockWaiterCount(&tag);
-}
-
-/*
- *		UnlockRelationForExtension
- */
-void
-UnlockRelationForExtension(Relation relation, LOCKMODE lockmode)
-{
-	LOCKTAG		tag;
-
-	SET_LOCKTAG_RELATION_EXTEND(tag,
-								relation->rd_lockInfo.lockRelId.dbId,
-								relation->rd_lockInfo.lockRelId.relId);
-
-	LockRelease(&tag, lockmode, false);
-}
-
-/*
  *		LockPage
  *
  * Obtain a page-level lock.  This is currently used by some index access
@@ -961,12 +889,6 @@ DescribeLockTag(StringInfo buf, const LOCKTAG *tag)
 							 tag->locktag_field2,
 							 tag->locktag_field1);
 			break;
-		case LOCKTAG_RELATION_EXTEND:
-			appendStringInfo(buf,
-							 _("extension of relation %u of database %u"),
-							 tag->locktag_field2,
-							 tag->locktag_field1);
-			break;
 		case LOCKTAG_PAGE:
 			appendStringInfo(buf,
 							 _("page %u of relation %u of database %u"),
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 4315be4..90311da 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -45,6 +45,7 @@
 #include "storage/sinvaladt.h"
 #include "storage/spin.h"
 #include "storage/standby.h"
+#include "storage/lmgr.h"
 #include "utils/memutils.h"
 #include "utils/ps_status.h"
 #include "utils/resowner_private.h"
@@ -388,6 +389,10 @@ InitLocks(void)
 	max_table_size = NLOCKENTS();
 	init_table_size = max_table_size / 2;
 
+
+	/* Initialize lock structure for relation extension lock */
+	InitRelExtLock(max_table_size);
+
 	/*
 	 * Allocate hash table for LOCK structs.  This stores per-locked-object
 	 * information.
@@ -3366,6 +3371,7 @@ LockShmemSize(void)
 	/* lock hash table */
 	max_table_size = NLOCKENTS();
 	size = add_size(size, hash_estimate_size(max_table_size, sizeof(LOCK)));
+	size = add_size(size, hash_estimate_size(max_table_size, sizeof(LWLock)));
 
 	/* proclock hash table */
 	max_table_size *= 2;
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 35536e4..8894665 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -451,6 +451,13 @@ InitializeLWLocks(void)
 	for (id = 0; id < NUM_PREDICATELOCK_PARTITIONS; id++, lock++)
 		LWLockInitialize(&lock->lock, LWTRANCHE_PREDICATE_LOCK_MANAGER);
 
+	/* Initialize relation extension lmgr's LWLocks in main array */
+	lock = MainLWLockArray + NUM_INDIVIDUAL_LWLOCKS +
+		NUM_BUFFER_PARTITIONS + NUM_LOCK_PARTITIONS +
+		NUM_PREDICATELOCK_PARTITIONS;
+	for (id = 0; id < NUM_RELEXTLOCK_PARTITIONS; id++, lock++)
+		LWLockInitialize(&lock->lock, LWTRANCHE_RELEXT_LOCK_MANAGER);
+
 	/* Initialize named tranches. */
 	if (NamedLWLockTrancheRequests > 0)
 	{
@@ -494,7 +501,7 @@ RegisterLWLockTranches(void)
 
 	if (LWLockTrancheArray == NULL)
 	{
-		LWLockTranchesAllocated = 64;
+		LWLockTranchesAllocated = 128;
 		LWLockTrancheArray = (char **)
 			MemoryContextAllocZero(TopMemoryContext,
 								   LWLockTranchesAllocated * sizeof(char *));
@@ -508,6 +515,7 @@ RegisterLWLockTranches(void)
 	LWLockRegisterTranche(LWTRANCHE_LOCK_MANAGER, "lock_manager");
 	LWLockRegisterTranche(LWTRANCHE_PREDICATE_LOCK_MANAGER,
 						  "predicate_lock_manager");
+	LWLockRegisterTranche(LWTRANCHE_RELEXT_LOCK_MANAGER, "relext_lock_manager");
 	LWLockRegisterTranche(LWTRANCHE_PARALLEL_QUERY_DSA,
 						  "parallel_query_dsa");
 	LWLockRegisterTranche(LWTRANCHE_TBM, "tbm");
diff --git a/src/backend/utils/adt/lockfuncs.c b/src/backend/utils/adt/lockfuncs.c
index ef4824f..1c165bc 100644
--- a/src/backend/utils/adt/lockfuncs.c
+++ b/src/backend/utils/adt/lockfuncs.c
@@ -25,7 +25,6 @@
 /* This must match enum LockTagType! */
 const char *const LockTagTypeNames[] = {
 	"relation",
-	"extend",
 	"page",
 	"tuple",
 	"transactionid",
@@ -234,7 +233,6 @@ pg_lock_status(PG_FUNCTION_ARGS)
 		switch ((LockTagType) instance->locktag.locktag_type)
 		{
 			case LOCKTAG_RELATION:
-			case LOCKTAG_RELATION_EXTEND:
 				values[1] = ObjectIdGetDatum(instance->locktag.locktag_field1);
 				values[2] = ObjectIdGetDatum(instance->locktag.locktag_field2);
 				nulls[3] = true;
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 5e029c0..53d1b6c 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -812,7 +812,8 @@ typedef enum
 	WAIT_EVENT_SAFE_SNAPSHOT,
 	WAIT_EVENT_SYNC_REP,
 	WAIT_EVENT_LOGICAL_SYNC_DATA,
-	WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE
+	WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE,
+	WAIT_EVENT_RELATION_EXTENSION
 } WaitEventIPC;
 
 /* ----------
diff --git a/src/include/storage/extension_lock.h b/src/include/storage/extension_lock.h
new file mode 100644
index 0000000..f178672
--- /dev/null
+++ b/src/include/storage/extension_lock.h
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * extension_lock.h
+ *	  Relation extension lock manager
+ *
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/extension_lock.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef EXTENSION_H
+#define EXTENSION_H
+
+#ifdef FRONTEND
+#error "extension_lock.h may not be included from frontend code"
+#endif
+
+#include "storage/proclist_types.h"
+#include "storage/s_lock.h"
+#include "storage/condition_variable.h"
+#include "port/atomics.h"
+
+typedef struct RelExtLock
+{
+	Oid					relid;
+	pg_atomic_uint32	state;
+	pg_atomic_uint32	nwaiters;
+	ConditionVariable	cv;
+} RelExtLock;
+
+typedef enum RelExtLockMode
+{
+	RELEXT_EXCLUSIVE,
+	RELEXT_SHARED
+} RelExtLockMode;
+
+/* Lock a relation for extension */
+extern void InitRelExtLock(long max_table_size);
+extern void LockRelationForExtension(Relation relation, RelExtLockMode lockmode);
+extern void UnlockRelationForExtension(Relation relation, RelExtLockMode lockmode);
+extern bool ConditionalLockRelationForExtension(Relation relation, RelExtLockMode lockmode);
+extern int	RelationExtensionLockWaiterCount(Relation relation);
+extern void RelationExtensionLockReleaseAll(void);
+
+#endif	/* EXTENSION_H */
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index 2a1244c..ef980ab 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -15,6 +15,7 @@
 #define LMGR_H
 
 #include "lib/stringinfo.h"
+#include "storage/extension_lock.h"
 #include "storage/itemptr.h"
 #include "storage/lock.h"
 #include "utils/rel.h"
@@ -34,6 +35,36 @@ typedef enum XLTW_Oper
 	XLTW_RecheckExclusionConstr
 } XLTW_Oper;
 
+typedef	struct RELEXTLOCKTAG
+{
+	Oid		relid;		/* identifies the lockable object */
+	LWLockMode mode;	/* lock mode for this table entry */
+} RELEXTLOCKTAG;
+
+/*
+ * This structure holds information per-object relation extension
+ * lock.
+ */
+typedef struct RELEXTLOCK
+{
+	RELEXTLOCKTAG	tag;	/* hash key -- must be first */
+	LWLock			lock;	/* LWLock for relation extension */
+} RELEXTLOCK;
+
+/*
+ * The LOCALRELEXTLOCK struct represents a local copy of data which is
+ * also present in the RELEXTLOCK table, organized for fast access without
+ * needing to acquire a LWLock.  It is strictly for optimization.
+ */
+typedef struct LOCALRELEXTLOCK
+{
+	/* hash key */
+	RELEXTLOCKTAG	relid;	/* unique identifier of locktable object */
+
+	/* data */
+	bool			held;	/* is lock held? */
+} LOCALRELEXTLOCK;
+
 extern void RelationInitLockInfo(Relation relation);
 
 /* Lock a relation */
@@ -50,13 +81,6 @@ extern bool LockHasWaitersRelation(Relation relation, LOCKMODE lockmode);
 extern void LockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
 extern void UnlockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
 
-/* Lock a relation for extension */
-extern void LockRelationForExtension(Relation relation, LOCKMODE lockmode);
-extern void UnlockRelationForExtension(Relation relation, LOCKMODE lockmode);
-extern bool ConditionalLockRelationForExtension(Relation relation,
-									LOCKMODE lockmode);
-extern int	RelationExtensionLockWaiterCount(Relation relation);
-
 /* Lock a page (currently only used within indexes) */
 extern void LockPage(Relation relation, BlockNumber blkno, LOCKMODE lockmode);
 extern bool ConditionalLockPage(Relation relation, BlockNumber blkno, LOCKMODE lockmode);
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index 7a9c105..9d6e90f 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -139,8 +139,6 @@ typedef uint16 LOCKMETHODID;
 typedef enum LockTagType
 {
 	LOCKTAG_RELATION,			/* whole relation */
-	/* ID info for a relation is DB OID + REL OID; DB OID = 0 if shared */
-	LOCKTAG_RELATION_EXTEND,	/* the right to extend a relation */
 	/* same ID info as RELATION */
 	LOCKTAG_PAGE,				/* one page of a relation */
 	/* ID info for a page is RELATION info + BlockNumber */
@@ -199,14 +197,6 @@ typedef struct LOCKTAG
 	 (locktag).locktag_type = LOCKTAG_RELATION, \
 	 (locktag).locktag_lockmethodid = DEFAULT_LOCKMETHOD)
 
-#define SET_LOCKTAG_RELATION_EXTEND(locktag,dboid,reloid) \
-	((locktag).locktag_field1 = (dboid), \
-	 (locktag).locktag_field2 = (reloid), \
-	 (locktag).locktag_field3 = 0, \
-	 (locktag).locktag_field4 = 0, \
-	 (locktag).locktag_type = LOCKTAG_RELATION_EXTEND, \
-	 (locktag).locktag_lockmethodid = DEFAULT_LOCKMETHOD)
-
 #define SET_LOCKTAG_PAGE(locktag,dboid,reloid,blocknum) \
 	((locktag).locktag_field1 = (dboid), \
 	 (locktag).locktag_field2 = (reloid), \
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 0cd45bb..30f538b 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -120,14 +120,21 @@ extern PGDLLIMPORT int NamedLWLockTrancheRequests;
 #define LOG2_NUM_PREDICATELOCK_PARTITIONS  4
 #define NUM_PREDICATELOCK_PARTITIONS  (1 << LOG2_NUM_PREDICATELOCK_PARTITIONS)
 
+/* Number of partitions the shared relation extension lock tables are divided into */
+#define LOG2_NUM_RELEXTLOCK_PARTITIONS	4
+#define NUM_RELEXTLOCK_PARTITIONS	(1 << LOG2_NUM_RELEXTLOCK_PARTITIONS)
+
 /* Offsets for various chunks of preallocated lwlocks. */
 #define BUFFER_MAPPING_LWLOCK_OFFSET	NUM_INDIVIDUAL_LWLOCKS
 #define LOCK_MANAGER_LWLOCK_OFFSET		\
 	(BUFFER_MAPPING_LWLOCK_OFFSET + NUM_BUFFER_PARTITIONS)
 #define PREDICATELOCK_MANAGER_LWLOCK_OFFSET \
 	(LOCK_MANAGER_LWLOCK_OFFSET + NUM_LOCK_PARTITIONS)
-#define NUM_FIXED_LWLOCKS \
+#define RELEXTLOCK_MANAGER_LWLOCK_OFFSET \
 	(PREDICATELOCK_MANAGER_LWLOCK_OFFSET + NUM_PREDICATELOCK_PARTITIONS)
+#define NUM_FIXED_LWLOCKS \
+	(PREDICATELOCK_MANAGER_LWLOCK_OFFSET + NUM_PREDICATELOCK_PARTITIONS + \
+	 NUM_RELEXTLOCK_PARTITIONS)
 
 typedef enum LWLockMode
 {
@@ -151,6 +158,8 @@ extern void LWLockReleaseClearVar(LWLock *lock, uint64 *valptr, uint64 val);
 extern void LWLockReleaseAll(void);
 extern bool LWLockHeldByMe(LWLock *lock);
 extern bool LWLockHeldByMeInMode(LWLock *lock, LWLockMode mode);
+extern bool LWLockCheckForCleanup(LWLock *lock);
+extern int LWLockWaiterCount(LWLock *lock);
 
 extern bool LWLockWaitForVar(LWLock *lock, uint64 *valptr, uint64 oldval, uint64 *newval);
 extern void LWLockUpdateVar(LWLock *lock, uint64 *valptr, uint64 value);
@@ -211,6 +220,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_BUFFER_MAPPING,
 	LWTRANCHE_LOCK_MANAGER,
 	LWTRANCHE_PREDICATE_LOCK_MANAGER,
+	LWTRANCHE_RELEXT_LOCK_MANAGER,
 	LWTRANCHE_PARALLEL_QUERY_DSA,
 	LWTRANCHE_TBM,
 	LWTRANCHE_FIRST_USER_DEFINED

#10

Masahiko Sawada

sawada.mshk@gmail.com

over 8 years ago

In reply to: Masahiko Sawada (#9)

1 attachment(s)

Re: Moving relation extension locks out of heavyweight lock manager

On Thu, Jun 22, 2017 at 12:03 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, May 19, 2017 at 11:12 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, May 17, 2017 at 1:30 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sat, May 13, 2017 at 7:27 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, May 12, 2017 at 9:14 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, May 10, 2017 at 8:39 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

... I'd like to propose to change relation
extension lock management so that it works using LWLock instead.

That's not a good idea because it'll make the code that executes while
holding that lock noninterruptible.

Is that really a problem? We typically only hold it over one kernel call,
which ought to be noninterruptible anyway.

During parallel bulk load operations, I think we hold it over multiple
kernel calls.

We do. Also, RelationGetNumberOfBlocks() is not necessarily only one
kernel call, no? Nor is vm_extend.

Yeah, these functions could call more than one kernel calls while
holding extension lock.

Also, it's not just the backend doing the filesystem operation that's
non-interruptible, but also any waiters, right?

Maybe this isn't a big problem, but it does seem to be that it would
be better to avoid it if we can.

I agree to change it to be interruptible for more safety.

Attached updated version patch. To use the lock mechanism similar to
LWLock but interrupt-able, I introduced new lock manager for extension
lock. A lot of code especially locking and unlocking, is inspired by
LWLock but it uses the condition variables to wait for acquiring lock.
Other part is not changed from previous patch. This is still a PoC
patch, lacks documentation. The following is the measurement result
with test script same as I used before.

* Copy test script
HEAD Patched
4 436.6 436.1
8 561.8 561.8
16 580.7 579.4
32 588.5 597.0
64 596.1 599.0

* Insert test script
HEAD Patched
4 156.5 156.0
8 167.0 167.9
16 176.2 175.6
32 181.1 181.0
64 181.5 183.0

Since I replaced heavyweight lock with lightweight lock I expected the
performance slightly improves from HEAD but it was almost same result.
I'll continue to look at more detail.

The previous patch conflicts with current HEAD, I rebased the patch to
current HEAD.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

Moving_extension_lock_out_of_heavyweight_lock_v3.patchapplication/octet-stream; name=Moving_extension_lock_out_of_heavyweight_lock_v3.patchDownload

diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index 80f803e..b928c1a 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -609,8 +609,8 @@ brin_page_cleanup(Relation idxrel, Buffer buf)
 	 */
 	if (PageIsNew(page))
 	{
-		LockRelationForExtension(idxrel, ShareLock);
-		UnlockRelationForExtension(idxrel, ShareLock);
+		LockRelationForExtension(idxrel, RELEXT_SHARED);
+		UnlockRelationForExtension(idxrel, RELEXT_SHARED);
 
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 		if (PageIsNew(page))
@@ -702,7 +702,7 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 			 */
 			if (!RELATION_IS_LOCAL(irel))
 			{
-				LockRelationForExtension(irel, ExclusiveLock);
+				LockRelationForExtension(irel, RELEXT_EXCLUSIVE);
 				extensionLockHeld = true;
 			}
 			buf = ReadBuffer(irel, P_NEW);
@@ -754,7 +754,7 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 				}
 
 				if (extensionLockHeld)
-					UnlockRelationForExtension(irel, ExclusiveLock);
+					UnlockRelationForExtension(irel, RELEXT_EXCLUSIVE);
 
 				ReleaseBuffer(buf);
 				return InvalidBuffer;
@@ -764,7 +764,7 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
 		if (extensionLockHeld)
-			UnlockRelationForExtension(irel, ExclusiveLock);
+			UnlockRelationForExtension(irel, RELEXT_EXCLUSIVE);
 
 		page = BufferGetPage(buf);
 
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index 22f2076..4c15b45 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -570,7 +570,7 @@ revmap_physical_extend(BrinRevmap *revmap)
 	else
 	{
 		if (needLock)
-			LockRelationForExtension(irel, ExclusiveLock);
+			LockRelationForExtension(irel, RELEXT_EXCLUSIVE);
 
 		buf = ReadBuffer(irel, P_NEW);
 		if (BufferGetBlockNumber(buf) != mapBlk)
@@ -582,7 +582,7 @@ revmap_physical_extend(BrinRevmap *revmap)
 			 * page from under whoever is using it.
 			 */
 			if (needLock)
-				UnlockRelationForExtension(irel, ExclusiveLock);
+				UnlockRelationForExtension(irel, RELEXT_EXCLUSIVE);
 			LockBuffer(revmap->rm_metaBuf, BUFFER_LOCK_UNLOCK);
 			ReleaseBuffer(buf);
 			return;
@@ -591,7 +591,7 @@ revmap_physical_extend(BrinRevmap *revmap)
 		page = BufferGetPage(buf);
 
 		if (needLock)
-			UnlockRelationForExtension(irel, ExclusiveLock);
+			UnlockRelationForExtension(irel, RELEXT_EXCLUSIVE);
 	}
 
 	/* Check that it's a regular block (or an empty page) */
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 91e4a8c..ac5ed7f 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -323,13 +323,13 @@ GinNewBuffer(Relation index)
 	/* Must extend the file */
 	needLock = !RELATION_IS_LOCAL(index);
 	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
+		LockRelationForExtension(index, RELEXT_EXCLUSIVE);
 
 	buffer = ReadBuffer(index, P_NEW);
 	LockBuffer(buffer, GIN_EXCLUSIVE);
 
 	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+		UnlockRelationForExtension(index, RELEXT_EXCLUSIVE);
 
 	return buffer;
 }
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index 31425e9..e9f84bc 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -716,10 +716,10 @@ ginvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	needLock = !RELATION_IS_LOCAL(index);
 
 	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
+		LockRelationForExtension(index, RELEXT_EXCLUSIVE);
 	npages = RelationGetNumberOfBlocks(index);
 	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+		UnlockRelationForExtension(index, RELEXT_EXCLUSIVE);
 
 	totFreePages = 0;
 
@@ -766,10 +766,10 @@ ginvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	stats->pages_free = totFreePages;
 
 	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
+		LockRelationForExtension(index, RELEXT_EXCLUSIVE);
 	stats->num_pages = RelationGetNumberOfBlocks(index);
 	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+		UnlockRelationForExtension(index, RELEXT_EXCLUSIVE);
 
 	return stats;
 }
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index b6ccc1a..e2cadc6 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -801,13 +801,13 @@ gistNewBuffer(Relation r)
 	needLock = !RELATION_IS_LOCAL(r);
 
 	if (needLock)
-		LockRelationForExtension(r, ExclusiveLock);
+		LockRelationForExtension(r, RELEXT_EXCLUSIVE);
 
 	buffer = ReadBuffer(r, P_NEW);
 	LockBuffer(buffer, GIST_EXCLUSIVE);
 
 	if (needLock)
-		UnlockRelationForExtension(r, ExclusiveLock);
+		UnlockRelationForExtension(r, RELEXT_EXCLUSIVE);
 
 	return buffer;
 }
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 77d9d12..ca45b06 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -59,10 +59,10 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 
 	/* try to find deleted pages */
 	if (needLock)
-		LockRelationForExtension(rel, ExclusiveLock);
+		LockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 	npages = RelationGetNumberOfBlocks(rel);
 	if (needLock)
-		UnlockRelationForExtension(rel, ExclusiveLock);
+		UnlockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 
 	totFreePages = 0;
 	for (blkno = GIST_ROOT_BLKNO + 1; blkno < npages; blkno++)
@@ -91,10 +91,10 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	/* return statistics */
 	stats->pages_free = totFreePages;
 	if (needLock)
-		LockRelationForExtension(rel, ExclusiveLock);
+		LockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 	stats->num_pages = RelationGetNumberOfBlocks(rel);
 	if (needLock)
-		UnlockRelationForExtension(rel, ExclusiveLock);
+		UnlockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 
 	return stats;
 }
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 13e3bdc..a8ce6c7 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -519,11 +519,11 @@ loop:
 	if (needLock)
 	{
 		if (!use_fsm)
-			LockRelationForExtension(relation, ExclusiveLock);
-		else if (!ConditionalLockRelationForExtension(relation, ExclusiveLock))
+			LockRelationForExtension(relation, RELEXT_EXCLUSIVE);
+		else if (!ConditionalLockRelationForExtension(relation, RELEXT_EXCLUSIVE))
 		{
 			/* Couldn't get the lock immediately; wait for it. */
-			LockRelationForExtension(relation, ExclusiveLock);
+			LockRelationForExtension(relation, RELEXT_EXCLUSIVE);
 
 			/*
 			 * Check if some other backend has extended a block for us while
@@ -537,7 +537,7 @@ loop:
 			 */
 			if (targetBlock != InvalidBlockNumber)
 			{
-				UnlockRelationForExtension(relation, ExclusiveLock);
+				UnlockRelationForExtension(relation, RELEXT_EXCLUSIVE);
 				goto loop;
 			}
 
@@ -576,7 +576,7 @@ loop:
 	 * against vacuumlazy.c --- see comments therein.
 	 */
 	if (needLock)
-		UnlockRelationForExtension(relation, ExclusiveLock);
+		UnlockRelationForExtension(relation, RELEXT_EXCLUSIVE);
 
 	/*
 	 * We need to initialize the empty new page.  Double-check that it really
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 4c2a13a..7dc3088 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -641,7 +641,7 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 	 * Note that another backend might have extended or created the relation
 	 * by the time we get the lock.
 	 */
-	LockRelationForExtension(rel, ExclusiveLock);
+	LockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 
 	/* Might have to re-open if a cache flush happened */
 	RelationOpenSmgr(rel);
@@ -679,7 +679,7 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 	/* Update local cache with the up-to-date size */
 	rel->rd_smgr->smgr_vm_nblocks = vm_nblocks_now;
 
-	UnlockRelationForExtension(rel, ExclusiveLock);
+	UnlockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 
 	pfree(pg);
 }
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 5c817b6..89daab0 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -658,7 +658,7 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 		needLock = !RELATION_IS_LOCAL(rel);
 
 		if (needLock)
-			LockRelationForExtension(rel, ExclusiveLock);
+			LockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 
 		buf = ReadBuffer(rel, P_NEW);
 
@@ -672,7 +672,7 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 		 * condition against btvacuumscan --- see comments therein.
 		 */
 		if (needLock)
-			UnlockRelationForExtension(rel, ExclusiveLock);
+			UnlockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 
 		/* Initialize the new page before returning it */
 		page = BufferGetPage(buf);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 3dbafdd..394a660 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -1058,10 +1058,10 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	{
 		/* Get the current relation length */
 		if (needLock)
-			LockRelationForExtension(rel, ExclusiveLock);
+			LockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 		num_pages = RelationGetNumberOfBlocks(rel);
 		if (needLock)
-			UnlockRelationForExtension(rel, ExclusiveLock);
+			UnlockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 
 		/* Quit if we've scanned the whole relation */
 		if (blkno >= num_pages)
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 8656af4..3d02a70 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -230,13 +230,13 @@ SpGistNewBuffer(Relation index)
 	/* Must extend the file */
 	needLock = !RELATION_IS_LOCAL(index);
 	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
+		LockRelationForExtension(index, RELEXT_EXCLUSIVE);
 
 	buffer = ReadBuffer(index, P_NEW);
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
 	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+		UnlockRelationForExtension(index, RELEXT_EXCLUSIVE);
 
 	return buffer;
 }
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index d7d5e90..3888d93 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -824,10 +824,10 @@ spgvacuumscan(spgBulkDeleteState *bds)
 	{
 		/* Get the current relation length */
 		if (needLock)
-			LockRelationForExtension(index, ExclusiveLock);
+			LockRelationForExtension(index, RELEXT_EXCLUSIVE);
 		num_pages = RelationGetNumberOfBlocks(index);
 		if (needLock)
-			UnlockRelationForExtension(index, ExclusiveLock);
+			UnlockRelationForExtension(index, RELEXT_EXCLUSIVE);
 
 		/* Quit if we've scanned the whole relation */
 		if (blkno >= num_pages)
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index e9b4045..3bdafa9 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -849,8 +849,8 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
 			 * it's got exclusive lock on the whole relation.
 			 */
 			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
-			LockRelationForExtension(onerel, ExclusiveLock);
-			UnlockRelationForExtension(onerel, ExclusiveLock);
+			LockRelationForExtension(onerel, RELEXT_EXCLUSIVE);
+			UnlockRelationForExtension(onerel, RELEXT_EXCLUSIVE);
 			LockBufferForCleanup(buf);
 			if (PageIsNew(page))
 			{
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 1f75e2e..a6f6f03 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3621,6 +3621,15 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_SYNC_REP:
 			event_name = "SyncRep";
 			break;
+		case WAIT_EVENT_LOGICAL_SYNC_DATA:
+			event_name = "LogicalSyncData";
+			break;
+		case WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE:
+			event_name = "LogicalSyncStateChange";
+			break;
+		case WAIT_EVENT_RELATION_EXTENSION:
+			event_name = "RelationExtension";
+			break;
 			/* no default case, so that compiler will warn */
 	}
 
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 4648473..498223a 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -624,7 +624,7 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 	 * Note that another backend might have extended or created the relation
 	 * by the time we get the lock.
 	 */
-	LockRelationForExtension(rel, ExclusiveLock);
+	LockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 
 	/* Might have to re-open if a cache flush happened */
 	RelationOpenSmgr(rel);
@@ -652,7 +652,7 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 	/* Update local cache with the up-to-date size */
 	rel->rd_smgr->smgr_fsm_nblocks = fsm_nblocks_now;
 
-	UnlockRelationForExtension(rel, ExclusiveLock);
+	UnlockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 
 	pfree(pg);
 }
diff --git a/src/backend/storage/lmgr/Makefile b/src/backend/storage/lmgr/Makefile
index e1b787e..2334a40 100644
--- a/src/backend/storage/lmgr/Makefile
+++ b/src/backend/storage/lmgr/Makefile
@@ -13,7 +13,7 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = lmgr.o lock.o proc.o deadlock.o lwlock.o lwlocknames.o spin.o \
-	s_lock.o predicate.o condition_variable.o
+	s_lock.o predicate.o condition_variable.o extension_lock.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/storage/lmgr/extension_lock.c b/src/backend/storage/lmgr/extension_lock.c
new file mode 100644
index 0000000..e8bbd5a
--- /dev/null
+++ b/src/backend/storage/lmgr/extension_lock.c
@@ -0,0 +1,380 @@
+/*-------------------------------------------------------------------------
+ *
+ * extension_lock.c
+ *	  Relation extension lock manager
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/storage/lmgr/extension_lock.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "pg_trace.h"
+#include "postmaster/postmaster.h"
+#include "replication/slot.h"
+#include "storage/ipc.h"
+#include "storage/proc.h"
+#include "storage/proclist.h"
+#include "storage/spin.h"
+#include "storage/extension_lock.h"
+#include "utils/memutils.h"
+#include "utils/rel.h"
+
+#ifdef LWLOCK_STATS
+#include "utils/hsearch.h"
+#endif
+
+/*
+ * Compute the hash code associated with a RELEXTLOCK.
+ *
+ * To avoid unnecessary recomputations of the hash code, we try to do this
+ * just once per function, and then pass it around as needed.  Aside from
+ * passing the hashcode to hash_search_with_hash_value(), we can extract
+ * the lock partition number from the hashcode.
+ */
+#define RelExtLockTargetTagHashCode(relextlocktargettag) \
+	get_hash_value(RelExtLockHash, (const void *) relextlocktargettag)
+
+/*
+ * The lockmgr's shared hash tables are partitioned to reduce contention.
+ * To determine which partition a given relid belongs to, compute the tag's
+ * hash code with ExtLockTagHashCode(), then apply one of these macros.
+ * NB: NUM_RELEXTENSIONLOCK_PARTITIONS must be a power of 2!
+ */
+#define RelExtLockHashPartition(hashcode) \
+	((hashcode) % NUM_RELEXTLOCK_PARTITIONS)
+#define RelExtLockHashPartitionLock(hashcode) \
+	(&MainLWLockArray[RELEXTLOCK_MANAGER_LWLOCK_OFFSET + \
+					  LockHashPartition(hashcode)].lock)
+#define RelExtLockHashPartitionLockByIndex(i) \
+	(&MainLWLockArray[RELEXTLOCK_MANAGER_LWLOCK_OFFSET + (i)].lock
+
+#define	RELEXT_VAL_EXCLUSIVE	((uint32) 1 << 24)
+#define RELEXT_VAL_SHARED		1
+
+#define RELEXT_LOCKMASK			((uint32) ((1 << 25) - 1))
+
+/* */
+#define MAX_SIMUL_EXTLOCKS 8
+
+/*
+ * This structure holds information per-object relation extension
+ * lock. held_extlocks represents the ExtLocks we're holding.
+ */
+typedef	struct relextlock_handle
+{
+	Oid			relid;
+	RelExtLock	*lock;
+	RelExtLockMode mode;	/* lock mode for this table entry */
+} relextlock_handle;
+static relextlock_handle held_relextlocks[MAX_SIMUL_EXTLOCKS];
+static int num_held_relextlocks = 0;
+
+static bool RelExtLockAcquire(Oid relid, RelExtLockMode lockmode, bool conditional);
+static void RelExtLockRelease(Oid rleid, RelExtLockMode lockmode);
+static bool RelExtLockAttemptLock(RelExtLock *ext_lock, RelExtLockMode lockmode);
+
+/*
+ * Pointers to hash tables containing lock state
+ *
+ * The RelExtLockHash hash table is in shared memory
+ */
+static HTAB *RelExtLockHash;
+
+/*
+ * InitRelExtLock
+ *      Initialize the relation extension lock manager's data structures.
+ */
+void
+InitRelExtLock(long max_table_size)
+{
+	HASHCTL	info;
+	long		init_table_size;
+
+	/*
+	 * Compute init/max size to request for lock hashtables.  Note these
+	 * calculations must agree with LockShmemSize!
+	 */
+	init_table_size = max_table_size / 2;
+
+	/*
+	 * Allocate hash table for RELEXTLOCK structs. This stores per-relation
+	 * lock.
+	 */
+	MemSet(&info, 0, sizeof(info));
+	info.keysize = sizeof(Oid);
+	info.entrysize = sizeof(RelExtLock);
+	info.num_partitions = NUM_RELEXTLOCK_PARTITIONS;
+
+	RelExtLockHash = ShmemInitHash("RelExtLock Hash",
+								   init_table_size,
+								   max_table_size,
+								   &info,
+								   HASH_ELEM | HASH_BLOBS | HASH_PARTITION);
+}
+
+/*
+ *		LockRelationForExtension
+ *
+ * This lock is used to interlock addition of pages to relations.
+ * We need such locking because bufmgr/smgr definition of P_NEW is not
+ * race-condition-proof.
+ *
+ * We assume the caller is already holding some type of regular lock on
+ * the relation, so no AcceptInvalidationMessages call is needed here.
+ */
+void
+LockRelationForExtension(Relation relation, RelExtLockMode lockmode)
+{
+	RelExtLockAcquire(relation->rd_id, lockmode, false);
+}
+
+/*
+ *		ConditionalLockRelationForExtension
+ *
+ * As above, but only lock if we can get the lock without blocking.
+ * Returns TRUE iff the lock was acquired.
+ */
+bool
+ConditionalLockRelationForExtension(Relation relation, RelExtLockMode lockmode)
+{
+	return RelExtLockAcquire(relation->rd_id, lockmode, true);
+}
+
+/*
+ *		RelationExtensionLockWaiterCount
+ *
+ * Count the number of processes waiting for the given relation extension lock.
+ * NOte that this routine doesn't acquire the partition lock. Please make sure
+ * that the caller must acquire partitionlock in exclusive mode or we must call
+ * this routine after acquired the relation extension lock of this relation.
+ */
+int
+RelationExtensionLockWaiterCount(Relation relation)
+{
+	RelExtLock	*ext_lock;
+	Oid		relid;
+	uint32	nwaiters;
+	uint32	hashcode;
+	bool	found;
+
+	relid = relation->rd_id;
+	hashcode = RelExtLockTargetTagHashCode(&relid);
+
+	ext_lock = (RelExtLock *) hash_search_with_hash_value(RelExtLockHash,
+														  (void *) &relid,
+														  hashcode,
+														  HASH_FIND, &found);
+	/* We assume that we already acquire this lock */
+	Assert(found);
+
+	nwaiters = pg_atomic_read_u32(&(ext_lock->nwaiters));
+
+	return nwaiters;
+}
+
+/*
+ *		UnlockRelationForExtension
+ */
+void
+UnlockRelationForExtension(Relation relation, RelExtLockMode lockmode)
+{
+	RelExtLockRelease(relation->rd_id, lockmode);
+}
+
+/*
+ * Acquire relation extension lock and create RELEXTLOCK hash entry on shared
+ * hash table. To avoid dead-lock with partition lock and LWLock, we acquire
+ * them but don't release it here. The caller must call DeleteRelExtLock later
+ * to release these locks.
+ */
+static bool
+RelExtLockAcquire(Oid relid, RelExtLockMode lockmode, bool conditional)
+{
+	RelExtLock	*ext_lock;
+	LWLock	*partitionLock;
+	uint32	hashcode;
+	bool	found;
+	bool	got_lock = false;
+	bool	waited = false;
+
+	hashcode = RelExtLockTargetTagHashCode(&relid);
+	partitionLock = RelExtLockHashPartitionLock(hashcode);
+	LWLockAcquire(partitionLock, LW_EXCLUSIVE);
+
+	ext_lock = (RelExtLock *) hash_search_with_hash_value(RelExtLockHash,
+														  (void * ) &relid,
+														  hashcode, HASH_ENTER, &found);
+
+	if (!ext_lock)
+		ereport(ERROR,
+				(errcode(ERRCODE_OUT_OF_MEMORY),
+				 errmsg("out of shared memory"),
+				 errhint("You might need to increase max_pred_locks_per_transaction.")));
+
+	for (;;)
+	{
+		bool ret;
+
+		ret = RelExtLockAttemptLock(ext_lock, lockmode);
+
+		if (ret)
+		{
+			got_lock = true;
+
+			if (waited)
+				pg_atomic_sub_fetch_u32(&(ext_lock->nwaiters), 1);
+
+			break;	/* got the lock */
+		}
+
+		/* Could not get lock, return if in conditional lock */
+		if (!ret && conditional)
+			break;
+
+		/* Add to wait list */
+		pg_atomic_add_fetch_u32(&(ext_lock->nwaiters), 1);
+		ConditionVariableSleep(&(ext_lock->cv), WAIT_EVENT_RELATION_EXTENSION);
+	}
+
+	ConditionVariableCancelSleep();
+
+	if (got_lock)
+	{
+		/* Add lock to list relation extension locks held by this backend */
+		held_relextlocks[num_held_relextlocks].relid = relid;
+		held_relextlocks[num_held_relextlocks].lock = ext_lock;
+		held_relextlocks[num_held_relextlocks].mode = lockmode;
+		num_held_relextlocks++;
+	}
+	else
+		LWLockRelease(partitionLock);
+
+	/* Always end up with true if not conditional lock */
+	return got_lock;
+}
+
+/*
+ * RelationExtensionLockReleaseAll - release all currently-held relation extension locks
+ */
+void
+RelationExtensionLockReleaseAll(void)
+{
+	while (num_held_relextlocks > 0)
+	{
+		HOLD_INTERRUPTS();
+
+		RelExtLockRelease(held_relextlocks[num_held_relextlocks - 1].relid,
+						  held_relextlocks[num_held_relextlocks - 1].mode);
+	}
+}
+
+/*
+ * ExstLockRelease
+ *
+ * Remove RELEXTLOCK from shared RelExtLockHash hash table. Since other backends
+ * might be acquiring it or waiting for this lock, we can delete it only if there
+ * is no longer backends who are interested in it.
+ *
+ * Note that we assume partition lock for hash table is already acquired when
+ * acquiring the lock. This routine should release partition lock as well after
+ * released LWLock.
+ */
+static void
+RelExtLockRelease(Oid relid, RelExtLockMode lockmode)
+{
+	RelExtLock	*ext_lock;
+	RelExtLockMode mode;
+	uint32	hashcode;
+	LWLock	*partitionLock;
+	uint32	oldstate;
+	uint32	nwaiters;
+	int i;
+
+	hashcode = RelExtLockTargetTagHashCode(&relid);
+	partitionLock = RelExtLockHashPartitionLock(hashcode);
+
+	for (i = num_held_relextlocks; --i >= 0;)
+		if (relid == held_relextlocks[i].relid &&
+			lockmode == held_relextlocks[i].mode)
+			break;
+
+	if (i < 0)
+		elog(ERROR, "relation extension lock for %u with lock mode %d is not held",
+			 relid, lockmode);
+
+	ext_lock = held_relextlocks[i].lock;
+	mode = held_relextlocks[i].mode;
+
+	num_held_relextlocks--;
+
+	/* Shrink */
+	for (; i < num_held_relextlocks; i++)
+		held_relextlocks[i] = held_relextlocks[i + 1];
+
+	if (mode == RELEXT_EXCLUSIVE)
+		oldstate = pg_atomic_sub_fetch_u32(&(ext_lock->state), RELEXT_VAL_EXCLUSIVE);
+	else
+		oldstate = pg_atomic_sub_fetch_u32(&(ext_lock->state), RELEXT_VAL_SHARED);
+
+	nwaiters = pg_atomic_read_u32(&(ext_lock->nwaiters));
+
+	/* Wake up waiters if there are */
+	if (nwaiters > 0)
+		ConditionVariableBroadcast(&(ext_lock->cv));
+	else
+		hash_search_with_hash_value(RelExtLockHash, (void *) &relid,
+									hashcode, HASH_REMOVE, NULL);
+
+	LWLockRelease(partitionLock);
+}
+
+/*
+ * Internal function that tries to atomically acquire the relation extension
+ * lock in the passed in mode. Return true if we got the lock.
+ */
+static bool
+RelExtLockAttemptLock(RelExtLock *ext_lock, RelExtLockMode lockmode)
+{
+	uint32	oldstate;
+
+	oldstate = pg_atomic_read_u32(&ext_lock->state);
+
+	while (true)
+	{
+		uint32	desired_state;
+		bool	lock_free;
+
+		desired_state = oldstate;
+
+		if (lockmode == RELEXT_EXCLUSIVE)
+		{
+			lock_free = (oldstate & RELEXT_LOCKMASK) == 0;
+			if (lock_free)
+				desired_state += RELEXT_VAL_EXCLUSIVE;
+		}
+		else
+		{
+			lock_free = (oldstate & RELEXT_VAL_EXCLUSIVE) == 0;
+			if (lock_free)
+				desired_state += RELEXT_VAL_SHARED;
+		}
+
+		if (pg_atomic_compare_exchange_u32(&ext_lock->state,
+										   &oldstate, desired_state))
+		{
+			if (lock_free)
+				return true;
+			else
+				return false;
+		}
+	}
+	pg_unreachable();
+}
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index fe98898..34095cb 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -319,78 +319,6 @@ UnlockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode)
 }
 
 /*
- *		LockRelationForExtension
- *
- * This lock tag is used to interlock addition of pages to relations.
- * We need such locking because bufmgr/smgr definition of P_NEW is not
- * race-condition-proof.
- *
- * We assume the caller is already holding some type of regular lock on
- * the relation, so no AcceptInvalidationMessages call is needed here.
- */
-void
-LockRelationForExtension(Relation relation, LOCKMODE lockmode)
-{
-	LOCKTAG		tag;
-
-	SET_LOCKTAG_RELATION_EXTEND(tag,
-								relation->rd_lockInfo.lockRelId.dbId,
-								relation->rd_lockInfo.lockRelId.relId);
-
-	(void) LockAcquire(&tag, lockmode, false, false);
-}
-
-/*
- *		ConditionalLockRelationForExtension
- *
- * As above, but only lock if we can get the lock without blocking.
- * Returns TRUE iff the lock was acquired.
- */
-bool
-ConditionalLockRelationForExtension(Relation relation, LOCKMODE lockmode)
-{
-	LOCKTAG		tag;
-
-	SET_LOCKTAG_RELATION_EXTEND(tag,
-								relation->rd_lockInfo.lockRelId.dbId,
-								relation->rd_lockInfo.lockRelId.relId);
-
-	return (LockAcquire(&tag, lockmode, false, true) != LOCKACQUIRE_NOT_AVAIL);
-}
-
-/*
- *		RelationExtensionLockWaiterCount
- *
- * Count the number of processes waiting for the given relation extension lock.
- */
-int
-RelationExtensionLockWaiterCount(Relation relation)
-{
-	LOCKTAG		tag;
-
-	SET_LOCKTAG_RELATION_EXTEND(tag,
-								relation->rd_lockInfo.lockRelId.dbId,
-								relation->rd_lockInfo.lockRelId.relId);
-
-	return LockWaiterCount(&tag);
-}
-
-/*
- *		UnlockRelationForExtension
- */
-void
-UnlockRelationForExtension(Relation relation, LOCKMODE lockmode)
-{
-	LOCKTAG		tag;
-
-	SET_LOCKTAG_RELATION_EXTEND(tag,
-								relation->rd_lockInfo.lockRelId.dbId,
-								relation->rd_lockInfo.lockRelId.relId);
-
-	LockRelease(&tag, lockmode, false);
-}
-
-/*
  *		LockPage
  *
  * Obtain a page-level lock.  This is currently used by some index access
@@ -961,12 +889,6 @@ DescribeLockTag(StringInfo buf, const LOCKTAG *tag)
 							 tag->locktag_field2,
 							 tag->locktag_field1);
 			break;
-		case LOCKTAG_RELATION_EXTEND:
-			appendStringInfo(buf,
-							 _("extension of relation %u of database %u"),
-							 tag->locktag_field2,
-							 tag->locktag_field1);
-			break;
 		case LOCKTAG_PAGE:
 			appendStringInfo(buf,
 							 _("page %u of relation %u of database %u"),
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 2b26173..bc576a7 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -45,6 +45,7 @@
 #include "storage/sinvaladt.h"
 #include "storage/spin.h"
 #include "storage/standby.h"
+#include "storage/lmgr.h"
 #include "utils/memutils.h"
 #include "utils/ps_status.h"
 #include "utils/resowner_private.h"
@@ -388,6 +389,10 @@ InitLocks(void)
 	max_table_size = NLOCKENTS();
 	init_table_size = max_table_size / 2;
 
+
+	/* Initialize lock structure for relation extension lock */
+	InitRelExtLock(max_table_size);
+
 	/*
 	 * Allocate hash table for LOCK structs.  This stores per-locked-object
 	 * information.
@@ -3366,6 +3371,7 @@ LockShmemSize(void)
 	/* lock hash table */
 	max_table_size = NLOCKENTS();
 	size = add_size(size, hash_estimate_size(max_table_size, sizeof(LOCK)));
+	size = add_size(size, hash_estimate_size(max_table_size, sizeof(LWLock)));
 
 	/* proclock hash table */
 	max_table_size *= 2;
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 82a1cf5..3d465a5 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -451,6 +451,13 @@ InitializeLWLocks(void)
 	for (id = 0; id < NUM_PREDICATELOCK_PARTITIONS; id++, lock++)
 		LWLockInitialize(&lock->lock, LWTRANCHE_PREDICATE_LOCK_MANAGER);
 
+	/* Initialize relation extension lmgr's LWLocks in main array */
+	lock = MainLWLockArray + NUM_INDIVIDUAL_LWLOCKS +
+		NUM_BUFFER_PARTITIONS + NUM_LOCK_PARTITIONS +
+		NUM_PREDICATELOCK_PARTITIONS;
+	for (id = 0; id < NUM_RELEXTLOCK_PARTITIONS; id++, lock++)
+		LWLockInitialize(&lock->lock, LWTRANCHE_RELEXT_LOCK_MANAGER);
+
 	/* Initialize named tranches. */
 	if (NamedLWLockTrancheRequests > 0)
 	{
@@ -494,7 +501,7 @@ RegisterLWLockTranches(void)
 
 	if (LWLockTrancheArray == NULL)
 	{
-		LWLockTranchesAllocated = 64;
+		LWLockTranchesAllocated = 128;
 		LWLockTrancheArray = (char **)
 			MemoryContextAllocZero(TopMemoryContext,
 								   LWLockTranchesAllocated * sizeof(char *));
@@ -508,6 +515,7 @@ RegisterLWLockTranches(void)
 	LWLockRegisterTranche(LWTRANCHE_LOCK_MANAGER, "lock_manager");
 	LWLockRegisterTranche(LWTRANCHE_PREDICATE_LOCK_MANAGER,
 						  "predicate_lock_manager");
+	LWLockRegisterTranche(LWTRANCHE_RELEXT_LOCK_MANAGER, "relext_lock_manager");
 	LWLockRegisterTranche(LWTRANCHE_PARALLEL_QUERY_DSA,
 						  "parallel_query_dsa");
 	LWLockRegisterTranche(LWTRANCHE_TBM, "tbm");
diff --git a/src/backend/utils/adt/lockfuncs.c b/src/backend/utils/adt/lockfuncs.c
index 9e0a8ab..6d8916c 100644
--- a/src/backend/utils/adt/lockfuncs.c
+++ b/src/backend/utils/adt/lockfuncs.c
@@ -25,7 +25,6 @@
 /* This must match enum LockTagType! */
 const char *const LockTagTypeNames[] = {
 	"relation",
-	"extend",
 	"page",
 	"tuple",
 	"transactionid",
@@ -234,7 +233,6 @@ pg_lock_status(PG_FUNCTION_ARGS)
 		switch ((LockTagType) instance->locktag.locktag_type)
 		{
 			case LOCKTAG_RELATION:
-			case LOCKTAG_RELATION_EXTEND:
 				values[1] = ObjectIdGetDatum(instance->locktag.locktag_field1);
 				values[2] = ObjectIdGetDatum(instance->locktag.locktag_field2);
 				nulls[3] = true;
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index cb05d9b..ceac774 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -815,7 +815,10 @@ typedef enum
 	WAIT_EVENT_REPLICATION_ORIGIN_DROP,
 	WAIT_EVENT_REPLICATION_SLOT_DROP,
 	WAIT_EVENT_SAFE_SNAPSHOT,
-	WAIT_EVENT_SYNC_REP
+	WAIT_EVENT_SYNC_REP,
+	WAIT_EVENT_LOGICAL_SYNC_DATA,
+	WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE,
+	WAIT_EVENT_RELATION_EXTENSION
 } WaitEventIPC;
 
 /* ----------
diff --git a/src/include/storage/extension_lock.h b/src/include/storage/extension_lock.h
new file mode 100644
index 0000000..f178672
--- /dev/null
+++ b/src/include/storage/extension_lock.h
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * extension_lock.h
+ *	  Relation extension lock manager
+ *
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/extension_lock.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef EXTENSION_H
+#define EXTENSION_H
+
+#ifdef FRONTEND
+#error "extension_lock.h may not be included from frontend code"
+#endif
+
+#include "storage/proclist_types.h"
+#include "storage/s_lock.h"
+#include "storage/condition_variable.h"
+#include "port/atomics.h"
+
+typedef struct RelExtLock
+{
+	Oid					relid;
+	pg_atomic_uint32	state;
+	pg_atomic_uint32	nwaiters;
+	ConditionVariable	cv;
+} RelExtLock;
+
+typedef enum RelExtLockMode
+{
+	RELEXT_EXCLUSIVE,
+	RELEXT_SHARED
+} RelExtLockMode;
+
+/* Lock a relation for extension */
+extern void InitRelExtLock(long max_table_size);
+extern void LockRelationForExtension(Relation relation, RelExtLockMode lockmode);
+extern void UnlockRelationForExtension(Relation relation, RelExtLockMode lockmode);
+extern bool ConditionalLockRelationForExtension(Relation relation, RelExtLockMode lockmode);
+extern int	RelationExtensionLockWaiterCount(Relation relation);
+extern void RelationExtensionLockReleaseAll(void);
+
+#endif	/* EXTENSION_H */
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index 0b92322..ac23354 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -15,6 +15,7 @@
 #define LMGR_H
 
 #include "lib/stringinfo.h"
+#include "storage/extension_lock.h"
 #include "storage/itemptr.h"
 #include "storage/lock.h"
 #include "utils/rel.h"
@@ -34,6 +35,36 @@ typedef enum XLTW_Oper
 	XLTW_RecheckExclusionConstr
 } XLTW_Oper;
 
+typedef	struct RELEXTLOCKTAG
+{
+	Oid		relid;		/* identifies the lockable object */
+	LWLockMode mode;	/* lock mode for this table entry */
+} RELEXTLOCKTAG;
+
+/*
+ * This structure holds information per-object relation extension
+ * lock.
+ */
+typedef struct RELEXTLOCK
+{
+	RELEXTLOCKTAG	tag;	/* hash key -- must be first */
+	LWLock			lock;	/* LWLock for relation extension */
+} RELEXTLOCK;
+
+/*
+ * The LOCALRELEXTLOCK struct represents a local copy of data which is
+ * also present in the RELEXTLOCK table, organized for fast access without
+ * needing to acquire a LWLock.  It is strictly for optimization.
+ */
+typedef struct LOCALRELEXTLOCK
+{
+	/* hash key */
+	RELEXTLOCKTAG	relid;	/* unique identifier of locktable object */
+
+	/* data */
+	bool			held;	/* is lock held? */
+} LOCALRELEXTLOCK;
+
 extern void RelationInitLockInfo(Relation relation);
 
 /* Lock a relation */
@@ -50,13 +81,6 @@ extern bool LockHasWaitersRelation(Relation relation, LOCKMODE lockmode);
 extern void LockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
 extern void UnlockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
 
-/* Lock a relation for extension */
-extern void LockRelationForExtension(Relation relation, LOCKMODE lockmode);
-extern void UnlockRelationForExtension(Relation relation, LOCKMODE lockmode);
-extern bool ConditionalLockRelationForExtension(Relation relation,
-									LOCKMODE lockmode);
-extern int	RelationExtensionLockWaiterCount(Relation relation);
-
 /* Lock a page (currently only used within indexes) */
 extern void LockPage(Relation relation, BlockNumber blkno, LOCKMODE lockmode);
 extern bool ConditionalLockPage(Relation relation, BlockNumber blkno, LOCKMODE lockmode);
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index 765431e..3be18ea 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -138,8 +138,6 @@ typedef uint16 LOCKMETHODID;
 typedef enum LockTagType
 {
 	LOCKTAG_RELATION,			/* whole relation */
-	/* ID info for a relation is DB OID + REL OID; DB OID = 0 if shared */
-	LOCKTAG_RELATION_EXTEND,	/* the right to extend a relation */
 	/* same ID info as RELATION */
 	LOCKTAG_PAGE,				/* one page of a relation */
 	/* ID info for a page is RELATION info + BlockNumber */
@@ -198,14 +196,6 @@ typedef struct LOCKTAG
 	 (locktag).locktag_type = LOCKTAG_RELATION, \
 	 (locktag).locktag_lockmethodid = DEFAULT_LOCKMETHOD)
 
-#define SET_LOCKTAG_RELATION_EXTEND(locktag,dboid,reloid) \
-	((locktag).locktag_field1 = (dboid), \
-	 (locktag).locktag_field2 = (reloid), \
-	 (locktag).locktag_field3 = 0, \
-	 (locktag).locktag_field4 = 0, \
-	 (locktag).locktag_type = LOCKTAG_RELATION_EXTEND, \
-	 (locktag).locktag_lockmethodid = DEFAULT_LOCKMETHOD)
-
 #define SET_LOCKTAG_PAGE(locktag,dboid,reloid,blocknum) \
 	((locktag).locktag_field1 = (dboid), \
 	 (locktag).locktag_field2 = (reloid), \
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 3d16132..c0e6242 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -120,14 +120,21 @@ extern PGDLLIMPORT int NamedLWLockTrancheRequests;
 #define LOG2_NUM_PREDICATELOCK_PARTITIONS  4
 #define NUM_PREDICATELOCK_PARTITIONS  (1 << LOG2_NUM_PREDICATELOCK_PARTITIONS)
 
+/* Number of partitions the shared relation extension lock tables are divided into */
+#define LOG2_NUM_RELEXTLOCK_PARTITIONS	4
+#define NUM_RELEXTLOCK_PARTITIONS	(1 << LOG2_NUM_RELEXTLOCK_PARTITIONS)
+
 /* Offsets for various chunks of preallocated lwlocks. */
 #define BUFFER_MAPPING_LWLOCK_OFFSET	NUM_INDIVIDUAL_LWLOCKS
 #define LOCK_MANAGER_LWLOCK_OFFSET		\
 	(BUFFER_MAPPING_LWLOCK_OFFSET + NUM_BUFFER_PARTITIONS)
 #define PREDICATELOCK_MANAGER_LWLOCK_OFFSET \
 	(LOCK_MANAGER_LWLOCK_OFFSET + NUM_LOCK_PARTITIONS)
-#define NUM_FIXED_LWLOCKS \
+#define RELEXTLOCK_MANAGER_LWLOCK_OFFSET \
 	(PREDICATELOCK_MANAGER_LWLOCK_OFFSET + NUM_PREDICATELOCK_PARTITIONS)
+#define NUM_FIXED_LWLOCKS \
+	(PREDICATELOCK_MANAGER_LWLOCK_OFFSET + NUM_PREDICATELOCK_PARTITIONS + \
+	 NUM_RELEXTLOCK_PARTITIONS)
 
 typedef enum LWLockMode
 {
@@ -151,6 +158,8 @@ extern void LWLockReleaseClearVar(LWLock *lock, uint64 *valptr, uint64 val);
 extern void LWLockReleaseAll(void);
 extern bool LWLockHeldByMe(LWLock *lock);
 extern bool LWLockHeldByMeInMode(LWLock *lock, LWLockMode mode);
+extern bool LWLockCheckForCleanup(LWLock *lock);
+extern int LWLockWaiterCount(LWLock *lock);
 
 extern bool LWLockWaitForVar(LWLock *lock, uint64 *valptr, uint64 oldval, uint64 *newval);
 extern void LWLockUpdateVar(LWLock *lock, uint64 *valptr, uint64 value);
@@ -211,6 +220,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_BUFFER_MAPPING,
 	LWTRANCHE_LOCK_MANAGER,
 	LWTRANCHE_PREDICATE_LOCK_MANAGER,
+	LWTRANCHE_RELEXT_LOCK_MANAGER,
 	LWTRANCHE_PARALLEL_QUERY_DSA,
 	LWTRANCHE_TBM,
 	LWTRANCHE_FIRST_USER_DEFINED

#11

Thomas Munro

thomas.munro@enterprisedb.com

over 8 years ago

In reply to: Masahiko Sawada (#10)

Re: Moving relation extension locks out of heavyweight lock manager

On Wed, Aug 16, 2017 at 2:13 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

The previous patch conflicts with current HEAD, I rebased the patch to
current HEAD.

Hi Masahiko-san,

FYI this doesn't build anymore. I think it's just because the wait
event enumerators were re-alphabetised in pgstat.h:

../../../../src/include/pgstat.h:820:2: error: redeclaration of
enumerator ‘WAIT_EVENT_LOGICAL_SYNC_DATA’
WAIT_EVENT_LOGICAL_SYNC_DATA,
^
../../../../src/include/pgstat.h:806:2: note: previous definition of
‘WAIT_EVENT_LOGICAL_SYNC_DATA’ was here
WAIT_EVENT_LOGICAL_SYNC_DATA,
^
../../../../src/include/pgstat.h:821:2: error: redeclaration of
enumerator ‘WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE’
WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE,
^
../../../../src/include/pgstat.h:807:2: note: previous definition of
‘WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE’ was here
WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE,
^

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

Thomas Munro

thomas.munro@enterprisedb.com

over 8 years ago

In reply to: Thomas Munro (#11)

Re: Moving relation extension locks out of heavyweight lock manager

On Fri, Sep 8, 2017 at 10:24 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Wed, Aug 16, 2017 at 2:13 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

The previous patch conflicts with current HEAD, I rebased the patch to
current HEAD.

Hi Masahiko-san,

Hi Sawada-san,

I have just learned from a colleague who is knowledgeable about
Japanese customs and kind enough to correct me that the appropriate
term of address for our colleagues in Japan on this mailing list is
<lastname>-san. I was confused about that -- apologies for my
clumsiness.

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Masahiko Sawada

sawada.mshk@gmail.com

over 8 years ago

In reply to: Thomas Munro (#12)

Re: Moving relation extension locks out of heavyweight lock manager

On Fri, Sep 8, 2017 at 8:25 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Fri, Sep 8, 2017 at 10:24 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Wed, Aug 16, 2017 at 2:13 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

The previous patch conflicts with current HEAD, I rebased the patch to
current HEAD.

Hi Masahiko-san,

Hi Sawada-san,

I have just learned from a colleague who is knowledgeable about
Japanese customs and kind enough to correct me that the appropriate
term of address for our colleagues in Japan on this mailing list is
<lastname>-san. I was confused about that -- apologies for my
clumsiness.

Don't worry about it, either is ok. In Japan there is a custom of
writing <lastname>-san but <firstname>-san is also not incorrect :-)
(also I think it's hard to distinguish between last name and first
name of Japanese name).

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Masahiko Sawada

sawada.mshk@gmail.com

over 8 years ago

In reply to: Thomas Munro (#11)

1 attachment(s)

Re: Moving relation extension locks out of heavyweight lock manager

On Fri, Sep 8, 2017 at 7:24 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Wed, Aug 16, 2017 at 2:13 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

The previous patch conflicts with current HEAD, I rebased the patch to
current HEAD.

Hi Masahiko-san,

FYI this doesn't build anymore. I think it's just because the wait
event enumerators were re-alphabetised in pgstat.h:

../../../../src/include/pgstat.h:820:2: error: redeclaration of
enumerator ‘WAIT_EVENT_LOGICAL_SYNC_DATA’
WAIT_EVENT_LOGICAL_SYNC_DATA,
^
../../../../src/include/pgstat.h:806:2: note: previous definition of
‘WAIT_EVENT_LOGICAL_SYNC_DATA’ was here
WAIT_EVENT_LOGICAL_SYNC_DATA,
^
../../../../src/include/pgstat.h:821:2: error: redeclaration of
enumerator ‘WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE’
WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE,
^
../../../../src/include/pgstat.h:807:2: note: previous definition of
‘WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE’ was here
WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE,
^

Thank you for the information! Attached rebased patch.

--
Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

Moving_extension_lock_out_of_heavyweight_lock_v4.patchapplication/octet-stream; name=Moving_extension_lock_out_of_heavyweight_lock_v4.patchDownload

diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index 80f803e..b928c1a 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -609,8 +609,8 @@ brin_page_cleanup(Relation idxrel, Buffer buf)
 	 */
 	if (PageIsNew(page))
 	{
-		LockRelationForExtension(idxrel, ShareLock);
-		UnlockRelationForExtension(idxrel, ShareLock);
+		LockRelationForExtension(idxrel, RELEXT_SHARED);
+		UnlockRelationForExtension(idxrel, RELEXT_SHARED);
 
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 		if (PageIsNew(page))
@@ -702,7 +702,7 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 			 */
 			if (!RELATION_IS_LOCAL(irel))
 			{
-				LockRelationForExtension(irel, ExclusiveLock);
+				LockRelationForExtension(irel, RELEXT_EXCLUSIVE);
 				extensionLockHeld = true;
 			}
 			buf = ReadBuffer(irel, P_NEW);
@@ -754,7 +754,7 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 				}
 
 				if (extensionLockHeld)
-					UnlockRelationForExtension(irel, ExclusiveLock);
+					UnlockRelationForExtension(irel, RELEXT_EXCLUSIVE);
 
 				ReleaseBuffer(buf);
 				return InvalidBuffer;
@@ -764,7 +764,7 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
 		if (extensionLockHeld)
-			UnlockRelationForExtension(irel, ExclusiveLock);
+			UnlockRelationForExtension(irel, RELEXT_EXCLUSIVE);
 
 		page = BufferGetPage(buf);
 
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index 22f2076..4c15b45 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -570,7 +570,7 @@ revmap_physical_extend(BrinRevmap *revmap)
 	else
 	{
 		if (needLock)
-			LockRelationForExtension(irel, ExclusiveLock);
+			LockRelationForExtension(irel, RELEXT_EXCLUSIVE);
 
 		buf = ReadBuffer(irel, P_NEW);
 		if (BufferGetBlockNumber(buf) != mapBlk)
@@ -582,7 +582,7 @@ revmap_physical_extend(BrinRevmap *revmap)
 			 * page from under whoever is using it.
 			 */
 			if (needLock)
-				UnlockRelationForExtension(irel, ExclusiveLock);
+				UnlockRelationForExtension(irel, RELEXT_EXCLUSIVE);
 			LockBuffer(revmap->rm_metaBuf, BUFFER_LOCK_UNLOCK);
 			ReleaseBuffer(buf);
 			return;
@@ -591,7 +591,7 @@ revmap_physical_extend(BrinRevmap *revmap)
 		page = BufferGetPage(buf);
 
 		if (needLock)
-			UnlockRelationForExtension(irel, ExclusiveLock);
+			UnlockRelationForExtension(irel, RELEXT_EXCLUSIVE);
 	}
 
 	/* Check that it's a regular block (or an empty page) */
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 136ea27..1690d21 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -325,13 +325,13 @@ GinNewBuffer(Relation index)
 	/* Must extend the file */
 	needLock = !RELATION_IS_LOCAL(index);
 	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
+		LockRelationForExtension(index, RELEXT_EXCLUSIVE);
 
 	buffer = ReadBuffer(index, P_NEW);
 	LockBuffer(buffer, GIN_EXCLUSIVE);
 
 	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+		UnlockRelationForExtension(index, RELEXT_EXCLUSIVE);
 
 	return buffer;
 }
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index 31425e9..e9f84bc 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -716,10 +716,10 @@ ginvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	needLock = !RELATION_IS_LOCAL(index);
 
 	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
+		LockRelationForExtension(index, RELEXT_EXCLUSIVE);
 	npages = RelationGetNumberOfBlocks(index);
 	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+		UnlockRelationForExtension(index, RELEXT_EXCLUSIVE);
 
 	totFreePages = 0;
 
@@ -766,10 +766,10 @@ ginvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	stats->pages_free = totFreePages;
 
 	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
+		LockRelationForExtension(index, RELEXT_EXCLUSIVE);
 	stats->num_pages = RelationGetNumberOfBlocks(index);
 	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+		UnlockRelationForExtension(index, RELEXT_EXCLUSIVE);
 
 	return stats;
 }
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index b6ccc1a..e2cadc6 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -801,13 +801,13 @@ gistNewBuffer(Relation r)
 	needLock = !RELATION_IS_LOCAL(r);
 
 	if (needLock)
-		LockRelationForExtension(r, ExclusiveLock);
+		LockRelationForExtension(r, RELEXT_EXCLUSIVE);
 
 	buffer = ReadBuffer(r, P_NEW);
 	LockBuffer(buffer, GIST_EXCLUSIVE);
 
 	if (needLock)
-		UnlockRelationForExtension(r, ExclusiveLock);
+		UnlockRelationForExtension(r, RELEXT_EXCLUSIVE);
 
 	return buffer;
 }
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 77d9d12..ca45b06 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -59,10 +59,10 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 
 	/* try to find deleted pages */
 	if (needLock)
-		LockRelationForExtension(rel, ExclusiveLock);
+		LockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 	npages = RelationGetNumberOfBlocks(rel);
 	if (needLock)
-		UnlockRelationForExtension(rel, ExclusiveLock);
+		UnlockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 
 	totFreePages = 0;
 	for (blkno = GIST_ROOT_BLKNO + 1; blkno < npages; blkno++)
@@ -91,10 +91,10 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	/* return statistics */
 	stats->pages_free = totFreePages;
 	if (needLock)
-		LockRelationForExtension(rel, ExclusiveLock);
+		LockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 	stats->num_pages = RelationGetNumberOfBlocks(rel);
 	if (needLock)
-		UnlockRelationForExtension(rel, ExclusiveLock);
+		UnlockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 
 	return stats;
 }
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 13e3bdc..a8ce6c7 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -519,11 +519,11 @@ loop:
 	if (needLock)
 	{
 		if (!use_fsm)
-			LockRelationForExtension(relation, ExclusiveLock);
-		else if (!ConditionalLockRelationForExtension(relation, ExclusiveLock))
+			LockRelationForExtension(relation, RELEXT_EXCLUSIVE);
+		else if (!ConditionalLockRelationForExtension(relation, RELEXT_EXCLUSIVE))
 		{
 			/* Couldn't get the lock immediately; wait for it. */
-			LockRelationForExtension(relation, ExclusiveLock);
+			LockRelationForExtension(relation, RELEXT_EXCLUSIVE);
 
 			/*
 			 * Check if some other backend has extended a block for us while
@@ -537,7 +537,7 @@ loop:
 			 */
 			if (targetBlock != InvalidBlockNumber)
 			{
-				UnlockRelationForExtension(relation, ExclusiveLock);
+				UnlockRelationForExtension(relation, RELEXT_EXCLUSIVE);
 				goto loop;
 			}
 
@@ -576,7 +576,7 @@ loop:
 	 * against vacuumlazy.c --- see comments therein.
 	 */
 	if (needLock)
-		UnlockRelationForExtension(relation, ExclusiveLock);
+		UnlockRelationForExtension(relation, RELEXT_EXCLUSIVE);
 
 	/*
 	 * We need to initialize the empty new page.  Double-check that it really
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 4c2a13a..7dc3088 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -641,7 +641,7 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 	 * Note that another backend might have extended or created the relation
 	 * by the time we get the lock.
 	 */
-	LockRelationForExtension(rel, ExclusiveLock);
+	LockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 
 	/* Might have to re-open if a cache flush happened */
 	RelationOpenSmgr(rel);
@@ -679,7 +679,7 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 	/* Update local cache with the up-to-date size */
 	rel->rd_smgr->smgr_vm_nblocks = vm_nblocks_now;
 
-	UnlockRelationForExtension(rel, ExclusiveLock);
+	UnlockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 
 	pfree(pg);
 }
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 5c817b6..89daab0 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -658,7 +658,7 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 		needLock = !RELATION_IS_LOCAL(rel);
 
 		if (needLock)
-			LockRelationForExtension(rel, ExclusiveLock);
+			LockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 
 		buf = ReadBuffer(rel, P_NEW);
 
@@ -672,7 +672,7 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 		 * condition against btvacuumscan --- see comments therein.
 		 */
 		if (needLock)
-			UnlockRelationForExtension(rel, ExclusiveLock);
+			UnlockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 
 		/* Initialize the new page before returning it */
 		page = BufferGetPage(buf);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 3dbafdd..394a660 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -1058,10 +1058,10 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	{
 		/* Get the current relation length */
 		if (needLock)
-			LockRelationForExtension(rel, ExclusiveLock);
+			LockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 		num_pages = RelationGetNumberOfBlocks(rel);
 		if (needLock)
-			UnlockRelationForExtension(rel, ExclusiveLock);
+			UnlockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 
 		/* Quit if we've scanned the whole relation */
 		if (blkno >= num_pages)
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 22f64b0..12be667 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -230,13 +230,13 @@ SpGistNewBuffer(Relation index)
 	/* Must extend the file */
 	needLock = !RELATION_IS_LOCAL(index);
 	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
+		LockRelationForExtension(index, RELEXT_EXCLUSIVE);
 
 	buffer = ReadBuffer(index, P_NEW);
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
 	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+		UnlockRelationForExtension(index, RELEXT_EXCLUSIVE);
 
 	return buffer;
 }
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index d7d5e90..3888d93 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -824,10 +824,10 @@ spgvacuumscan(spgBulkDeleteState *bds)
 	{
 		/* Get the current relation length */
 		if (needLock)
-			LockRelationForExtension(index, ExclusiveLock);
+			LockRelationForExtension(index, RELEXT_EXCLUSIVE);
 		num_pages = RelationGetNumberOfBlocks(index);
 		if (needLock)
-			UnlockRelationForExtension(index, ExclusiveLock);
+			UnlockRelationForExtension(index, RELEXT_EXCLUSIVE);
 
 		/* Quit if we've scanned the whole relation */
 		if (blkno >= num_pages)
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 45b1859..a5d6a28 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -849,8 +849,8 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
 			 * it's got exclusive lock on the whole relation.
 			 */
 			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
-			LockRelationForExtension(onerel, ExclusiveLock);
-			UnlockRelationForExtension(onerel, ExclusiveLock);
+			LockRelationForExtension(onerel, RELEXT_EXCLUSIVE);
+			UnlockRelationForExtension(onerel, RELEXT_EXCLUSIVE);
 			LockBufferForCleanup(buf);
 			if (PageIsNew(page))
 			{
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index accf302..cbcc5bf 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3624,6 +3624,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_SYNC_REP:
 			event_name = "SyncRep";
 			break;
+		case WAIT_EVENT_RELATION_EXTENSION:
+			event_name = "RelationExtension";
+			break;
 			/* no default case, so that compiler will warn */
 	}
 
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 4648473..498223a 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -624,7 +624,7 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 	 * Note that another backend might have extended or created the relation
 	 * by the time we get the lock.
 	 */
-	LockRelationForExtension(rel, ExclusiveLock);
+	LockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 
 	/* Might have to re-open if a cache flush happened */
 	RelationOpenSmgr(rel);
@@ -652,7 +652,7 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 	/* Update local cache with the up-to-date size */
 	rel->rd_smgr->smgr_fsm_nblocks = fsm_nblocks_now;
 
-	UnlockRelationForExtension(rel, ExclusiveLock);
+	UnlockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 
 	pfree(pg);
 }
diff --git a/src/backend/storage/lmgr/Makefile b/src/backend/storage/lmgr/Makefile
index e1b787e..2334a40 100644
--- a/src/backend/storage/lmgr/Makefile
+++ b/src/backend/storage/lmgr/Makefile
@@ -13,7 +13,7 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = lmgr.o lock.o proc.o deadlock.o lwlock.o lwlocknames.o spin.o \
-	s_lock.o predicate.o condition_variable.o
+	s_lock.o predicate.o condition_variable.o extension_lock.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/storage/lmgr/extension_lock.c b/src/backend/storage/lmgr/extension_lock.c
new file mode 100644
index 0000000..e8bbd5a
--- /dev/null
+++ b/src/backend/storage/lmgr/extension_lock.c
@@ -0,0 +1,380 @@
+/*-------------------------------------------------------------------------
+ *
+ * extension_lock.c
+ *	  Relation extension lock manager
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/storage/lmgr/extension_lock.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "pg_trace.h"
+#include "postmaster/postmaster.h"
+#include "replication/slot.h"
+#include "storage/ipc.h"
+#include "storage/proc.h"
+#include "storage/proclist.h"
+#include "storage/spin.h"
+#include "storage/extension_lock.h"
+#include "utils/memutils.h"
+#include "utils/rel.h"
+
+#ifdef LWLOCK_STATS
+#include "utils/hsearch.h"
+#endif
+
+/*
+ * Compute the hash code associated with a RELEXTLOCK.
+ *
+ * To avoid unnecessary recomputations of the hash code, we try to do this
+ * just once per function, and then pass it around as needed.  Aside from
+ * passing the hashcode to hash_search_with_hash_value(), we can extract
+ * the lock partition number from the hashcode.
+ */
+#define RelExtLockTargetTagHashCode(relextlocktargettag) \
+	get_hash_value(RelExtLockHash, (const void *) relextlocktargettag)
+
+/*
+ * The lockmgr's shared hash tables are partitioned to reduce contention.
+ * To determine which partition a given relid belongs to, compute the tag's
+ * hash code with ExtLockTagHashCode(), then apply one of these macros.
+ * NB: NUM_RELEXTENSIONLOCK_PARTITIONS must be a power of 2!
+ */
+#define RelExtLockHashPartition(hashcode) \
+	((hashcode) % NUM_RELEXTLOCK_PARTITIONS)
+#define RelExtLockHashPartitionLock(hashcode) \
+	(&MainLWLockArray[RELEXTLOCK_MANAGER_LWLOCK_OFFSET + \
+					  LockHashPartition(hashcode)].lock)
+#define RelExtLockHashPartitionLockByIndex(i) \
+	(&MainLWLockArray[RELEXTLOCK_MANAGER_LWLOCK_OFFSET + (i)].lock
+
+#define	RELEXT_VAL_EXCLUSIVE	((uint32) 1 << 24)
+#define RELEXT_VAL_SHARED		1
+
+#define RELEXT_LOCKMASK			((uint32) ((1 << 25) - 1))
+
+/* */
+#define MAX_SIMUL_EXTLOCKS 8
+
+/*
+ * This structure holds information per-object relation extension
+ * lock. held_extlocks represents the ExtLocks we're holding.
+ */
+typedef	struct relextlock_handle
+{
+	Oid			relid;
+	RelExtLock	*lock;
+	RelExtLockMode mode;	/* lock mode for this table entry */
+} relextlock_handle;
+static relextlock_handle held_relextlocks[MAX_SIMUL_EXTLOCKS];
+static int num_held_relextlocks = 0;
+
+static bool RelExtLockAcquire(Oid relid, RelExtLockMode lockmode, bool conditional);
+static void RelExtLockRelease(Oid rleid, RelExtLockMode lockmode);
+static bool RelExtLockAttemptLock(RelExtLock *ext_lock, RelExtLockMode lockmode);
+
+/*
+ * Pointers to hash tables containing lock state
+ *
+ * The RelExtLockHash hash table is in shared memory
+ */
+static HTAB *RelExtLockHash;
+
+/*
+ * InitRelExtLock
+ *      Initialize the relation extension lock manager's data structures.
+ */
+void
+InitRelExtLock(long max_table_size)
+{
+	HASHCTL	info;
+	long		init_table_size;
+
+	/*
+	 * Compute init/max size to request for lock hashtables.  Note these
+	 * calculations must agree with LockShmemSize!
+	 */
+	init_table_size = max_table_size / 2;
+
+	/*
+	 * Allocate hash table for RELEXTLOCK structs. This stores per-relation
+	 * lock.
+	 */
+	MemSet(&info, 0, sizeof(info));
+	info.keysize = sizeof(Oid);
+	info.entrysize = sizeof(RelExtLock);
+	info.num_partitions = NUM_RELEXTLOCK_PARTITIONS;
+
+	RelExtLockHash = ShmemInitHash("RelExtLock Hash",
+								   init_table_size,
+								   max_table_size,
+								   &info,
+								   HASH_ELEM | HASH_BLOBS | HASH_PARTITION);
+}
+
+/*
+ *		LockRelationForExtension
+ *
+ * This lock is used to interlock addition of pages to relations.
+ * We need such locking because bufmgr/smgr definition of P_NEW is not
+ * race-condition-proof.
+ *
+ * We assume the caller is already holding some type of regular lock on
+ * the relation, so no AcceptInvalidationMessages call is needed here.
+ */
+void
+LockRelationForExtension(Relation relation, RelExtLockMode lockmode)
+{
+	RelExtLockAcquire(relation->rd_id, lockmode, false);
+}
+
+/*
+ *		ConditionalLockRelationForExtension
+ *
+ * As above, but only lock if we can get the lock without blocking.
+ * Returns TRUE iff the lock was acquired.
+ */
+bool
+ConditionalLockRelationForExtension(Relation relation, RelExtLockMode lockmode)
+{
+	return RelExtLockAcquire(relation->rd_id, lockmode, true);
+}
+
+/*
+ *		RelationExtensionLockWaiterCount
+ *
+ * Count the number of processes waiting for the given relation extension lock.
+ * NOte that this routine doesn't acquire the partition lock. Please make sure
+ * that the caller must acquire partitionlock in exclusive mode or we must call
+ * this routine after acquired the relation extension lock of this relation.
+ */
+int
+RelationExtensionLockWaiterCount(Relation relation)
+{
+	RelExtLock	*ext_lock;
+	Oid		relid;
+	uint32	nwaiters;
+	uint32	hashcode;
+	bool	found;
+
+	relid = relation->rd_id;
+	hashcode = RelExtLockTargetTagHashCode(&relid);
+
+	ext_lock = (RelExtLock *) hash_search_with_hash_value(RelExtLockHash,
+														  (void *) &relid,
+														  hashcode,
+														  HASH_FIND, &found);
+	/* We assume that we already acquire this lock */
+	Assert(found);
+
+	nwaiters = pg_atomic_read_u32(&(ext_lock->nwaiters));
+
+	return nwaiters;
+}
+
+/*
+ *		UnlockRelationForExtension
+ */
+void
+UnlockRelationForExtension(Relation relation, RelExtLockMode lockmode)
+{
+	RelExtLockRelease(relation->rd_id, lockmode);
+}
+
+/*
+ * Acquire relation extension lock and create RELEXTLOCK hash entry on shared
+ * hash table. To avoid dead-lock with partition lock and LWLock, we acquire
+ * them but don't release it here. The caller must call DeleteRelExtLock later
+ * to release these locks.
+ */
+static bool
+RelExtLockAcquire(Oid relid, RelExtLockMode lockmode, bool conditional)
+{
+	RelExtLock	*ext_lock;
+	LWLock	*partitionLock;
+	uint32	hashcode;
+	bool	found;
+	bool	got_lock = false;
+	bool	waited = false;
+
+	hashcode = RelExtLockTargetTagHashCode(&relid);
+	partitionLock = RelExtLockHashPartitionLock(hashcode);
+	LWLockAcquire(partitionLock, LW_EXCLUSIVE);
+
+	ext_lock = (RelExtLock *) hash_search_with_hash_value(RelExtLockHash,
+														  (void * ) &relid,
+														  hashcode, HASH_ENTER, &found);
+
+	if (!ext_lock)
+		ereport(ERROR,
+				(errcode(ERRCODE_OUT_OF_MEMORY),
+				 errmsg("out of shared memory"),
+				 errhint("You might need to increase max_pred_locks_per_transaction.")));
+
+	for (;;)
+	{
+		bool ret;
+
+		ret = RelExtLockAttemptLock(ext_lock, lockmode);
+
+		if (ret)
+		{
+			got_lock = true;
+
+			if (waited)
+				pg_atomic_sub_fetch_u32(&(ext_lock->nwaiters), 1);
+
+			break;	/* got the lock */
+		}
+
+		/* Could not get lock, return if in conditional lock */
+		if (!ret && conditional)
+			break;
+
+		/* Add to wait list */
+		pg_atomic_add_fetch_u32(&(ext_lock->nwaiters), 1);
+		ConditionVariableSleep(&(ext_lock->cv), WAIT_EVENT_RELATION_EXTENSION);
+	}
+
+	ConditionVariableCancelSleep();
+
+	if (got_lock)
+	{
+		/* Add lock to list relation extension locks held by this backend */
+		held_relextlocks[num_held_relextlocks].relid = relid;
+		held_relextlocks[num_held_relextlocks].lock = ext_lock;
+		held_relextlocks[num_held_relextlocks].mode = lockmode;
+		num_held_relextlocks++;
+	}
+	else
+		LWLockRelease(partitionLock);
+
+	/* Always end up with true if not conditional lock */
+	return got_lock;
+}
+
+/*
+ * RelationExtensionLockReleaseAll - release all currently-held relation extension locks
+ */
+void
+RelationExtensionLockReleaseAll(void)
+{
+	while (num_held_relextlocks > 0)
+	{
+		HOLD_INTERRUPTS();
+
+		RelExtLockRelease(held_relextlocks[num_held_relextlocks - 1].relid,
+						  held_relextlocks[num_held_relextlocks - 1].mode);
+	}
+}
+
+/*
+ * ExstLockRelease
+ *
+ * Remove RELEXTLOCK from shared RelExtLockHash hash table. Since other backends
+ * might be acquiring it or waiting for this lock, we can delete it only if there
+ * is no longer backends who are interested in it.
+ *
+ * Note that we assume partition lock for hash table is already acquired when
+ * acquiring the lock. This routine should release partition lock as well after
+ * released LWLock.
+ */
+static void
+RelExtLockRelease(Oid relid, RelExtLockMode lockmode)
+{
+	RelExtLock	*ext_lock;
+	RelExtLockMode mode;
+	uint32	hashcode;
+	LWLock	*partitionLock;
+	uint32	oldstate;
+	uint32	nwaiters;
+	int i;
+
+	hashcode = RelExtLockTargetTagHashCode(&relid);
+	partitionLock = RelExtLockHashPartitionLock(hashcode);
+
+	for (i = num_held_relextlocks; --i >= 0;)
+		if (relid == held_relextlocks[i].relid &&
+			lockmode == held_relextlocks[i].mode)
+			break;
+
+	if (i < 0)
+		elog(ERROR, "relation extension lock for %u with lock mode %d is not held",
+			 relid, lockmode);
+
+	ext_lock = held_relextlocks[i].lock;
+	mode = held_relextlocks[i].mode;
+
+	num_held_relextlocks--;
+
+	/* Shrink */
+	for (; i < num_held_relextlocks; i++)
+		held_relextlocks[i] = held_relextlocks[i + 1];
+
+	if (mode == RELEXT_EXCLUSIVE)
+		oldstate = pg_atomic_sub_fetch_u32(&(ext_lock->state), RELEXT_VAL_EXCLUSIVE);
+	else
+		oldstate = pg_atomic_sub_fetch_u32(&(ext_lock->state), RELEXT_VAL_SHARED);
+
+	nwaiters = pg_atomic_read_u32(&(ext_lock->nwaiters));
+
+	/* Wake up waiters if there are */
+	if (nwaiters > 0)
+		ConditionVariableBroadcast(&(ext_lock->cv));
+	else
+		hash_search_with_hash_value(RelExtLockHash, (void *) &relid,
+									hashcode, HASH_REMOVE, NULL);
+
+	LWLockRelease(partitionLock);
+}
+
+/*
+ * Internal function that tries to atomically acquire the relation extension
+ * lock in the passed in mode. Return true if we got the lock.
+ */
+static bool
+RelExtLockAttemptLock(RelExtLock *ext_lock, RelExtLockMode lockmode)
+{
+	uint32	oldstate;
+
+	oldstate = pg_atomic_read_u32(&ext_lock->state);
+
+	while (true)
+	{
+		uint32	desired_state;
+		bool	lock_free;
+
+		desired_state = oldstate;
+
+		if (lockmode == RELEXT_EXCLUSIVE)
+		{
+			lock_free = (oldstate & RELEXT_LOCKMASK) == 0;
+			if (lock_free)
+				desired_state += RELEXT_VAL_EXCLUSIVE;
+		}
+		else
+		{
+			lock_free = (oldstate & RELEXT_VAL_EXCLUSIVE) == 0;
+			if (lock_free)
+				desired_state += RELEXT_VAL_SHARED;
+		}
+
+		if (pg_atomic_compare_exchange_u32(&ext_lock->state,
+										   &oldstate, desired_state))
+		{
+			if (lock_free)
+				return true;
+			else
+				return false;
+		}
+	}
+	pg_unreachable();
+}
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index fe98898..34095cb 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -319,78 +319,6 @@ UnlockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode)
 }
 
 /*
- *		LockRelationForExtension
- *
- * This lock tag is used to interlock addition of pages to relations.
- * We need such locking because bufmgr/smgr definition of P_NEW is not
- * race-condition-proof.
- *
- * We assume the caller is already holding some type of regular lock on
- * the relation, so no AcceptInvalidationMessages call is needed here.
- */
-void
-LockRelationForExtension(Relation relation, LOCKMODE lockmode)
-{
-	LOCKTAG		tag;
-
-	SET_LOCKTAG_RELATION_EXTEND(tag,
-								relation->rd_lockInfo.lockRelId.dbId,
-								relation->rd_lockInfo.lockRelId.relId);
-
-	(void) LockAcquire(&tag, lockmode, false, false);
-}
-
-/*
- *		ConditionalLockRelationForExtension
- *
- * As above, but only lock if we can get the lock without blocking.
- * Returns TRUE iff the lock was acquired.
- */
-bool
-ConditionalLockRelationForExtension(Relation relation, LOCKMODE lockmode)
-{
-	LOCKTAG		tag;
-
-	SET_LOCKTAG_RELATION_EXTEND(tag,
-								relation->rd_lockInfo.lockRelId.dbId,
-								relation->rd_lockInfo.lockRelId.relId);
-
-	return (LockAcquire(&tag, lockmode, false, true) != LOCKACQUIRE_NOT_AVAIL);
-}
-
-/*
- *		RelationExtensionLockWaiterCount
- *
- * Count the number of processes waiting for the given relation extension lock.
- */
-int
-RelationExtensionLockWaiterCount(Relation relation)
-{
-	LOCKTAG		tag;
-
-	SET_LOCKTAG_RELATION_EXTEND(tag,
-								relation->rd_lockInfo.lockRelId.dbId,
-								relation->rd_lockInfo.lockRelId.relId);
-
-	return LockWaiterCount(&tag);
-}
-
-/*
- *		UnlockRelationForExtension
- */
-void
-UnlockRelationForExtension(Relation relation, LOCKMODE lockmode)
-{
-	LOCKTAG		tag;
-
-	SET_LOCKTAG_RELATION_EXTEND(tag,
-								relation->rd_lockInfo.lockRelId.dbId,
-								relation->rd_lockInfo.lockRelId.relId);
-
-	LockRelease(&tag, lockmode, false);
-}
-
-/*
  *		LockPage
  *
  * Obtain a page-level lock.  This is currently used by some index access
@@ -961,12 +889,6 @@ DescribeLockTag(StringInfo buf, const LOCKTAG *tag)
 							 tag->locktag_field2,
 							 tag->locktag_field1);
 			break;
-		case LOCKTAG_RELATION_EXTEND:
-			appendStringInfo(buf,
-							 _("extension of relation %u of database %u"),
-							 tag->locktag_field2,
-							 tag->locktag_field1);
-			break;
 		case LOCKTAG_PAGE:
 			appendStringInfo(buf,
 							 _("page %u of relation %u of database %u"),
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 2b26173..bc576a7 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -45,6 +45,7 @@
 #include "storage/sinvaladt.h"
 #include "storage/spin.h"
 #include "storage/standby.h"
+#include "storage/lmgr.h"
 #include "utils/memutils.h"
 #include "utils/ps_status.h"
 #include "utils/resowner_private.h"
@@ -388,6 +389,10 @@ InitLocks(void)
 	max_table_size = NLOCKENTS();
 	init_table_size = max_table_size / 2;
 
+
+	/* Initialize lock structure for relation extension lock */
+	InitRelExtLock(max_table_size);
+
 	/*
 	 * Allocate hash table for LOCK structs.  This stores per-locked-object
 	 * information.
@@ -3366,6 +3371,7 @@ LockShmemSize(void)
 	/* lock hash table */
 	max_table_size = NLOCKENTS();
 	size = add_size(size, hash_estimate_size(max_table_size, sizeof(LOCK)));
+	size = add_size(size, hash_estimate_size(max_table_size, sizeof(LWLock)));
 
 	/* proclock hash table */
 	max_table_size *= 2;
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 82a1cf5..3d465a5 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -451,6 +451,13 @@ InitializeLWLocks(void)
 	for (id = 0; id < NUM_PREDICATELOCK_PARTITIONS; id++, lock++)
 		LWLockInitialize(&lock->lock, LWTRANCHE_PREDICATE_LOCK_MANAGER);
 
+	/* Initialize relation extension lmgr's LWLocks in main array */
+	lock = MainLWLockArray + NUM_INDIVIDUAL_LWLOCKS +
+		NUM_BUFFER_PARTITIONS + NUM_LOCK_PARTITIONS +
+		NUM_PREDICATELOCK_PARTITIONS;
+	for (id = 0; id < NUM_RELEXTLOCK_PARTITIONS; id++, lock++)
+		LWLockInitialize(&lock->lock, LWTRANCHE_RELEXT_LOCK_MANAGER);
+
 	/* Initialize named tranches. */
 	if (NamedLWLockTrancheRequests > 0)
 	{
@@ -494,7 +501,7 @@ RegisterLWLockTranches(void)
 
 	if (LWLockTrancheArray == NULL)
 	{
-		LWLockTranchesAllocated = 64;
+		LWLockTranchesAllocated = 128;
 		LWLockTrancheArray = (char **)
 			MemoryContextAllocZero(TopMemoryContext,
 								   LWLockTranchesAllocated * sizeof(char *));
@@ -508,6 +515,7 @@ RegisterLWLockTranches(void)
 	LWLockRegisterTranche(LWTRANCHE_LOCK_MANAGER, "lock_manager");
 	LWLockRegisterTranche(LWTRANCHE_PREDICATE_LOCK_MANAGER,
 						  "predicate_lock_manager");
+	LWLockRegisterTranche(LWTRANCHE_RELEXT_LOCK_MANAGER, "relext_lock_manager");
 	LWLockRegisterTranche(LWTRANCHE_PARALLEL_QUERY_DSA,
 						  "parallel_query_dsa");
 	LWLockRegisterTranche(LWTRANCHE_TBM, "tbm");
diff --git a/src/backend/utils/adt/lockfuncs.c b/src/backend/utils/adt/lockfuncs.c
index 9e0a8ab..6d8916c 100644
--- a/src/backend/utils/adt/lockfuncs.c
+++ b/src/backend/utils/adt/lockfuncs.c
@@ -25,7 +25,6 @@
 /* This must match enum LockTagType! */
 const char *const LockTagTypeNames[] = {
 	"relation",
-	"extend",
 	"page",
 	"tuple",
 	"transactionid",
@@ -234,7 +233,6 @@ pg_lock_status(PG_FUNCTION_ARGS)
 		switch ((LockTagType) instance->locktag.locktag_type)
 		{
 			case LOCKTAG_RELATION:
-			case LOCKTAG_RELATION_EXTEND:
 				values[1] = ObjectIdGetDatum(instance->locktag.locktag_field1);
 				values[2] = ObjectIdGetDatum(instance->locktag.locktag_field2);
 				nulls[3] = true;
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 57ac5d4..b0de147 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -816,7 +816,8 @@ typedef enum
 	WAIT_EVENT_REPLICATION_ORIGIN_DROP,
 	WAIT_EVENT_REPLICATION_SLOT_DROP,
 	WAIT_EVENT_SAFE_SNAPSHOT,
-	WAIT_EVENT_SYNC_REP
+	WAIT_EVENT_SYNC_REP,
+	WAIT_EVENT_RELATION_EXTENSION
 } WaitEventIPC;
 
 /* ----------
diff --git a/src/include/storage/extension_lock.h b/src/include/storage/extension_lock.h
new file mode 100644
index 0000000..f178672
--- /dev/null
+++ b/src/include/storage/extension_lock.h
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * extension_lock.h
+ *	  Relation extension lock manager
+ *
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/extension_lock.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef EXTENSION_H
+#define EXTENSION_H
+
+#ifdef FRONTEND
+#error "extension_lock.h may not be included from frontend code"
+#endif
+
+#include "storage/proclist_types.h"
+#include "storage/s_lock.h"
+#include "storage/condition_variable.h"
+#include "port/atomics.h"
+
+typedef struct RelExtLock
+{
+	Oid					relid;
+	pg_atomic_uint32	state;
+	pg_atomic_uint32	nwaiters;
+	ConditionVariable	cv;
+} RelExtLock;
+
+typedef enum RelExtLockMode
+{
+	RELEXT_EXCLUSIVE,
+	RELEXT_SHARED
+} RelExtLockMode;
+
+/* Lock a relation for extension */
+extern void InitRelExtLock(long max_table_size);
+extern void LockRelationForExtension(Relation relation, RelExtLockMode lockmode);
+extern void UnlockRelationForExtension(Relation relation, RelExtLockMode lockmode);
+extern bool ConditionalLockRelationForExtension(Relation relation, RelExtLockMode lockmode);
+extern int	RelationExtensionLockWaiterCount(Relation relation);
+extern void RelationExtensionLockReleaseAll(void);
+
+#endif	/* EXTENSION_H */
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index 0b92322..ac23354 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -15,6 +15,7 @@
 #define LMGR_H
 
 #include "lib/stringinfo.h"
+#include "storage/extension_lock.h"
 #include "storage/itemptr.h"
 #include "storage/lock.h"
 #include "utils/rel.h"
@@ -34,6 +35,36 @@ typedef enum XLTW_Oper
 	XLTW_RecheckExclusionConstr
 } XLTW_Oper;
 
+typedef	struct RELEXTLOCKTAG
+{
+	Oid		relid;		/* identifies the lockable object */
+	LWLockMode mode;	/* lock mode for this table entry */
+} RELEXTLOCKTAG;
+
+/*
+ * This structure holds information per-object relation extension
+ * lock.
+ */
+typedef struct RELEXTLOCK
+{
+	RELEXTLOCKTAG	tag;	/* hash key -- must be first */
+	LWLock			lock;	/* LWLock for relation extension */
+} RELEXTLOCK;
+
+/*
+ * The LOCALRELEXTLOCK struct represents a local copy of data which is
+ * also present in the RELEXTLOCK table, organized for fast access without
+ * needing to acquire a LWLock.  It is strictly for optimization.
+ */
+typedef struct LOCALRELEXTLOCK
+{
+	/* hash key */
+	RELEXTLOCKTAG	relid;	/* unique identifier of locktable object */
+
+	/* data */
+	bool			held;	/* is lock held? */
+} LOCALRELEXTLOCK;
+
 extern void RelationInitLockInfo(Relation relation);
 
 /* Lock a relation */
@@ -50,13 +81,6 @@ extern bool LockHasWaitersRelation(Relation relation, LOCKMODE lockmode);
 extern void LockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
 extern void UnlockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
 
-/* Lock a relation for extension */
-extern void LockRelationForExtension(Relation relation, LOCKMODE lockmode);
-extern void UnlockRelationForExtension(Relation relation, LOCKMODE lockmode);
-extern bool ConditionalLockRelationForExtension(Relation relation,
-									LOCKMODE lockmode);
-extern int	RelationExtensionLockWaiterCount(Relation relation);
-
 /* Lock a page (currently only used within indexes) */
 extern void LockPage(Relation relation, BlockNumber blkno, LOCKMODE lockmode);
 extern bool ConditionalLockPage(Relation relation, BlockNumber blkno, LOCKMODE lockmode);
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index 765431e..3be18ea 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -138,8 +138,6 @@ typedef uint16 LOCKMETHODID;
 typedef enum LockTagType
 {
 	LOCKTAG_RELATION,			/* whole relation */
-	/* ID info for a relation is DB OID + REL OID; DB OID = 0 if shared */
-	LOCKTAG_RELATION_EXTEND,	/* the right to extend a relation */
 	/* same ID info as RELATION */
 	LOCKTAG_PAGE,				/* one page of a relation */
 	/* ID info for a page is RELATION info + BlockNumber */
@@ -198,14 +196,6 @@ typedef struct LOCKTAG
 	 (locktag).locktag_type = LOCKTAG_RELATION, \
 	 (locktag).locktag_lockmethodid = DEFAULT_LOCKMETHOD)
 
-#define SET_LOCKTAG_RELATION_EXTEND(locktag,dboid,reloid) \
-	((locktag).locktag_field1 = (dboid), \
-	 (locktag).locktag_field2 = (reloid), \
-	 (locktag).locktag_field3 = 0, \
-	 (locktag).locktag_field4 = 0, \
-	 (locktag).locktag_type = LOCKTAG_RELATION_EXTEND, \
-	 (locktag).locktag_lockmethodid = DEFAULT_LOCKMETHOD)
-
 #define SET_LOCKTAG_PAGE(locktag,dboid,reloid,blocknum) \
 	((locktag).locktag_field1 = (dboid), \
 	 (locktag).locktag_field2 = (reloid), \
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 3d16132..c0e6242 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -120,14 +120,21 @@ extern PGDLLIMPORT int NamedLWLockTrancheRequests;
 #define LOG2_NUM_PREDICATELOCK_PARTITIONS  4
 #define NUM_PREDICATELOCK_PARTITIONS  (1 << LOG2_NUM_PREDICATELOCK_PARTITIONS)
 
+/* Number of partitions the shared relation extension lock tables are divided into */
+#define LOG2_NUM_RELEXTLOCK_PARTITIONS	4
+#define NUM_RELEXTLOCK_PARTITIONS	(1 << LOG2_NUM_RELEXTLOCK_PARTITIONS)
+
 /* Offsets for various chunks of preallocated lwlocks. */
 #define BUFFER_MAPPING_LWLOCK_OFFSET	NUM_INDIVIDUAL_LWLOCKS
 #define LOCK_MANAGER_LWLOCK_OFFSET		\
 	(BUFFER_MAPPING_LWLOCK_OFFSET + NUM_BUFFER_PARTITIONS)
 #define PREDICATELOCK_MANAGER_LWLOCK_OFFSET \
 	(LOCK_MANAGER_LWLOCK_OFFSET + NUM_LOCK_PARTITIONS)
-#define NUM_FIXED_LWLOCKS \
+#define RELEXTLOCK_MANAGER_LWLOCK_OFFSET \
 	(PREDICATELOCK_MANAGER_LWLOCK_OFFSET + NUM_PREDICATELOCK_PARTITIONS)
+#define NUM_FIXED_LWLOCKS \
+	(PREDICATELOCK_MANAGER_LWLOCK_OFFSET + NUM_PREDICATELOCK_PARTITIONS + \
+	 NUM_RELEXTLOCK_PARTITIONS)
 
 typedef enum LWLockMode
 {
@@ -151,6 +158,8 @@ extern void LWLockReleaseClearVar(LWLock *lock, uint64 *valptr, uint64 val);
 extern void LWLockReleaseAll(void);
 extern bool LWLockHeldByMe(LWLock *lock);
 extern bool LWLockHeldByMeInMode(LWLock *lock, LWLockMode mode);
+extern bool LWLockCheckForCleanup(LWLock *lock);
+extern int LWLockWaiterCount(LWLock *lock);
 
 extern bool LWLockWaitForVar(LWLock *lock, uint64 *valptr, uint64 oldval, uint64 *newval);
 extern void LWLockUpdateVar(LWLock *lock, uint64 *valptr, uint64 value);
@@ -211,6 +220,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_BUFFER_MAPPING,
 	LWTRANCHE_LOCK_MANAGER,
 	LWTRANCHE_PREDICATE_LOCK_MANAGER,
+	LWTRANCHE_RELEXT_LOCK_MANAGER,
 	LWTRANCHE_PARALLEL_QUERY_DSA,
 	LWTRANCHE_TBM,
 	LWTRANCHE_FIRST_USER_DEFINED

#15

Masahiko Sawada

sawada.mshk@gmail.com

about 8 years ago

In reply to: Masahiko Sawada (#14)

1 attachment(s)

Re: Moving relation extension locks out of heavyweight lock manager

On Fri, Sep 8, 2017 at 4:32 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Sep 8, 2017 at 7:24 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Wed, Aug 16, 2017 at 2:13 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

The previous patch conflicts with current HEAD, I rebased the patch to
current HEAD.

Hi Masahiko-san,

FYI this doesn't build anymore. I think it's just because the wait
event enumerators were re-alphabetised in pgstat.h:

../../../../src/include/pgstat.h:820:2: error: redeclaration of
enumerator ‘WAIT_EVENT_LOGICAL_SYNC_DATA’
WAIT_EVENT_LOGICAL_SYNC_DATA,
^
../../../../src/include/pgstat.h:806:2: note: previous definition of
‘WAIT_EVENT_LOGICAL_SYNC_DATA’ was here
WAIT_EVENT_LOGICAL_SYNC_DATA,
^
../../../../src/include/pgstat.h:821:2: error: redeclaration of
enumerator ‘WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE’
WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE,
^
../../../../src/include/pgstat.h:807:2: note: previous definition of
‘WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE’ was here
WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE,
^

Thank you for the information! Attached rebased patch.

Since the previous patch conflicts with current HEAD, I attached the
updated patch for next CF.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

Moving_extension_lock_out_of_heavyweight_lock_v5.patchtext/x-patch; charset=US-ASCII; name=Moving_extension_lock_out_of_heavyweight_lock_v5.patchDownload

diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index 80f803e..b928c1a 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -609,8 +609,8 @@ brin_page_cleanup(Relation idxrel, Buffer buf)
 	 */
 	if (PageIsNew(page))
 	{
-		LockRelationForExtension(idxrel, ShareLock);
-		UnlockRelationForExtension(idxrel, ShareLock);
+		LockRelationForExtension(idxrel, RELEXT_SHARED);
+		UnlockRelationForExtension(idxrel, RELEXT_SHARED);
 
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 		if (PageIsNew(page))
@@ -702,7 +702,7 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 			 */
 			if (!RELATION_IS_LOCAL(irel))
 			{
-				LockRelationForExtension(irel, ExclusiveLock);
+				LockRelationForExtension(irel, RELEXT_EXCLUSIVE);
 				extensionLockHeld = true;
 			}
 			buf = ReadBuffer(irel, P_NEW);
@@ -754,7 +754,7 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 				}
 
 				if (extensionLockHeld)
-					UnlockRelationForExtension(irel, ExclusiveLock);
+					UnlockRelationForExtension(irel, RELEXT_EXCLUSIVE);
 
 				ReleaseBuffer(buf);
 				return InvalidBuffer;
@@ -764,7 +764,7 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
 		if (extensionLockHeld)
-			UnlockRelationForExtension(irel, ExclusiveLock);
+			UnlockRelationForExtension(irel, RELEXT_EXCLUSIVE);
 
 		page = BufferGetPage(buf);
 
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index 22f2076..4c15b45 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -570,7 +570,7 @@ revmap_physical_extend(BrinRevmap *revmap)
 	else
 	{
 		if (needLock)
-			LockRelationForExtension(irel, ExclusiveLock);
+			LockRelationForExtension(irel, RELEXT_EXCLUSIVE);
 
 		buf = ReadBuffer(irel, P_NEW);
 		if (BufferGetBlockNumber(buf) != mapBlk)
@@ -582,7 +582,7 @@ revmap_physical_extend(BrinRevmap *revmap)
 			 * page from under whoever is using it.
 			 */
 			if (needLock)
-				UnlockRelationForExtension(irel, ExclusiveLock);
+				UnlockRelationForExtension(irel, RELEXT_EXCLUSIVE);
 			LockBuffer(revmap->rm_metaBuf, BUFFER_LOCK_UNLOCK);
 			ReleaseBuffer(buf);
 			return;
@@ -591,7 +591,7 @@ revmap_physical_extend(BrinRevmap *revmap)
 		page = BufferGetPage(buf);
 
 		if (needLock)
-			UnlockRelationForExtension(irel, ExclusiveLock);
+			UnlockRelationForExtension(irel, RELEXT_EXCLUSIVE);
 	}
 
 	/* Check that it's a regular block (or an empty page) */
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 136ea27..1690d21 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -325,13 +325,13 @@ GinNewBuffer(Relation index)
 	/* Must extend the file */
 	needLock = !RELATION_IS_LOCAL(index);
 	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
+		LockRelationForExtension(index, RELEXT_EXCLUSIVE);
 
 	buffer = ReadBuffer(index, P_NEW);
 	LockBuffer(buffer, GIN_EXCLUSIVE);
 
 	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+		UnlockRelationForExtension(index, RELEXT_EXCLUSIVE);
 
 	return buffer;
 }
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index 31425e9..e9f84bc 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -716,10 +716,10 @@ ginvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	needLock = !RELATION_IS_LOCAL(index);
 
 	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
+		LockRelationForExtension(index, RELEXT_EXCLUSIVE);
 	npages = RelationGetNumberOfBlocks(index);
 	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+		UnlockRelationForExtension(index, RELEXT_EXCLUSIVE);
 
 	totFreePages = 0;
 
@@ -766,10 +766,10 @@ ginvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	stats->pages_free = totFreePages;
 
 	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
+		LockRelationForExtension(index, RELEXT_EXCLUSIVE);
 	stats->num_pages = RelationGetNumberOfBlocks(index);
 	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+		UnlockRelationForExtension(index, RELEXT_EXCLUSIVE);
 
 	return stats;
 }
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 26d89f7..cd351d8 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -821,13 +821,13 @@ gistNewBuffer(Relation r)
 	needLock = !RELATION_IS_LOCAL(r);
 
 	if (needLock)
-		LockRelationForExtension(r, ExclusiveLock);
+		LockRelationForExtension(r, RELEXT_EXCLUSIVE);
 
 	buffer = ReadBuffer(r, P_NEW);
 	LockBuffer(buffer, GIST_EXCLUSIVE);
 
 	if (needLock)
-		UnlockRelationForExtension(r, ExclusiveLock);
+		UnlockRelationForExtension(r, RELEXT_EXCLUSIVE);
 
 	return buffer;
 }
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 77d9d12..ca45b06 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -59,10 +59,10 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 
 	/* try to find deleted pages */
 	if (needLock)
-		LockRelationForExtension(rel, ExclusiveLock);
+		LockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 	npages = RelationGetNumberOfBlocks(rel);
 	if (needLock)
-		UnlockRelationForExtension(rel, ExclusiveLock);
+		UnlockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 
 	totFreePages = 0;
 	for (blkno = GIST_ROOT_BLKNO + 1; blkno < npages; blkno++)
@@ -91,10 +91,10 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	/* return statistics */
 	stats->pages_free = totFreePages;
 	if (needLock)
-		LockRelationForExtension(rel, ExclusiveLock);
+		LockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 	stats->num_pages = RelationGetNumberOfBlocks(rel);
 	if (needLock)
-		UnlockRelationForExtension(rel, ExclusiveLock);
+		UnlockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 
 	return stats;
 }
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 13e3bdc..a8ce6c7 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -519,11 +519,11 @@ loop:
 	if (needLock)
 	{
 		if (!use_fsm)
-			LockRelationForExtension(relation, ExclusiveLock);
-		else if (!ConditionalLockRelationForExtension(relation, ExclusiveLock))
+			LockRelationForExtension(relation, RELEXT_EXCLUSIVE);
+		else if (!ConditionalLockRelationForExtension(relation, RELEXT_EXCLUSIVE))
 		{
 			/* Couldn't get the lock immediately; wait for it. */
-			LockRelationForExtension(relation, ExclusiveLock);
+			LockRelationForExtension(relation, RELEXT_EXCLUSIVE);
 
 			/*
 			 * Check if some other backend has extended a block for us while
@@ -537,7 +537,7 @@ loop:
 			 */
 			if (targetBlock != InvalidBlockNumber)
 			{
-				UnlockRelationForExtension(relation, ExclusiveLock);
+				UnlockRelationForExtension(relation, RELEXT_EXCLUSIVE);
 				goto loop;
 			}
 
@@ -576,7 +576,7 @@ loop:
 	 * against vacuumlazy.c --- see comments therein.
 	 */
 	if (needLock)
-		UnlockRelationForExtension(relation, ExclusiveLock);
+		UnlockRelationForExtension(relation, RELEXT_EXCLUSIVE);
 
 	/*
 	 * We need to initialize the empty new page.  Double-check that it really
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 4c2a13a..7dc3088 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -641,7 +641,7 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 	 * Note that another backend might have extended or created the relation
 	 * by the time we get the lock.
 	 */
-	LockRelationForExtension(rel, ExclusiveLock);
+	LockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 
 	/* Might have to re-open if a cache flush happened */
 	RelationOpenSmgr(rel);
@@ -679,7 +679,7 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 	/* Update local cache with the up-to-date size */
 	rel->rd_smgr->smgr_vm_nblocks = vm_nblocks_now;
 
-	UnlockRelationForExtension(rel, ExclusiveLock);
+	UnlockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 
 	pfree(pg);
 }
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 10697e9..e1407ac 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -658,7 +658,7 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 		needLock = !RELATION_IS_LOCAL(rel);
 
 		if (needLock)
-			LockRelationForExtension(rel, ExclusiveLock);
+			LockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 
 		buf = ReadBuffer(rel, P_NEW);
 
@@ -672,7 +672,7 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 		 * condition against btvacuumscan --- see comments therein.
 		 */
 		if (needLock)
-			UnlockRelationForExtension(rel, ExclusiveLock);
+			UnlockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 
 		/* Initialize the new page before returning it */
 		page = BufferGetPage(buf);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 3dbafdd..394a660 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -1058,10 +1058,10 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	{
 		/* Get the current relation length */
 		if (needLock)
-			LockRelationForExtension(rel, ExclusiveLock);
+			LockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 		num_pages = RelationGetNumberOfBlocks(rel);
 		if (needLock)
-			UnlockRelationForExtension(rel, ExclusiveLock);
+			UnlockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 
 		/* Quit if we've scanned the whole relation */
 		if (blkno >= num_pages)
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 22f64b0..12be667 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -230,13 +230,13 @@ SpGistNewBuffer(Relation index)
 	/* Must extend the file */
 	needLock = !RELATION_IS_LOCAL(index);
 	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
+		LockRelationForExtension(index, RELEXT_EXCLUSIVE);
 
 	buffer = ReadBuffer(index, P_NEW);
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
 	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+		UnlockRelationForExtension(index, RELEXT_EXCLUSIVE);
 
 	return buffer;
 }
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index d7d5e90..3888d93 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -824,10 +824,10 @@ spgvacuumscan(spgBulkDeleteState *bds)
 	{
 		/* Get the current relation length */
 		if (needLock)
-			LockRelationForExtension(index, ExclusiveLock);
+			LockRelationForExtension(index, RELEXT_EXCLUSIVE);
 		num_pages = RelationGetNumberOfBlocks(index);
 		if (needLock)
-			UnlockRelationForExtension(index, ExclusiveLock);
+			UnlockRelationForExtension(index, RELEXT_EXCLUSIVE);
 
 		/* Quit if we've scanned the whole relation */
 		if (blkno >= num_pages)
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 30b1c08..443e230 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -849,8 +849,8 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
 			 * it's got exclusive lock on the whole relation.
 			 */
 			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
-			LockRelationForExtension(onerel, ExclusiveLock);
-			UnlockRelationForExtension(onerel, ExclusiveLock);
+			LockRelationForExtension(onerel, RELEXT_EXCLUSIVE);
+			UnlockRelationForExtension(onerel, RELEXT_EXCLUSIVE);
 			LockBufferForCleanup(buf);
 			if (PageIsNew(page))
 			{
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 3a0b49c..64e26df 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3628,6 +3628,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_SYNC_REP:
 			event_name = "SyncRep";
 			break;
+		case WAIT_EVENT_RELATION_EXTENSION:
+			event_name = "RelationExtension";
+			break;
 			/* no default case, so that compiler will warn */
 	}
 
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 4648473..498223a 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -624,7 +624,7 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 	 * Note that another backend might have extended or created the relation
 	 * by the time we get the lock.
 	 */
-	LockRelationForExtension(rel, ExclusiveLock);
+	LockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 
 	/* Might have to re-open if a cache flush happened */
 	RelationOpenSmgr(rel);
@@ -652,7 +652,7 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 	/* Update local cache with the up-to-date size */
 	rel->rd_smgr->smgr_fsm_nblocks = fsm_nblocks_now;
 
-	UnlockRelationForExtension(rel, ExclusiveLock);
+	UnlockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 
 	pfree(pg);
 }
diff --git a/src/backend/storage/lmgr/Makefile b/src/backend/storage/lmgr/Makefile
index e1b787e..2334a40 100644
--- a/src/backend/storage/lmgr/Makefile
+++ b/src/backend/storage/lmgr/Makefile
@@ -13,7 +13,7 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = lmgr.o lock.o proc.o deadlock.o lwlock.o lwlocknames.o spin.o \
-	s_lock.o predicate.o condition_variable.o
+	s_lock.o predicate.o condition_variable.o extension_lock.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/storage/lmgr/extension_lock.c b/src/backend/storage/lmgr/extension_lock.c
new file mode 100644
index 0000000..e8bbd5a
--- /dev/null
+++ b/src/backend/storage/lmgr/extension_lock.c
@@ -0,0 +1,380 @@
+/*-------------------------------------------------------------------------
+ *
+ * extension_lock.c
+ *	  Relation extension lock manager
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/storage/lmgr/extension_lock.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "pg_trace.h"
+#include "postmaster/postmaster.h"
+#include "replication/slot.h"
+#include "storage/ipc.h"
+#include "storage/proc.h"
+#include "storage/proclist.h"
+#include "storage/spin.h"
+#include "storage/extension_lock.h"
+#include "utils/memutils.h"
+#include "utils/rel.h"
+
+#ifdef LWLOCK_STATS
+#include "utils/hsearch.h"
+#endif
+
+/*
+ * Compute the hash code associated with a RELEXTLOCK.
+ *
+ * To avoid unnecessary recomputations of the hash code, we try to do this
+ * just once per function, and then pass it around as needed.  Aside from
+ * passing the hashcode to hash_search_with_hash_value(), we can extract
+ * the lock partition number from the hashcode.
+ */
+#define RelExtLockTargetTagHashCode(relextlocktargettag) \
+	get_hash_value(RelExtLockHash, (const void *) relextlocktargettag)
+
+/*
+ * The lockmgr's shared hash tables are partitioned to reduce contention.
+ * To determine which partition a given relid belongs to, compute the tag's
+ * hash code with ExtLockTagHashCode(), then apply one of these macros.
+ * NB: NUM_RELEXTENSIONLOCK_PARTITIONS must be a power of 2!
+ */
+#define RelExtLockHashPartition(hashcode) \
+	((hashcode) % NUM_RELEXTLOCK_PARTITIONS)
+#define RelExtLockHashPartitionLock(hashcode) \
+	(&MainLWLockArray[RELEXTLOCK_MANAGER_LWLOCK_OFFSET + \
+					  LockHashPartition(hashcode)].lock)
+#define RelExtLockHashPartitionLockByIndex(i) \
+	(&MainLWLockArray[RELEXTLOCK_MANAGER_LWLOCK_OFFSET + (i)].lock
+
+#define	RELEXT_VAL_EXCLUSIVE	((uint32) 1 << 24)
+#define RELEXT_VAL_SHARED		1
+
+#define RELEXT_LOCKMASK			((uint32) ((1 << 25) - 1))
+
+/* */
+#define MAX_SIMUL_EXTLOCKS 8
+
+/*
+ * This structure holds information per-object relation extension
+ * lock. held_extlocks represents the ExtLocks we're holding.
+ */
+typedef	struct relextlock_handle
+{
+	Oid			relid;
+	RelExtLock	*lock;
+	RelExtLockMode mode;	/* lock mode for this table entry */
+} relextlock_handle;
+static relextlock_handle held_relextlocks[MAX_SIMUL_EXTLOCKS];
+static int num_held_relextlocks = 0;
+
+static bool RelExtLockAcquire(Oid relid, RelExtLockMode lockmode, bool conditional);
+static void RelExtLockRelease(Oid rleid, RelExtLockMode lockmode);
+static bool RelExtLockAttemptLock(RelExtLock *ext_lock, RelExtLockMode lockmode);
+
+/*
+ * Pointers to hash tables containing lock state
+ *
+ * The RelExtLockHash hash table is in shared memory
+ */
+static HTAB *RelExtLockHash;
+
+/*
+ * InitRelExtLock
+ *      Initialize the relation extension lock manager's data structures.
+ */
+void
+InitRelExtLock(long max_table_size)
+{
+	HASHCTL	info;
+	long		init_table_size;
+
+	/*
+	 * Compute init/max size to request for lock hashtables.  Note these
+	 * calculations must agree with LockShmemSize!
+	 */
+	init_table_size = max_table_size / 2;
+
+	/*
+	 * Allocate hash table for RELEXTLOCK structs. This stores per-relation
+	 * lock.
+	 */
+	MemSet(&info, 0, sizeof(info));
+	info.keysize = sizeof(Oid);
+	info.entrysize = sizeof(RelExtLock);
+	info.num_partitions = NUM_RELEXTLOCK_PARTITIONS;
+
+	RelExtLockHash = ShmemInitHash("RelExtLock Hash",
+								   init_table_size,
+								   max_table_size,
+								   &info,
+								   HASH_ELEM | HASH_BLOBS | HASH_PARTITION);
+}
+
+/*
+ *		LockRelationForExtension
+ *
+ * This lock is used to interlock addition of pages to relations.
+ * We need such locking because bufmgr/smgr definition of P_NEW is not
+ * race-condition-proof.
+ *
+ * We assume the caller is already holding some type of regular lock on
+ * the relation, so no AcceptInvalidationMessages call is needed here.
+ */
+void
+LockRelationForExtension(Relation relation, RelExtLockMode lockmode)
+{
+	RelExtLockAcquire(relation->rd_id, lockmode, false);
+}
+
+/*
+ *		ConditionalLockRelationForExtension
+ *
+ * As above, but only lock if we can get the lock without blocking.
+ * Returns TRUE iff the lock was acquired.
+ */
+bool
+ConditionalLockRelationForExtension(Relation relation, RelExtLockMode lockmode)
+{
+	return RelExtLockAcquire(relation->rd_id, lockmode, true);
+}
+
+/*
+ *		RelationExtensionLockWaiterCount
+ *
+ * Count the number of processes waiting for the given relation extension lock.
+ * NOte that this routine doesn't acquire the partition lock. Please make sure
+ * that the caller must acquire partitionlock in exclusive mode or we must call
+ * this routine after acquired the relation extension lock of this relation.
+ */
+int
+RelationExtensionLockWaiterCount(Relation relation)
+{
+	RelExtLock	*ext_lock;
+	Oid		relid;
+	uint32	nwaiters;
+	uint32	hashcode;
+	bool	found;
+
+	relid = relation->rd_id;
+	hashcode = RelExtLockTargetTagHashCode(&relid);
+
+	ext_lock = (RelExtLock *) hash_search_with_hash_value(RelExtLockHash,
+														  (void *) &relid,
+														  hashcode,
+														  HASH_FIND, &found);
+	/* We assume that we already acquire this lock */
+	Assert(found);
+
+	nwaiters = pg_atomic_read_u32(&(ext_lock->nwaiters));
+
+	return nwaiters;
+}
+
+/*
+ *		UnlockRelationForExtension
+ */
+void
+UnlockRelationForExtension(Relation relation, RelExtLockMode lockmode)
+{
+	RelExtLockRelease(relation->rd_id, lockmode);
+}
+
+/*
+ * Acquire relation extension lock and create RELEXTLOCK hash entry on shared
+ * hash table. To avoid dead-lock with partition lock and LWLock, we acquire
+ * them but don't release it here. The caller must call DeleteRelExtLock later
+ * to release these locks.
+ */
+static bool
+RelExtLockAcquire(Oid relid, RelExtLockMode lockmode, bool conditional)
+{
+	RelExtLock	*ext_lock;
+	LWLock	*partitionLock;
+	uint32	hashcode;
+	bool	found;
+	bool	got_lock = false;
+	bool	waited = false;
+
+	hashcode = RelExtLockTargetTagHashCode(&relid);
+	partitionLock = RelExtLockHashPartitionLock(hashcode);
+	LWLockAcquire(partitionLock, LW_EXCLUSIVE);
+
+	ext_lock = (RelExtLock *) hash_search_with_hash_value(RelExtLockHash,
+														  (void * ) &relid,
+														  hashcode, HASH_ENTER, &found);
+
+	if (!ext_lock)
+		ereport(ERROR,
+				(errcode(ERRCODE_OUT_OF_MEMORY),
+				 errmsg("out of shared memory"),
+				 errhint("You might need to increase max_pred_locks_per_transaction.")));
+
+	for (;;)
+	{
+		bool ret;
+
+		ret = RelExtLockAttemptLock(ext_lock, lockmode);
+
+		if (ret)
+		{
+			got_lock = true;
+
+			if (waited)
+				pg_atomic_sub_fetch_u32(&(ext_lock->nwaiters), 1);
+
+			break;	/* got the lock */
+		}
+
+		/* Could not get lock, return if in conditional lock */
+		if (!ret && conditional)
+			break;
+
+		/* Add to wait list */
+		pg_atomic_add_fetch_u32(&(ext_lock->nwaiters), 1);
+		ConditionVariableSleep(&(ext_lock->cv), WAIT_EVENT_RELATION_EXTENSION);
+	}
+
+	ConditionVariableCancelSleep();
+
+	if (got_lock)
+	{
+		/* Add lock to list relation extension locks held by this backend */
+		held_relextlocks[num_held_relextlocks].relid = relid;
+		held_relextlocks[num_held_relextlocks].lock = ext_lock;
+		held_relextlocks[num_held_relextlocks].mode = lockmode;
+		num_held_relextlocks++;
+	}
+	else
+		LWLockRelease(partitionLock);
+
+	/* Always end up with true if not conditional lock */
+	return got_lock;
+}
+
+/*
+ * RelationExtensionLockReleaseAll - release all currently-held relation extension locks
+ */
+void
+RelationExtensionLockReleaseAll(void)
+{
+	while (num_held_relextlocks > 0)
+	{
+		HOLD_INTERRUPTS();
+
+		RelExtLockRelease(held_relextlocks[num_held_relextlocks - 1].relid,
+						  held_relextlocks[num_held_relextlocks - 1].mode);
+	}
+}
+
+/*
+ * ExstLockRelease
+ *
+ * Remove RELEXTLOCK from shared RelExtLockHash hash table. Since other backends
+ * might be acquiring it or waiting for this lock, we can delete it only if there
+ * is no longer backends who are interested in it.
+ *
+ * Note that we assume partition lock for hash table is already acquired when
+ * acquiring the lock. This routine should release partition lock as well after
+ * released LWLock.
+ */
+static void
+RelExtLockRelease(Oid relid, RelExtLockMode lockmode)
+{
+	RelExtLock	*ext_lock;
+	RelExtLockMode mode;
+	uint32	hashcode;
+	LWLock	*partitionLock;
+	uint32	oldstate;
+	uint32	nwaiters;
+	int i;
+
+	hashcode = RelExtLockTargetTagHashCode(&relid);
+	partitionLock = RelExtLockHashPartitionLock(hashcode);
+
+	for (i = num_held_relextlocks; --i >= 0;)
+		if (relid == held_relextlocks[i].relid &&
+			lockmode == held_relextlocks[i].mode)
+			break;
+
+	if (i < 0)
+		elog(ERROR, "relation extension lock for %u with lock mode %d is not held",
+			 relid, lockmode);
+
+	ext_lock = held_relextlocks[i].lock;
+	mode = held_relextlocks[i].mode;
+
+	num_held_relextlocks--;
+
+	/* Shrink */
+	for (; i < num_held_relextlocks; i++)
+		held_relextlocks[i] = held_relextlocks[i + 1];
+
+	if (mode == RELEXT_EXCLUSIVE)
+		oldstate = pg_atomic_sub_fetch_u32(&(ext_lock->state), RELEXT_VAL_EXCLUSIVE);
+	else
+		oldstate = pg_atomic_sub_fetch_u32(&(ext_lock->state), RELEXT_VAL_SHARED);
+
+	nwaiters = pg_atomic_read_u32(&(ext_lock->nwaiters));
+
+	/* Wake up waiters if there are */
+	if (nwaiters > 0)
+		ConditionVariableBroadcast(&(ext_lock->cv));
+	else
+		hash_search_with_hash_value(RelExtLockHash, (void *) &relid,
+									hashcode, HASH_REMOVE, NULL);
+
+	LWLockRelease(partitionLock);
+}
+
+/*
+ * Internal function that tries to atomically acquire the relation extension
+ * lock in the passed in mode. Return true if we got the lock.
+ */
+static bool
+RelExtLockAttemptLock(RelExtLock *ext_lock, RelExtLockMode lockmode)
+{
+	uint32	oldstate;
+
+	oldstate = pg_atomic_read_u32(&ext_lock->state);
+
+	while (true)
+	{
+		uint32	desired_state;
+		bool	lock_free;
+
+		desired_state = oldstate;
+
+		if (lockmode == RELEXT_EXCLUSIVE)
+		{
+			lock_free = (oldstate & RELEXT_LOCKMASK) == 0;
+			if (lock_free)
+				desired_state += RELEXT_VAL_EXCLUSIVE;
+		}
+		else
+		{
+			lock_free = (oldstate & RELEXT_VAL_EXCLUSIVE) == 0;
+			if (lock_free)
+				desired_state += RELEXT_VAL_SHARED;
+		}
+
+		if (pg_atomic_compare_exchange_u32(&ext_lock->state,
+										   &oldstate, desired_state))
+		{
+			if (lock_free)
+				return true;
+			else
+				return false;
+		}
+	}
+	pg_unreachable();
+}
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index fe98898..34095cb 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -319,78 +319,6 @@ UnlockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode)
 }
 
 /*
- *		LockRelationForExtension
- *
- * This lock tag is used to interlock addition of pages to relations.
- * We need such locking because bufmgr/smgr definition of P_NEW is not
- * race-condition-proof.
- *
- * We assume the caller is already holding some type of regular lock on
- * the relation, so no AcceptInvalidationMessages call is needed here.
- */
-void
-LockRelationForExtension(Relation relation, LOCKMODE lockmode)
-{
-	LOCKTAG		tag;
-
-	SET_LOCKTAG_RELATION_EXTEND(tag,
-								relation->rd_lockInfo.lockRelId.dbId,
-								relation->rd_lockInfo.lockRelId.relId);
-
-	(void) LockAcquire(&tag, lockmode, false, false);
-}
-
-/*
- *		ConditionalLockRelationForExtension
- *
- * As above, but only lock if we can get the lock without blocking.
- * Returns TRUE iff the lock was acquired.
- */
-bool
-ConditionalLockRelationForExtension(Relation relation, LOCKMODE lockmode)
-{
-	LOCKTAG		tag;
-
-	SET_LOCKTAG_RELATION_EXTEND(tag,
-								relation->rd_lockInfo.lockRelId.dbId,
-								relation->rd_lockInfo.lockRelId.relId);
-
-	return (LockAcquire(&tag, lockmode, false, true) != LOCKACQUIRE_NOT_AVAIL);
-}
-
-/*
- *		RelationExtensionLockWaiterCount
- *
- * Count the number of processes waiting for the given relation extension lock.
- */
-int
-RelationExtensionLockWaiterCount(Relation relation)
-{
-	LOCKTAG		tag;
-
-	SET_LOCKTAG_RELATION_EXTEND(tag,
-								relation->rd_lockInfo.lockRelId.dbId,
-								relation->rd_lockInfo.lockRelId.relId);
-
-	return LockWaiterCount(&tag);
-}
-
-/*
- *		UnlockRelationForExtension
- */
-void
-UnlockRelationForExtension(Relation relation, LOCKMODE lockmode)
-{
-	LOCKTAG		tag;
-
-	SET_LOCKTAG_RELATION_EXTEND(tag,
-								relation->rd_lockInfo.lockRelId.dbId,
-								relation->rd_lockInfo.lockRelId.relId);
-
-	LockRelease(&tag, lockmode, false);
-}
-
-/*
  *		LockPage
  *
  * Obtain a page-level lock.  This is currently used by some index access
@@ -961,12 +889,6 @@ DescribeLockTag(StringInfo buf, const LOCKTAG *tag)
 							 tag->locktag_field2,
 							 tag->locktag_field1);
 			break;
-		case LOCKTAG_RELATION_EXTEND:
-			appendStringInfo(buf,
-							 _("extension of relation %u of database %u"),
-							 tag->locktag_field2,
-							 tag->locktag_field1);
-			break;
 		case LOCKTAG_PAGE:
 			appendStringInfo(buf,
 							 _("page %u of relation %u of database %u"),
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 2b26173..bc576a7 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -45,6 +45,7 @@
 #include "storage/sinvaladt.h"
 #include "storage/spin.h"
 #include "storage/standby.h"
+#include "storage/lmgr.h"
 #include "utils/memutils.h"
 #include "utils/ps_status.h"
 #include "utils/resowner_private.h"
@@ -388,6 +389,10 @@ InitLocks(void)
 	max_table_size = NLOCKENTS();
 	init_table_size = max_table_size / 2;
 
+
+	/* Initialize lock structure for relation extension lock */
+	InitRelExtLock(max_table_size);
+
 	/*
 	 * Allocate hash table for LOCK structs.  This stores per-locked-object
 	 * information.
@@ -3366,6 +3371,7 @@ LockShmemSize(void)
 	/* lock hash table */
 	max_table_size = NLOCKENTS();
 	size = add_size(size, hash_estimate_size(max_table_size, sizeof(LOCK)));
+	size = add_size(size, hash_estimate_size(max_table_size, sizeof(LWLock)));
 
 	/* proclock hash table */
 	max_table_size *= 2;
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index f1060f9..bc25a53 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -451,6 +451,13 @@ InitializeLWLocks(void)
 	for (id = 0; id < NUM_PREDICATELOCK_PARTITIONS; id++, lock++)
 		LWLockInitialize(&lock->lock, LWTRANCHE_PREDICATE_LOCK_MANAGER);
 
+	/* Initialize relation extension lmgr's LWLocks in main array */
+	lock = MainLWLockArray + NUM_INDIVIDUAL_LWLOCKS +
+		NUM_BUFFER_PARTITIONS + NUM_LOCK_PARTITIONS +
+		NUM_PREDICATELOCK_PARTITIONS;
+	for (id = 0; id < NUM_RELEXTLOCK_PARTITIONS; id++, lock++)
+		LWLockInitialize(&lock->lock, LWTRANCHE_RELEXT_LOCK_MANAGER);
+
 	/* Initialize named tranches. */
 	if (NamedLWLockTrancheRequests > 0)
 	{
@@ -508,6 +515,7 @@ RegisterLWLockTranches(void)
 	LWLockRegisterTranche(LWTRANCHE_LOCK_MANAGER, "lock_manager");
 	LWLockRegisterTranche(LWTRANCHE_PREDICATE_LOCK_MANAGER,
 						  "predicate_lock_manager");
+	LWLockRegisterTranche(LWTRANCHE_RELEXT_LOCK_MANAGER, "relext_lock_manager");
 	LWLockRegisterTranche(LWTRANCHE_PARALLEL_QUERY_DSA,
 						  "parallel_query_dsa");
 	LWLockRegisterTranche(LWTRANCHE_SESSION_DSA,
diff --git a/src/backend/utils/adt/lockfuncs.c b/src/backend/utils/adt/lockfuncs.c
index 9e0a8ab..6d8916c 100644
--- a/src/backend/utils/adt/lockfuncs.c
+++ b/src/backend/utils/adt/lockfuncs.c
@@ -25,7 +25,6 @@
 /* This must match enum LockTagType! */
 const char *const LockTagTypeNames[] = {
 	"relation",
-	"extend",
 	"page",
 	"tuple",
 	"transactionid",
@@ -234,7 +233,6 @@ pg_lock_status(PG_FUNCTION_ARGS)
 		switch ((LockTagType) instance->locktag.locktag_type)
 		{
 			case LOCKTAG_RELATION:
-			case LOCKTAG_RELATION_EXTEND:
 				values[1] = ObjectIdGetDatum(instance->locktag.locktag_field1);
 				values[2] = ObjectIdGetDatum(instance->locktag.locktag_field2);
 				nulls[3] = true;
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 089b7c3..958822f 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -816,7 +816,8 @@ typedef enum
 	WAIT_EVENT_REPLICATION_ORIGIN_DROP,
 	WAIT_EVENT_REPLICATION_SLOT_DROP,
 	WAIT_EVENT_SAFE_SNAPSHOT,
-	WAIT_EVENT_SYNC_REP
+	WAIT_EVENT_SYNC_REP,
+	WAIT_EVENT_RELATION_EXTENSION
 } WaitEventIPC;
 
 /* ----------
diff --git a/src/include/storage/extension_lock.h b/src/include/storage/extension_lock.h
new file mode 100644
index 0000000..f178672
--- /dev/null
+++ b/src/include/storage/extension_lock.h
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * extension_lock.h
+ *	  Relation extension lock manager
+ *
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/extension_lock.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef EXTENSION_H
+#define EXTENSION_H
+
+#ifdef FRONTEND
+#error "extension_lock.h may not be included from frontend code"
+#endif
+
+#include "storage/proclist_types.h"
+#include "storage/s_lock.h"
+#include "storage/condition_variable.h"
+#include "port/atomics.h"
+
+typedef struct RelExtLock
+{
+	Oid					relid;
+	pg_atomic_uint32	state;
+	pg_atomic_uint32	nwaiters;
+	ConditionVariable	cv;
+} RelExtLock;
+
+typedef enum RelExtLockMode
+{
+	RELEXT_EXCLUSIVE,
+	RELEXT_SHARED
+} RelExtLockMode;
+
+/* Lock a relation for extension */
+extern void InitRelExtLock(long max_table_size);
+extern void LockRelationForExtension(Relation relation, RelExtLockMode lockmode);
+extern void UnlockRelationForExtension(Relation relation, RelExtLockMode lockmode);
+extern bool ConditionalLockRelationForExtension(Relation relation, RelExtLockMode lockmode);
+extern int	RelationExtensionLockWaiterCount(Relation relation);
+extern void RelationExtensionLockReleaseAll(void);
+
+#endif	/* EXTENSION_H */
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index 0b92322..ac23354 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -15,6 +15,7 @@
 #define LMGR_H
 
 #include "lib/stringinfo.h"
+#include "storage/extension_lock.h"
 #include "storage/itemptr.h"
 #include "storage/lock.h"
 #include "utils/rel.h"
@@ -34,6 +35,36 @@ typedef enum XLTW_Oper
 	XLTW_RecheckExclusionConstr
 } XLTW_Oper;
 
+typedef	struct RELEXTLOCKTAG
+{
+	Oid		relid;		/* identifies the lockable object */
+	LWLockMode mode;	/* lock mode for this table entry */
+} RELEXTLOCKTAG;
+
+/*
+ * This structure holds information per-object relation extension
+ * lock.
+ */
+typedef struct RELEXTLOCK
+{
+	RELEXTLOCKTAG	tag;	/* hash key -- must be first */
+	LWLock			lock;	/* LWLock for relation extension */
+} RELEXTLOCK;
+
+/*
+ * The LOCALRELEXTLOCK struct represents a local copy of data which is
+ * also present in the RELEXTLOCK table, organized for fast access without
+ * needing to acquire a LWLock.  It is strictly for optimization.
+ */
+typedef struct LOCALRELEXTLOCK
+{
+	/* hash key */
+	RELEXTLOCKTAG	relid;	/* unique identifier of locktable object */
+
+	/* data */
+	bool			held;	/* is lock held? */
+} LOCALRELEXTLOCK;
+
 extern void RelationInitLockInfo(Relation relation);
 
 /* Lock a relation */
@@ -50,13 +81,6 @@ extern bool LockHasWaitersRelation(Relation relation, LOCKMODE lockmode);
 extern void LockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
 extern void UnlockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
 
-/* Lock a relation for extension */
-extern void LockRelationForExtension(Relation relation, LOCKMODE lockmode);
-extern void UnlockRelationForExtension(Relation relation, LOCKMODE lockmode);
-extern bool ConditionalLockRelationForExtension(Relation relation,
-									LOCKMODE lockmode);
-extern int	RelationExtensionLockWaiterCount(Relation relation);
-
 /* Lock a page (currently only used within indexes) */
 extern void LockPage(Relation relation, BlockNumber blkno, LOCKMODE lockmode);
 extern bool ConditionalLockPage(Relation relation, BlockNumber blkno, LOCKMODE lockmode);
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index 765431e..3be18ea 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -138,8 +138,6 @@ typedef uint16 LOCKMETHODID;
 typedef enum LockTagType
 {
 	LOCKTAG_RELATION,			/* whole relation */
-	/* ID info for a relation is DB OID + REL OID; DB OID = 0 if shared */
-	LOCKTAG_RELATION_EXTEND,	/* the right to extend a relation */
 	/* same ID info as RELATION */
 	LOCKTAG_PAGE,				/* one page of a relation */
 	/* ID info for a page is RELATION info + BlockNumber */
@@ -198,14 +196,6 @@ typedef struct LOCKTAG
 	 (locktag).locktag_type = LOCKTAG_RELATION, \
 	 (locktag).locktag_lockmethodid = DEFAULT_LOCKMETHOD)
 
-#define SET_LOCKTAG_RELATION_EXTEND(locktag,dboid,reloid) \
-	((locktag).locktag_field1 = (dboid), \
-	 (locktag).locktag_field2 = (reloid), \
-	 (locktag).locktag_field3 = 0, \
-	 (locktag).locktag_field4 = 0, \
-	 (locktag).locktag_type = LOCKTAG_RELATION_EXTEND, \
-	 (locktag).locktag_lockmethodid = DEFAULT_LOCKMETHOD)
-
 #define SET_LOCKTAG_PAGE(locktag,dboid,reloid,blocknum) \
 	((locktag).locktag_field1 = (dboid), \
 	 (locktag).locktag_field2 = (reloid), \
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index f4c4aed..2e9a1ac 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -120,14 +120,21 @@ extern PGDLLIMPORT int NamedLWLockTrancheRequests;
 #define LOG2_NUM_PREDICATELOCK_PARTITIONS  4
 #define NUM_PREDICATELOCK_PARTITIONS  (1 << LOG2_NUM_PREDICATELOCK_PARTITIONS)
 
+/* Number of partitions the shared relation extension lock tables are divided into */
+#define LOG2_NUM_RELEXTLOCK_PARTITIONS	4
+#define NUM_RELEXTLOCK_PARTITIONS	(1 << LOG2_NUM_RELEXTLOCK_PARTITIONS)
+
 /* Offsets for various chunks of preallocated lwlocks. */
 #define BUFFER_MAPPING_LWLOCK_OFFSET	NUM_INDIVIDUAL_LWLOCKS
 #define LOCK_MANAGER_LWLOCK_OFFSET		\
 	(BUFFER_MAPPING_LWLOCK_OFFSET + NUM_BUFFER_PARTITIONS)
 #define PREDICATELOCK_MANAGER_LWLOCK_OFFSET \
 	(LOCK_MANAGER_LWLOCK_OFFSET + NUM_LOCK_PARTITIONS)
-#define NUM_FIXED_LWLOCKS \
+#define RELEXTLOCK_MANAGER_LWLOCK_OFFSET \
 	(PREDICATELOCK_MANAGER_LWLOCK_OFFSET + NUM_PREDICATELOCK_PARTITIONS)
+#define NUM_FIXED_LWLOCKS \
+	(PREDICATELOCK_MANAGER_LWLOCK_OFFSET + NUM_PREDICATELOCK_PARTITIONS + \
+	 NUM_RELEXTLOCK_PARTITIONS)
 
 typedef enum LWLockMode
 {
@@ -151,6 +158,8 @@ extern void LWLockReleaseClearVar(LWLock *lock, uint64 *valptr, uint64 val);
 extern void LWLockReleaseAll(void);
 extern bool LWLockHeldByMe(LWLock *lock);
 extern bool LWLockHeldByMeInMode(LWLock *lock, LWLockMode mode);
+extern bool LWLockCheckForCleanup(LWLock *lock);
+extern int LWLockWaiterCount(LWLock *lock);
 
 extern bool LWLockWaitForVar(LWLock *lock, uint64 *valptr, uint64 oldval, uint64 *newval);
 extern void LWLockUpdateVar(LWLock *lock, uint64 *valptr, uint64 value);
@@ -211,6 +220,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_BUFFER_MAPPING,
 	LWTRANCHE_LOCK_MANAGER,
 	LWTRANCHE_PREDICATE_LOCK_MANAGER,
+	LWTRANCHE_RELEXT_LOCK_MANAGER,
 	LWTRANCHE_PARALLEL_QUERY_DSA,
 	LWTRANCHE_SESSION_DSA,
 	LWTRANCHE_SESSION_RECORD_TABLE,

#16

Robert Haas

robertmhaas@gmail.com

about 8 years ago

In reply to: Masahiko Sawada (#15)

Re: Moving relation extension locks out of heavyweight lock manager

On Thu, Oct 26, 2017 at 12:36 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Since the previous patch conflicts with current HEAD, I attached the
updated patch for next CF.

I think we should back up here and ask ourselves a couple of questions:

1. What are we trying to accomplish here?

2. Is this the best way to accomplish it?

To the first question, the problem as I understand it as follows:
Heavyweight locks don't conflict between members of a parallel group.
However, this is wrong for LOCKTAG_RELATION_EXTENSION, LOCKTAG_PAGE,
LOCKTAG_TUPLE, and LOCKTAG_SPECULATIVE_TOKEN. Currently, those cases
don't arise, because parallel operations are strictly read-only
(except for inserts by the leader into a just-created table, when only
one member of the group can be taking the lock anyway). However, once
we allow writes, they become possible, so some solution is needed.

To the second question, there are a couple of ways we could fix this.
First, we could continue to allow these locks to be taken in the
heavyweight lock manager, but make them conflict even between members
of the same lock group. This is, however, complicated. A significant
problem (or so I think) is that the deadlock detector logic, which is
already quite hard to test, will become even more complicated, since
wait edges between members of a lock group need to exist at some times
and not other times. Moreover, to the best of my knowledge, the
increased complexity would have no benefit, because it doesn't look to
me like we ever take any other heavyweight lock while holding one of
these four kinds of locks. Therefore, no deadlock can occur: if we're
waiting for one of these locks, the process that holds it is not
waiting for any other heavyweight lock. This gives rise to a second
idea: move these locks out of the heavyweight lock manager and handle
them with separate code that does not have deadlock detection and
doesn't need as many lock modes. I think that idea is basically
sound, although it's possibly not the only sound idea.

However, that makes me wonder whether we shouldn't be a bit more
aggressive with this patch: why JUST relation extension locks? Why
not all four types of locks listed above? Actually, tuple locks are a
bit sticky, because they have four lock modes. The other three kinds
are very similar -- all you can do is "take it" (implicitly, in
exclusive mode), "try to take it" (again, implicitly, in exclusive
mode), or "wait for it to be released" (i.e. share lock and then
release). Another idea is to try to handle those three types and
leave the tuple locking problem for another day.

I suggest that a good thing to do more or less immediately, regardless
of when this patch ends up being ready, would be to insert an
insertion that LockAcquire() is never called while holding a lock of
one of these types. If that assertion ever fails, then the whole
theory that these lock types don't need deadlock detection is wrong,
and we'd like to find out about that sooner or later.

On the details of the patch, it appears that RelExtLockAcquire()
executes the wait-for-lock code with the partition lock held, and then
continues to hold the partition lock for the entire time that the
relation extension lock is held. That not only makes all code that
runs while holding the lock non-interruptible but makes a lot of the
rest of this code pointless. How is any of this atomics code going to
be reached by more than one process at the same time if the entire
bucket is exclusive-locked? I would guess that the concurrency is not
very good here for the same reason. Of course, just releasing the
bucket lock wouldn't be right either, because then ext_lock might go
away while we've got a pointer to it, which wouldn't be good. I think
you could make this work if each lock had both a locker count and a
pin count, and the object can only be removed when the pin_count is 0.
So the lock algorithm would look like this:

- Acquire the partition LWLock.
- Find the item of interest, creating it if necessary. If out of
memory for more elements, sweep through the table and reclaim
0-pin-count entries, then retry.
- Increment the pin count.
- Attempt to acquire the lock atomically; if we succeed, release the
partition lock and return.
- If this was a conditional-acquire, then decrement the pin count,
release the partition lock and return.
- Release the partition lock.
- Sleep on the condition variable until we manage to atomically
acquire the lock.

The unlock algorithm would just decrement the pin count and, if the
resulting value is non-zero, broadcast on the condition variable.

Although I think this will work, I'm not sure this is actually a great
algorithm. Every lock acquisition has to take and release the
partition lock, use at least two more atomic ops (to take the pin and
the lock), and search a hash table. I don't think that's going to be
staggeringly fast. Maybe it's OK. It's not that much worse, possibly
not any worse, than what the main lock manager does now. However,
especially if we implement a solution specific to relation locks, it
seems like it would be better if we could somehow optimize based on
the facts that (1) many relation locks will not conflict and (2) it's
very common for the same backend to take and release the same
extension lock over and over again. I don't have a specific proposal
right now.

Whatever we end up with, I think we should write some kind of a test
harness to benchmark the number of acquire/release cycles per second
that we can do with the current relation extension lock system vs. the
proposed new system. Ideally, we'd be faster, since we're proposing a
more specialized mechanism. But at least we should not be slower.
pgbench isn't a good test because the relation extension lock will
barely be taken let alone contended; we need to check something like
parallel copies into the same table to see any effect.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

Masahiko Sawada

sawada.mshk@gmail.com

about 8 years ago

In reply to: Robert Haas (#16)

Re: Moving relation extension locks out of heavyweight lock manager

On Fri, Oct 27, 2017 at 12:03 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Oct 26, 2017 at 12:36 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Since the previous patch conflicts with current HEAD, I attached the
updated patch for next CF.

I think we should back up here and ask ourselves a couple of questions:

Thank you for summarizing of the purpose and discussion of this patch.

1. What are we trying to accomplish here?

2. Is this the best way to accomplish it?

To the first question, the problem as I understand it as follows:
Heavyweight locks don't conflict between members of a parallel group.
However, this is wrong for LOCKTAG_RELATION_EXTENSION, LOCKTAG_PAGE,
LOCKTAG_TUPLE, and LOCKTAG_SPECULATIVE_TOKEN. Currently, those cases
don't arise, because parallel operations are strictly read-only
(except for inserts by the leader into a just-created table, when only
one member of the group can be taking the lock anyway). However, once
we allow writes, they become possible, so some solution is needed.

To the second question, there are a couple of ways we could fix this.
First, we could continue to allow these locks to be taken in the
heavyweight lock manager, but make them conflict even between members
of the same lock group. This is, however, complicated. A significant
problem (or so I think) is that the deadlock detector logic, which is
already quite hard to test, will become even more complicated, since
wait edges between members of a lock group need to exist at some times
and not other times. Moreover, to the best of my knowledge, the
increased complexity would have no benefit, because it doesn't look to
me like we ever take any other heavyweight lock while holding one of
these four kinds of locks. Therefore, no deadlock can occur: if we're
waiting for one of these locks, the process that holds it is not
waiting for any other heavyweight lock. This gives rise to a second
idea: move these locks out of the heavyweight lock manager and handle
them with separate code that does not have deadlock detection and
doesn't need as many lock modes. I think that idea is basically
sound, although it's possibly not the only sound idea.

I'm on the same page.

However, that makes me wonder whether we shouldn't be a bit more
aggressive with this patch: why JUST relation extension locks? Why
not all four types of locks listed above? Actually, tuple locks are a
bit sticky, because they have four lock modes. The other three kinds
are very similar -- all you can do is "take it" (implicitly, in
exclusive mode), "try to take it" (again, implicitly, in exclusive
mode), or "wait for it to be released" (i.e. share lock and then
release). Another idea is to try to handle those three types and
leave the tuple locking problem for another day.

I suggest that a good thing to do more or less immediately, regardless
of when this patch ends up being ready, would be to insert an
insertion that LockAcquire() is never called while holding a lock of
one of these types. If that assertion ever fails, then the whole
theory that these lock types don't need deadlock detection is wrong,
and we'd like to find out about that sooner or later.

I understood. I'll check that first. If this direction has no problem
and we changed these three locks so that it uses new lock mechanism,
we'll not be able to use these locks at the same time. Since it also
means that we impose a limitation to the future we should think
carefully about it. We can implement the deadlock detection mechanism
for it again but it doesn't make sense.

On the details of the patch, it appears that RelExtLockAcquire()
executes the wait-for-lock code with the partition lock held, and then
continues to hold the partition lock for the entire time that the
relation extension lock is held. That not only makes all code that
runs while holding the lock non-interruptible but makes a lot of the
rest of this code pointless. How is any of this atomics code going to
be reached by more than one process at the same time if the entire
bucket is exclusive-locked? I would guess that the concurrency is not
very good here for the same reason. Of course, just releasing the
bucket lock wouldn't be right either, because then ext_lock might go
away while we've got a pointer to it, which wouldn't be good. I think
you could make this work if each lock had both a locker count and a
pin count, and the object can only be removed when the pin_count is 0.
So the lock algorithm would look like this:

- Acquire the partition LWLock.
- Find the item of interest, creating it if necessary. If out of
memory for more elements, sweep through the table and reclaim
0-pin-count entries, then retry.
- Increment the pin count.
- Attempt to acquire the lock atomically; if we succeed, release the
partition lock and return.
- If this was a conditional-acquire, then decrement the pin count,
release the partition lock and return.
- Release the partition lock.
- Sleep on the condition variable until we manage to atomically
acquire the lock.

The unlock algorithm would just decrement the pin count and, if the
resulting value is non-zero, broadcast on the condition variable.

Thank you for the suggestion!

Although I think this will work, I'm not sure this is actually a great
algorithm. Every lock acquisition has to take and release the
partition lock, use at least two more atomic ops (to take the pin and
the lock), and search a hash table. I don't think that's going to be
staggeringly fast. Maybe it's OK. It's not that much worse, possibly
not any worse, than what the main lock manager does now. However,
especially if we implement a solution specific to relation locks, it
seems like it would be better if we could somehow optimize based on
the facts that (1) many relation locks will not conflict and (2) it's
very common for the same backend to take and release the same
extension lock over and over again. I don't have a specific proposal
right now.

Yeah, we can optimize based on the purpose of the solution. In either
case I should answer the above question first.

Whatever we end up with, I think we should write some kind of a test
harness to benchmark the number of acquire/release cycles per second
that we can do with the current relation extension lock system vs. the
proposed new system. Ideally, we'd be faster, since we're proposing a
more specialized mechanism. But at least we should not be slower.
pgbench isn't a good test because the relation extension lock will
barely be taken let alone contended; we need to check something like
parallel copies into the same table to see any effect.

I did a benchmark using a custom script that always updates the
primary key (disabling HOT updates). But parallel copies into the same
tale would also be good. Thank you.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

Masahiko Sawada

sawada.mshk@gmail.com

about 8 years ago

In reply to: Masahiko Sawada (#17)

Re: Moving relation extension locks out of heavyweight lock manager

On Mon, Oct 30, 2017 at 3:17 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Oct 27, 2017 at 12:03 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Oct 26, 2017 at 12:36 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Since the previous patch conflicts with current HEAD, I attached the
updated patch for next CF.

I think we should back up here and ask ourselves a couple of questions:

Thank you for summarizing of the purpose and discussion of this patch.

1. What are we trying to accomplish here?

2. Is this the best way to accomplish it?

To the first question, the problem as I understand it as follows:
Heavyweight locks don't conflict between members of a parallel group.
However, this is wrong for LOCKTAG_RELATION_EXTENSION, LOCKTAG_PAGE,
LOCKTAG_TUPLE, and LOCKTAG_SPECULATIVE_TOKEN. Currently, those cases
don't arise, because parallel operations are strictly read-only
(except for inserts by the leader into a just-created table, when only
one member of the group can be taking the lock anyway). However, once
we allow writes, they become possible, so some solution is needed.

To the second question, there are a couple of ways we could fix this.
First, we could continue to allow these locks to be taken in the
heavyweight lock manager, but make them conflict even between members
of the same lock group. This is, however, complicated. A significant
problem (or so I think) is that the deadlock detector logic, which is
already quite hard to test, will become even more complicated, since
wait edges between members of a lock group need to exist at some times
and not other times. Moreover, to the best of my knowledge, the
increased complexity would have no benefit, because it doesn't look to
me like we ever take any other heavyweight lock while holding one of
these four kinds of locks. Therefore, no deadlock can occur: if we're
waiting for one of these locks, the process that holds it is not
waiting for any other heavyweight lock. This gives rise to a second
idea: move these locks out of the heavyweight lock manager and handle
them with separate code that does not have deadlock detection and
doesn't need as many lock modes. I think that idea is basically
sound, although it's possibly not the only sound idea.

I'm on the same page.

However, that makes me wonder whether we shouldn't be a bit more
aggressive with this patch: why JUST relation extension locks? Why
not all four types of locks listed above? Actually, tuple locks are a
bit sticky, because they have four lock modes. The other three kinds
are very similar -- all you can do is "take it" (implicitly, in
exclusive mode), "try to take it" (again, implicitly, in exclusive
mode), or "wait for it to be released" (i.e. share lock and then
release). Another idea is to try to handle those three types and
leave the tuple locking problem for another day.

I suggest that a good thing to do more or less immediately, regardless
of when this patch ends up being ready, would be to insert an
insertion that LockAcquire() is never called while holding a lock of
one of these types. If that assertion ever fails, then the whole
theory that these lock types don't need deadlock detection is wrong,
and we'd like to find out about that sooner or later.

I understood. I'll check that first.

I've checked whether LockAcquire is called while holding a lock of one
of four types: LOCKTAG_RELATION_EXTENSION, LOCKTAG_PAGE,
LOCKTAG_TUPLE, and LOCKTAG_SPECULATIVE_TOKEN. To summary, I think that
we cannot move these four lock types together out of heavy-weight
lock, but can move only the relation extension lock with tricks.

Here is detail of the survey.

* LOCKTAG_RELATION_EXTENSION
There is a path that LockRelationForExtension() could be called while
holding another relation extension lock. In brin_getinsertbuffer(), we
acquire a relation extension lock for a index relation and could
initialize a new buffer (brin_initailize_empty_new_buffer()). During
initializing a new buffer, we call RecordPageWithFreeSpace() which
eventually can call fsm_readbuf(rel, addr, true) where the third
argument is "extend". We can process this problem by having the list
(or local hash) of acquired locks and skip acquiring the lock if
already had. For other call paths calling LockRelationForExtension, I
don't see any problem.

* LOCKTAG_PAGE, LOCKTAG_TUPLE, LOCKTAG_SPECULATIVE_INSERTION
There is a path that we can acquire a relation extension lock while
holding these lock.
For LOCKTAG_PAGE, in ginInsertCleanup() we acquire a page lock for the
meta page and process the pending list which could acquire a relation
extension lock for a index relation. For LOCKTAG_TUPLE, in
heap_update() we acquire a tuple lock and could call
RelationGetBufferForTuple(). For LOCKTAG_SPECULATIVE_INSERTION, in
ExecInsert() we acquire a speculative insertion lock and call
heap_insert and ExecInsertIndexTuples(). The operation that is called
while holding each lock type can acquire a relation extension lock.

Also the following is the list of places where we call LockAcquire()
with four lock types (result of git grep "XXX"). I've checked based on
the following list.

* LockRelationForExtension()
contrib/bloom/blutils.c:
LockRelationForExtension(index, ExclusiveLock);
contrib/pgstattuple/pgstattuple.c:
LockRelationForExtension(rel, ExclusiveLock);
src/backend/access/brin/brin_pageops.c:
LockRelationForExtension(idxrel, ShareLock);
src/backend/access/brin/brin_pageops.c:
LockRelationForExtension(irel, ExclusiveLock);
src/backend/access/brin/brin_revmap.c:
LockRelationForExtension(irel, ExclusiveLock);
src/backend/access/gin/ginutil.c:
LockRelationForExtension(index, ExclusiveLock);
src/backend/access/gin/ginvacuum.c:
LockRelationForExtension(index, ExclusiveLock);
src/backend/access/gin/ginvacuum.c:
LockRelationForExtension(index, ExclusiveLock);
src/backend/access/gist/gistutil.c:
LockRelationForExtension(r, ExclusiveLock);
src/backend/access/gist/gistvacuum.c:
LockRelationForExtension(rel, ExclusiveLock);
src/backend/access/gist/gistvacuum.c:
LockRelationForExtension(rel, ExclusiveLock);
src/backend/access/heap/hio.c:
LockRelationForExtension(relation, ExclusiveLock);
src/backend/access/heap/hio.c:
LockRelationForExtension(relation, ExclusiveLock);
src/backend/access/heap/visibilitymap.c:
LockRelationForExtension(rel, ExclusiveLock);
src/backend/access/nbtree/nbtpage.c:
LockRelationForExtension(rel, ExclusiveLock);
src/backend/access/nbtree/nbtree.c:
LockRelationForExtension(rel, ExclusiveLock);
src/backend/access/spgist/spgutils.c:
LockRelationForExtension(index, ExclusiveLock);
src/backend/access/spgist/spgvacuum.c:
LockRelationForExtension(index, ExclusiveLock);
src/backend/commands/vacuumlazy.c:
LockRelationForExtension(onerel, ExclusiveLock);
src/backend/storage/freespace/freespace.c:
LockRelationForExtension(rel, ExclusiveLock);
src/backend/storage/lmgr/lmgr.c:LockRelationForExtension(Relation
relation, LOCKMODE lockmode)

* ConditionalLockRelationForExtension
src/backend/access/heap/hio.c: else if
(!ConditionalLockRelationForExtension(relation, ExclusiveLock))
src/backend/storage/lmgr/lmgr.c:ConditionalLockRelationForExtension(Relation
relation, LOCKMODE lockmode)

* LockPage
src/backend/access/gin/ginfast.c: LockPage(index,
GIN_METAPAGE_BLKNO, ExclusiveLock);

* ConditionalLockPage
src/backend/access/gin/ginfast.c: if
(!ConditionalLockPage(index, GIN_METAPAGE_BLKNO, ExclusiveLock))

* LockTuple
src/backend/access/heap/heapam.c: LockTuple((rel), (tup),
tupleLockExtraInfo[mode].hwlock)

* ConditionalLockTuple
src/backend/access/heap/heapam.c: ConditionalLockTuple((rel),
(tup), tupleLockExtraInfo[mode].hwlock)
src/backend/storage/lmgr/lmgr.c:ConditionalLockTuple(Relation
relation, ItemPointer tid, LOCKMODE lockmode)

* SpeculativeInsertionLockAcquire
src/backend/executor/nodeModifyTable.c: specToken =
SpeculativeInsertionLockAcquire(GetCurrentTransactionId());

If this direction has no problem
and we changed these three locks so that it uses new lock mechanism,
we'll not be able to use these locks at the same time. Since it also
means that we impose a limitation to the future we should think
carefully about it. We can implement the deadlock detection mechanism
for it again but it doesn't make sense.

On the details of the patch, it appears that RelExtLockAcquire()
executes the wait-for-lock code with the partition lock held, and then
continues to hold the partition lock for the entire time that the
relation extension lock is held. That not only makes all code that
runs while holding the lock non-interruptible but makes a lot of the
rest of this code pointless. How is any of this atomics code going to
be reached by more than one process at the same time if the entire
bucket is exclusive-locked? I would guess that the concurrency is not
very good here for the same reason. Of course, just releasing the
bucket lock wouldn't be right either, because then ext_lock might go
away while we've got a pointer to it, which wouldn't be good. I think
you could make this work if each lock had both a locker count and a
pin count, and the object can only be removed when the pin_count is 0.
So the lock algorithm would look like this:

- Acquire the partition LWLock.
- Find the item of interest, creating it if necessary. If out of
memory for more elements, sweep through the table and reclaim
0-pin-count entries, then retry.
- Increment the pin count.
- Attempt to acquire the lock atomically; if we succeed, release the
partition lock and return.
- If this was a conditional-acquire, then decrement the pin count,
release the partition lock and return.
- Release the partition lock.
- Sleep on the condition variable until we manage to atomically
acquire the lock.

The unlock algorithm would just decrement the pin count and, if the
resulting value is non-zero, broadcast on the condition variable.

Thank you for the suggestion!

Although I think this will work, I'm not sure this is actually a great
algorithm. Every lock acquisition has to take and release the
partition lock, use at least two more atomic ops (to take the pin and
the lock), and search a hash table. I don't think that's going to be
staggeringly fast. Maybe it's OK. It's not that much worse, possibly
not any worse, than what the main lock manager does now. However,
especially if we implement a solution specific to relation locks, it
seems like it would be better if we could somehow optimize based on
the facts that (1) many relation locks will not conflict and (2) it's
very common for the same backend to take and release the same
extension lock over and over again. I don't have a specific proposal
right now.

Yeah, we can optimize based on the purpose of the solution. In either
case I should answer the above question first.

Whatever we end up with, I think we should write some kind of a test
harness to benchmark the number of acquire/release cycles per second
that we can do with the current relation extension lock system vs. the
proposed new system. Ideally, we'd be faster, since we're proposing a
more specialized mechanism. But at least we should not be slower.
pgbench isn't a good test because the relation extension lock will
barely be taken let alone contended; we need to check something like
parallel copies into the same table to see any effect.

I did a benchmark using a custom script that always updates the
primary key (disabling HOT updates). But parallel copies into the same
tale would also be good. Thank you.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

Robert Haas

robertmhaas@gmail.com

about 8 years ago

In reply to: Masahiko Sawada (#18)

Re: Moving relation extension locks out of heavyweight lock manager

On Mon, Nov 6, 2017 at 4:42 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I suggest that a good thing to do more or less immediately, regardless
of when this patch ends up being ready, would be to insert an
insertion that LockAcquire() is never called while holding a lock of
one of these types. If that assertion ever fails, then the whole
theory that these lock types don't need deadlock detection is wrong,
and we'd like to find out about that sooner or later.

I understood. I'll check that first.

I've checked whether LockAcquire is called while holding a lock of one
of four types: LOCKTAG_RELATION_EXTENSION, LOCKTAG_PAGE,
LOCKTAG_TUPLE, and LOCKTAG_SPECULATIVE_TOKEN. To summary, I think that
we cannot move these four lock types together out of heavy-weight
lock, but can move only the relation extension lock with tricks.

Here is detail of the survey.

Thanks for these details, but I'm not sure I fully understand.

* LOCKTAG_RELATION_EXTENSION
There is a path that LockRelationForExtension() could be called while
holding another relation extension lock. In brin_getinsertbuffer(), we
acquire a relation extension lock for a index relation and could
initialize a new buffer (brin_initailize_empty_new_buffer()). During
initializing a new buffer, we call RecordPageWithFreeSpace() which
eventually can call fsm_readbuf(rel, addr, true) where the third
argument is "extend". We can process this problem by having the list
(or local hash) of acquired locks and skip acquiring the lock if
already had. For other call paths calling LockRelationForExtension, I
don't see any problem.

Does calling fsm_readbuf(rel,addr,true) take some heavyweight lock?

Basically, what matters here in the end is whether we can articulate a
deadlock-proof rule around the order in which these locks are
acquired. The simplest such rule would be "you can only acquire one
lock of any of these types at a time, and you can't subsequently
acquire a heavyweight lock". But a more complicated rule would be OK
too, e.g. "you can acquire as many heavyweight locks as you want, and
after that you can optionally acquire one page, tuple, or speculative
token lock, and after that you can acquire a relation extension lock".
The latter rule, although more complex, is still deadlock-proof,
because the heavyweight locks still use the deadlock detector, and the
rest has a consistent order of lock acquisition that precludes one
backend taking A then B while another backend takes B then A. I'm not
entirely clear whether your survey leads us to a place where we can
articulate such a deadlock-proof rule.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20

Masahiko Sawada

sawada.mshk@gmail.com

about 8 years ago

In reply to: Robert Haas (#19)

Re: Moving relation extension locks out of heavyweight lock manager

On Wed, Nov 8, 2017 at 5:41 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Nov 6, 2017 at 4:42 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I suggest that a good thing to do more or less immediately, regardless
of when this patch ends up being ready, would be to insert an
insertion that LockAcquire() is never called while holding a lock of
one of these types. If that assertion ever fails, then the whole
theory that these lock types don't need deadlock detection is wrong,
and we'd like to find out about that sooner or later.

I understood. I'll check that first.

I've checked whether LockAcquire is called while holding a lock of one
of four types: LOCKTAG_RELATION_EXTENSION, LOCKTAG_PAGE,
LOCKTAG_TUPLE, and LOCKTAG_SPECULATIVE_TOKEN. To summary, I think that
we cannot move these four lock types together out of heavy-weight
lock, but can move only the relation extension lock with tricks.

Here is detail of the survey.

Thanks for these details, but I'm not sure I fully understand.

* LOCKTAG_RELATION_EXTENSION
There is a path that LockRelationForExtension() could be called while
holding another relation extension lock. In brin_getinsertbuffer(), we
acquire a relation extension lock for a index relation and could
initialize a new buffer (brin_initailize_empty_new_buffer()). During
initializing a new buffer, we call RecordPageWithFreeSpace() which
eventually can call fsm_readbuf(rel, addr, true) where the third
argument is "extend". We can process this problem by having the list
(or local hash) of acquired locks and skip acquiring the lock if
already had. For other call paths calling LockRelationForExtension, I
don't see any problem.

Does calling fsm_readbuf(rel,addr,true) take some heavyweight lock?

No, I meant fsm_readbuf(rel,addr,true) can acquire a relation
extension lock. So it's not problem.

Basically, what matters here in the end is whether we can articulate a
deadlock-proof rule around the order in which these locks are
acquired.

You're right, my survey was not enough to make a decision.

As far as the acquiring these four lock types goes, there are two call
paths that acquire any type of locks while holding another type of
lock. The one is that acquiring a relation extension lock and then
acquiring a relation extension lock for the same relation again. As
explained before, this can be resolved by remembering the holding lock
(perhaps holding only last one is enough). Another is that acquiring
either a tuple lock, a page lock or a speculative insertion lock and
then acquiring a relation extension lock. In the second case, we try
to acquire these two locks in the same order; acquiring 3 types lock
and then extension lock. So it's not problem if we apply the rule that
is that we disallow to try acquiring these three lock types while
holding any relation extension lock. Also, as far as I surveyed there
is no path to acquire a relation lock while holding other 3 type
locks.

The simplest such rule would be "you can only acquire one
lock of any of these types at a time, and you can't subsequently
acquire a heavyweight lock". But a more complicated rule would be OK
too, e.g. "you can acquire as many heavyweight locks as you want, and
after that you can optionally acquire one page, tuple, or speculative
token lock, and after that you can acquire a relation extension lock".
The latter rule, although more complex, is still deadlock-proof,
because the heavyweight locks still use the deadlock detector, and the
rest has a consistent order of lock acquisition that precludes one
backend taking A then B while another backend takes B then A. I'm not
entirely clear whether your survey leads us to a place where we can
articulate such a deadlock-proof rule.

Speaking of the acquiring these four lock types and heavy weight lock,
there obviously is a call path to acquire any of four lock types while
holding a heavy weight lock. In reverse, there also is a call path
that we acquire a heavy weight lock while holding any of four lock
types. The call path I found is that in heap_delete we acquire a tuple
lock and call XactLockTableWait or MultiXactIdWait which eventually
could acquire LOCKTAG_TRANSACTION in order to wait for the concurrent
transactions finish. But IIUC since these functions acquire the lock
for the concurrent transaction's transaction id, deadlocks doesn't
happen.
However, there might be other similar call paths if I'm missing
something. For example, we do some operations that might acquire any
heavy weight locks other than LOCKTAG_TRANSACTION, while holding a
page lock (in ginInsertCleanup) or holding a specualtive insertion
lock (in nodeModifyTable).

To summary, I think we can put the following rules in order to move
four lock types out of heavy weight lock.

1. Do not acquire either a tuple lock, a page lock or a speculative
insertion lock while holding a extension lock.
2. Do not acquire any heavy weight lock except for LOCKTAG_TRANSACTION
while holding any of these four lock types.

Also I'm concerned that it imposes the rules for developers which is
difficult to check statically. We can put several assertions to source
code but it's hard to test the all possible paths by regression tests.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21

Robert Haas

robertmhaas@gmail.com

about 8 years ago

In reply to: Masahiko Sawada (#20)

Re: Moving relation extension locks out of heavyweight lock manager

On Wed, Nov 8, 2017 at 9:40 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Speaking of the acquiring these four lock types and heavy weight lock,
there obviously is a call path to acquire any of four lock types while
holding a heavy weight lock. In reverse, there also is a call path
that we acquire a heavy weight lock while holding any of four lock
types. The call path I found is that in heap_delete we acquire a tuple
lock and call XactLockTableWait or MultiXactIdWait which eventually
could acquire LOCKTAG_TRANSACTION in order to wait for the concurrent
transactions finish. But IIUC since these functions acquire the lock
for the concurrent transaction's transaction id, deadlocks doesn't
happen.

No, that's not right. Now that you mention it, I realize that tuple
locks can definitely cause deadlocks. Example:

setup:
rhaas=# create table foo (a int, b text);
CREATE TABLE
rhaas=# create table bar (a int, b text);
CREATE TABLE
rhaas=# insert into foo values (1, 'hoge');
INSERT 0 1

session 1:
rhaas=# begin;
BEGIN
rhaas=# update foo set b = 'hogehoge' where a = 1;
UPDATE 1

session 2:
rhaas=# begin;
BEGIN
rhaas=# update foo set b = 'quux' where a = 1;

session 3:
rhaas=# begin;
BEGIN
rhaas=# lock bar;
LOCK TABLE
rhaas=# update foo set b = 'blarfle' where a = 1;

back to session 1:
rhaas=# select * from bar;
ERROR: deadlock detected
LINE 1: select * from bar;
^
DETAIL: Process 88868 waits for AccessShareLock on relation 16391 of
database 16384; blocked by process 88845.
Process 88845 waits for ExclusiveLock on tuple (0,1) of relation 16385
of database 16384; blocked by process 88840.
Process 88840 waits for ShareLock on transaction 1193; blocked by process 88868.
HINT: See server log for query details.

So what I said before was wrong: we definitely cannot exclude tuple
locks from deadlock detection. However, we might be able to handle
the problem in another way: introduce a separate, parallel-query
specific mechanism to avoid having two participants try to update
and/or delete the same tuple at the same time - e.g. advertise the
BufferTag + offset within the page in DSM, and if somebody else
already has that same combination advertised, wait until they no
longer do. That shouldn't ever deadlock, because the other worker
shouldn't be able to find itself waiting for us while it's busy
updating a tuple.

After some further study, speculative insertion locks look problematic
too. I'm worried about the code path ExecInsert() [taking speculative
insertion locking] -> heap_insert -> heap_prepare_insert ->
toast_insert_or_update -> toast_save_datum ->
heap_open(rel->rd_rel->reltoastrelid, RowExclusiveLock). That sure
looks like we can end up waiting for a relation lock while holding a
speculative insertion lock, which seems to mean that speculative
insertion locks are subject to at least theoretical deadlock hazards
as well. Note that even if we were guaranteed to be holding the lock
on the toast relation already at this point, it wouldn't fix the
problem, because we might still have to build or refresh a relcache
entry at this point, which could end up scanning (and thus locking)
system catalogs. Any syscache lookup can theoretically take a lock,
even though most of the time it doesn't, and thus taking a lock that
has been removed from the deadlock detector (or, say, an lwlock) and
then performing a syscache lookup with it held is not OK. So I don't
think we can remove speculative insertion locks from the deadlock
detector either.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22

Tom Lane

tgl@sss.pgh.pa.us

about 8 years ago

In reply to: Robert Haas (#21)

Re: Moving relation extension locks out of heavyweight lock manager

Robert Haas <robertmhaas@gmail.com> writes:

No, that's not right. Now that you mention it, I realize that tuple
locks can definitely cause deadlocks. Example:

Yeah. Foreign-key-related tuple locks are another rich source of
examples.

... So I don't
think we can remove speculative insertion locks from the deadlock
detector either.

That scares me too. I think that relation extension can safely
be transferred to some lower-level mechanism, because what has to
be done while holding the lock is circumscribed and below the level
of database operations (which might need other locks). These other
ideas seem a lot riskier.

(But see recent conversation where I discouraged Alvaro from holding
extension locks across BRIN summarization activity. We'll need to look
and make sure that nobody else has had creative ideas like that.)

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23

Masahiko Sawada

sawada.mshk@gmail.com

about 8 years ago

In reply to: Tom Lane (#22)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

Thank you for pointing out and comments.

On Fri, Nov 10, 2017 at 12:38 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

No, that's not right. Now that you mention it, I realize that tuple
locks can definitely cause deadlocks. Example:

Yeah. Foreign-key-related tuple locks are another rich source of
examples.

... So I don't
think we can remove speculative insertion locks from the deadlock
detector either.

That scares me too. I think that relation extension can safely
be transferred to some lower-level mechanism, because what has to
be done while holding the lock is circumscribed and below the level
of database operations (which might need other locks). These other
ideas seem a lot riskier.

(But see recent conversation where I discouraged Alvaro from holding
extension locks across BRIN summarization activity. We'll need to look
and make sure that nobody else has had creative ideas like that.)

It seems that we should focus on transferring only relation extension
locks as a first step. The page locks would also be safe but it might
require some fundamental changes related to fast insertion, which is
discussed on other thread[1]/messages/by-id/CAD21AoBLUSyiYKnTYtSAbC+F=XDjiaBrOUEGK+zUXdQ8owfPKw@mail.gmail.com. Also in this case I think it's better to
focus on relation extension locks so that we can optimize the
lower-level lock mechanism for it.

So I'll update the patch based on the comment I got from Robert before.

[1]: /messages/by-id/CAD21AoBLUSyiYKnTYtSAbC+F=XDjiaBrOUEGK+zUXdQ8owfPKw@mail.gmail.com

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#24

Masahiko Sawada

sawada.mshk@gmail.com

about 8 years ago

In reply to: Masahiko Sawada (#23)

2 attachment(s)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Tue, Nov 14, 2017 at 4:36 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Thank you for pointing out and comments.

On Fri, Nov 10, 2017 at 12:38 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

No, that's not right. Now that you mention it, I realize that tuple
locks can definitely cause deadlocks. Example:

Yeah. Foreign-key-related tuple locks are another rich source of
examples.

... So I don't
think we can remove speculative insertion locks from the deadlock
detector either.

That scares me too. I think that relation extension can safely
be transferred to some lower-level mechanism, because what has to
be done while holding the lock is circumscribed and below the level
of database operations (which might need other locks). These other
ideas seem a lot riskier.

(But see recent conversation where I discouraged Alvaro from holding
extension locks across BRIN summarization activity. We'll need to look
and make sure that nobody else has had creative ideas like that.)

It seems that we should focus on transferring only relation extension
locks as a first step. The page locks would also be safe but it might
require some fundamental changes related to fast insertion, which is
discussed on other thread[1]. Also in this case I think it's better to
focus on relation extension locks so that we can optimize the
lower-level lock mechanism for it.

So I'll update the patch based on the comment I got from Robert before.

Attached updated version patch. I've moved only relation extension
locks out of heavy-weight lock as per discussion so far.

I've done a write-heavy benchmark on my laptop; loading 24kB data to
one table using COPY by 1 client, for 10 seconds. The through-put of
patched is 10% better than current HEAD. The result of 5 times is the
following.

----- PATCHED -----
tps = 178.791515 (excluding connections establishing)
tps = 176.522693 (excluding connections establishing)
tps = 168.705442 (excluding connections establishing)
tps = 158.158009 (excluding connections establishing)
tps = 161.145709 (excluding connections establishing)

----- HEAD -----
tps = 147.079803 (excluding connections establishing)
tps = 149.079540 (excluding connections establishing)
tps = 149.082275 (excluding connections establishing)
tps = 148.255376 (excluding connections establishing)
tps = 145.542552 (excluding connections establishing)

Also I've done a micro-benchmark; calling LockRelationForExtension and
UnlockRelationForExtension tightly in order to measure the number of
lock/unlock cycles per second. The result is,
PATCHED = 3.95892e+06 (cycles/sec)
HEAD = 1.15284e+06 (cycles/sec)
The patched is 3 times faster than current HEAD.

Attached updated patch and the function I used for micro-benchmark.
Please review it.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

func.ctext/x-csrc; charset=US-ASCII; name=func.cDownload

Moving_extension_lock_out_of_heavyweight_lock_v6.patchtext/x-patch; charset=US-ASCII; name=Moving_extension_lock_out_of_heavyweight_lock_v6.patchDownload

diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index 09db5c6..4e64258 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -623,8 +623,8 @@ brin_page_cleanup(Relation idxrel, Buffer buf)
 	 */
 	if (PageIsNew(page))
 	{
-		LockRelationForExtension(idxrel, ShareLock);
-		UnlockRelationForExtension(idxrel, ShareLock);
+		LockRelationForExtension(idxrel, RELEXT_SHARED);
+		UnlockRelationForExtension(idxrel, RELEXT_SHARED);
 
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 		if (PageIsNew(page))
@@ -716,7 +716,7 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 			 */
 			if (!RELATION_IS_LOCAL(irel))
 			{
-				LockRelationForExtension(irel, ExclusiveLock);
+				LockRelationForExtension(irel, RELEXT_EXCLUSIVE);
 				extensionLockHeld = true;
 			}
 			buf = ReadBuffer(irel, P_NEW);
@@ -768,7 +768,7 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 				}
 
 				if (extensionLockHeld)
-					UnlockRelationForExtension(irel, ExclusiveLock);
+					UnlockRelationForExtension(irel, RELEXT_EXCLUSIVE);
 
 				ReleaseBuffer(buf);
 				return InvalidBuffer;
@@ -778,7 +778,7 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
 		if (extensionLockHeld)
-			UnlockRelationForExtension(irel, ExclusiveLock);
+			UnlockRelationForExtension(irel, RELEXT_EXCLUSIVE);
 
 		page = BufferGetPage(buf);
 
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index 03e53ce..c8fc1ab 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -570,7 +570,7 @@ revmap_physical_extend(BrinRevmap *revmap)
 	else
 	{
 		if (needLock)
-			LockRelationForExtension(irel, ExclusiveLock);
+			LockRelationForExtension(irel, RELEXT_EXCLUSIVE);
 
 		buf = ReadBuffer(irel, P_NEW);
 		if (BufferGetBlockNumber(buf) != mapBlk)
@@ -582,7 +582,7 @@ revmap_physical_extend(BrinRevmap *revmap)
 			 * page from under whoever is using it.
 			 */
 			if (needLock)
-				UnlockRelationForExtension(irel, ExclusiveLock);
+				UnlockRelationForExtension(irel, RELEXT_EXCLUSIVE);
 			LockBuffer(revmap->rm_metaBuf, BUFFER_LOCK_UNLOCK);
 			ReleaseBuffer(buf);
 			return;
@@ -591,7 +591,7 @@ revmap_physical_extend(BrinRevmap *revmap)
 		page = BufferGetPage(buf);
 
 		if (needLock)
-			UnlockRelationForExtension(irel, ExclusiveLock);
+			UnlockRelationForExtension(irel, RELEXT_EXCLUSIVE);
 	}
 
 	/* Check that it's a regular block (or an empty page) */
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index d9c6483..1af884a 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -325,13 +325,13 @@ GinNewBuffer(Relation index)
 	/* Must extend the file */
 	needLock = !RELATION_IS_LOCAL(index);
 	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
+		LockRelationForExtension(index, RELEXT_EXCLUSIVE);
 
 	buffer = ReadBuffer(index, P_NEW);
 	LockBuffer(buffer, GIN_EXCLUSIVE);
 
 	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+		UnlockRelationForExtension(index, RELEXT_EXCLUSIVE);
 
 	return buffer;
 }
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index 394bc83..c1a89f9 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -716,10 +716,10 @@ ginvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	needLock = !RELATION_IS_LOCAL(index);
 
 	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
+		LockRelationForExtension(index, RELEXT_EXCLUSIVE);
 	npages = RelationGetNumberOfBlocks(index);
 	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+		UnlockRelationForExtension(index, RELEXT_EXCLUSIVE);
 
 	totFreePages = 0;
 
@@ -766,10 +766,10 @@ ginvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	stats->pages_free = totFreePages;
 
 	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
+		LockRelationForExtension(index, RELEXT_EXCLUSIVE);
 	stats->num_pages = RelationGetNumberOfBlocks(index);
 	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+		UnlockRelationForExtension(index, RELEXT_EXCLUSIVE);
 
 	return stats;
 }
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index d8d1c0a..5f4fe13 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -821,13 +821,13 @@ gistNewBuffer(Relation r)
 	needLock = !RELATION_IS_LOCAL(r);
 
 	if (needLock)
-		LockRelationForExtension(r, ExclusiveLock);
+		LockRelationForExtension(r, RELEXT_EXCLUSIVE);
 
 	buffer = ReadBuffer(r, P_NEW);
 	LockBuffer(buffer, GIST_EXCLUSIVE);
 
 	if (needLock)
-		UnlockRelationForExtension(r, ExclusiveLock);
+		UnlockRelationForExtension(r, RELEXT_EXCLUSIVE);
 
 	return buffer;
 }
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 77d9d12..ca45b06 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -59,10 +59,10 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 
 	/* try to find deleted pages */
 	if (needLock)
-		LockRelationForExtension(rel, ExclusiveLock);
+		LockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 	npages = RelationGetNumberOfBlocks(rel);
 	if (needLock)
-		UnlockRelationForExtension(rel, ExclusiveLock);
+		UnlockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 
 	totFreePages = 0;
 	for (blkno = GIST_ROOT_BLKNO + 1; blkno < npages; blkno++)
@@ -91,10 +91,10 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	/* return statistics */
 	stats->pages_free = totFreePages;
 	if (needLock)
-		LockRelationForExtension(rel, ExclusiveLock);
+		LockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 	stats->num_pages = RelationGetNumberOfBlocks(rel);
 	if (needLock)
-		UnlockRelationForExtension(rel, ExclusiveLock);
+		UnlockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 
 	return stats;
 }
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 13e3bdc..a8ce6c7 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -519,11 +519,11 @@ loop:
 	if (needLock)
 	{
 		if (!use_fsm)
-			LockRelationForExtension(relation, ExclusiveLock);
-		else if (!ConditionalLockRelationForExtension(relation, ExclusiveLock))
+			LockRelationForExtension(relation, RELEXT_EXCLUSIVE);
+		else if (!ConditionalLockRelationForExtension(relation, RELEXT_EXCLUSIVE))
 		{
 			/* Couldn't get the lock immediately; wait for it. */
-			LockRelationForExtension(relation, ExclusiveLock);
+			LockRelationForExtension(relation, RELEXT_EXCLUSIVE);
 
 			/*
 			 * Check if some other backend has extended a block for us while
@@ -537,7 +537,7 @@ loop:
 			 */
 			if (targetBlock != InvalidBlockNumber)
 			{
-				UnlockRelationForExtension(relation, ExclusiveLock);
+				UnlockRelationForExtension(relation, RELEXT_EXCLUSIVE);
 				goto loop;
 			}
 
@@ -576,7 +576,7 @@ loop:
 	 * against vacuumlazy.c --- see comments therein.
 	 */
 	if (needLock)
-		UnlockRelationForExtension(relation, ExclusiveLock);
+		UnlockRelationForExtension(relation, RELEXT_EXCLUSIVE);
 
 	/*
 	 * We need to initialize the empty new page.  Double-check that it really
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 4c2a13a..7dc3088 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -641,7 +641,7 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 	 * Note that another backend might have extended or created the relation
 	 * by the time we get the lock.
 	 */
-	LockRelationForExtension(rel, ExclusiveLock);
+	LockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 
 	/* Might have to re-open if a cache flush happened */
 	RelationOpenSmgr(rel);
@@ -679,7 +679,7 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 	/* Update local cache with the up-to-date size */
 	rel->rd_smgr->smgr_vm_nblocks = vm_nblocks_now;
 
-	UnlockRelationForExtension(rel, ExclusiveLock);
+	UnlockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 
 	pfree(pg);
 }
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index c774349..0eb1102 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -659,7 +659,7 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 		needLock = !RELATION_IS_LOCAL(rel);
 
 		if (needLock)
-			LockRelationForExtension(rel, ExclusiveLock);
+			LockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 
 		buf = ReadBuffer(rel, P_NEW);
 
@@ -673,7 +673,7 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 		 * condition against btvacuumscan --- see comments therein.
 		 */
 		if (needLock)
-			UnlockRelationForExtension(rel, ExclusiveLock);
+			UnlockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 
 		/* Initialize the new page before returning it */
 		page = BufferGetPage(buf);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 399e6a1..be457b0 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -1058,10 +1058,10 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	{
 		/* Get the current relation length */
 		if (needLock)
-			LockRelationForExtension(rel, ExclusiveLock);
+			LockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 		num_pages = RelationGetNumberOfBlocks(rel);
 		if (needLock)
-			UnlockRelationForExtension(rel, ExclusiveLock);
+			UnlockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 
 		/* Quit if we've scanned the whole relation */
 		if (blkno >= num_pages)
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index bd5301f..8f54015 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -230,13 +230,13 @@ SpGistNewBuffer(Relation index)
 	/* Must extend the file */
 	needLock = !RELATION_IS_LOCAL(index);
 	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
+		LockRelationForExtension(index, RELEXT_EXCLUSIVE);
 
 	buffer = ReadBuffer(index, P_NEW);
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
 	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+		UnlockRelationForExtension(index, RELEXT_EXCLUSIVE);
 
 	return buffer;
 }
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index d7d5e90..3888d93 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -824,10 +824,10 @@ spgvacuumscan(spgBulkDeleteState *bds)
 	{
 		/* Get the current relation length */
 		if (needLock)
-			LockRelationForExtension(index, ExclusiveLock);
+			LockRelationForExtension(index, RELEXT_EXCLUSIVE);
 		num_pages = RelationGetNumberOfBlocks(index);
 		if (needLock)
-			UnlockRelationForExtension(index, ExclusiveLock);
+			UnlockRelationForExtension(index, RELEXT_EXCLUSIVE);
 
 		/* Quit if we've scanned the whole relation */
 		if (blkno >= num_pages)
diff --git a/src/backend/commands/discard.c b/src/backend/commands/discard.c
index f0dcd87..216c197 100644
--- a/src/backend/commands/discard.c
+++ b/src/backend/commands/discard.c
@@ -19,6 +19,7 @@
 #include "commands/discard.h"
 #include "commands/prepare.h"
 #include "commands/sequence.h"
+#include "storage/extension_lock.h"
 #include "utils/guc.h"
 #include "utils/portal.h"
 
@@ -71,6 +72,7 @@ DiscardAll(bool isTopLevel)
 	ResetAllOptions();
 	DropAllPreparedStatements();
 	Async_UnlistenAll();
+	RelExtLockReleaseAll();
 	LockReleaseAll(USER_LOCKMETHOD, true);
 	ResetPlanCache();
 	ResetTempTableNamespace();
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 6587db7..56ee82b 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -860,8 +860,8 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
 			 * it's got exclusive lock on the whole relation.
 			 */
 			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
-			LockRelationForExtension(onerel, ExclusiveLock);
-			UnlockRelationForExtension(onerel, ExclusiveLock);
+			LockRelationForExtension(onerel, RELEXT_EXCLUSIVE);
+			UnlockRelationForExtension(onerel, RELEXT_EXCLUSIVE);
 			LockBufferForCleanup(buf);
 			if (PageIsNew(page))
 			{
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 5c256ff..5beba70 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3628,6 +3628,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_SYNC_REP:
 			event_name = "SyncRep";
 			break;
+		case WAIT_EVENT_RELATION_EXTENSION:
+			event_name = "RelationExtension";
+			break;
 			/* no default case, so that compiler will warn */
 	}
 
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 4648473..498223a 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -624,7 +624,7 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 	 * Note that another backend might have extended or created the relation
 	 * by the time we get the lock.
 	 */
-	LockRelationForExtension(rel, ExclusiveLock);
+	LockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 
 	/* Might have to re-open if a cache flush happened */
 	RelationOpenSmgr(rel);
@@ -652,7 +652,7 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 	/* Update local cache with the up-to-date size */
 	rel->rd_smgr->smgr_fsm_nblocks = fsm_nblocks_now;
 
-	UnlockRelationForExtension(rel, ExclusiveLock);
+	UnlockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 
 	pfree(pg);
 }
diff --git a/src/backend/storage/lmgr/Makefile b/src/backend/storage/lmgr/Makefile
index e1b787e..2334a40 100644
--- a/src/backend/storage/lmgr/Makefile
+++ b/src/backend/storage/lmgr/Makefile
@@ -13,7 +13,7 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = lmgr.o lock.o proc.o deadlock.o lwlock.o lwlocknames.o spin.o \
-	s_lock.o predicate.o condition_variable.o
+	s_lock.o predicate.o condition_variable.o extension_lock.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/storage/lmgr/README b/src/backend/storage/lmgr/README
index 56b0a12..5e0a394 100644
--- a/src/backend/storage/lmgr/README
+++ b/src/backend/storage/lmgr/README
@@ -29,6 +29,12 @@ process has to wait for an LWLock, it blocks on a SysV semaphore so as
 to not consume CPU time.  Waiting processes will be granted the lock in
 arrival order.  There is no timeout.
 
+* Relation extension locks.  The relation extension lock manager is
+specialized in relation extensions. In PostgreSQL 10 relation extension
+lock has been moved out of regular lock. It's similar to regular locks
+but doesn't have full dead lock detection and group locking. When
+confliction relation extension lock waits using condition variables.
+
 * Regular locks (a/k/a heavyweight locks).  The regular lock manager
 supports a variety of lock modes with table-driven semantics, and it has
 full deadlock detection and automatic release at transaction end.
@@ -40,9 +46,9 @@ Acquisition of either a spinlock or a lightweight lock causes query
 cancel and die() interrupts to be held off until all such locks are
 released. No such restriction exists for regular locks, however.  Also
 note that we can accept query cancel and die() interrupts while waiting
-for a regular lock, but we will not accept them while waiting for
-spinlocks or LW locks. It is therefore not a good idea to use LW locks
-when the wait time might exceed a few seconds.
+for a relation extension lock or a regular lock, but we will not accept
+them while waiting for spinlocks or LW locks. It is therefore not a good
+idea to use LW locks when the wait time might exceed a few seconds.
 
 The rest of this README file discusses the regular lock manager in detail.
 
diff --git a/src/backend/storage/lmgr/extension_lock.c b/src/backend/storage/lmgr/extension_lock.c
new file mode 100644
index 0000000..13acef7
--- /dev/null
+++ b/src/backend/storage/lmgr/extension_lock.c
@@ -0,0 +1,494 @@
+/*-------------------------------------------------------------------------
+ *
+ * extension_lock.c
+ *	  Relation extension lock manager
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/storage/lmgr/extension_lock.c
+ *
+ * NOTES:
+ *
+ * This lock manager is specialized in relation extension locks; light
+ * weight and interruptible lock manager. It's similar to heavy-weight
+ * lock but doesn't have dead lock detection mechanism and group locking
+ * mechanism.
+ *
+ * For lock acquisition we use an atomic compare-and-exchange on the
+ * state variable. When a process tries to acquire a lock that conflicts
+ * with existing lock, it is put to sleep using condition variables
+ * if not conditional locking. When release the lock, we use an atomic
+ * decrement to release the lock, but don't remove the RELEXTLOCK entry
+ * in the hash table. The all unused entries will be reclaimed when
+ * acquisition once the hash table got full.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "pg_trace.h"
+#include "postmaster/postmaster.h"
+#include "replication/slot.h"
+#include "storage/ipc.h"
+#include "storage/proc.h"
+#include "storage/proclist.h"
+#include "storage/spin.h"
+#include "storage/extension_lock.h"
+#include "utils/memutils.h"
+#include "utils/rel.h"
+
+/*
+ * Compute the hash code associated with a RELEXTLOCK.
+ *
+ * To avoid unnecessary recomputations of the hash code, we try to do this
+ * just once per function, and then pass it around as needed.  Aside from
+ * passing the hashcode to hash_search_with_hash_value(), we can extract
+ * the lock partition number from the hashcode.
+ */
+#define RelExtLockTargetTagHashCode(relextlocktargettag) \
+	get_hash_value(RelExtLockHash, (const void *) relextlocktargettag)
+
+/*
+ * The lockmgr's shared hash tables are partitioned to reduce contention.
+ * To determine which partition a given relid belongs to, compute the tag's
+ * hash code with ExtLockTagHashCode(), then apply one of these macros.
+ * NB: NUM_RELEXTENSIONLOCK_PARTITIONS must be a power of 2!
+ */
+#define RelExtLockHashPartition(hashcode) \
+	((hashcode) % NUM_RELEXTLOCK_PARTITIONS)
+#define RelExtLockHashPartitionLock(hashcode) \
+	(&MainLWLockArray[RELEXTLOCK_MANAGER_LWLOCK_OFFSET + \
+					  LockHashPartition(hashcode)].lock)
+#define RelExtLockHashPartitionLockByIndex(i) \
+	(&MainLWLockArray[RELEXTLOCK_MANAGER_LWLOCK_OFFSET + (i)].lock)
+
+#define	RELEXT_VAL_EXCLUSIVE	((uint32) 1 << 24)
+#define RELEXT_VAL_SHARED		1
+
+#define RELEXT_LOCK_MASK			((uint32) ((1 << 25) - 1))
+
+typedef struct RELEXTLOCK
+{
+	/* hash key -- must be first */
+	Oid					relid;
+
+	/* state of exclusive/non-exclusive lock */
+	pg_atomic_uint32	state;
+	pg_atomic_uint32	pin_counts;
+
+	ConditionVariable	cv;
+} RELEXTLOCK;
+
+/*
+ * This structure holds information per-object relation extension
+ * lock. held_extlocks represents the RelExtLocks we're holding.
+ */
+typedef	struct relextlock_handle
+{
+	RELEXTLOCK		*lock;
+	RelExtLockMode	mode;	/* lock mode for this table entry */
+	int				nLocks;
+} relextlock_handle;
+
+/*
+ * We use this structure to keep track of locked relation extension locks
+ * for release during error recovery.  Normaly, at most one lock associated
+ * with a relation will be held at once. However, sometimes we could try to
+ * acquire new one while holding another one; for example, adding extra
+ * relation blocks for both relation and its free space map.
+ */
+static relextlock_handle held_relextlock;
+static int num_held_relextlocks = 0;
+
+static bool RelExtLockAcquire(Oid relid, RelExtLockMode lockmode, bool conditional);
+static void RelExtLockRelease(Oid rleid, RelExtLockMode lockmode);
+static bool RelExtLockAttemptLock(RELEXTLOCK *extlock, RelExtLockMode lockmode);
+static bool RelExtLockShrinkLocks(void);
+
+/*
+ * Pointers to hash tables containing lock state
+ *
+ * The RelExtLockHash hash table is in shared memory
+ */
+static HTAB *RelExtLockHash;
+
+/*
+ * InitRelExtLock
+ *      Initialize the relation extension lock manager's data structures.
+ */
+void
+InitRelExtLock(long max_table_size)
+{
+	HASHCTL	info;
+	long		init_table_size;
+
+	/*
+	 * Compute init/max size to request for lock hashtables.  Note these
+	 * calculations must agree with LockShmemSize!
+	 */
+	init_table_size = max_table_size / 2;
+
+	/*
+	 * Allocate hash table for RELEXTLOCK structs. This stores per-relation
+	 * lock.
+	 */
+	MemSet(&info, 0, sizeof(info));
+	info.keysize = sizeof(Oid);
+	info.entrysize = sizeof(RELEXTLOCK);
+	info.num_partitions = NUM_RELEXTLOCK_PARTITIONS;
+
+	RelExtLockHash = ShmemInitHash("RELEXTLOCK Hash",
+								   init_table_size,
+								   max_table_size,
+								   &info,
+								   HASH_ELEM | HASH_BLOBS | HASH_PARTITION);
+}
+
+/*
+ *		LockRelationForExtension
+ *
+ * This lock is used to interlock addition of pages to relations.
+ * We need such locking because bufmgr/smgr definition of P_NEW is not
+ * race-condition-proof.
+ *
+ * We assume the caller is already holding some type of regular lock on
+ * the relation, so no AcceptInvalidationMessages call is needed here.
+ */
+void
+LockRelationForExtension(Relation relation, RelExtLockMode lockmode)
+{
+	RelExtLockAcquire(relation->rd_id, lockmode, false);
+}
+
+/*
+ *		ConditionalLockRelationForExtension
+ *
+ * As above, but only lock if we can get the lock without blocking.
+ * Returns TRUE iff the lock was acquired.
+ */
+bool
+ConditionalLockRelationForExtension(Relation relation, RelExtLockMode lockmode)
+{
+	return RelExtLockAcquire(relation->rd_id, lockmode, true);
+}
+
+/*
+ *		RelationExtensionLockWaiterCount
+ *
+ * Count the number of processes waiting for the given relation extension lock.
+ */
+int
+RelationExtensionLockWaiterCount(Relation relation)
+{
+	LWLock		*partitionLock;
+	RELEXTLOCK	*extlock;
+	Oid			relid;
+	uint32		hashcode;
+	uint32		pin_counts;
+	bool		found;
+
+	relid = RelationGetRelid(relation);
+
+	hashcode = RelExtLockTargetTagHashCode(&relid);
+	partitionLock = RelExtLockHashPartitionLock(hashcode);
+
+	LWLockAcquire(partitionLock, LW_SHARED);
+
+	extlock = (RELEXTLOCK *) hash_search_with_hash_value(RelExtLockHash,
+														 (void *) &relid,
+														 hashcode,
+														 HASH_FIND, &found);
+
+	LWLockRelease(partitionLock);
+
+	/* We assume that we already acquire this lock */
+	Assert(found);
+
+	pin_counts = pg_atomic_read_u32(&(extlock->pin_counts));
+
+	/* Except for me */
+	return pin_counts - 1;
+}
+
+/*
+ *		UnlockRelationForExtension
+ */
+void
+UnlockRelationForExtension(Relation relation, RelExtLockMode lockmode)
+{
+	RelExtLockRelease(relation->rd_id, lockmode);
+}
+
+/*
+ * RelationExtensionLockReleaseAll - release all currently-held relation extension locks
+ */
+void
+RelExtLockReleaseAll(void)
+{
+	if (num_held_relextlocks > 0)
+	{
+		HOLD_INTERRUPTS();
+		RelExtLockRelease(held_relextlock.lock->relid, held_relextlock.mode);
+	}
+}
+
+/*
+ * Acquire relation extension lock and create RELEXTLOCK hash entry on shared
+ * hash table. To avoid dead-lock with partition lock and LWLock, we acquire
+ * them but don't release it here. The caller must call DeleteRelExtLock later
+ * to release these locks.
+ */
+static bool
+RelExtLockAcquire(Oid relid, RelExtLockMode lockmode, bool conditional)
+{
+	RELEXTLOCK	*extlock = NULL;
+	LWLock	*partitionLock;
+	uint32	hashcode;
+	bool	found;
+	bool mustwait;
+
+	hashcode = RelExtLockTargetTagHashCode(&relid);
+	partitionLock = RelExtLockHashPartitionLock(hashcode);
+
+	/* If we already hold the lock, we can just increase the count locally */
+	if (num_held_relextlocks > 0 &&
+		relid == held_relextlock.lock->relid &&
+		lockmode == held_relextlock.mode)
+	{
+		held_relextlock.nLocks++;
+		return true;
+	}
+
+	for (;;)
+	{
+
+		LWLockAcquire(partitionLock, LW_EXCLUSIVE);
+
+		if (!extlock)
+			extlock = (RELEXTLOCK *) hash_search_with_hash_value(RelExtLockHash,
+																 (void * ) &relid,
+																 hashcode, HASH_ENTER_NULL,
+																 &found);
+
+		/*
+		 * Failed to create new hash entry. Try to shrink the hash table and
+		 * retry.
+		 */
+		if (!extlock)
+		{
+			bool	successed;
+			LWLockRelease(partitionLock);
+			successed = RelExtLockShrinkLocks();
+
+			if (!successed)
+				ereport(ERROR,
+						(errmsg("out of shared memory"),
+						 errhint("You might need to increase max_pred_locks_per_transaction.")));
+
+			continue;
+		}
+
+		if (!found)
+		{
+			extlock->relid = relid;
+			pg_atomic_init_u32(&(extlock->state), 0);
+			pg_atomic_init_u32(&(extlock->pin_counts), 0);
+			ConditionVariableInit(&(extlock->cv));
+		}
+
+		/* Increment pin count */
+		pg_atomic_add_fetch_u32(&(extlock->pin_counts), 1);
+
+		mustwait = RelExtLockAttemptLock(extlock, lockmode);
+
+		if (!mustwait)
+			break;	/* got the lock */
+
+		/* Could not got the lock, return if in conditional locking */
+		if (mustwait && conditional)
+		{
+			pg_atomic_sub_fetch_u32(&(extlock->pin_counts), 1);
+			LWLockRelease(partitionLock);
+			return false;
+		}
+
+		/* Release the partition lock before sleep */
+		LWLockRelease(partitionLock);
+
+		/* Sleep until the lock is released */
+		ConditionVariableSleep(&(extlock->cv), WAIT_EVENT_RELATION_EXTENSION);
+	}
+
+	LWLockRelease(partitionLock);
+	ConditionVariableCancelSleep();
+
+	Assert(!mustwait);
+
+	/* Remember lock held by this backend */
+	held_relextlock.lock = extlock;
+	held_relextlock.mode = lockmode;
+	held_relextlock.nLocks = 1;
+	num_held_relextlocks++;
+
+	/* Always return true if not conditional lock */
+	return true;
+}
+
+/*
+ * ExtLockRelease
+ *
+ * Release a previously acquired relation extension lock. We don't remove
+ * hash entry at the time. Once the hash table got full, all un-pinned hash
+ * entries will be removed.
+ */
+static void
+RelExtLockRelease(Oid relid, RelExtLockMode lockmode)
+{
+	RELEXTLOCK	*extlock;
+	RelExtLockMode mode;
+	LWLock	*partitionLock;
+	uint32	hashcode;
+	uint32	pin_counts;
+
+	/* We should have acquired a lock before releasing */
+	Assert(num_held_relextlocks > 0);
+
+	/* Decrease the lock count locally */
+	held_relextlock.nLocks--;
+
+	/* If we are still holding the lock, we're done */
+	 if (held_relextlock.nLocks > 0)
+		return;
+
+	hashcode = RelExtLockTargetTagHashCode(&relid);
+	partitionLock = RelExtLockHashPartitionLock(hashcode);
+
+	LWLockAcquire(partitionLock, LW_EXCLUSIVE);
+
+	Assert(num_held_relextlocks > 0);
+
+	if (relid != held_relextlock.lock->relid || lockmode != held_relextlock.mode)
+		ereport(ERROR,
+				(errcode(ERRCODE_LOCK_NOT_AVAILABLE),
+				 errmsg("relation extension lock for %u in lock mode %d is not held",
+						relid, lockmode)));
+
+	extlock = held_relextlock.lock;
+	mode = held_relextlock.mode;
+
+	num_held_relextlocks--;
+
+	if (mode == RELEXT_EXCLUSIVE)
+		pg_atomic_sub_fetch_u32(&(extlock->state), RELEXT_VAL_EXCLUSIVE);
+	else
+		pg_atomic_sub_fetch_u32(&(extlock->state), RELEXT_VAL_SHARED);
+
+	/* Decrement pin counter */
+	pin_counts = pg_atomic_sub_fetch_u32(&(extlock->pin_counts), 1);
+
+	LWLockRelease(partitionLock);
+
+	/* Wake up waiters if someone looking at this lock */
+	if (pin_counts > 0)
+		ConditionVariableBroadcast(&(extlock->cv));
+}
+
+/*
+ * Internal function that tries to atomically acquire the relation extension
+ * lock in the passed in mode.
+ *
+ * Returns true if the lock isn't free and we need to wait.
+ */
+static bool
+RelExtLockAttemptLock(RELEXTLOCK *extlock, RelExtLockMode lockmode)
+{
+	uint32	oldstate;
+
+	oldstate = pg_atomic_read_u32(&extlock->state);
+
+	while (true)
+	{
+		uint32	desired_state;
+		bool	lock_free;
+
+		desired_state = oldstate;
+
+		if (lockmode == RELEXT_EXCLUSIVE)
+		{
+			lock_free = (oldstate & RELEXT_LOCK_MASK) == 0;
+			if (lock_free)
+				desired_state += RELEXT_VAL_EXCLUSIVE;
+		}
+		else
+		{
+			lock_free = (oldstate & RELEXT_VAL_EXCLUSIVE) == 0;
+			if (lock_free)
+				desired_state += RELEXT_VAL_SHARED;
+		}
+
+		if (pg_atomic_compare_exchange_u32(&extlock->state,
+										   &oldstate, desired_state))
+		{
+			if (lock_free)
+				return false;
+			else
+				return true;
+		}
+	}
+	pg_unreachable();
+}
+
+/*
+ * Reclaim all un-pinned RELEXTLOCK entries from the hash table.
+ */
+static bool
+RelExtLockShrinkLocks(void)
+{
+	HASH_SEQ_STATUS	hstat;
+	RELEXTLOCK		*extlock;
+	List			*entries_to_remove = NIL;
+	ListCell		*cell;
+	int				i;
+
+	/*
+	 * To ensure consistency, take all partition locks in exclusive
+	 * mode.
+	 */
+	for (i = 0; i < NUM_RELEXTLOCK_PARTITIONS; i++)
+		LWLockAcquire(RelExtLockHashPartitionLockByIndex(i), LW_EXCLUSIVE);
+
+	/* Collect all un-pinned RELEXTLOCK entries */
+	hash_seq_init(&hstat, RelExtLockHash);
+	while ((extlock = (RELEXTLOCK *) hash_seq_search(&hstat)) != NULL)
+	{
+		uint32	pin_count = pg_atomic_read_u32(&(extlock->pin_counts));
+
+		if (pin_count == 0)
+			entries_to_remove = lappend(entries_to_remove, extlock);
+	}
+
+	/* We could not find any entries that we can remove right now */
+	if (list_length(entries_to_remove) == 0)
+		return false;
+
+	/* Remove collected entries from RelExtLockHash has table */
+	foreach (cell, entries_to_remove)
+	{
+		RELEXTLOCK	*el = (RELEXTLOCK *) lfirst(cell);
+		uint32	hc = RelExtLockTargetTagHashCode(&(el->relid));
+
+		hash_search_with_hash_value(RelExtLockHash, (void *) &(el->relid),
+									hc, HASH_REMOVE, NULL);
+	}
+
+	/* Release all partition locks */
+	for (i = 0; i < NUM_RELEXTLOCK_PARTITIONS; i++)
+		LWLockRelease(RelExtLockHashPartitionLockByIndex(i));
+
+	return true;
+}
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index da5679b..4fbc0c4 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -319,78 +319,6 @@ UnlockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode)
 }
 
 /*
- *		LockRelationForExtension
- *
- * This lock tag is used to interlock addition of pages to relations.
- * We need such locking because bufmgr/smgr definition of P_NEW is not
- * race-condition-proof.
- *
- * We assume the caller is already holding some type of regular lock on
- * the relation, so no AcceptInvalidationMessages call is needed here.
- */
-void
-LockRelationForExtension(Relation relation, LOCKMODE lockmode)
-{
-	LOCKTAG		tag;
-
-	SET_LOCKTAG_RELATION_EXTEND(tag,
-								relation->rd_lockInfo.lockRelId.dbId,
-								relation->rd_lockInfo.lockRelId.relId);
-
-	(void) LockAcquire(&tag, lockmode, false, false);
-}
-
-/*
- *		ConditionalLockRelationForExtension
- *
- * As above, but only lock if we can get the lock without blocking.
- * Returns true iff the lock was acquired.
- */
-bool
-ConditionalLockRelationForExtension(Relation relation, LOCKMODE lockmode)
-{
-	LOCKTAG		tag;
-
-	SET_LOCKTAG_RELATION_EXTEND(tag,
-								relation->rd_lockInfo.lockRelId.dbId,
-								relation->rd_lockInfo.lockRelId.relId);
-
-	return (LockAcquire(&tag, lockmode, false, true) != LOCKACQUIRE_NOT_AVAIL);
-}
-
-/*
- *		RelationExtensionLockWaiterCount
- *
- * Count the number of processes waiting for the given relation extension lock.
- */
-int
-RelationExtensionLockWaiterCount(Relation relation)
-{
-	LOCKTAG		tag;
-
-	SET_LOCKTAG_RELATION_EXTEND(tag,
-								relation->rd_lockInfo.lockRelId.dbId,
-								relation->rd_lockInfo.lockRelId.relId);
-
-	return LockWaiterCount(&tag);
-}
-
-/*
- *		UnlockRelationForExtension
- */
-void
-UnlockRelationForExtension(Relation relation, LOCKMODE lockmode)
-{
-	LOCKTAG		tag;
-
-	SET_LOCKTAG_RELATION_EXTEND(tag,
-								relation->rd_lockInfo.lockRelId.dbId,
-								relation->rd_lockInfo.lockRelId.relId);
-
-	LockRelease(&tag, lockmode, false);
-}
-
-/*
  *		LockPage
  *
  * Obtain a page-level lock.  This is currently used by some index access
@@ -961,12 +889,6 @@ DescribeLockTag(StringInfo buf, const LOCKTAG *tag)
 							 tag->locktag_field2,
 							 tag->locktag_field1);
 			break;
-		case LOCKTAG_RELATION_EXTEND:
-			appendStringInfo(buf,
-							 _("extension of relation %u of database %u"),
-							 tag->locktag_field2,
-							 tag->locktag_field1);
-			break;
 		case LOCKTAG_PAGE:
 			appendStringInfo(buf,
 							 _("page %u of relation %u of database %u"),
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 5833086..5ca1c27 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -45,6 +45,7 @@
 #include "storage/sinvaladt.h"
 #include "storage/spin.h"
 #include "storage/standby.h"
+#include "storage/lmgr.h"
 #include "utils/memutils.h"
 #include "utils/ps_status.h"
 #include "utils/resowner_private.h"
@@ -388,6 +389,9 @@ InitLocks(void)
 	max_table_size = NLOCKENTS();
 	init_table_size = max_table_size / 2;
 
+	/* Initialize lock structure for relation extension lock */
+	InitRelExtLock(max_table_size);
+
 	/*
 	 * Allocate hash table for LOCK structs.  This stores per-locked-object
 	 * information.
@@ -3366,6 +3370,7 @@ LockShmemSize(void)
 	/* lock hash table */
 	max_table_size = NLOCKENTS();
 	size = add_size(size, hash_estimate_size(max_table_size, sizeof(LOCK)));
+	size = add_size(size, hash_estimate_size(max_table_size, sizeof(LWLock)));
 
 	/* proclock hash table */
 	max_table_size *= 2;
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index e5c3e86..b12aba0 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -451,6 +451,13 @@ InitializeLWLocks(void)
 	for (id = 0; id < NUM_PREDICATELOCK_PARTITIONS; id++, lock++)
 		LWLockInitialize(&lock->lock, LWTRANCHE_PREDICATE_LOCK_MANAGER);
 
+	/* Initialize relation extension lmgr's LWLocks in main array */
+	lock = MainLWLockArray + NUM_INDIVIDUAL_LWLOCKS +
+		NUM_BUFFER_PARTITIONS + NUM_LOCK_PARTITIONS +
+		NUM_PREDICATELOCK_PARTITIONS;
+	for (id = 0; id < NUM_RELEXTLOCK_PARTITIONS; id++, lock++)
+		LWLockInitialize(&lock->lock, LWTRANCHE_RELEXT_LOCK_MANAGER);
+
 	/* Initialize named tranches. */
 	if (NamedLWLockTrancheRequests > 0)
 	{
@@ -508,6 +515,7 @@ RegisterLWLockTranches(void)
 	LWLockRegisterTranche(LWTRANCHE_LOCK_MANAGER, "lock_manager");
 	LWLockRegisterTranche(LWTRANCHE_PREDICATE_LOCK_MANAGER,
 						  "predicate_lock_manager");
+	LWLockRegisterTranche(LWTRANCHE_RELEXT_LOCK_MANAGER, "relext_lock_manager");
 	LWLockRegisterTranche(LWTRANCHE_PARALLEL_QUERY_DSA,
 						  "parallel_query_dsa");
 	LWLockRegisterTranche(LWTRANCHE_SESSION_DSA,
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 5f6727d..f698e9c 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -765,6 +765,8 @@ ProcReleaseLocks(bool isCommit)
 		return;
 	/* If waiting, get off wait queue (should only be needed after error) */
 	LockErrorCleanup();
+	/* Release relation extension locks */
+	RelExtLockReleaseAll();
 	/* Release standard locks, including session-level if aborting */
 	LockReleaseAll(DEFAULT_LOCKMETHOD, !isCommit);
 	/* Release transaction-level advisory locks */
diff --git a/src/backend/utils/adt/lockfuncs.c b/src/backend/utils/adt/lockfuncs.c
index 9e0a8ab..6d8916c 100644
--- a/src/backend/utils/adt/lockfuncs.c
+++ b/src/backend/utils/adt/lockfuncs.c
@@ -25,7 +25,6 @@
 /* This must match enum LockTagType! */
 const char *const LockTagTypeNames[] = {
 	"relation",
-	"extend",
 	"page",
 	"tuple",
 	"transactionid",
@@ -234,7 +233,6 @@ pg_lock_status(PG_FUNCTION_ARGS)
 		switch ((LockTagType) instance->locktag.locktag_type)
 		{
 			case LOCKTAG_RELATION:
-			case LOCKTAG_RELATION_EXTEND:
 				values[1] = ObjectIdGetDatum(instance->locktag.locktag_field1);
 				values[2] = ObjectIdGetDatum(instance->locktag.locktag_field2);
 				nulls[3] = true;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 20f1d27..c004844 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -1153,6 +1153,7 @@ ShutdownPostgres(int code, Datum arg)
 	 * User locks are not released by transaction end, so be sure to release
 	 * them explicitly.
 	 */
+	RelExtLockReleaseAll();
 	LockReleaseAll(USER_LOCKMETHOD, true);
 }
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 089b7c3..958822f 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -816,7 +816,8 @@ typedef enum
 	WAIT_EVENT_REPLICATION_ORIGIN_DROP,
 	WAIT_EVENT_REPLICATION_SLOT_DROP,
 	WAIT_EVENT_SAFE_SNAPSHOT,
-	WAIT_EVENT_SYNC_REP
+	WAIT_EVENT_SYNC_REP,
+	WAIT_EVENT_RELATION_EXTENSION
 } WaitEventIPC;
 
 /* ----------
diff --git a/src/include/storage/extension_lock.h b/src/include/storage/extension_lock.h
new file mode 100644
index 0000000..d373b04
--- /dev/null
+++ b/src/include/storage/extension_lock.h
@@ -0,0 +1,41 @@
+/*-------------------------------------------------------------------------
+ *
+ * extension_lock.h
+ *	  Relation extension lock manager
+ *
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/extension_lock.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef EXTENSION_LOCK_H
+#define EXTENSION_LOCK_H
+
+#ifdef FRONTEND
+#error "extension_lock.h may not be included from frontend code"
+#endif
+
+#include "storage/proclist_types.h"
+#include "storage/s_lock.h"
+#include "storage/condition_variable.h"
+#include "port/atomics.h"
+
+typedef enum RelExtLockMode
+{
+	RELEXT_EXCLUSIVE,
+	RELEXT_SHARED
+} RelExtLockMode;
+
+/* Lock a relation for extension */
+extern void InitRelExtLock(long max_table_size);
+extern void LockRelationForExtension(Relation relation, RelExtLockMode lockmode);
+extern void UnlockRelationForExtension(Relation relation, RelExtLockMode lockmode);
+extern bool ConditionalLockRelationForExtension(Relation relation, RelExtLockMode lockmode);
+extern int	RelationExtensionLockWaiterCount(Relation relation);
+extern void RelExtLockReleaseAll(void);
+
+#endif	/* EXTENSION_LOCK_H */
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index 0b92322..6b357aa 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -15,6 +15,7 @@
 #define LMGR_H
 
 #include "lib/stringinfo.h"
+#include "storage/extension_lock.h"
 #include "storage/itemptr.h"
 #include "storage/lock.h"
 #include "utils/rel.h"
@@ -50,13 +51,6 @@ extern bool LockHasWaitersRelation(Relation relation, LOCKMODE lockmode);
 extern void LockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
 extern void UnlockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
 
-/* Lock a relation for extension */
-extern void LockRelationForExtension(Relation relation, LOCKMODE lockmode);
-extern void UnlockRelationForExtension(Relation relation, LOCKMODE lockmode);
-extern bool ConditionalLockRelationForExtension(Relation relation,
-									LOCKMODE lockmode);
-extern int	RelationExtensionLockWaiterCount(Relation relation);
-
 /* Lock a page (currently only used within indexes) */
 extern void LockPage(Relation relation, BlockNumber blkno, LOCKMODE lockmode);
 extern bool ConditionalLockPage(Relation relation, BlockNumber blkno, LOCKMODE lockmode);
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index 765431e..3be18ea 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -138,8 +138,6 @@ typedef uint16 LOCKMETHODID;
 typedef enum LockTagType
 {
 	LOCKTAG_RELATION,			/* whole relation */
-	/* ID info for a relation is DB OID + REL OID; DB OID = 0 if shared */
-	LOCKTAG_RELATION_EXTEND,	/* the right to extend a relation */
 	/* same ID info as RELATION */
 	LOCKTAG_PAGE,				/* one page of a relation */
 	/* ID info for a page is RELATION info + BlockNumber */
@@ -198,14 +196,6 @@ typedef struct LOCKTAG
 	 (locktag).locktag_type = LOCKTAG_RELATION, \
 	 (locktag).locktag_lockmethodid = DEFAULT_LOCKMETHOD)
 
-#define SET_LOCKTAG_RELATION_EXTEND(locktag,dboid,reloid) \
-	((locktag).locktag_field1 = (dboid), \
-	 (locktag).locktag_field2 = (reloid), \
-	 (locktag).locktag_field3 = 0, \
-	 (locktag).locktag_field4 = 0, \
-	 (locktag).locktag_type = LOCKTAG_RELATION_EXTEND, \
-	 (locktag).locktag_lockmethodid = DEFAULT_LOCKMETHOD)
-
 #define SET_LOCKTAG_PAGE(locktag,dboid,reloid,blocknum) \
 	((locktag).locktag_field1 = (dboid), \
 	 (locktag).locktag_field2 = (reloid), \
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 596fdad..b138aad 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -120,14 +120,21 @@ extern PGDLLIMPORT int NamedLWLockTrancheRequests;
 #define LOG2_NUM_PREDICATELOCK_PARTITIONS  4
 #define NUM_PREDICATELOCK_PARTITIONS  (1 << LOG2_NUM_PREDICATELOCK_PARTITIONS)
 
+/* Number of partitions the shared relation extension lock tables are divided into */
+#define LOG2_NUM_RELEXTLOCK_PARTITIONS	4
+#define NUM_RELEXTLOCK_PARTITIONS	(1 << LOG2_NUM_RELEXTLOCK_PARTITIONS)
+
 /* Offsets for various chunks of preallocated lwlocks. */
 #define BUFFER_MAPPING_LWLOCK_OFFSET	NUM_INDIVIDUAL_LWLOCKS
 #define LOCK_MANAGER_LWLOCK_OFFSET		\
 	(BUFFER_MAPPING_LWLOCK_OFFSET + NUM_BUFFER_PARTITIONS)
 #define PREDICATELOCK_MANAGER_LWLOCK_OFFSET \
 	(LOCK_MANAGER_LWLOCK_OFFSET + NUM_LOCK_PARTITIONS)
-#define NUM_FIXED_LWLOCKS \
+#define RELEXTLOCK_MANAGER_LWLOCK_OFFSET \
 	(PREDICATELOCK_MANAGER_LWLOCK_OFFSET + NUM_PREDICATELOCK_PARTITIONS)
+#define NUM_FIXED_LWLOCKS \
+	(PREDICATELOCK_MANAGER_LWLOCK_OFFSET + NUM_PREDICATELOCK_PARTITIONS + \
+	 NUM_RELEXTLOCK_PARTITIONS)
 
 typedef enum LWLockMode
 {
@@ -151,6 +158,8 @@ extern void LWLockReleaseClearVar(LWLock *lock, uint64 *valptr, uint64 val);
 extern void LWLockReleaseAll(void);
 extern bool LWLockHeldByMe(LWLock *lock);
 extern bool LWLockHeldByMeInMode(LWLock *lock, LWLockMode mode);
+extern bool LWLockCheckForCleanup(LWLock *lock);
+extern int LWLockWaiterCount(LWLock *lock);
 
 extern bool LWLockWaitForVar(LWLock *lock, uint64 *valptr, uint64 oldval, uint64 *newval);
 extern void LWLockUpdateVar(LWLock *lock, uint64 *valptr, uint64 value);
@@ -211,6 +220,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_BUFFER_MAPPING,
 	LWTRANCHE_LOCK_MANAGER,
 	LWTRANCHE_PREDICATE_LOCK_MANAGER,
+	LWTRANCHE_RELEXT_LOCK_MANAGER,
 	LWTRANCHE_PARALLEL_QUERY_DSA,
 	LWTRANCHE_SESSION_DSA,
 	LWTRANCHE_SESSION_RECORD_TABLE,

#25

Robert Haas

robertmhaas@gmail.com

about 8 years ago

In reply to: Masahiko Sawada (#24)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Mon, Nov 20, 2017 at 5:19 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Attached updated version patch. I've moved only relation extension
locks out of heavy-weight lock as per discussion so far.

I've done a write-heavy benchmark on my laptop; loading 24kB data to
one table using COPY by 1 client, for 10 seconds. The through-put of
patched is 10% better than current HEAD. The result of 5 times is the
following.

----- PATCHED -----
tps = 178.791515 (excluding connections establishing)
tps = 176.522693 (excluding connections establishing)
tps = 168.705442 (excluding connections establishing)
tps = 158.158009 (excluding connections establishing)
tps = 161.145709 (excluding connections establishing)

----- HEAD -----
tps = 147.079803 (excluding connections establishing)
tps = 149.079540 (excluding connections establishing)
tps = 149.082275 (excluding connections establishing)
tps = 148.255376 (excluding connections establishing)
tps = 145.542552 (excluding connections establishing)

Also I've done a micro-benchmark; calling LockRelationForExtension and
UnlockRelationForExtension tightly in order to measure the number of
lock/unlock cycles per second. The result is,
PATCHED = 3.95892e+06 (cycles/sec)
HEAD = 1.15284e+06 (cycles/sec)
The patched is 3 times faster than current HEAD.

Attached updated patch and the function I used for micro-benchmark.
Please review it.

That's a nice speed-up.

How about a preliminary patch that asserts that we never take another
heavyweight lock while holding a relation extension lock?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#26

Masahiko Sawada

sawada.mshk@gmail.com

about 8 years ago

In reply to: Robert Haas (#25)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Wed, Nov 22, 2017 at 5:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Nov 20, 2017 at 5:19 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Attached updated version patch. I've moved only relation extension
locks out of heavy-weight lock as per discussion so far.

I've done a write-heavy benchmark on my laptop; loading 24kB data to
one table using COPY by 1 client, for 10 seconds. The through-put of
patched is 10% better than current HEAD. The result of 5 times is the
following.

----- PATCHED -----
tps = 178.791515 (excluding connections establishing)
tps = 176.522693 (excluding connections establishing)
tps = 168.705442 (excluding connections establishing)
tps = 158.158009 (excluding connections establishing)
tps = 161.145709 (excluding connections establishing)

----- HEAD -----
tps = 147.079803 (excluding connections establishing)
tps = 149.079540 (excluding connections establishing)
tps = 149.082275 (excluding connections establishing)
tps = 148.255376 (excluding connections establishing)
tps = 145.542552 (excluding connections establishing)

Also I've done a micro-benchmark; calling LockRelationForExtension and
UnlockRelationForExtension tightly in order to measure the number of
lock/unlock cycles per second. The result is,
PATCHED = 3.95892e+06 (cycles/sec)
HEAD = 1.15284e+06 (cycles/sec)
The patched is 3 times faster than current HEAD.

Attached updated patch and the function I used for micro-benchmark.
Please review it.

That's a nice speed-up.

How about a preliminary patch that asserts that we never take another
heavyweight lock while holding a relation extension lock?

Agreed. Also, since we disallow to holding more than one locks of
different relations at once I'll add an assertion for it as well.

I think we no longer need to pass the lock level to
UnloclRelationForExtension(). Now that relation extension lock will be
simple we can release the lock in the mode that we used to acquire
like LWLock.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#27

Masahiko Sawada

sawada.mshk@gmail.com

about 8 years ago

In reply to: Masahiko Sawada (#26)

1 attachment(s)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Wed, Nov 22, 2017 at 11:32 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Nov 22, 2017 at 5:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Nov 20, 2017 at 5:19 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Attached updated version patch. I've moved only relation extension
locks out of heavy-weight lock as per discussion so far.

I've done a write-heavy benchmark on my laptop; loading 24kB data to
one table using COPY by 1 client, for 10 seconds. The through-put of
patched is 10% better than current HEAD. The result of 5 times is the
following.

----- PATCHED -----
tps = 178.791515 (excluding connections establishing)
tps = 176.522693 (excluding connections establishing)
tps = 168.705442 (excluding connections establishing)
tps = 158.158009 (excluding connections establishing)
tps = 161.145709 (excluding connections establishing)

----- HEAD -----
tps = 147.079803 (excluding connections establishing)
tps = 149.079540 (excluding connections establishing)
tps = 149.082275 (excluding connections establishing)
tps = 148.255376 (excluding connections establishing)
tps = 145.542552 (excluding connections establishing)

Also I've done a micro-benchmark; calling LockRelationForExtension and
UnlockRelationForExtension tightly in order to measure the number of
lock/unlock cycles per second. The result is,
PATCHED = 3.95892e+06 (cycles/sec)
HEAD = 1.15284e+06 (cycles/sec)
The patched is 3 times faster than current HEAD.

Attached updated patch and the function I used for micro-benchmark.
Please review it.

That's a nice speed-up.

How about a preliminary patch that asserts that we never take another
heavyweight lock while holding a relation extension lock?

Agreed. Also, since we disallow to holding more than one locks of
different relations at once I'll add an assertion for it as well.

I think we no longer need to pass the lock level to
UnloclRelationForExtension(). Now that relation extension lock will be
simple we can release the lock in the mode that we used to acquire
like LWLock.

Attached latest patch incorporated all comments so far. Please review it.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

Moving_extension_lock_out_of_heavyweight_lock_v7.patchapplication/octet-stream; name=Moving_extension_lock_out_of_heavyweight_lock_v7.patchDownload

diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index 09db5c6..56d4836 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -623,8 +623,8 @@ brin_page_cleanup(Relation idxrel, Buffer buf)
 	 */
 	if (PageIsNew(page))
 	{
-		LockRelationForExtension(idxrel, ShareLock);
-		UnlockRelationForExtension(idxrel, ShareLock);
+		LockRelationForExtension(idxrel, RELEXT_SHARED);
+		UnlockRelationForExtension(idxrel);
 
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 		if (PageIsNew(page))
@@ -716,7 +716,7 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 			 */
 			if (!RELATION_IS_LOCAL(irel))
 			{
-				LockRelationForExtension(irel, ExclusiveLock);
+				LockRelationForExtension(irel, RELEXT_EXCLUSIVE);
 				extensionLockHeld = true;
 			}
 			buf = ReadBuffer(irel, P_NEW);
@@ -768,7 +768,7 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 				}
 
 				if (extensionLockHeld)
-					UnlockRelationForExtension(irel, ExclusiveLock);
+					UnlockRelationForExtension(irel);
 
 				ReleaseBuffer(buf);
 				return InvalidBuffer;
@@ -778,7 +778,7 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
 		if (extensionLockHeld)
-			UnlockRelationForExtension(irel, ExclusiveLock);
+			UnlockRelationForExtension(irel);
 
 		page = BufferGetPage(buf);
 
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index 03e53ce..f84be0c 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -570,7 +570,7 @@ revmap_physical_extend(BrinRevmap *revmap)
 	else
 	{
 		if (needLock)
-			LockRelationForExtension(irel, ExclusiveLock);
+			LockRelationForExtension(irel, RELEXT_EXCLUSIVE);
 
 		buf = ReadBuffer(irel, P_NEW);
 		if (BufferGetBlockNumber(buf) != mapBlk)
@@ -582,7 +582,7 @@ revmap_physical_extend(BrinRevmap *revmap)
 			 * page from under whoever is using it.
 			 */
 			if (needLock)
-				UnlockRelationForExtension(irel, ExclusiveLock);
+				UnlockRelationForExtension(irel);
 			LockBuffer(revmap->rm_metaBuf, BUFFER_LOCK_UNLOCK);
 			ReleaseBuffer(buf);
 			return;
@@ -591,7 +591,7 @@ revmap_physical_extend(BrinRevmap *revmap)
 		page = BufferGetPage(buf);
 
 		if (needLock)
-			UnlockRelationForExtension(irel, ExclusiveLock);
+			UnlockRelationForExtension(irel);
 	}
 
 	/* Check that it's a regular block (or an empty page) */
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index d9c6483..af2679c 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -325,13 +325,13 @@ GinNewBuffer(Relation index)
 	/* Must extend the file */
 	needLock = !RELATION_IS_LOCAL(index);
 	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
+		LockRelationForExtension(index, RELEXT_EXCLUSIVE);
 
 	buffer = ReadBuffer(index, P_NEW);
 	LockBuffer(buffer, GIN_EXCLUSIVE);
 
 	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+		UnlockRelationForExtension(index);
 
 	return buffer;
 }
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index 394bc83..b383423 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -716,10 +716,10 @@ ginvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	needLock = !RELATION_IS_LOCAL(index);
 
 	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
+		LockRelationForExtension(index, RELEXT_EXCLUSIVE);
 	npages = RelationGetNumberOfBlocks(index);
 	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+		UnlockRelationForExtension(index);
 
 	totFreePages = 0;
 
@@ -766,10 +766,10 @@ ginvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	stats->pages_free = totFreePages;
 
 	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
+		LockRelationForExtension(index, RELEXT_EXCLUSIVE);
 	stats->num_pages = RelationGetNumberOfBlocks(index);
 	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+		UnlockRelationForExtension(index);
 
 	return stats;
 }
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index d8d1c0a..d313f70 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -821,13 +821,13 @@ gistNewBuffer(Relation r)
 	needLock = !RELATION_IS_LOCAL(r);
 
 	if (needLock)
-		LockRelationForExtension(r, ExclusiveLock);
+		LockRelationForExtension(r, RELEXT_EXCLUSIVE);
 
 	buffer = ReadBuffer(r, P_NEW);
 	LockBuffer(buffer, GIST_EXCLUSIVE);
 
 	if (needLock)
-		UnlockRelationForExtension(r, ExclusiveLock);
+		UnlockRelationForExtension(r);
 
 	return buffer;
 }
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 77d9d12..ecef5c9 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -59,10 +59,10 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 
 	/* try to find deleted pages */
 	if (needLock)
-		LockRelationForExtension(rel, ExclusiveLock);
+		LockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 	npages = RelationGetNumberOfBlocks(rel);
 	if (needLock)
-		UnlockRelationForExtension(rel, ExclusiveLock);
+		UnlockRelationForExtension(rel);
 
 	totFreePages = 0;
 	for (blkno = GIST_ROOT_BLKNO + 1; blkno < npages; blkno++)
@@ -91,10 +91,10 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	/* return statistics */
 	stats->pages_free = totFreePages;
 	if (needLock)
-		LockRelationForExtension(rel, ExclusiveLock);
+		LockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 	stats->num_pages = RelationGetNumberOfBlocks(rel);
 	if (needLock)
-		UnlockRelationForExtension(rel, ExclusiveLock);
+		UnlockRelationForExtension(rel);
 
 	return stats;
 }
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 13e3bdc..0f81815 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -519,11 +519,11 @@ loop:
 	if (needLock)
 	{
 		if (!use_fsm)
-			LockRelationForExtension(relation, ExclusiveLock);
-		else if (!ConditionalLockRelationForExtension(relation, ExclusiveLock))
+			LockRelationForExtension(relation, RELEXT_EXCLUSIVE);
+		else if (!ConditionalLockRelationForExtension(relation, RELEXT_EXCLUSIVE))
 		{
 			/* Couldn't get the lock immediately; wait for it. */
-			LockRelationForExtension(relation, ExclusiveLock);
+			LockRelationForExtension(relation, RELEXT_EXCLUSIVE);
 
 			/*
 			 * Check if some other backend has extended a block for us while
@@ -537,7 +537,7 @@ loop:
 			 */
 			if (targetBlock != InvalidBlockNumber)
 			{
-				UnlockRelationForExtension(relation, ExclusiveLock);
+				UnlockRelationForExtension(relation);
 				goto loop;
 			}
 
@@ -576,7 +576,7 @@ loop:
 	 * against vacuumlazy.c --- see comments therein.
 	 */
 	if (needLock)
-		UnlockRelationForExtension(relation, ExclusiveLock);
+		UnlockRelationForExtension(relation);
 
 	/*
 	 * We need to initialize the empty new page.  Double-check that it really
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 4c2a13a..0c57021 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -641,7 +641,7 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 	 * Note that another backend might have extended or created the relation
 	 * by the time we get the lock.
 	 */
-	LockRelationForExtension(rel, ExclusiveLock);
+	LockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 
 	/* Might have to re-open if a cache flush happened */
 	RelationOpenSmgr(rel);
@@ -679,7 +679,7 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 	/* Update local cache with the up-to-date size */
 	rel->rd_smgr->smgr_vm_nblocks = vm_nblocks_now;
 
-	UnlockRelationForExtension(rel, ExclusiveLock);
+	UnlockRelationForExtension(rel);
 
 	pfree(pg);
 }
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index c774349..f22202f 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -659,7 +659,7 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 		needLock = !RELATION_IS_LOCAL(rel);
 
 		if (needLock)
-			LockRelationForExtension(rel, ExclusiveLock);
+			LockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 
 		buf = ReadBuffer(rel, P_NEW);
 
@@ -673,7 +673,7 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 		 * condition against btvacuumscan --- see comments therein.
 		 */
 		if (needLock)
-			UnlockRelationForExtension(rel, ExclusiveLock);
+			UnlockRelationForExtension(rel);
 
 		/* Initialize the new page before returning it */
 		page = BufferGetPage(buf);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 399e6a1..c110737 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -1058,10 +1058,10 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	{
 		/* Get the current relation length */
 		if (needLock)
-			LockRelationForExtension(rel, ExclusiveLock);
+			LockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 		num_pages = RelationGetNumberOfBlocks(rel);
 		if (needLock)
-			UnlockRelationForExtension(rel, ExclusiveLock);
+			UnlockRelationForExtension(rel);
 
 		/* Quit if we've scanned the whole relation */
 		if (blkno >= num_pages)
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index bd5301f..e635fe3 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -230,13 +230,13 @@ SpGistNewBuffer(Relation index)
 	/* Must extend the file */
 	needLock = !RELATION_IS_LOCAL(index);
 	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
+		LockRelationForExtension(index, RELEXT_EXCLUSIVE);
 
 	buffer = ReadBuffer(index, P_NEW);
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
 	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+		UnlockRelationForExtension(index);
 
 	return buffer;
 }
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index d7d5e90..991db10 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -824,10 +824,10 @@ spgvacuumscan(spgBulkDeleteState *bds)
 	{
 		/* Get the current relation length */
 		if (needLock)
-			LockRelationForExtension(index, ExclusiveLock);
+			LockRelationForExtension(index, RELEXT_EXCLUSIVE);
 		num_pages = RelationGetNumberOfBlocks(index);
 		if (needLock)
-			UnlockRelationForExtension(index, ExclusiveLock);
+			UnlockRelationForExtension(index);
 
 		/* Quit if we've scanned the whole relation */
 		if (blkno >= num_pages)
diff --git a/src/backend/commands/discard.c b/src/backend/commands/discard.c
index f0dcd87..216c197 100644
--- a/src/backend/commands/discard.c
+++ b/src/backend/commands/discard.c
@@ -19,6 +19,7 @@
 #include "commands/discard.h"
 #include "commands/prepare.h"
 #include "commands/sequence.h"
+#include "storage/extension_lock.h"
 #include "utils/guc.h"
 #include "utils/portal.h"
 
@@ -71,6 +72,7 @@ DiscardAll(bool isTopLevel)
 	ResetAllOptions();
 	DropAllPreparedStatements();
 	Async_UnlistenAll();
+	RelExtLockReleaseAll();
 	LockReleaseAll(USER_LOCKMETHOD, true);
 	ResetPlanCache();
 	ResetTempTableNamespace();
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 6587db7..6880706 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -860,8 +860,8 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
 			 * it's got exclusive lock on the whole relation.
 			 */
 			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
-			LockRelationForExtension(onerel, ExclusiveLock);
-			UnlockRelationForExtension(onerel, ExclusiveLock);
+			LockRelationForExtension(onerel, RELEXT_EXCLUSIVE);
+			UnlockRelationForExtension(onerel);
 			LockBufferForCleanup(buf);
 			if (PageIsNew(page))
 			{
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 5c256ff..5beba70 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3628,6 +3628,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_SYNC_REP:
 			event_name = "SyncRep";
 			break;
+		case WAIT_EVENT_RELATION_EXTENSION:
+			event_name = "RelationExtension";
+			break;
 			/* no default case, so that compiler will warn */
 	}
 
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 4648473..010f7ca 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -624,7 +624,7 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 	 * Note that another backend might have extended or created the relation
 	 * by the time we get the lock.
 	 */
-	LockRelationForExtension(rel, ExclusiveLock);
+	LockRelationForExtension(rel, RELEXT_EXCLUSIVE);
 
 	/* Might have to re-open if a cache flush happened */
 	RelationOpenSmgr(rel);
@@ -652,7 +652,7 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 	/* Update local cache with the up-to-date size */
 	rel->rd_smgr->smgr_fsm_nblocks = fsm_nblocks_now;
 
-	UnlockRelationForExtension(rel, ExclusiveLock);
+	UnlockRelationForExtension(rel);
 
 	pfree(pg);
 }
diff --git a/src/backend/storage/lmgr/Makefile b/src/backend/storage/lmgr/Makefile
index e1b787e..2334a40 100644
--- a/src/backend/storage/lmgr/Makefile
+++ b/src/backend/storage/lmgr/Makefile
@@ -13,7 +13,7 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = lmgr.o lock.o proc.o deadlock.o lwlock.o lwlocknames.o spin.o \
-	s_lock.o predicate.o condition_variable.o
+	s_lock.o predicate.o condition_variable.o extension_lock.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/storage/lmgr/README b/src/backend/storage/lmgr/README
index 56b0a12..5e0a394 100644
--- a/src/backend/storage/lmgr/README
+++ b/src/backend/storage/lmgr/README
@@ -29,6 +29,12 @@ process has to wait for an LWLock, it blocks on a SysV semaphore so as
 to not consume CPU time.  Waiting processes will be granted the lock in
 arrival order.  There is no timeout.
 
+* Relation extension locks.  The relation extension lock manager is
+specialized in relation extensions. In PostgreSQL 10 relation extension
+lock has been moved out of regular lock. It's similar to regular locks
+but doesn't have full dead lock detection and group locking. When
+confliction relation extension lock waits using condition variables.
+
 * Regular locks (a/k/a heavyweight locks).  The regular lock manager
 supports a variety of lock modes with table-driven semantics, and it has
 full deadlock detection and automatic release at transaction end.
@@ -40,9 +46,9 @@ Acquisition of either a spinlock or a lightweight lock causes query
 cancel and die() interrupts to be held off until all such locks are
 released. No such restriction exists for regular locks, however.  Also
 note that we can accept query cancel and die() interrupts while waiting
-for a regular lock, but we will not accept them while waiting for
-spinlocks or LW locks. It is therefore not a good idea to use LW locks
-when the wait time might exceed a few seconds.
+for a relation extension lock or a regular lock, but we will not accept
+them while waiting for spinlocks or LW locks. It is therefore not a good
+idea to use LW locks when the wait time might exceed a few seconds.
 
 The rest of this README file discusses the regular lock manager in detail.
 
diff --git a/src/backend/storage/lmgr/extension_lock.c b/src/backend/storage/lmgr/extension_lock.c
new file mode 100644
index 0000000..5b50890
--- /dev/null
+++ b/src/backend/storage/lmgr/extension_lock.c
@@ -0,0 +1,503 @@
+/*-------------------------------------------------------------------------
+ *
+ * extension_lock.c
+ *	  Relation extension lock manager
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/storage/lmgr/extension_lock.c
+ *
+ * NOTES:
+ *
+ * This lock manager is specialized in relation extension locks; light
+ * weight and interruptible lock manager. It's similar to heavy-weight
+ * lock but doesn't have dead lock detection mechanism and group locking
+ * mechanism.
+ *
+ * For lock acquisition we use an atomic compare-and-exchange on the
+ * state variable. When a process tries to acquire a lock that conflicts
+ * with existing lock, it is put to sleep using condition variables
+ * if not conditional locking. When release the lock, we use an atomic
+ * decrement to release the lock, but don't remove the RELEXTLOCK entry
+ * in the hash table. The all unused entries will be reclaimed when
+ * acquisition once the hash table got full.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/extension_lock.h"
+#include "utils/rel.h"
+
+/*
+ * Compute the hash code associated with a RELEXTLOCK.
+ *
+ * To avoid unnecessary recomputations of the hash code, we try to do this
+ * just once per function, and then pass it around as needed.  Aside from
+ * passing the hashcode to hash_search_with_hash_value(), we can extract
+ * the lock partition number from the hashcode.
+ */
+#define RelExtLockTargetTagHashCode(relextlocktargettag) \
+	get_hash_value(RelExtLockHash, (const void *) relextlocktargettag)
+
+/*
+ * The lockmgr's shared hash tables are partitioned to reduce contention.
+ * To determine which partition a given relid belongs to, compute the tag's
+ * hash code with ExtLockTagHashCode(), then apply one of these macros.
+ * NB: NUM_RELEXTLOCK_PARTITIONS must be a power of 2!
+ */
+#define RelExtLockHashPartition(hashcode) \
+	((hashcode) % NUM_RELEXTLOCK_PARTITIONS)
+#define RelExtLockHashPartitionLock(hashcode) \
+	(&MainLWLockArray[RELEXTLOCK_MANAGER_LWLOCK_OFFSET + \
+					  LockHashPartition(hashcode)].lock)
+#define RelExtLockHashPartitionLockByIndex(i) \
+	(&MainLWLockArray[RELEXTLOCK_MANAGER_LWLOCK_OFFSET + (i)].lock)
+
+#define	RELEXT_VAL_EXCLUSIVE	((uint32) 1 << 24)
+#define RELEXT_VAL_SHARED		1
+
+#define RELEXT_LOCK_MASK			((uint32) ((1 << 25) - 1))
+
+typedef struct RELEXTLOCK
+{
+	/* hash key -- must be first */
+	Oid					relid;
+
+	/* state of exclusive/non-exclusive lock */
+	pg_atomic_uint32	state;
+	pg_atomic_uint32	pin_counts;
+
+	ConditionVariable	cv;
+} RELEXTLOCK;
+
+/*
+ * This structure holds information per-object relation extension
+ * lock. held_extlocks represents the RelExtLocks we're holding.
+ * We use this structure to keep track of locked relation extension locks
+ * for release during error recovery.  At most one lock can be held at
+ * once. Note that sometimes we could try to acquire a lock for the
+ * additional forks while holding the lock for the main fork; for example,
+ * adding extra relation blocks for both relation and its free space map.
+ * But since this lock manager doesn't distinguish between the forks,
+ * we just increment nLocks in the case.
+ */
+typedef	struct relextlock_handle
+{
+	RELEXTLOCK		*lock;
+	RelExtLockMode	mode;	/* lock mode for this table entry */
+	int				nLocks;
+} relextlock_handle;
+
+static relextlock_handle held_relextlock;
+static int num_held_relextlocks = 0;
+
+static bool RelExtLockAcquire(Oid relid, RelExtLockMode lockmode, bool conditional);
+static void RelExtLockRelease(Oid rleid);
+static bool RelExtLockAttemptLock(RELEXTLOCK *extlock, RelExtLockMode lockmode);
+static bool RelExtLockShrinkLocks(void);
+
+/*
+ * Pointers to hash tables containing relation extension lock state
+ *
+ * The RelExtLockHash hash table is in shared memory
+ */
+static HTAB *RelExtLockHash;
+
+/*
+ * InitRelExtLock
+ *      Initialize the relation extension lock manager's data structures.
+ */
+void
+InitRelExtLock(long max_table_size)
+{
+	HASHCTL	info;
+	long		init_table_size;
+
+	/*
+	 * Compute init/max size to request for lock hashtables.  Note these
+	 * calculations must agree with LockShmemSize!
+	 */
+	init_table_size = max_table_size / 2;
+
+	/*
+	 * Allocate hash table for RELEXTLOCK structs. This stores per-relation
+	 * lock.
+	 */
+	MemSet(&info, 0, sizeof(info));
+	info.keysize = sizeof(Oid);
+	info.entrysize = sizeof(RELEXTLOCK);
+	info.num_partitions = NUM_RELEXTLOCK_PARTITIONS;
+
+	RelExtLockHash = ShmemInitHash("RELEXTLOCK Hash",
+								   init_table_size,
+								   max_table_size,
+								   &info,
+								   HASH_ELEM | HASH_BLOBS | HASH_PARTITION);
+}
+
+/*
+ *		LockRelationForExtension
+ *
+ * This lock is used to interlock addition of pages to relations.
+ * We need such locking because bufmgr/smgr definition of P_NEW is not
+ * race-condition-proof.
+ *
+ * We assume the caller is already holding some type of regular lock on
+ * the relation, so no AcceptInvalidationMessages call is needed here.
+ */
+void
+LockRelationForExtension(Relation relation, RelExtLockMode lockmode)
+{
+	RelExtLockAcquire(relation->rd_id, lockmode, false);
+}
+
+/*
+ *		ConditionalLockRelationForExtension
+ *
+ * As above, but only lock if we can get the lock without blocking.
+ * Returns TRUE iff the lock was acquired.
+ */
+bool
+ConditionalLockRelationForExtension(Relation relation, RelExtLockMode lockmode)
+{
+	return RelExtLockAcquire(relation->rd_id, lockmode, true);
+}
+
+/*
+ *		RelationExtensionLockWaiterCount
+ *
+ * Count the number of processes waiting for the given relation extension lock.
+ */
+int
+RelationExtensionLockWaiterCount(Relation relation)
+{
+	LWLock		*partitionLock;
+	RELEXTLOCK	*extlock;
+	Oid			relid;
+	uint32		hashcode;
+	uint32		pin_counts;
+	bool		found;
+
+	relid = RelationGetRelid(relation);
+
+	hashcode = RelExtLockTargetTagHashCode(&relid);
+	partitionLock = RelExtLockHashPartitionLock(hashcode);
+
+	LWLockAcquire(partitionLock, LW_SHARED);
+
+	extlock = (RELEXTLOCK *) hash_search_with_hash_value(RelExtLockHash,
+														 (void *) &relid,
+														 hashcode,
+														 HASH_FIND, &found);
+
+	LWLockRelease(partitionLock);
+
+	/* We assume that we already acquire this lock */
+	Assert(found);
+
+	pin_counts = pg_atomic_read_u32(&(extlock->pin_counts));
+
+	/* Except for me */
+	return pin_counts - 1;
+}
+
+/*
+ *		UnlockRelationForExtension
+ */
+void
+UnlockRelationForExtension(Relation relation)
+{
+	RelExtLockRelease(relation->rd_id);
+}
+
+/*
+ * RelationExtensionLockReleaseAll - release all currently-held relation extension locks
+ */
+void
+RelExtLockReleaseAll(void)
+{
+	if (num_held_relextlocks > 0)
+	{
+		HOLD_INTERRUPTS();
+		RelExtLockRelease(held_relextlock.lock->relid);
+	}
+}
+
+/*
+ * Return the number of holding relation extension locks.
+ */
+int
+RelExtLockHoldingLockCount(void)
+{
+	return num_held_relextlocks;
+}
+
+/*
+ * Acquire relation extension lock and create RELEXTLOCK hash entry on shared
+ * hash table. If we're trying to acquire the same lock as what already held,
+ * we just increment nLock locally and return without touching the hash table.
+ */
+static bool
+RelExtLockAcquire(Oid relid, RelExtLockMode lockmode, bool conditional)
+{
+	RELEXTLOCK	*extlock = NULL;
+	LWLock	*partitionLock;
+	uint32	hashcode;
+	bool	found;
+	bool	mustwait;
+
+	hashcode = RelExtLockTargetTagHashCode(&relid);
+	partitionLock = RelExtLockHashPartitionLock(hashcode);
+
+	/*
+	 * If we already hold the lock, we can just increase the count locally.
+	 * Since we don't support dead lock detection for relation extension
+	 * lock and don't control the order of lock acquisition, it cannot not
+	 * happen that trying to take a new lock while holding an another lock.
+	 */
+	if (num_held_relextlocks > 0)
+	{
+		if (relid == held_relextlock.lock->relid &&
+			lockmode == held_relextlock.mode)
+		{
+			held_relextlock.nLocks++;
+			return true;
+		}
+		else
+			Assert(false);	/* cannot happen */
+	}
+
+	for (;;)
+	{
+
+		LWLockAcquire(partitionLock, LW_EXCLUSIVE);
+
+		if (!extlock)
+			extlock = (RELEXTLOCK *) hash_search_with_hash_value(RelExtLockHash,
+																 (void * ) &relid,
+																 hashcode, HASH_ENTER_NULL,
+																 &found);
+
+		/*
+		 * Failed to create new hash entry. Try to shrink the hash table and
+		 * retry.
+		 */
+		if (!extlock)
+		{
+			bool	successed;
+			LWLockRelease(partitionLock);
+			successed = RelExtLockShrinkLocks();
+
+			if (!successed)
+				ereport(ERROR,
+						(errmsg("out of shared memory"),
+						 errhint("You might need to increase max_pred_locks_per_transaction.")));
+
+			continue;
+		}
+
+		/* Not found, initialize */
+		if (!found)
+		{
+			extlock->relid = relid;
+			pg_atomic_init_u32(&(extlock->state), 0);
+			pg_atomic_init_u32(&(extlock->pin_counts), 0);
+			ConditionVariableInit(&(extlock->cv));
+		}
+
+		/* Increment pin count */
+		pg_atomic_add_fetch_u32(&(extlock->pin_counts), 1);
+
+		mustwait = RelExtLockAttemptLock(extlock, lockmode);
+
+		if (!mustwait)
+			break;	/* got the lock */
+
+		/* Could not got the lock, return iff in conditional locking */
+		if (mustwait && conditional)
+		{
+			pg_atomic_sub_fetch_u32(&(extlock->pin_counts), 1);
+			LWLockRelease(partitionLock);
+			return false;
+		}
+
+		/* Release the partition lock before sleep */
+		LWLockRelease(partitionLock);
+
+		/* Sleep until the lock is released */
+		ConditionVariableSleep(&(extlock->cv), WAIT_EVENT_RELATION_EXTENSION);
+	}
+
+	LWLockRelease(partitionLock);
+	ConditionVariableCancelSleep();
+
+	Assert(!mustwait);
+
+	/* Remember lock held by this backend */
+	held_relextlock.lock = extlock;
+	held_relextlock.mode = lockmode;
+	held_relextlock.nLocks = 1;
+	num_held_relextlocks++;
+
+	/* Always return true if not conditional lock */
+	return true;
+}
+
+/*
+ * RelExtLockRelease
+ *
+ * Release a previously acquired relation extension lock. We don't remove
+ * hash entry at the time. Once the hash table got full, all un-pinned hash
+ * entries will be removed.
+ */
+static void
+RelExtLockRelease(Oid relid)
+{
+	RELEXTLOCK	*extlock;
+	RelExtLockMode mode;
+	LWLock	*partitionLock;
+	uint32	hashcode;
+	uint32	pin_counts;
+
+	/* We should have acquired a lock before releasing */
+	Assert(num_held_relextlocks > 0);
+
+	if (relid != held_relextlock.lock->relid)
+		ereport(ERROR,
+				(errcode(ERRCODE_LOCK_NOT_AVAILABLE),
+				 errmsg("relation extension lock for %u is not held",
+						relid)));
+
+	/* Decrease the lock count locally */
+	held_relextlock.nLocks--;
+
+	/* If we are still holding the lock, we're done */
+	 if (held_relextlock.nLocks > 0)
+		return;
+
+	hashcode = RelExtLockTargetTagHashCode(&relid);
+	partitionLock = RelExtLockHashPartitionLock(hashcode);
+
+	/* Keep holding the partition lock until unlocking is done */
+	LWLockAcquire(partitionLock, LW_EXCLUSIVE);
+
+	extlock = held_relextlock.lock;
+	mode = held_relextlock.mode;
+
+	if (mode == RELEXT_EXCLUSIVE)
+		pg_atomic_sub_fetch_u32(&(extlock->state), RELEXT_VAL_EXCLUSIVE);
+	else
+		pg_atomic_sub_fetch_u32(&(extlock->state), RELEXT_VAL_SHARED);
+
+	num_held_relextlocks--;
+
+	/* Decrement pin counter */
+	pin_counts = pg_atomic_sub_fetch_u32(&(extlock->pin_counts), 1);
+
+	LWLockRelease(partitionLock);
+
+	/* Wake up waiters if there are someone looking at this lock */
+	if (pin_counts > 0)
+		ConditionVariableBroadcast(&(extlock->cv));
+}
+
+/*
+ * Internal function that attempts to atomically acquire the relation
+ * extension lock in the passed in mode.
+ *
+ * Returns true if the lock isn't free and we need to wait.
+ */
+static bool
+RelExtLockAttemptLock(RELEXTLOCK *extlock, RelExtLockMode lockmode)
+{
+	uint32	oldstate;
+
+	oldstate = pg_atomic_read_u32(&extlock->state);
+
+	while (true)
+	{
+		uint32	desired_state;
+		bool	lock_free;
+
+		desired_state = oldstate;
+
+		if (lockmode == RELEXT_EXCLUSIVE)
+		{
+			lock_free = (oldstate & RELEXT_LOCK_MASK) == 0;
+			if (lock_free)
+				desired_state += RELEXT_VAL_EXCLUSIVE;
+		}
+		else
+		{
+			lock_free = (oldstate & RELEXT_VAL_EXCLUSIVE) == 0;
+			if (lock_free)
+				desired_state += RELEXT_VAL_SHARED;
+		}
+
+		if (pg_atomic_compare_exchange_u32(&extlock->state,
+										   &oldstate, desired_state))
+		{
+			if (lock_free)
+				return false;
+			else
+				return true;
+		}
+	}
+	pg_unreachable();
+}
+
+/*
+ * Reclaim all un-pinned RELEXTLOCK entries from the hash table.
+ */
+static bool
+RelExtLockShrinkLocks(void)
+{
+	HASH_SEQ_STATUS	hstat;
+	RELEXTLOCK		*extlock;
+	List			*entries_to_remove = NIL;
+	ListCell		*cell;
+	int				i;
+
+	/*
+	 * To ensure consistency, take all partition locks in exclusive
+	 * mode.
+	 */
+	for (i = 0; i < NUM_RELEXTLOCK_PARTITIONS; i++)
+		LWLockAcquire(RelExtLockHashPartitionLockByIndex(i), LW_EXCLUSIVE);
+
+	/* Collect all un-pinned RELEXTLOCK entries */
+	hash_seq_init(&hstat, RelExtLockHash);
+	while ((extlock = (RELEXTLOCK *) hash_seq_search(&hstat)) != NULL)
+	{
+		uint32	pin_count = pg_atomic_read_u32(&(extlock->pin_counts));
+
+		if (pin_count == 0)
+			entries_to_remove = lappend(entries_to_remove, extlock);
+	}
+
+	/* We could not find any entries that we can remove right now */
+	if (list_length(entries_to_remove) == 0)
+		return false;
+
+	/* Remove collected entries from RelExtLockHash has table */
+	foreach (cell, entries_to_remove)
+	{
+		RELEXTLOCK	*el = (RELEXTLOCK *) lfirst(cell);
+		uint32	hc = RelExtLockTargetTagHashCode(&(el->relid));
+
+		hash_search_with_hash_value(RelExtLockHash, (void *) &(el->relid),
+									hc, HASH_REMOVE, NULL);
+	}
+
+	/* Release all partition locks */
+	for (i = 0; i < NUM_RELEXTLOCK_PARTITIONS; i++)
+		LWLockRelease(RelExtLockHashPartitionLockByIndex(i));
+
+	return true;
+}
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index da5679b..4fbc0c4 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -319,78 +319,6 @@ UnlockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode)
 }
 
 /*
- *		LockRelationForExtension
- *
- * This lock tag is used to interlock addition of pages to relations.
- * We need such locking because bufmgr/smgr definition of P_NEW is not
- * race-condition-proof.
- *
- * We assume the caller is already holding some type of regular lock on
- * the relation, so no AcceptInvalidationMessages call is needed here.
- */
-void
-LockRelationForExtension(Relation relation, LOCKMODE lockmode)
-{
-	LOCKTAG		tag;
-
-	SET_LOCKTAG_RELATION_EXTEND(tag,
-								relation->rd_lockInfo.lockRelId.dbId,
-								relation->rd_lockInfo.lockRelId.relId);
-
-	(void) LockAcquire(&tag, lockmode, false, false);
-}
-
-/*
- *		ConditionalLockRelationForExtension
- *
- * As above, but only lock if we can get the lock without blocking.
- * Returns true iff the lock was acquired.
- */
-bool
-ConditionalLockRelationForExtension(Relation relation, LOCKMODE lockmode)
-{
-	LOCKTAG		tag;
-
-	SET_LOCKTAG_RELATION_EXTEND(tag,
-								relation->rd_lockInfo.lockRelId.dbId,
-								relation->rd_lockInfo.lockRelId.relId);
-
-	return (LockAcquire(&tag, lockmode, false, true) != LOCKACQUIRE_NOT_AVAIL);
-}
-
-/*
- *		RelationExtensionLockWaiterCount
- *
- * Count the number of processes waiting for the given relation extension lock.
- */
-int
-RelationExtensionLockWaiterCount(Relation relation)
-{
-	LOCKTAG		tag;
-
-	SET_LOCKTAG_RELATION_EXTEND(tag,
-								relation->rd_lockInfo.lockRelId.dbId,
-								relation->rd_lockInfo.lockRelId.relId);
-
-	return LockWaiterCount(&tag);
-}
-
-/*
- *		UnlockRelationForExtension
- */
-void
-UnlockRelationForExtension(Relation relation, LOCKMODE lockmode)
-{
-	LOCKTAG		tag;
-
-	SET_LOCKTAG_RELATION_EXTEND(tag,
-								relation->rd_lockInfo.lockRelId.dbId,
-								relation->rd_lockInfo.lockRelId.relId);
-
-	LockRelease(&tag, lockmode, false);
-}
-
-/*
  *		LockPage
  *
  * Obtain a page-level lock.  This is currently used by some index access
@@ -961,12 +889,6 @@ DescribeLockTag(StringInfo buf, const LOCKTAG *tag)
 							 tag->locktag_field2,
 							 tag->locktag_field1);
 			break;
-		case LOCKTAG_RELATION_EXTEND:
-			appendStringInfo(buf,
-							 _("extension of relation %u of database %u"),
-							 tag->locktag_field2,
-							 tag->locktag_field1);
-			break;
 		case LOCKTAG_PAGE:
 			appendStringInfo(buf,
 							 _("page %u of relation %u of database %u"),
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 5833086..ad6f057 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -45,6 +45,7 @@
 #include "storage/sinvaladt.h"
 #include "storage/spin.h"
 #include "storage/standby.h"
+#include "storage/lmgr.h"
 #include "utils/memutils.h"
 #include "utils/ps_status.h"
 #include "utils/resowner_private.h"
@@ -388,6 +389,9 @@ InitLocks(void)
 	max_table_size = NLOCKENTS();
 	init_table_size = max_table_size / 2;
 
+	/* Initialize lock structure for relation extension lock */
+	InitRelExtLock(max_table_size);
+
 	/*
 	 * Allocate hash table for LOCK structs.  This stores per-locked-object
 	 * information.
@@ -717,6 +721,15 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	int			status;
 	bool		log_lock = false;
 
+	/*
+	 * We allow to take a relation extension lock after took a
+	 * heavy-weight lock. However, since we don't have dead lock
+	 * detection mechanism between heavy-weight lock and relation
+	 * extension lock it's not allowed taking an another heavy-weight
+	 * lock while holding a relation extension lock.
+	 */
+	Assert(RelExtLockHoldingLockCount() == 0);
+
 	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
 		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
 	lockMethodTable = LockMethods[lockmethodid];
@@ -3366,6 +3379,7 @@ LockShmemSize(void)
 	/* lock hash table */
 	max_table_size = NLOCKENTS();
 	size = add_size(size, hash_estimate_size(max_table_size, sizeof(LOCK)));
+	size = add_size(size, hash_estimate_size(max_table_size, sizeof(LWLock)));
 
 	/* proclock hash table */
 	max_table_size *= 2;
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index e5c3e86..b12aba0 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -451,6 +451,13 @@ InitializeLWLocks(void)
 	for (id = 0; id < NUM_PREDICATELOCK_PARTITIONS; id++, lock++)
 		LWLockInitialize(&lock->lock, LWTRANCHE_PREDICATE_LOCK_MANAGER);
 
+	/* Initialize relation extension lmgr's LWLocks in main array */
+	lock = MainLWLockArray + NUM_INDIVIDUAL_LWLOCKS +
+		NUM_BUFFER_PARTITIONS + NUM_LOCK_PARTITIONS +
+		NUM_PREDICATELOCK_PARTITIONS;
+	for (id = 0; id < NUM_RELEXTLOCK_PARTITIONS; id++, lock++)
+		LWLockInitialize(&lock->lock, LWTRANCHE_RELEXT_LOCK_MANAGER);
+
 	/* Initialize named tranches. */
 	if (NamedLWLockTrancheRequests > 0)
 	{
@@ -508,6 +515,7 @@ RegisterLWLockTranches(void)
 	LWLockRegisterTranche(LWTRANCHE_LOCK_MANAGER, "lock_manager");
 	LWLockRegisterTranche(LWTRANCHE_PREDICATE_LOCK_MANAGER,
 						  "predicate_lock_manager");
+	LWLockRegisterTranche(LWTRANCHE_RELEXT_LOCK_MANAGER, "relext_lock_manager");
 	LWLockRegisterTranche(LWTRANCHE_PARALLEL_QUERY_DSA,
 						  "parallel_query_dsa");
 	LWLockRegisterTranche(LWTRANCHE_SESSION_DSA,
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 5f6727d..f698e9c 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -765,6 +765,8 @@ ProcReleaseLocks(bool isCommit)
 		return;
 	/* If waiting, get off wait queue (should only be needed after error) */
 	LockErrorCleanup();
+	/* Release relation extension locks */
+	RelExtLockReleaseAll();
 	/* Release standard locks, including session-level if aborting */
 	LockReleaseAll(DEFAULT_LOCKMETHOD, !isCommit);
 	/* Release transaction-level advisory locks */
diff --git a/src/backend/utils/adt/lockfuncs.c b/src/backend/utils/adt/lockfuncs.c
index 9e0a8ab..6d8916c 100644
--- a/src/backend/utils/adt/lockfuncs.c
+++ b/src/backend/utils/adt/lockfuncs.c
@@ -25,7 +25,6 @@
 /* This must match enum LockTagType! */
 const char *const LockTagTypeNames[] = {
 	"relation",
-	"extend",
 	"page",
 	"tuple",
 	"transactionid",
@@ -234,7 +233,6 @@ pg_lock_status(PG_FUNCTION_ARGS)
 		switch ((LockTagType) instance->locktag.locktag_type)
 		{
 			case LOCKTAG_RELATION:
-			case LOCKTAG_RELATION_EXTEND:
 				values[1] = ObjectIdGetDatum(instance->locktag.locktag_field1);
 				values[2] = ObjectIdGetDatum(instance->locktag.locktag_field2);
 				nulls[3] = true;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 20f1d27..c004844 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -1153,6 +1153,7 @@ ShutdownPostgres(int code, Datum arg)
 	 * User locks are not released by transaction end, so be sure to release
 	 * them explicitly.
 	 */
+	RelExtLockReleaseAll();
 	LockReleaseAll(USER_LOCKMETHOD, true);
 }
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 089b7c3..958822f 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -816,7 +816,8 @@ typedef enum
 	WAIT_EVENT_REPLICATION_ORIGIN_DROP,
 	WAIT_EVENT_REPLICATION_SLOT_DROP,
 	WAIT_EVENT_SAFE_SNAPSHOT,
-	WAIT_EVENT_SYNC_REP
+	WAIT_EVENT_SYNC_REP,
+	WAIT_EVENT_RELATION_EXTENSION
 } WaitEventIPC;
 
 /* ----------
diff --git a/src/include/storage/extension_lock.h b/src/include/storage/extension_lock.h
new file mode 100644
index 0000000..daa6416
--- /dev/null
+++ b/src/include/storage/extension_lock.h
@@ -0,0 +1,42 @@
+/*-------------------------------------------------------------------------
+ *
+ * extension_lock.h
+ *	  Relation extension lock manager
+ *
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/extension_lock.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef EXTENSION_LOCK_H
+#define EXTENSION_LOCK_H
+
+#ifdef FRONTEND
+#error "extension_lock.h may not be included from frontend code"
+#endif
+
+#include "storage/proclist_types.h"
+#include "storage/s_lock.h"
+#include "storage/condition_variable.h"
+#include "port/atomics.h"
+
+typedef enum RelExtLockMode
+{
+	RELEXT_EXCLUSIVE,
+	RELEXT_SHARED
+} RelExtLockMode;
+
+/* Lock a relation for extension */
+extern void InitRelExtLock(long max_table_size);
+extern void LockRelationForExtension(Relation relation, RelExtLockMode lockmode);
+extern void UnlockRelationForExtension(Relation relation);
+extern bool ConditionalLockRelationForExtension(Relation relation, RelExtLockMode lockmode);
+extern int	RelationExtensionLockWaiterCount(Relation relation);
+extern void RelExtLockReleaseAll(void);
+extern int	RelExtLockHoldingLockCount(void);
+
+#endif	/* EXTENSION_LOCK_H */
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index 0b92322..6b357aa 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -15,6 +15,7 @@
 #define LMGR_H
 
 #include "lib/stringinfo.h"
+#include "storage/extension_lock.h"
 #include "storage/itemptr.h"
 #include "storage/lock.h"
 #include "utils/rel.h"
@@ -50,13 +51,6 @@ extern bool LockHasWaitersRelation(Relation relation, LOCKMODE lockmode);
 extern void LockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
 extern void UnlockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
 
-/* Lock a relation for extension */
-extern void LockRelationForExtension(Relation relation, LOCKMODE lockmode);
-extern void UnlockRelationForExtension(Relation relation, LOCKMODE lockmode);
-extern bool ConditionalLockRelationForExtension(Relation relation,
-									LOCKMODE lockmode);
-extern int	RelationExtensionLockWaiterCount(Relation relation);
-
 /* Lock a page (currently only used within indexes) */
 extern void LockPage(Relation relation, BlockNumber blkno, LOCKMODE lockmode);
 extern bool ConditionalLockPage(Relation relation, BlockNumber blkno, LOCKMODE lockmode);
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index 765431e..3be18ea 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -138,8 +138,6 @@ typedef uint16 LOCKMETHODID;
 typedef enum LockTagType
 {
 	LOCKTAG_RELATION,			/* whole relation */
-	/* ID info for a relation is DB OID + REL OID; DB OID = 0 if shared */
-	LOCKTAG_RELATION_EXTEND,	/* the right to extend a relation */
 	/* same ID info as RELATION */
 	LOCKTAG_PAGE,				/* one page of a relation */
 	/* ID info for a page is RELATION info + BlockNumber */
@@ -198,14 +196,6 @@ typedef struct LOCKTAG
 	 (locktag).locktag_type = LOCKTAG_RELATION, \
 	 (locktag).locktag_lockmethodid = DEFAULT_LOCKMETHOD)
 
-#define SET_LOCKTAG_RELATION_EXTEND(locktag,dboid,reloid) \
-	((locktag).locktag_field1 = (dboid), \
-	 (locktag).locktag_field2 = (reloid), \
-	 (locktag).locktag_field3 = 0, \
-	 (locktag).locktag_field4 = 0, \
-	 (locktag).locktag_type = LOCKTAG_RELATION_EXTEND, \
-	 (locktag).locktag_lockmethodid = DEFAULT_LOCKMETHOD)
-
 #define SET_LOCKTAG_PAGE(locktag,dboid,reloid,blocknum) \
 	((locktag).locktag_field1 = (dboid), \
 	 (locktag).locktag_field2 = (reloid), \
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 596fdad..b138aad 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -120,14 +120,21 @@ extern PGDLLIMPORT int NamedLWLockTrancheRequests;
 #define LOG2_NUM_PREDICATELOCK_PARTITIONS  4
 #define NUM_PREDICATELOCK_PARTITIONS  (1 << LOG2_NUM_PREDICATELOCK_PARTITIONS)
 
+/* Number of partitions the shared relation extension lock tables are divided into */
+#define LOG2_NUM_RELEXTLOCK_PARTITIONS	4
+#define NUM_RELEXTLOCK_PARTITIONS	(1 << LOG2_NUM_RELEXTLOCK_PARTITIONS)
+
 /* Offsets for various chunks of preallocated lwlocks. */
 #define BUFFER_MAPPING_LWLOCK_OFFSET	NUM_INDIVIDUAL_LWLOCKS
 #define LOCK_MANAGER_LWLOCK_OFFSET		\
 	(BUFFER_MAPPING_LWLOCK_OFFSET + NUM_BUFFER_PARTITIONS)
 #define PREDICATELOCK_MANAGER_LWLOCK_OFFSET \
 	(LOCK_MANAGER_LWLOCK_OFFSET + NUM_LOCK_PARTITIONS)
-#define NUM_FIXED_LWLOCKS \
+#define RELEXTLOCK_MANAGER_LWLOCK_OFFSET \
 	(PREDICATELOCK_MANAGER_LWLOCK_OFFSET + NUM_PREDICATELOCK_PARTITIONS)
+#define NUM_FIXED_LWLOCKS \
+	(PREDICATELOCK_MANAGER_LWLOCK_OFFSET + NUM_PREDICATELOCK_PARTITIONS + \
+	 NUM_RELEXTLOCK_PARTITIONS)
 
 typedef enum LWLockMode
 {
@@ -151,6 +158,8 @@ extern void LWLockReleaseClearVar(LWLock *lock, uint64 *valptr, uint64 val);
 extern void LWLockReleaseAll(void);
 extern bool LWLockHeldByMe(LWLock *lock);
 extern bool LWLockHeldByMeInMode(LWLock *lock, LWLockMode mode);
+extern bool LWLockCheckForCleanup(LWLock *lock);
+extern int LWLockWaiterCount(LWLock *lock);
 
 extern bool LWLockWaitForVar(LWLock *lock, uint64 *valptr, uint64 oldval, uint64 *newval);
 extern void LWLockUpdateVar(LWLock *lock, uint64 *valptr, uint64 value);
@@ -211,6 +220,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_BUFFER_MAPPING,
 	LWTRANCHE_LOCK_MANAGER,
 	LWTRANCHE_PREDICATE_LOCK_MANAGER,
+	LWTRANCHE_RELEXT_LOCK_MANAGER,
 	LWTRANCHE_PARALLEL_QUERY_DSA,
 	LWTRANCHE_SESSION_DSA,
 	LWTRANCHE_SESSION_RECORD_TABLE,

#28

Robert Haas

robertmhaas@gmail.com

about 8 years ago

In reply to: Masahiko Sawada (#27)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Sun, Nov 26, 2017 at 9:33 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Attached latest patch incorporated all comments so far. Please review it.

I think you only need RelExtLockReleaseAllI() where we currently have
LockReleaseAll(DEFAULT_LOCKMETHOD, ...) not where we have
LockReleaseAll(USER_LOCKMETHOD, ...). That's because relation
extension locks use the default lock method, not USER_LOCKMETHOD.

You need to update the table of wait events in the documentation.
Please be sure to actually build the documentation afterwards and make
sure it looks OK. Maybe the way event name should be
RelationExtensionLock rather than just RelationExtension; we are not
waiting for the extension itself.

You have a typo/thinko in lmgr/README: confliction is not a word.
Maybe you mean "When conflicts occur, lock waits are implemented using
condition variables."

Instead of having shared and exclusive locks, how about just having
exclusive locks and introducing a new primitive operation that waits
for the lock to be free and returns without acquiring it? That is
essentially what brin_pageops.c is doing by taking and releasing the
shared lock, and it's the only caller that takes anything but an
exclusive lock. This seems like it would permit a considerable
simplification of the locking mechanism, since there would then be
only two possible states: 1 (locked) and 0 (not locked).

In RelExtLockAcquire, I can't endorse this sort of coding:

+        if (relid == held_relextlock.lock->relid &&
+            lockmode == held_relextlock.mode)
+        {
+            held_relextlock.nLocks++;
+            return true;
+        }
+        else
+            Assert(false);    /* cannot happen */

Either convert the Assert() to an elog(), or change the if-statement
to an Assert() of the same condition. I'd probably vote for the first
one. As it is, if that Assert(false) is ever hit, chaos will (maybe)
ensue. Let's make sure we nip any such problems in the bud.

"successed" is not a good variable name; that's not an English word.

+        /* Could not got the lock, return iff in conditional locking */
+        if (mustwait && conditional)

Comment contradicts code. The comment is right; the code need not
test mustwait, as that's already been done.

The way this is hooked into the shared-memory initialization stuff
looks strange in a number of ways:

- Apparently, you're making initialize enough space for as many
relation extension locks as the save of the main heavyweight lock
table, but that seems like overkill. I'm not sure how much space we
actually need for relation extension locks, but I bet it's a lot less
than we need for regular heavyweight locks.
- The error message emitted when you run out of space also claims that
you can fix the issue by raising max_pred_locks_per_transaction, but
that has no effect on the size of the main lock table or this table.
- The changes to LockShmemSize() suppose that the hash table elements
have a size equal to the size of an LWLock, but the actual size is
sizeof(RELEXTLOCK).
- I don't really know why the code for this should be daisy-chained
off of the lock.c code inside of being called from
CreateSharedMemoryAndSemaphores() just like (almost) all of the other
subsystems.

This code ignores the existence of multiple databases; RELEXTLOCK
contains a relid, but no database OID. That's easy enough to fix, but
it actually causes no problem unless, by bad luck, you have two
relations with the same OID in different databases that are both being
rapidly extended at the same time -- and even then, it's only a
performance problem, not a correctness problem. In fact, I wonder if
we shouldn't go further: instead of creating these RELEXTLOCK
structures dynamically, let's just have a fixed number of them, say
1024. When we get a request to take a lock, hash <dboid, reloid> and
take the result modulo 1024; lock the RELEXTLOCK at that offset in the
array.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#29

Michael Paquier

michael.paquier@gmail.com

about 8 years ago

In reply to: Robert Haas (#28)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Wed, Nov 29, 2017 at 5:33 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sun, Nov 26, 2017 at 9:33 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Attached latest patch incorporated all comments so far. Please review it.

I think you only need RelExtLockReleaseAllI() where we currently have
LockReleaseAll(DEFAULT_LOCKMETHOD, ...) not where we have
LockReleaseAll(USER_LOCKMETHOD, ...). That's because relation
extension locks use the default lock method, not USER_LOCKMETHOD.

Latest review is fresh. I am moving this to next CF with "waiting on
author" as status.
--
Michael

#30

Masahiko Sawada

sawada.mshk@gmail.com

about 8 years ago

In reply to: Michael Paquier (#29)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Thu, Nov 30, 2017 at 10:52 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Nov 29, 2017 at 5:33 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sun, Nov 26, 2017 at 9:33 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Attached latest patch incorporated all comments so far. Please review it.

I think you only need RelExtLockReleaseAllI() where we currently have
LockReleaseAll(DEFAULT_LOCKMETHOD, ...) not where we have
LockReleaseAll(USER_LOCKMETHOD, ...). That's because relation
extension locks use the default lock method, not USER_LOCKMETHOD.

Latest review is fresh. I am moving this to next CF with "waiting on
author" as status.

Thank you Michael-san, I'll submit a latest patch.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#31

Masahiko Sawada

sawada.mshk@gmail.com

about 8 years ago

In reply to: Robert Haas (#28)

1 attachment(s)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Wed, Nov 29, 2017 at 5:33 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sun, Nov 26, 2017 at 9:33 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Attached latest patch incorporated all comments so far. Please review it.

I think you only need RelExtLockReleaseAllI() where we currently have
LockReleaseAll(DEFAULT_LOCKMETHOD, ...) not where we have
LockReleaseAll(USER_LOCKMETHOD, ...). That's because relation
extension locks use the default lock method, not USER_LOCKMETHOD.

Fixed.

You need to update the table of wait events in the documentation.
Please be sure to actually build the documentation afterwards and make
sure it looks OK. Maybe the way event name should be
RelationExtensionLock rather than just RelationExtension; we are not
waiting for the extension itself.

Fixed. I added both new wait_event and wait_event_type for relext
lock. Also checked to pass a documentation build.

You have a typo/thinko in lmgr/README: confliction is not a word.
Maybe you mean "When conflicts occur, lock waits are implemented using
condition variables."

Fixed.

Instead of having shared and exclusive locks, how about just having
exclusive locks and introducing a new primitive operation that waits
for the lock to be free and returns without acquiring it? That is
essentially what brin_pageops.c is doing by taking and releasing the
shared lock, and it's the only caller that takes anything but an
exclusive lock. This seems like it would permit a considerable
simplification of the locking mechanism, since there would then be
only two possible states: 1 (locked) and 0 (not locked).

I think it's a good idea. With this change, the concurrency of
executing brin_page_cleanup() would get decreased. But since
brin_page_cleanup() is called only during vacuum so far it's no
problem. I think we can process the code in vacuumlazy.c in the same
manner as well. I've changed the patch so that it has only exclusive
locks and introduces WaitForRelationExtensionLockToBeFree() function
to wait for the the lock to be free.

Also, now that we got rid of shared locks, I gathered lock state and
pin count into a atomic uint32.

In RelExtLockAcquire, I can't endorse this sort of coding:
+        if (relid == held_relextlock.lock->relid &&
+            lockmode == held_relextlock.mode)
+        {
+            held_relextlock.nLocks++;
+            return true;
+        }
+        else
+            Assert(false);    /* cannot happen */
Either convert the Assert() to an elog(), or change the if-statement
to an Assert() of the same condition. I'd probably vote for the first
one. As it is, if that Assert(false) is ever hit, chaos will (maybe)
ensue. Let's make sure we nip any such problems in the bud.

Agreed, fixed.

"successed" is not a good variable name; that's not an English word.

Fixed.

+        /* Could not got the lock, return iff in conditional locking */
+        if (mustwait && conditional)
Comment contradicts code. The comment is right; the code need not
test mustwait, as that's already been done.

Fixed.

The way this is hooked into the shared-memory initialization stuff
looks strange in a number of ways:

- Apparently, you're making initialize enough space for as many
relation extension locks as the save of the main heavyweight lock
table, but that seems like overkill. I'm not sure how much space we
actually need for relation extension locks, but I bet it's a lot less
than we need for regular heavyweight locks.

Agreed. The maximum of the number of relext locks is the number of
relations on a database cluster, it's not relevant with the number of
clients. Currently NLOCKENTS() counts the number of locks including
relation extension lock. One idea is to introduce a new GUC to control
the memory size, although the total memory size for locks will get
increased. Probably we can make it behave similar to
max_pred_locks_per_relation. Or, in order to not change total memory
size for lock even after moved it out of heavyweight lock, we can
divide NLOCKENTS() into heavyweight lock and relation extension lock
(for example, 80% for heavyweight locks and 20% relation extension
locks). But the latter would make parameter tuning hard. I'd vote for
the first one to keep it simple. Any ideas? This part is not fixed in
the patch yet.

- The error message emitted when you run out of space also claims that
you can fix the issue by raising max_pred_locks_per_transaction, but
that has no effect on the size of the main lock table or this table.

Fixed.

- The changes to LockShmemSize() suppose that the hash table elements
have a size equal to the size of an LWLock, but the actual size is
sizeof(RELEXTLOCK).

Fixed.

- I don't really know why the code for this should be daisy-chained
off of the lock.c code inside of being called from
CreateSharedMemoryAndSemaphores() just like (almost) all of the other
subsystems.

Fixed.

This code ignores the existence of multiple databases; RELEXTLOCK
contains a relid, but no database OID. That's easy enough to fix, but
it actually causes no problem unless, by bad luck, you have two
relations with the same OID in different databases that are both being
rapidly extended at the same time -- and even then, it's only a
performance problem, not a correctness problem. In fact, I wonder if
we shouldn't go further: instead of creating these RELEXTLOCK
structures dynamically, let's just have a fixed number of them, say
1024. When we get a request to take a lock, hash <dboid, reloid> and
take the result modulo 1024; lock the RELEXTLOCK at that offset in the
array.

Attached the latest patch incorporated comments except for the fix of
the memory size for relext lock.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

Moving_extension_lock_out_of_heavyweight_lock_v8.patchapplication/octet-stream; name=Moving_extension_lock_out_of_heavyweight_lock_v8.patchDownload

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 8d461c8..cffc70d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -675,6 +675,13 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
         </listitem>
         <listitem>
          <para>
+          <literal>RelationExtensionLock</literal>: The backend is waiting for
+          a relation extension lock. This lock protects a particular relation
+          such as table against concurrent relation extension.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
           <literal>BufferPin</literal>: The server process is waiting to access to
           a data buffer during a period when no other process can be
           examining that buffer.  Buffer pin waits can be protracted if
@@ -1158,6 +1165,11 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry>Waiting to acquire an advisory user lock.</entry>
         </row>
         <row>
+         <entry><literal>RelationExtensionLock</literal></entry>
+         <entry><literal>RelationExtensionLock</literal></entry>
+         <entry>Waiting to acquire a relation extension lock on a relation.</entry>
+        </row>
+        <row>
          <entry><literal>BufferPin</literal></entry>
          <entry><literal>BufferPin</literal></entry>
          <entry>Waiting to acquire a pin on a buffer.</entry>
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index 09db5c6..05cca9d 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -17,6 +17,7 @@
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/extension_lock.h"
 #include "storage/freespace.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
@@ -623,8 +624,7 @@ brin_page_cleanup(Relation idxrel, Buffer buf)
 	 */
 	if (PageIsNew(page))
 	{
-		LockRelationForExtension(idxrel, ShareLock);
-		UnlockRelationForExtension(idxrel, ShareLock);
+		WaitForRelationExtensionLockToBeFree(idxrel);
 
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 		if (PageIsNew(page))
@@ -716,7 +716,7 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 			 */
 			if (!RELATION_IS_LOCAL(irel))
 			{
-				LockRelationForExtension(irel, ExclusiveLock);
+				LockRelationForExtension(irel);
 				extensionLockHeld = true;
 			}
 			buf = ReadBuffer(irel, P_NEW);
@@ -768,7 +768,7 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 				}
 
 				if (extensionLockHeld)
-					UnlockRelationForExtension(irel, ExclusiveLock);
+					UnlockRelationForExtension(irel);
 
 				ReleaseBuffer(buf);
 				return InvalidBuffer;
@@ -778,7 +778,7 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
 		if (extensionLockHeld)
-			UnlockRelationForExtension(irel, ExclusiveLock);
+			UnlockRelationForExtension(irel);
 
 		page = BufferGetPage(buf);
 
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index 03e53ce..af8f5ce 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -29,6 +29,7 @@
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/extension_lock.h"
 #include "storage/lmgr.h"
 #include "utils/rel.h"
 
@@ -570,7 +571,7 @@ revmap_physical_extend(BrinRevmap *revmap)
 	else
 	{
 		if (needLock)
-			LockRelationForExtension(irel, ExclusiveLock);
+			LockRelationForExtension(irel);
 
 		buf = ReadBuffer(irel, P_NEW);
 		if (BufferGetBlockNumber(buf) != mapBlk)
@@ -582,7 +583,7 @@ revmap_physical_extend(BrinRevmap *revmap)
 			 * page from under whoever is using it.
 			 */
 			if (needLock)
-				UnlockRelationForExtension(irel, ExclusiveLock);
+				UnlockRelationForExtension(irel);
 			LockBuffer(revmap->rm_metaBuf, BUFFER_LOCK_UNLOCK);
 			ReleaseBuffer(buf);
 			return;
@@ -591,7 +592,7 @@ revmap_physical_extend(BrinRevmap *revmap)
 		page = BufferGetPage(buf);
 
 		if (needLock)
-			UnlockRelationForExtension(irel, ExclusiveLock);
+			UnlockRelationForExtension(irel);
 	}
 
 	/* Check that it's a regular block (or an empty page) */
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index d9c6483..8d35918 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -21,6 +21,7 @@
 #include "catalog/pg_collation.h"
 #include "catalog/pg_type.h"
 #include "miscadmin.h"
+#include "storage/extension_lock.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/builtins.h"
@@ -325,13 +326,13 @@ GinNewBuffer(Relation index)
 	/* Must extend the file */
 	needLock = !RELATION_IS_LOCAL(index);
 	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
+		LockRelationForExtension(index);
 
 	buffer = ReadBuffer(index, P_NEW);
 	LockBuffer(buffer, GIN_EXCLUSIVE);
 
 	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+		UnlockRelationForExtension(index);
 
 	return buffer;
 }
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index 394bc83..d769a76 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -20,6 +20,7 @@
 #include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "postmaster/autovacuum.h"
+#include "storage/extension_lock.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/memutils.h"
@@ -716,10 +717,10 @@ ginvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	needLock = !RELATION_IS_LOCAL(index);
 
 	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
+		LockRelationForExtension(index);
 	npages = RelationGetNumberOfBlocks(index);
 	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+		UnlockRelationForExtension(index);
 
 	totFreePages = 0;
 
@@ -766,10 +767,10 @@ ginvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	stats->pages_free = totFreePages;
 
 	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
+		LockRelationForExtension(index);
 	stats->num_pages = RelationGetNumberOfBlocks(index);
 	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+		UnlockRelationForExtension(index);
 
 	return stats;
 }
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index d8d1c0a..76171a5 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -20,6 +20,7 @@
 #include "access/htup_details.h"
 #include "access/reloptions.h"
 #include "catalog/pg_opclass.h"
+#include "storage/extension_lock.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/builtins.h"
@@ -821,13 +822,13 @@ gistNewBuffer(Relation r)
 	needLock = !RELATION_IS_LOCAL(r);
 
 	if (needLock)
-		LockRelationForExtension(r, ExclusiveLock);
+		LockRelationForExtension(r);
 
 	buffer = ReadBuffer(r, P_NEW);
 	LockBuffer(buffer, GIST_EXCLUSIVE);
 
 	if (needLock)
-		UnlockRelationForExtension(r, ExclusiveLock);
+		UnlockRelationForExtension(r);
 
 	return buffer;
 }
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 77d9d12..42ef36a 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -18,6 +18,7 @@
 #include "access/gist_private.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
+#include "storage/extension_lock.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 
@@ -59,10 +60,10 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 
 	/* try to find deleted pages */
 	if (needLock)
-		LockRelationForExtension(rel, ExclusiveLock);
+		LockRelationForExtension(rel);
 	npages = RelationGetNumberOfBlocks(rel);
 	if (needLock)
-		UnlockRelationForExtension(rel, ExclusiveLock);
+		UnlockRelationForExtension(rel);
 
 	totFreePages = 0;
 	for (blkno = GIST_ROOT_BLKNO + 1; blkno < npages; blkno++)
@@ -91,10 +92,10 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	/* return statistics */
 	stats->pages_free = totFreePages;
 	if (needLock)
-		LockRelationForExtension(rel, ExclusiveLock);
+		LockRelationForExtension(rel);
 	stats->num_pages = RelationGetNumberOfBlocks(rel);
 	if (needLock)
-		UnlockRelationForExtension(rel, ExclusiveLock);
+		UnlockRelationForExtension(rel);
 
 	return stats;
 }
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 13e3bdc..9287f2d 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -20,6 +20,7 @@
 #include "access/htup_details.h"
 #include "access/visibilitymap.h"
 #include "storage/bufmgr.h"
+#include "storage/extension_lock.h"
 #include "storage/freespace.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
@@ -519,11 +520,11 @@ loop:
 	if (needLock)
 	{
 		if (!use_fsm)
-			LockRelationForExtension(relation, ExclusiveLock);
-		else if (!ConditionalLockRelationForExtension(relation, ExclusiveLock))
+			LockRelationForExtension(relation);
+		else if (!ConditionalLockRelationForExtension(relation))
 		{
 			/* Couldn't get the lock immediately; wait for it. */
-			LockRelationForExtension(relation, ExclusiveLock);
+			LockRelationForExtension(relation);
 
 			/*
 			 * Check if some other backend has extended a block for us while
@@ -537,7 +538,7 @@ loop:
 			 */
 			if (targetBlock != InvalidBlockNumber)
 			{
-				UnlockRelationForExtension(relation, ExclusiveLock);
+				UnlockRelationForExtension(relation);
 				goto loop;
 			}
 
@@ -576,7 +577,7 @@ loop:
 	 * against vacuumlazy.c --- see comments therein.
 	 */
 	if (needLock)
-		UnlockRelationForExtension(relation, ExclusiveLock);
+		UnlockRelationForExtension(relation);
 
 	/*
 	 * We need to initialize the empty new page.  Double-check that it really
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 4c2a13a..2efee68 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -90,6 +90,7 @@
 #include "access/xlog.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/extension_lock.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
 #include "utils/inval.h"
@@ -641,7 +642,7 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 	 * Note that another backend might have extended or created the relation
 	 * by the time we get the lock.
 	 */
-	LockRelationForExtension(rel, ExclusiveLock);
+	LockRelationForExtension(rel);
 
 	/* Might have to re-open if a cache flush happened */
 	RelationOpenSmgr(rel);
@@ -679,7 +680,7 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 	/* Update local cache with the up-to-date size */
 	rel->rd_smgr->smgr_vm_nblocks = vm_nblocks_now;
 
-	UnlockRelationForExtension(rel, ExclusiveLock);
+	UnlockRelationForExtension(rel);
 
 	pfree(pg);
 }
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index c774349..7824c92 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -28,6 +28,7 @@
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
+#include "storage/extension_lock.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
@@ -659,7 +660,7 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 		needLock = !RELATION_IS_LOCAL(rel);
 
 		if (needLock)
-			LockRelationForExtension(rel, ExclusiveLock);
+			LockRelationForExtension(rel);
 
 		buf = ReadBuffer(rel, P_NEW);
 
@@ -673,7 +674,7 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 		 * condition against btvacuumscan --- see comments therein.
 		 */
 		if (needLock)
-			UnlockRelationForExtension(rel, ExclusiveLock);
+			UnlockRelationForExtension(rel);
 
 		/* Initialize the new page before returning it */
 		page = BufferGetPage(buf);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 399e6a1..5af1c21 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -25,6 +25,7 @@
 #include "commands/vacuum.h"
 #include "pgstat.h"
 #include "storage/condition_variable.h"
+#include "storage/extension_lock.h"
 #include "storage/indexfsm.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
@@ -1058,10 +1059,10 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	{
 		/* Get the current relation length */
 		if (needLock)
-			LockRelationForExtension(rel, ExclusiveLock);
+			LockRelationForExtension(rel);
 		num_pages = RelationGetNumberOfBlocks(rel);
 		if (needLock)
-			UnlockRelationForExtension(rel, ExclusiveLock);
+			UnlockRelationForExtension(rel);
 
 		/* Quit if we've scanned the whole relation */
 		if (blkno >= num_pages)
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index bd5301f..0ff53a3 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -23,6 +23,7 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/builtins.h"
+#include "storage/extension_lock.h"
 #include "utils/index_selfuncs.h"
 #include "utils/lsyscache.h"
 
@@ -230,13 +231,13 @@ SpGistNewBuffer(Relation index)
 	/* Must extend the file */
 	needLock = !RELATION_IS_LOCAL(index);
 	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
+		LockRelationForExtension(index);
 
 	buffer = ReadBuffer(index, P_NEW);
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
 	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+		UnlockRelationForExtension(index);
 
 	return buffer;
 }
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index d7d5e90..385d1cb 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -24,6 +24,7 @@
 #include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/extension_lock.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/snapmgr.h"
@@ -824,10 +825,10 @@ spgvacuumscan(spgBulkDeleteState *bds)
 	{
 		/* Get the current relation length */
 		if (needLock)
-			LockRelationForExtension(index, ExclusiveLock);
+			LockRelationForExtension(index);
 		num_pages = RelationGetNumberOfBlocks(index);
 		if (needLock)
-			UnlockRelationForExtension(index, ExclusiveLock);
+			UnlockRelationForExtension(index);
 
 		/* Quit if we've scanned the whole relation */
 		if (blkno >= num_pages)
diff --git a/src/backend/commands/discard.c b/src/backend/commands/discard.c
index f0dcd87..4a470eb 100644
--- a/src/backend/commands/discard.c
+++ b/src/backend/commands/discard.c
@@ -19,6 +19,7 @@
 #include "commands/discard.h"
 #include "commands/prepare.h"
 #include "commands/sequence.h"
+#include "storage/extension_lock.h"
 #include "utils/guc.h"
 #include "utils/portal.h"
 
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 20ce431..4a72223 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -54,6 +54,7 @@
 #include "portability/instr_time.h"
 #include "postmaster/autovacuum.h"
 #include "storage/bufmgr.h"
+#include "storage/extension_lock.h"
 #include "storage/freespace.h"
 #include "storage/lmgr.h"
 #include "utils/lsyscache.h"
@@ -860,8 +861,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
 			 * it's got exclusive lock on the whole relation.
 			 */
 			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
-			LockRelationForExtension(onerel, ExclusiveLock);
-			UnlockRelationForExtension(onerel, ExclusiveLock);
+			WaitForRelationExtensionLockToBeFree(onerel);
 			LockBufferForCleanup(buf);
 			if (PageIsNew(page))
 			{
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 5c256ff..d12777a 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3351,6 +3351,9 @@ pgstat_get_wait_event_type(uint32 wait_event_info)
 		case PG_WAIT_LOCK:
 			event_type = "Lock";
 			break;
+		case PG_WAIT_RELEXTLOCK:
+			event_type = "RelationExtensionLock";
+			break;
 		case PG_WAIT_BUFFER_PIN:
 			event_type = "BufferPin";
 			break;
@@ -3408,6 +3411,9 @@ pgstat_get_wait_event(uint32 wait_event_info)
 		case PG_WAIT_LOCK:
 			event_name = GetLockNameFromTagType(eventId);
 			break;
+		case PG_WAIT_RELEXTLOCK:
+			event_name = "RelationExtensionLock";
+			break;
 		case PG_WAIT_BUFFER_PIN:
 			event_name = "BufferPin";
 			break;
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 4648473..172a48c 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -26,6 +26,7 @@
 #include "access/htup_details.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
+#include "storage/extension_lock.h"
 #include "storage/freespace.h"
 #include "storage/fsm_internals.h"
 #include "storage/lmgr.h"
@@ -624,7 +625,7 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 	 * Note that another backend might have extended or created the relation
 	 * by the time we get the lock.
 	 */
-	LockRelationForExtension(rel, ExclusiveLock);
+	LockRelationForExtension(rel);
 
 	/* Might have to re-open if a cache flush happened */
 	RelationOpenSmgr(rel);
@@ -652,7 +653,7 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 	/* Update local cache with the up-to-date size */
 	rel->rd_smgr->smgr_fsm_nblocks = fsm_nblocks_now;
 
-	UnlockRelationForExtension(rel, ExclusiveLock);
+	UnlockRelationForExtension(rel);
 
 	pfree(pg);
 }
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2d1ed14..a322296 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -35,6 +35,7 @@
 #include "replication/origin.h"
 #include "storage/bufmgr.h"
 #include "storage/dsm.h"
+#include "storage/extension_lock.h"
 #include "storage/ipc.h"
 #include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
@@ -133,6 +134,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 		size = add_size(size, BackgroundWorkerShmemSize());
 		size = add_size(size, MultiXactShmemSize());
 		size = add_size(size, LWLockShmemSize());
+		size = add_size(size, RelExtLockShmemSize());
 		size = add_size(size, ProcArrayShmemSize());
 		size = add_size(size, BackendStatusShmemSize());
 		size = add_size(size, SInvalShmemSize());
@@ -235,6 +237,11 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 	InitPredicateLocks();
 
 	/*
+	 * Set up relation exntesion lock manager
+	 */
+	InitRelExtLocks();
+
+	/*
 	 * Set up process table
 	 */
 	if (!IsUnderPostmaster)
diff --git a/src/backend/storage/lmgr/Makefile b/src/backend/storage/lmgr/Makefile
index e1b787e..2334a40 100644
--- a/src/backend/storage/lmgr/Makefile
+++ b/src/backend/storage/lmgr/Makefile
@@ -13,7 +13,7 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = lmgr.o lock.o proc.o deadlock.o lwlock.o lwlocknames.o spin.o \
-	s_lock.o predicate.o condition_variable.o
+	s_lock.o predicate.o condition_variable.o extension_lock.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/storage/lmgr/README b/src/backend/storage/lmgr/README
index 56b0a12..71eb293 100644
--- a/src/backend/storage/lmgr/README
+++ b/src/backend/storage/lmgr/README
@@ -29,6 +29,13 @@ process has to wait for an LWLock, it blocks on a SysV semaphore so as
 to not consume CPU time.  Waiting processes will be granted the lock in
 arrival order.  There is no timeout.
 
+* Relation extension locks.  The relation extension lock manager is
+specialized in relation extensions. In PostgreSQL 11 relation extension
+lock has been moved out of regular lock. It's similar to regular locks
+but doesn't have full dead lock detection, group locking and multiple
+lock modes. When conflicts occur, lock waits are implemented using
+condition variables.
+
 * Regular locks (a/k/a heavyweight locks).  The regular lock manager
 supports a variety of lock modes with table-driven semantics, and it has
 full deadlock detection and automatic release at transaction end.
@@ -40,9 +47,9 @@ Acquisition of either a spinlock or a lightweight lock causes query
 cancel and die() interrupts to be held off until all such locks are
 released. No such restriction exists for regular locks, however.  Also
 note that we can accept query cancel and die() interrupts while waiting
-for a regular lock, but we will not accept them while waiting for
-spinlocks or LW locks. It is therefore not a good idea to use LW locks
-when the wait time might exceed a few seconds.
+for a relation extension lock or a regular lock, but we will not accept
+them while waiting for spinlocks or LW locks. It is therefore not a good
+idea to use LW locks when the wait time might exceed a few seconds.
 
 The rest of this README file discusses the regular lock manager in detail.
 
diff --git a/src/backend/storage/lmgr/extension_lock.c b/src/backend/storage/lmgr/extension_lock.c
new file mode 100644
index 0000000..788f294
--- /dev/null
+++ b/src/backend/storage/lmgr/extension_lock.c
@@ -0,0 +1,584 @@
+/*-------------------------------------------------------------------------
+ *
+ * extension_lock.c
+ *	  Relation extension lock manager
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/storage/lmgr/extension_lock.c
+ *
+ * NOTES:
+ *
+ * This lock manager is specialized in relation extension locks; light
+ * weight and interruptible lock manager. It's similar to heavy-weight
+ * lock but doesn't have dead lock detection mechanism, group locking
+ * mechanism and multiple lock modes.
+ *
+ * For lock acquisition we use an atomic compare-and-exchange on the
+ * state variable. When a process tries to acquire a lock that conflicts
+ * with existing lock, it is put to sleep using condition variables
+ * if not conditional locking. When release the lock, we use an atomic
+ * decrement to release the lock, but don't remove the RELEXTLOCK entry
+ * in the hash table. The all unused entries will be reclaimed when
+ * acquisition once the hash table got full.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/extension_lock.h"
+#include "utils/rel.h"
+
+/*
+ * Compute the hash code associated with a RELEXTLOCK.
+ *
+ * To avoid unnecessary recomputations of the hash code, we try to do this
+ * just once per function, and then pass it around as needed.  Aside from
+ * passing the hashcode to hash_search_with_hash_value(), we can extract
+ * the lock partition number from the hashcode.
+ */
+#define RelExtLockTargetTagHashCode(relextlocktargettag) \
+	get_hash_value(RelExtLockHash, (const void *) relextlocktargettag)
+
+/*
+ * The lockmgr's shared hash tables are partitioned to reduce contention.
+ * To determine which partition a given relid belongs to, compute the tag's
+ * hash code with ExtLockTagHashCode(), then apply one of these macros.
+ * NB: NUM_RELEXTLOCK_PARTITIONS must be a power of 2!
+ */
+#define RelExtLockHashPartition(hashcode) \
+	((hashcode) % NUM_RELEXTLOCK_PARTITIONS)
+#define RelExtLockHashPartitionLock(hashcode) \
+	(&MainLWLockArray[RELEXTLOCK_MANAGER_LWLOCK_OFFSET + \
+					  LockHashPartition(hashcode)].lock)
+#define RelExtLockHashPartitionLockByIndex(i) \
+	(&MainLWLockArray[RELEXTLOCK_MANAGER_LWLOCK_OFFSET + (i)].lock)
+
+#define	RELEXT_VAL_LOCK		((uint32) ((1 << 25)))
+#define RELEXT_LOCK_MASK	((uint32) ((1 << 25)))
+
+/* Must be greater than MAX_BACKENDS - which is 2^23-1, so we're fine. */
+#define RELEXT_PIN_COUNT_MASK	((uint32) ((1 << 24) - 1))
+
+/* FIXME */
+#define N_RELEXTENTS 512
+
+typedef struct RELEXTLOCK
+{
+	/* hash key -- must be first */
+	Oid					relid;
+
+	/* state of exclusive lock and pin counts */
+	pg_atomic_uint32	state;
+
+	ConditionVariable	cv;
+} RELEXTLOCK;
+
+/*
+ * This structure holds information per-object relation extension
+ * lock. held_extlocks represents the RelExtLocks we're holding.
+ * We use this structure to keep track of locked relation extension locks
+ * for release during error recovery.  At most one lock can be held at
+ * once. Note that sometimes we could try to acquire a lock for the
+ * additional forks while holding the lock for the main fork; for example,
+ * adding extra relation blocks for both relation and its free space map.
+ * But since this lock manager doesn't distinguish between the forks,
+ * we just increment nLocks in the case.
+ */
+typedef	struct relextlock_handle
+{
+	RELEXTLOCK		*lock;
+	int				nLocks;
+} relextlock_handle;
+
+static relextlock_handle held_relextlock;
+static int num_held_relextlocks = 0;
+
+static bool RelExtLockAcquire(Oid relid, bool conditional);
+static void RelExtLockRelease(Oid rleid);
+static bool RelExtLockAttemptLock(RELEXTLOCK *extlock);
+static bool RelExtLockShrinkLocks(void);
+
+/*
+ * Pointers to hash tables containing relation extension lock state
+ *
+ * The RelExtLockHash hash table is in shared memory
+ */
+static HTAB *RelExtLockHash;
+
+Size
+RelExtLockShmemSize(void)
+{
+	/* Relation extension lock hash table */
+	return hash_estimate_size(N_RELEXTENTS, sizeof(RELEXTLOCK));
+}
+
+/*
+ * InitRelExtLock
+ *      Initialize the relation extension lock manager's data structures.
+ */
+void
+InitRelExtLocks(void)
+{
+	HASHCTL	info;
+	long		max_table_size = N_RELEXTENTS;
+	long		init_table_size = max_table_size / 2;
+
+	/*
+	 * Allocate hash table for RELEXTLOCK structs. This stores per-relation
+	 * lock.
+	 */
+	MemSet(&info, 0, sizeof(info));
+	info.keysize = sizeof(Oid);
+	info.entrysize = sizeof(RELEXTLOCK);
+	info.num_partitions = NUM_RELEXTLOCK_PARTITIONS;
+
+	RelExtLockHash = ShmemInitHash("RELEXTLOCK Hash",
+								   init_table_size,
+								   max_table_size,
+								   &info,
+								   HASH_ELEM | HASH_BLOBS | HASH_PARTITION);
+}
+
+/*
+ *		LockRelationForExtension
+ *
+ * This lock is used to interlock addition of pages to relations.
+ * We need such locking because bufmgr/smgr definition of P_NEW is not
+ * race-condition-proof.
+ *
+ * We assume the caller is already holding some type of regular lock on
+ * the relation, so no AcceptInvalidationMessages call is needed here.
+ */
+void
+LockRelationForExtension(Relation relation)
+{
+	RelExtLockAcquire(relation->rd_id, false);
+}
+
+/*
+ *		ConditionalLockRelationForExtension
+ *
+ * As above, but only lock if we can get the lock without blocking.
+ * Returns TRUE iff the lock was acquired.
+ */
+bool
+ConditionalLockRelationForExtension(Relation relation)
+{
+	return RelExtLockAcquire(relation->rd_id, true);
+}
+
+/*
+ *		RelationExtensionLockWaiterCount
+ *
+ * Count the number of processes waiting for the given relation extension lock.
+ */
+int
+RelationExtensionLockWaiterCount(Relation relation)
+{
+	LWLock		*partitionLock;
+	RELEXTLOCK	*extlock;
+	Oid			relid;
+	uint32		hashcode;
+	uint32		state;
+	uint32		pin_counts;
+	bool		found;
+
+	relid = RelationGetRelid(relation);
+
+	hashcode = RelExtLockTargetTagHashCode(&relid);
+	partitionLock = RelExtLockHashPartitionLock(hashcode);
+
+	LWLockAcquire(partitionLock, LW_SHARED);
+
+	extlock = (RELEXTLOCK *) hash_search_with_hash_value(RelExtLockHash,
+														 (void *) &relid,
+														 hashcode,
+														 HASH_FIND, &found);
+
+	LWLockRelease(partitionLock);
+
+	/* We assume that we already acquired this lock */
+	Assert(found);
+
+	state = pg_atomic_read_u32(&(extlock->state));
+	pin_counts = state & RELEXT_PIN_COUNT_MASK;
+
+	/* Except for me */
+	return pin_counts - 1;
+}
+
+/*
+ *		UnlockRelationForExtension
+ */
+void
+UnlockRelationForExtension(Relation relation)
+{
+	RelExtLockRelease(relation->rd_id);
+}
+
+/*
+ *		RelationExtensionLockReleaseAll
+ *
+ * release all currently-held relation extension locks
+ */
+void
+RelExtLockReleaseAll(void)
+{
+	if (num_held_relextlocks > 0)
+	{
+		HOLD_INTERRUPTS();
+		RelExtLockRelease(held_relextlock.lock->relid);
+	}
+}
+
+/*
+ *		RelExtLockHoldingLockCount
+ *
+ * Return the number of holding relation extension locks.
+ */
+int
+RelExtLockHoldingLockCount(void)
+{
+	return num_held_relextlocks;
+}
+
+/*
+ *		WaitForRelationExtensionLockToBeFree
+ *
+ * Wait for the relation extension lock on the given relation to
+ * be free without acquiring it.
+ */
+void
+WaitForRelationExtensionLockToBeFree(Relation relation)
+{
+	RELEXTLOCK	*extlock = NULL;
+	Oid		relid;
+	LWLock	*partitionLock;
+	uint32	hashcode;
+	bool	found;
+	bool	pin_counted = false;
+
+	relid = RelationGetRelid(relation);
+
+	/* If the lock is held by me, no need to wait */
+	if (num_held_relextlocks > 0 && relid == held_relextlock.lock->relid)
+		return;
+
+	hashcode = RelExtLockTargetTagHashCode(&relid);
+	partitionLock = RelExtLockHashPartitionLock(hashcode);
+
+	for (;;)
+	{
+		uint32	state;
+
+		LWLockAcquire(partitionLock, LW_SHARED);
+
+		if (!extlock)
+		{
+			extlock = (RELEXTLOCK *) hash_search_with_hash_value(RelExtLockHash,
+																 (void * ) &relid,
+																 hashcode, HASH_FIND,
+																 &found);
+
+			/* Break if there is no RELEXTLOCK entry for this relation */
+			if (!found)
+				break;
+		}
+
+		/*
+		 * Break if nobody is holding the lock on this relation. Before
+		 * leaving, decrement pin count if we had been waiting.
+		 */
+		state = pg_atomic_read_u32(&(extlock)->state);
+		if ((state & RELEXT_LOCK_MASK) == 0)
+		{
+			/* Decrement pin counter */
+			if (pin_counted)
+				state = pg_atomic_sub_fetch_u32(&(extlock->state), 1);
+			break;
+		}
+
+		/* Increment pin count to be waken up by owner */
+		if (!pin_counted)
+		{
+			pg_atomic_add_fetch_u32(&(extlock->state), 1);
+			pin_counted = true;
+		}
+
+		/* Release the partition lock before sleep */
+		LWLockRelease(partitionLock);
+
+		/* Sleep until the lock is released */
+		ConditionVariableSleep(&(extlock->cv), PG_WAIT_RELEXTLOCK);
+	}
+
+	LWLockRelease(partitionLock);
+	ConditionVariableCancelSleep();
+
+	return;
+}
+
+/*
+ * Acquire relation extension lock and create RELEXTLOCK hash entry on shared
+ * hash table. If we're trying to acquire the same lock as what already held,
+ * we just increment nLock locally and return without touching the hash table.
+ */
+static bool
+RelExtLockAcquire(Oid relid, bool conditional)
+{
+	RELEXTLOCK	*extlock = NULL;
+	LWLock	*partitionLock;
+	uint32	hashcode;
+	bool	found;
+	bool	mustwait;
+
+	/*
+	 * If we already hold the lock, we can just increase the count locally.
+	 * Since we don't support dead lock detection for relation extension
+	 * lock and don't control the order of lock acquisition, it cannot not
+	 * happen that trying to take a new lock while holding an another lock.
+	 */
+	if (num_held_relextlocks > 0)
+	{
+		if (relid == held_relextlock.lock->relid)
+		{
+			held_relextlock.nLocks++;
+			return true;
+		}
+		else
+			elog(ERROR,
+				 "cannot acquire relation extension locks for multiple relations at the same");
+	}
+
+	hashcode = RelExtLockTargetTagHashCode(&relid);
+	partitionLock = RelExtLockHashPartitionLock(hashcode);
+
+	for (;;)
+	{
+
+		LWLockAcquire(partitionLock, LW_EXCLUSIVE);
+
+		if (!extlock)
+		{
+			extlock = (RELEXTLOCK *) hash_search_with_hash_value(RelExtLockHash,
+																 (void * ) &relid,
+																 hashcode, HASH_ENTER_NULL,
+																 &found);
+
+			/*
+			 * Failed to create new hash entry. Try to shrink the hash table and
+			 * retry.
+			 */
+			if (!extlock)
+			{
+				bool	ret;
+
+				/*
+				 * Release the partition lock before the shrink since the all partition
+				 * locks are held during the shrink.
+				 */
+				LWLockRelease(partitionLock);
+
+				/* Shrink */
+				ret = RelExtLockShrinkLocks();
+
+				if (!ret)
+					ereport(ERROR,
+							(errmsg("out of shared memory"),
+							 errhint("You might need to increase max_locks_per_transaction.")));
+				continue;
+			}
+
+			/* Not found, initialize */
+			if (!found)
+			{
+				extlock->relid = relid;
+				pg_atomic_init_u32(&(extlock->state), 0);
+				ConditionVariableInit(&(extlock->cv));
+			}
+		}
+
+		/* Increment pin count */
+		pg_atomic_add_fetch_u32(&(extlock->state), 1);
+
+		mustwait = RelExtLockAttemptLock(extlock);
+
+		if (!mustwait)
+			break;	/* got the lock */
+
+		/* Could not got the lock, return iff in conditional locking */
+		if (conditional)
+		{
+			pg_atomic_sub_fetch_u32(&(extlock->state), 1);
+			LWLockRelease(partitionLock);
+			return false;
+		}
+
+		/* Release the partition lock before sleep */
+		LWLockRelease(partitionLock);
+
+		/* Sleep until the lock is released */
+		ConditionVariableSleep(&(extlock->cv), PG_WAIT_RELEXTLOCK);
+	}
+
+	LWLockRelease(partitionLock);
+	ConditionVariableCancelSleep();
+
+	Assert(!mustwait);
+
+	/* Remember lock held by this backend */
+	held_relextlock.lock = extlock;
+	held_relextlock.nLocks = 1;
+	num_held_relextlocks++;
+
+	/* Always return true if not conditional lock */
+	return true;
+}
+
+/*
+ * RelExtLockRelease
+ *
+ * Release a previously acquired relation extension lock. We don't remove
+ * hash entry at the time. Once the hash table got full, all un-pinned hash
+ * entries will be removed.
+ */
+static void
+RelExtLockRelease(Oid relid)
+{
+	RELEXTLOCK	*extlock;
+	LWLock	*partitionLock;
+	uint32	hashcode;
+	uint32	state;
+	uint32	pin_counts;
+	/* Release the lock and decrement the pin counter at a time */
+	uint32	val = RELEXT_VAL_LOCK | 1;
+
+	/* We should have acquired a lock before releasing */
+	Assert(num_held_relextlocks > 0);
+
+	if (relid != held_relextlock.lock->relid)
+		ereport(ERROR,
+				(errcode(ERRCODE_LOCK_NOT_AVAILABLE),
+				 errmsg("relation extension lock for %u is not held",
+						relid)));
+
+	/* Decrease the lock count locally */
+	held_relextlock.nLocks--;
+
+	/* If we are still holding the lock, we're done */
+	 if (held_relextlock.nLocks > 0)
+		return;
+
+	hashcode = RelExtLockTargetTagHashCode(&relid);
+	partitionLock = RelExtLockHashPartitionLock(hashcode);
+
+	extlock = held_relextlock.lock;
+
+	/* Keep holding the partition lock until unlocking is done */
+	LWLockAcquire(partitionLock, LW_EXCLUSIVE);
+
+	/* Release the lock and decrement pin counter */
+	state = pg_atomic_sub_fetch_u32(&(extlock->state), val);
+
+	LWLockRelease(partitionLock);
+
+	num_held_relextlocks--;
+
+	/* Wake up waiters if there is someone looking at this lock */
+	pin_counts = state & RELEXT_PIN_COUNT_MASK;
+	if (pin_counts > 0)
+		ConditionVariableBroadcast(&(extlock->cv));
+}
+
+/*
+ * Internal function that attempts to atomically acquire the relation
+ * extension lock.
+ *
+ * Returns true if the lock isn't free and we need to wait.
+ */
+static bool
+RelExtLockAttemptLock(RELEXTLOCK *extlock)
+{
+	uint32	oldstate;
+
+	oldstate = pg_atomic_read_u32(&extlock->state);
+
+	while (true)
+	{
+		uint32	desired_state;
+		bool	lock_free;
+
+		desired_state = oldstate;
+
+		lock_free = (oldstate & RELEXT_LOCK_MASK) == 0;
+		if (lock_free)
+			desired_state += RELEXT_VAL_LOCK;
+
+		if (pg_atomic_compare_exchange_u32(&extlock->state,
+										   &oldstate, desired_state))
+		{
+			if (lock_free)
+				return false;
+			else
+				return true;
+		}
+	}
+	pg_unreachable();
+}
+
+/*
+ * Reclaim all un-pinned RELEXTLOCK entries from the hash table.
+ */
+static bool
+RelExtLockShrinkLocks(void)
+{
+	HASH_SEQ_STATUS	hstat;
+	RELEXTLOCK		*extlock;
+	List			*entries_to_remove = NIL;
+	ListCell		*cell;
+	int				i;
+
+	/*
+	 * To ensure consistency, take all partition locks in exclusive
+	 * mode.
+	 */
+	for (i = 0; i < NUM_RELEXTLOCK_PARTITIONS; i++)
+		LWLockAcquire(RelExtLockHashPartitionLockByIndex(i), LW_EXCLUSIVE);
+
+	/* Collect all un-pinned RELEXTLOCK entries */
+	hash_seq_init(&hstat, RelExtLockHash);
+	while ((extlock = (RELEXTLOCK *) hash_seq_search(&hstat)) != NULL)
+	{
+		uint32	state = pg_atomic_read_u32(&(extlock->state));
+		uint32	pin_counts = state & RELEXT_PIN_COUNT_MASK;
+
+		if (pin_counts == 0)
+			entries_to_remove = lappend(entries_to_remove, extlock);
+	}
+
+	/* We could not find any entries that we can remove right now */
+	if (list_length(entries_to_remove) == 0)
+		return false;
+
+	/* Remove collected entries from RelExtLockHash has table */
+	foreach (cell, entries_to_remove)
+	{
+		RELEXTLOCK	*el = (RELEXTLOCK *) lfirst(cell);
+		uint32	hc = RelExtLockTargetTagHashCode(&(el->relid));
+
+		hash_search_with_hash_value(RelExtLockHash, (void *) &(el->relid),
+									hc, HASH_REMOVE, NULL);
+	}
+
+	/* Release all partition locks */
+	for (i = 0; i < NUM_RELEXTLOCK_PARTITIONS; i++)
+		LWLockRelease(RelExtLockHashPartitionLockByIndex(i));
+
+	return true;
+}
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index da5679b..4fbc0c4 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -319,78 +319,6 @@ UnlockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode)
 }
 
 /*
- *		LockRelationForExtension
- *
- * This lock tag is used to interlock addition of pages to relations.
- * We need such locking because bufmgr/smgr definition of P_NEW is not
- * race-condition-proof.
- *
- * We assume the caller is already holding some type of regular lock on
- * the relation, so no AcceptInvalidationMessages call is needed here.
- */
-void
-LockRelationForExtension(Relation relation, LOCKMODE lockmode)
-{
-	LOCKTAG		tag;
-
-	SET_LOCKTAG_RELATION_EXTEND(tag,
-								relation->rd_lockInfo.lockRelId.dbId,
-								relation->rd_lockInfo.lockRelId.relId);
-
-	(void) LockAcquire(&tag, lockmode, false, false);
-}
-
-/*
- *		ConditionalLockRelationForExtension
- *
- * As above, but only lock if we can get the lock without blocking.
- * Returns true iff the lock was acquired.
- */
-bool
-ConditionalLockRelationForExtension(Relation relation, LOCKMODE lockmode)
-{
-	LOCKTAG		tag;
-
-	SET_LOCKTAG_RELATION_EXTEND(tag,
-								relation->rd_lockInfo.lockRelId.dbId,
-								relation->rd_lockInfo.lockRelId.relId);
-
-	return (LockAcquire(&tag, lockmode, false, true) != LOCKACQUIRE_NOT_AVAIL);
-}
-
-/*
- *		RelationExtensionLockWaiterCount
- *
- * Count the number of processes waiting for the given relation extension lock.
- */
-int
-RelationExtensionLockWaiterCount(Relation relation)
-{
-	LOCKTAG		tag;
-
-	SET_LOCKTAG_RELATION_EXTEND(tag,
-								relation->rd_lockInfo.lockRelId.dbId,
-								relation->rd_lockInfo.lockRelId.relId);
-
-	return LockWaiterCount(&tag);
-}
-
-/*
- *		UnlockRelationForExtension
- */
-void
-UnlockRelationForExtension(Relation relation, LOCKMODE lockmode)
-{
-	LOCKTAG		tag;
-
-	SET_LOCKTAG_RELATION_EXTEND(tag,
-								relation->rd_lockInfo.lockRelId.dbId,
-								relation->rd_lockInfo.lockRelId.relId);
-
-	LockRelease(&tag, lockmode, false);
-}
-
-/*
  *		LockPage
  *
  * Obtain a page-level lock.  This is currently used by some index access
@@ -961,12 +889,6 @@ DescribeLockTag(StringInfo buf, const LOCKTAG *tag)
 							 tag->locktag_field2,
 							 tag->locktag_field1);
 			break;
-		case LOCKTAG_RELATION_EXTEND:
-			appendStringInfo(buf,
-							 _("extension of relation %u of database %u"),
-							 tag->locktag_field2,
-							 tag->locktag_field1);
-			break;
 		case LOCKTAG_PAGE:
 			appendStringInfo(buf,
 							 _("page %u of relation %u of database %u"),
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 5833086..8b642ed 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -40,11 +40,13 @@
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "pgstat.h"
+#include "storage/extension_lock.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "storage/sinvaladt.h"
 #include "storage/spin.h"
 #include "storage/standby.h"
+#include "storage/lmgr.h"
 #include "utils/memutils.h"
 #include "utils/ps_status.h"
 #include "utils/resowner_private.h"
@@ -717,6 +719,15 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	int			status;
 	bool		log_lock = false;
 
+	/*
+	 * We allow to take a relation extension lock after took a
+	 * heavy-weight lock. However, since we don't have dead lock
+	 * detection mechanism between heavy-weight lock and relation
+	 * extension lock it's not allowed taking an another heavy-weight
+	 * lock while holding a relation extension lock.
+	 */
+	Assert(RelExtLockHoldingLockCount() == 0);
+
 	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
 		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
 	lockMethodTable = LockMethods[lockmethodid];
@@ -3366,6 +3377,7 @@ LockShmemSize(void)
 	/* lock hash table */
 	max_table_size = NLOCKENTS();
 	size = add_size(size, hash_estimate_size(max_table_size, sizeof(LOCK)));
+	size = add_size(size, hash_estimate_size(max_table_size, sizeof(LWLock)));
 
 	/* proclock hash table */
 	max_table_size *= 2;
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index e5c3e86..746e263 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -81,6 +81,7 @@
 #include "pg_trace.h"
 #include "postmaster/postmaster.h"
 #include "replication/slot.h"
+#include "storage/extension_lock.h"
 #include "storage/ipc.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
@@ -451,6 +452,13 @@ InitializeLWLocks(void)
 	for (id = 0; id < NUM_PREDICATELOCK_PARTITIONS; id++, lock++)
 		LWLockInitialize(&lock->lock, LWTRANCHE_PREDICATE_LOCK_MANAGER);
 
+	/* Initialize relation extension lmgr's LWLocks in main array */
+	lock = MainLWLockArray + NUM_INDIVIDUAL_LWLOCKS +
+		NUM_BUFFER_PARTITIONS + NUM_LOCK_PARTITIONS +
+		NUM_PREDICATELOCK_PARTITIONS;
+	for (id = 0; id < NUM_RELEXTLOCK_PARTITIONS; id++, lock++)
+		LWLockInitialize(&lock->lock, LWTRANCHE_RELEXT_LOCK_MANAGER);
+
 	/* Initialize named tranches. */
 	if (NamedLWLockTrancheRequests > 0)
 	{
@@ -508,6 +516,8 @@ RegisterLWLockTranches(void)
 	LWLockRegisterTranche(LWTRANCHE_LOCK_MANAGER, "lock_manager");
 	LWLockRegisterTranche(LWTRANCHE_PREDICATE_LOCK_MANAGER,
 						  "predicate_lock_manager");
+	LWLockRegisterTranche(LWTRANCHE_RELEXT_LOCK_MANAGER,
+						  "relext_lock_manager");
 	LWLockRegisterTranche(LWTRANCHE_PARALLEL_QUERY_DSA,
 						  "parallel_query_dsa");
 	LWLockRegisterTranche(LWTRANCHE_SESSION_DSA,
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 5f6727d..40ab31a 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -44,6 +44,7 @@
 #include "replication/slot.h"
 #include "replication/syncrep.h"
 #include "storage/condition_variable.h"
+#include "storage/extension_lock.h"
 #include "storage/standby.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
@@ -765,6 +766,8 @@ ProcReleaseLocks(bool isCommit)
 		return;
 	/* If waiting, get off wait queue (should only be needed after error) */
 	LockErrorCleanup();
+	/* Release relation extension locks */
+	RelExtLockReleaseAll();
 	/* Release standard locks, including session-level if aborting */
 	LockReleaseAll(DEFAULT_LOCKMETHOD, !isCommit);
 	/* Release transaction-level advisory locks */
diff --git a/src/backend/utils/adt/lockfuncs.c b/src/backend/utils/adt/lockfuncs.c
index 9e0a8ab..6d8916c 100644
--- a/src/backend/utils/adt/lockfuncs.c
+++ b/src/backend/utils/adt/lockfuncs.c
@@ -25,7 +25,6 @@
 /* This must match enum LockTagType! */
 const char *const LockTagTypeNames[] = {
 	"relation",
-	"extend",
 	"page",
 	"tuple",
 	"transactionid",
@@ -234,7 +233,6 @@ pg_lock_status(PG_FUNCTION_ARGS)
 		switch ((LockTagType) instance->locktag.locktag_type)
 		{
 			case LOCKTAG_RELATION:
-			case LOCKTAG_RELATION_EXTEND:
 				values[1] = ObjectIdGetDatum(instance->locktag.locktag_field1);
 				values[2] = ObjectIdGetDatum(instance->locktag.locktag_field2);
 				nulls[3] = true;
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 089b7c3..c6c6571 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -735,6 +735,7 @@ typedef enum BackendState
  * ----------
  */
 #define PG_WAIT_LWLOCK				0x01000000U
+#define PG_WAIT_RELEXTLOCK			0x02000000U
 #define PG_WAIT_LOCK				0x03000000U
 #define PG_WAIT_BUFFER_PIN			0x04000000U
 #define PG_WAIT_ACTIVITY			0x05000000U
diff --git a/src/include/storage/extension_lock.h b/src/include/storage/extension_lock.h
new file mode 100644
index 0000000..cbce89b
--- /dev/null
+++ b/src/include/storage/extension_lock.h
@@ -0,0 +1,42 @@
+/*-------------------------------------------------------------------------
+ *
+ * extension_lock.h
+ *	  Relation extension lock manager
+ *
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/extension_lock.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef EXTENSION_LOCK_H
+#define EXTENSION_LOCK_H
+
+#ifdef FRONTEND
+#error "extension_lock.h may not be included from frontend code"
+#endif
+
+#include "port/atomics.h"
+#include "storage/s_lock.h"
+#include "storage/condition_variable.h"
+#include "storage/proclist_types.h"
+
+/* Number of partitions the shared relation extension lock tables are divided into */
+#define LOG2_NUM_RELEXTLOCK_PARTITIONS 4
+#define NUM_RELEXTLOCK_PARTITIONS      (1 << LOG2_NUM_RELEXTLOCK_PARTITIONS)
+
+/* Lock a relation for extension */
+extern Size RelExtLockShmemSize(void);
+extern void InitRelExtLocks(void);
+extern void LockRelationForExtension(Relation relation);
+extern void UnlockRelationForExtension(Relation relation);
+extern bool ConditionalLockRelationForExtension(Relation relation);
+extern int	RelationExtensionLockWaiterCount(Relation relation);
+extern void WaitForRelationExtensionLockToBeFree(Relation relation);
+extern void RelExtLockReleaseAll(void);
+extern int	RelExtLockHoldingLockCount(void);
+
+#endif	/* EXTENSION_LOCK_H */
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index 0b92322..7e6b80c 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -50,13 +50,6 @@ extern bool LockHasWaitersRelation(Relation relation, LOCKMODE lockmode);
 extern void LockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
 extern void UnlockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
 
-/* Lock a relation for extension */
-extern void LockRelationForExtension(Relation relation, LOCKMODE lockmode);
-extern void UnlockRelationForExtension(Relation relation, LOCKMODE lockmode);
-extern bool ConditionalLockRelationForExtension(Relation relation,
-									LOCKMODE lockmode);
-extern int	RelationExtensionLockWaiterCount(Relation relation);
-
 /* Lock a page (currently only used within indexes) */
 extern void LockPage(Relation relation, BlockNumber blkno, LOCKMODE lockmode);
 extern bool ConditionalLockPage(Relation relation, BlockNumber blkno, LOCKMODE lockmode);
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index 765431e..3be18ea 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -138,8 +138,6 @@ typedef uint16 LOCKMETHODID;
 typedef enum LockTagType
 {
 	LOCKTAG_RELATION,			/* whole relation */
-	/* ID info for a relation is DB OID + REL OID; DB OID = 0 if shared */
-	LOCKTAG_RELATION_EXTEND,	/* the right to extend a relation */
 	/* same ID info as RELATION */
 	LOCKTAG_PAGE,				/* one page of a relation */
 	/* ID info for a page is RELATION info + BlockNumber */
@@ -198,14 +196,6 @@ typedef struct LOCKTAG
 	 (locktag).locktag_type = LOCKTAG_RELATION, \
 	 (locktag).locktag_lockmethodid = DEFAULT_LOCKMETHOD)
 
-#define SET_LOCKTAG_RELATION_EXTEND(locktag,dboid,reloid) \
-	((locktag).locktag_field1 = (dboid), \
-	 (locktag).locktag_field2 = (reloid), \
-	 (locktag).locktag_field3 = 0, \
-	 (locktag).locktag_field4 = 0, \
-	 (locktag).locktag_type = LOCKTAG_RELATION_EXTEND, \
-	 (locktag).locktag_lockmethodid = DEFAULT_LOCKMETHOD)
-
 #define SET_LOCKTAG_PAGE(locktag,dboid,reloid,blocknum) \
 	((locktag).locktag_field1 = (dboid), \
 	 (locktag).locktag_field2 = (reloid), \
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 596fdad..bef48ea 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -126,8 +126,11 @@ extern PGDLLIMPORT int NamedLWLockTrancheRequests;
 	(BUFFER_MAPPING_LWLOCK_OFFSET + NUM_BUFFER_PARTITIONS)
 #define PREDICATELOCK_MANAGER_LWLOCK_OFFSET \
 	(LOCK_MANAGER_LWLOCK_OFFSET + NUM_LOCK_PARTITIONS)
-#define NUM_FIXED_LWLOCKS \
+#define RELEXTLOCK_MANAGER_LWLOCK_OFFSET \
 	(PREDICATELOCK_MANAGER_LWLOCK_OFFSET + NUM_PREDICATELOCK_PARTITIONS)
+#define NUM_FIXED_LWLOCKS \
+	(PREDICATELOCK_MANAGER_LWLOCK_OFFSET + NUM_PREDICATELOCK_PARTITIONS + \
+	 NUM_RELEXTLOCK_PARTITIONS)
 
 typedef enum LWLockMode
 {
@@ -211,6 +214,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_BUFFER_MAPPING,
 	LWTRANCHE_LOCK_MANAGER,
 	LWTRANCHE_PREDICATE_LOCK_MANAGER,
+	LWTRANCHE_RELEXT_LOCK_MANAGER,
 	LWTRANCHE_PARALLEL_QUERY_DSA,
 	LWTRANCHE_SESSION_DSA,
 	LWTRANCHE_SESSION_RECORD_TABLE,

#32

Robert Haas

robertmhaas@gmail.com

about 8 years ago

In reply to: Masahiko Sawada (#31)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Thu, Nov 30, 2017 at 6:20 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

This code ignores the existence of multiple databases; RELEXTLOCK
contains a relid, but no database OID. That's easy enough to fix, but
it actually causes no problem unless, by bad luck, you have two
relations with the same OID in different databases that are both being
rapidly extended at the same time -- and even then, it's only a
performance problem, not a correctness problem. In fact, I wonder if
we shouldn't go further: instead of creating these RELEXTLOCK
structures dynamically, let's just have a fixed number of them, say
1024. When we get a request to take a lock, hash <dboid, reloid> and
take the result modulo 1024; lock the RELEXTLOCK at that offset in the
array.

Attached the latest patch incorporated comments except for the fix of
the memory size for relext lock.

It doesn't do anything about the comment of mine quoted above. Since
it's only possible to hold one relation extension lock at a time, we
don't really need the hash table here at all. We can just have an
array of 1024 or so locks and map every <db,relid> pair on to one of
them by hashing. The worst thing we'll get it some false contention,
but that doesn't seem awful, and it would permit considerable further
simplification of this code -- and maybe make it faster in the
process, because we'd no longer need the hash table, or the pin count,
or the extra LWLocks that protect the hash table. All we would have
is atomic operations manipulating the lock state, which seems like it
would be quite a lot faster and simpler.

BTW, I think RelExtLockReleaseAll is broken because it shouldn't
HOLD_INTERRUPTS(); I also think it's kind of silly to loop here when
we know we can only hold one lock. Maybe RelExtLockRelease can take
bool force and do if (force) held_relextlock.nLocks = 0; else
held_relextlock.nLocks--. Or, better yet, have the caller adjust that
value and then only call RelExtLockRelease() if we needed to release
the lock in shared memory. That avoids needless branching. On a
related note, is there any point in having both held_relextlock.nLocks
and num_held_relextlocks?

I think RelationExtensionLock should be a new type of IPC wait event,
rather than a whole new category.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#33

Masahiko Sawada

sawada.mshk@gmail.com

about 8 years ago

In reply to: Robert Haas (#32)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Fri, Dec 1, 2017 at 3:04 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Nov 30, 2017 at 6:20 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

This code ignores the existence of multiple databases; RELEXTLOCK
contains a relid, but no database OID. That's easy enough to fix, but
it actually causes no problem unless, by bad luck, you have two
relations with the same OID in different databases that are both being
rapidly extended at the same time -- and even then, it's only a
performance problem, not a correctness problem. In fact, I wonder if
we shouldn't go further: instead of creating these RELEXTLOCK
structures dynamically, let's just have a fixed number of them, say
1024. When we get a request to take a lock, hash <dboid, reloid> and
take the result modulo 1024; lock the RELEXTLOCK at that offset in the
array.

Attached the latest patch incorporated comments except for the fix of
the memory size for relext lock.

It doesn't do anything about the comment of mine quoted above.

Sorry I'd missed the comment.

Since it's only possible to hold one relation extension lock at a time, we
don't really need the hash table here at all. We can just have an
array of 1024 or so locks and map every <db,relid> pair on to one of
them by hashing. The worst thing we'll get it some false contention,
but that doesn't seem awful, and it would permit considerable further
simplification of this code -- and maybe make it faster in the
process, because we'd no longer need the hash table, or the pin count,
or the extra LWLocks that protect the hash table. All we would have
is atomic operations manipulating the lock state, which seems like it
would be quite a lot faster and simpler.

Agreed. With this change, we will have an array of the struct that has
lock state and cv. The lock state has the wait count as well as the
status of lock.

BTW, I think RelExtLockReleaseAll is broken because it shouldn't
HOLD_INTERRUPTS(); I also think it's kind of silly to loop here when
we know we can only hold one lock. Maybe RelExtLockRelease can take
bool force and do if (force) held_relextlock.nLocks = 0; else
held_relextlock.nLocks--. Or, better yet, have the caller adjust that
value and then only call RelExtLockRelease() if we needed to release
the lock in shared memory. That avoids needless branching.

Agreed. I'd vote for the latter.

On a
related note, is there any point in having both held_relextlock.nLocks
and num_held_relextlocks?

num_held_relextlocks is actually unnecessary, will be removed.

I think RelationExtensionLock should be a new type of IPC wait event,
rather than a whole new category.

Hmm, I thought the wait event types of IPC seems related to events
that communicates to other processes for the same purpose, for example
parallel query, sync repli etc. On the other hand, the relation
extension locks are one kind of the lock mechanism. That's way I added
a new category. But maybe it can be fit to the IPC wait event.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#34

Masahiko Sawada

sawada.mshk@gmail.com

about 8 years ago

In reply to: Masahiko Sawada (#33)

1 attachment(s)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Fri, Dec 1, 2017 at 10:26 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Dec 1, 2017 at 3:04 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Nov 30, 2017 at 6:20 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

This code ignores the existence of multiple databases; RELEXTLOCK
contains a relid, but no database OID. That's easy enough to fix, but
it actually causes no problem unless, by bad luck, you have two
relations with the same OID in different databases that are both being
rapidly extended at the same time -- and even then, it's only a
performance problem, not a correctness problem. In fact, I wonder if
we shouldn't go further: instead of creating these RELEXTLOCK
structures dynamically, let's just have a fixed number of them, say
1024. When we get a request to take a lock, hash <dboid, reloid> and
take the result modulo 1024; lock the RELEXTLOCK at that offset in the
array.

Attached the latest patch incorporated comments except for the fix of
the memory size for relext lock.

It doesn't do anything about the comment of mine quoted above.

Sorry I'd missed the comment.

Since it's only possible to hold one relation extension lock at a time, we
don't really need the hash table here at all. We can just have an
array of 1024 or so locks and map every <db,relid> pair on to one of
them by hashing. The worst thing we'll get it some false contention,
but that doesn't seem awful, and it would permit considerable further
simplification of this code -- and maybe make it faster in the
process, because we'd no longer need the hash table, or the pin count,
or the extra LWLocks that protect the hash table. All we would have
is atomic operations manipulating the lock state, which seems like it
would be quite a lot faster and simpler.

Agreed. With this change, we will have an array of the struct that has
lock state and cv. The lock state has the wait count as well as the
status of lock.

BTW, I think RelExtLockReleaseAll is broken because it shouldn't
HOLD_INTERRUPTS(); I also think it's kind of silly to loop here when
we know we can only hold one lock. Maybe RelExtLockRelease can take
bool force and do if (force) held_relextlock.nLocks = 0; else
held_relextlock.nLocks--. Or, better yet, have the caller adjust that
value and then only call RelExtLockRelease() if we needed to release
the lock in shared memory. That avoids needless branching.

Agreed. I'd vote for the latter.

On a
related note, is there any point in having both held_relextlock.nLocks
and num_held_relextlocks?

num_held_relextlocks is actually unnecessary, will be removed.

I think RelationExtensionLock should be a new type of IPC wait event,
rather than a whole new category.

Hmm, I thought the wait event types of IPC seems related to events
that communicates to other processes for the same purpose, for example
parallel query, sync repli etc. On the other hand, the relation
extension locks are one kind of the lock mechanism. That's way I added
a new category. But maybe it can be fit to the IPC wait event.

Attached updated patch. I've done a performance measurement again on
the same configuration as before since the acquiring/releasing
procedures have been changed.

----- PATCHED -----
tps = 162.579320 (excluding connections establishing)
tps = 162.144352 (excluding connections establishing)
tps = 160.659403 (excluding connections establishing)
tps = 161.213995 (excluding connections establishing)
tps = 164.560460 (excluding connections establishing)
----- HEAD -----
tps = 157.738645 (excluding connections establishing)
tps = 146.178575 (excluding connections establishing)
tps = 143.788961 (excluding connections establishing)
tps = 144.886594 (excluding connections establishing)
tps = 145.496337 (excluding connections establishing)

* micro-benchmark
PATCHED = 1.61757e+07 (cycles/sec)
HEAD = 1.48685e+06 (cycles/sec)
The patched is 10 times faster than current HEAD.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

Moving_extension_lock_out_of_heavyweight_lock_v9.patchtext/x-patch; charset=US-ASCII; name=Moving_extension_lock_out_of_heavyweight_lock_v9.patchDownload

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 8d461c8..7aa7981 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -669,8 +669,8 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
           Heavyweight locks, also known as lock manager locks or simply locks,
           primarily protect SQL-visible objects such as tables.  However,
           they are also used to ensure mutual exclusion for certain internal
-          operations such as relation extension.  <literal>wait_event</literal> will
-          identify the type of lock awaited.
+          operations such as waiting for a transaction to finish.
+          <literal>wait_event</literal> will identify the type of lock awaited.
          </para>
         </listitem>
         <listitem>
@@ -1122,10 +1122,6 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry>Waiting to acquire a lock on a relation.</entry>
         </row>
         <row>
-         <entry><literal>extend</literal></entry>
-         <entry>Waiting to extend a relation.</entry>
-        </row>
-        <row>
          <entry><literal>page</literal></entry>
          <entry>Waiting to acquire a lock on page of a relation.</entry>
         </row>
@@ -1315,6 +1311,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry>Waiting for group leader to update transaction status at transaction end.</entry>
         </row>
         <row>
+         <entry><literal>RelationExtensionLock</literal></entry>
+         <entry>Waiting to extend a relation.</entry>
+        </row>
+        <row>
          <entry><literal>ReplicationOriginDrop</literal></entry>
          <entry>Waiting for a replication origin to become inactive to be dropped.</entry>
         </row>
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index 09db5c6..05cca9d 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -17,6 +17,7 @@
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/extension_lock.h"
 #include "storage/freespace.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
@@ -623,8 +624,7 @@ brin_page_cleanup(Relation idxrel, Buffer buf)
 	 */
 	if (PageIsNew(page))
 	{
-		LockRelationForExtension(idxrel, ShareLock);
-		UnlockRelationForExtension(idxrel, ShareLock);
+		WaitForRelationExtensionLockToBeFree(idxrel);
 
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 		if (PageIsNew(page))
@@ -716,7 +716,7 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 			 */
 			if (!RELATION_IS_LOCAL(irel))
 			{
-				LockRelationForExtension(irel, ExclusiveLock);
+				LockRelationForExtension(irel);
 				extensionLockHeld = true;
 			}
 			buf = ReadBuffer(irel, P_NEW);
@@ -768,7 +768,7 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 				}
 
 				if (extensionLockHeld)
-					UnlockRelationForExtension(irel, ExclusiveLock);
+					UnlockRelationForExtension(irel);
 
 				ReleaseBuffer(buf);
 				return InvalidBuffer;
@@ -778,7 +778,7 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
 		if (extensionLockHeld)
-			UnlockRelationForExtension(irel, ExclusiveLock);
+			UnlockRelationForExtension(irel);
 
 		page = BufferGetPage(buf);
 
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index 03e53ce..af8f5ce 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -29,6 +29,7 @@
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/extension_lock.h"
 #include "storage/lmgr.h"
 #include "utils/rel.h"
 
@@ -570,7 +571,7 @@ revmap_physical_extend(BrinRevmap *revmap)
 	else
 	{
 		if (needLock)
-			LockRelationForExtension(irel, ExclusiveLock);
+			LockRelationForExtension(irel);
 
 		buf = ReadBuffer(irel, P_NEW);
 		if (BufferGetBlockNumber(buf) != mapBlk)
@@ -582,7 +583,7 @@ revmap_physical_extend(BrinRevmap *revmap)
 			 * page from under whoever is using it.
 			 */
 			if (needLock)
-				UnlockRelationForExtension(irel, ExclusiveLock);
+				UnlockRelationForExtension(irel);
 			LockBuffer(revmap->rm_metaBuf, BUFFER_LOCK_UNLOCK);
 			ReleaseBuffer(buf);
 			return;
@@ -591,7 +592,7 @@ revmap_physical_extend(BrinRevmap *revmap)
 		page = BufferGetPage(buf);
 
 		if (needLock)
-			UnlockRelationForExtension(irel, ExclusiveLock);
+			UnlockRelationForExtension(irel);
 	}
 
 	/* Check that it's a regular block (or an empty page) */
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index d9c6483..8d35918 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -21,6 +21,7 @@
 #include "catalog/pg_collation.h"
 #include "catalog/pg_type.h"
 #include "miscadmin.h"
+#include "storage/extension_lock.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/builtins.h"
@@ -325,13 +326,13 @@ GinNewBuffer(Relation index)
 	/* Must extend the file */
 	needLock = !RELATION_IS_LOCAL(index);
 	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
+		LockRelationForExtension(index);
 
 	buffer = ReadBuffer(index, P_NEW);
 	LockBuffer(buffer, GIN_EXCLUSIVE);
 
 	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+		UnlockRelationForExtension(index);
 
 	return buffer;
 }
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index 394bc83..d769a76 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -20,6 +20,7 @@
 #include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "postmaster/autovacuum.h"
+#include "storage/extension_lock.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/memutils.h"
@@ -716,10 +717,10 @@ ginvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	needLock = !RELATION_IS_LOCAL(index);
 
 	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
+		LockRelationForExtension(index);
 	npages = RelationGetNumberOfBlocks(index);
 	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+		UnlockRelationForExtension(index);
 
 	totFreePages = 0;
 
@@ -766,10 +767,10 @@ ginvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	stats->pages_free = totFreePages;
 
 	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
+		LockRelationForExtension(index);
 	stats->num_pages = RelationGetNumberOfBlocks(index);
 	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+		UnlockRelationForExtension(index);
 
 	return stats;
 }
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index d8d1c0a..76171a5 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -20,6 +20,7 @@
 #include "access/htup_details.h"
 #include "access/reloptions.h"
 #include "catalog/pg_opclass.h"
+#include "storage/extension_lock.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/builtins.h"
@@ -821,13 +822,13 @@ gistNewBuffer(Relation r)
 	needLock = !RELATION_IS_LOCAL(r);
 
 	if (needLock)
-		LockRelationForExtension(r, ExclusiveLock);
+		LockRelationForExtension(r);
 
 	buffer = ReadBuffer(r, P_NEW);
 	LockBuffer(buffer, GIST_EXCLUSIVE);
 
 	if (needLock)
-		UnlockRelationForExtension(r, ExclusiveLock);
+		UnlockRelationForExtension(r);
 
 	return buffer;
 }
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 77d9d12..42ef36a 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -18,6 +18,7 @@
 #include "access/gist_private.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
+#include "storage/extension_lock.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 
@@ -59,10 +60,10 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 
 	/* try to find deleted pages */
 	if (needLock)
-		LockRelationForExtension(rel, ExclusiveLock);
+		LockRelationForExtension(rel);
 	npages = RelationGetNumberOfBlocks(rel);
 	if (needLock)
-		UnlockRelationForExtension(rel, ExclusiveLock);
+		UnlockRelationForExtension(rel);
 
 	totFreePages = 0;
 	for (blkno = GIST_ROOT_BLKNO + 1; blkno < npages; blkno++)
@@ -91,10 +92,10 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	/* return statistics */
 	stats->pages_free = totFreePages;
 	if (needLock)
-		LockRelationForExtension(rel, ExclusiveLock);
+		LockRelationForExtension(rel);
 	stats->num_pages = RelationGetNumberOfBlocks(rel);
 	if (needLock)
-		UnlockRelationForExtension(rel, ExclusiveLock);
+		UnlockRelationForExtension(rel);
 
 	return stats;
 }
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 13e3bdc..9287f2d 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -20,6 +20,7 @@
 #include "access/htup_details.h"
 #include "access/visibilitymap.h"
 #include "storage/bufmgr.h"
+#include "storage/extension_lock.h"
 #include "storage/freespace.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
@@ -519,11 +520,11 @@ loop:
 	if (needLock)
 	{
 		if (!use_fsm)
-			LockRelationForExtension(relation, ExclusiveLock);
-		else if (!ConditionalLockRelationForExtension(relation, ExclusiveLock))
+			LockRelationForExtension(relation);
+		else if (!ConditionalLockRelationForExtension(relation))
 		{
 			/* Couldn't get the lock immediately; wait for it. */
-			LockRelationForExtension(relation, ExclusiveLock);
+			LockRelationForExtension(relation);
 
 			/*
 			 * Check if some other backend has extended a block for us while
@@ -537,7 +538,7 @@ loop:
 			 */
 			if (targetBlock != InvalidBlockNumber)
 			{
-				UnlockRelationForExtension(relation, ExclusiveLock);
+				UnlockRelationForExtension(relation);
 				goto loop;
 			}
 
@@ -576,7 +577,7 @@ loop:
 	 * against vacuumlazy.c --- see comments therein.
 	 */
 	if (needLock)
-		UnlockRelationForExtension(relation, ExclusiveLock);
+		UnlockRelationForExtension(relation);
 
 	/*
 	 * We need to initialize the empty new page.  Double-check that it really
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 4c2a13a..2efee68 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -90,6 +90,7 @@
 #include "access/xlog.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/extension_lock.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
 #include "utils/inval.h"
@@ -641,7 +642,7 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 	 * Note that another backend might have extended or created the relation
 	 * by the time we get the lock.
 	 */
-	LockRelationForExtension(rel, ExclusiveLock);
+	LockRelationForExtension(rel);
 
 	/* Might have to re-open if a cache flush happened */
 	RelationOpenSmgr(rel);
@@ -679,7 +680,7 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 	/* Update local cache with the up-to-date size */
 	rel->rd_smgr->smgr_vm_nblocks = vm_nblocks_now;
 
-	UnlockRelationForExtension(rel, ExclusiveLock);
+	UnlockRelationForExtension(rel);
 
 	pfree(pg);
 }
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index c774349..7824c92 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -28,6 +28,7 @@
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
+#include "storage/extension_lock.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
@@ -659,7 +660,7 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 		needLock = !RELATION_IS_LOCAL(rel);
 
 		if (needLock)
-			LockRelationForExtension(rel, ExclusiveLock);
+			LockRelationForExtension(rel);
 
 		buf = ReadBuffer(rel, P_NEW);
 
@@ -673,7 +674,7 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 		 * condition against btvacuumscan --- see comments therein.
 		 */
 		if (needLock)
-			UnlockRelationForExtension(rel, ExclusiveLock);
+			UnlockRelationForExtension(rel);
 
 		/* Initialize the new page before returning it */
 		page = BufferGetPage(buf);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 399e6a1..5af1c21 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -25,6 +25,7 @@
 #include "commands/vacuum.h"
 #include "pgstat.h"
 #include "storage/condition_variable.h"
+#include "storage/extension_lock.h"
 #include "storage/indexfsm.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
@@ -1058,10 +1059,10 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	{
 		/* Get the current relation length */
 		if (needLock)
-			LockRelationForExtension(rel, ExclusiveLock);
+			LockRelationForExtension(rel);
 		num_pages = RelationGetNumberOfBlocks(rel);
 		if (needLock)
-			UnlockRelationForExtension(rel, ExclusiveLock);
+			UnlockRelationForExtension(rel);
 
 		/* Quit if we've scanned the whole relation */
 		if (blkno >= num_pages)
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index bd5301f..0ff53a3 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -23,6 +23,7 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/builtins.h"
+#include "storage/extension_lock.h"
 #include "utils/index_selfuncs.h"
 #include "utils/lsyscache.h"
 
@@ -230,13 +231,13 @@ SpGistNewBuffer(Relation index)
 	/* Must extend the file */
 	needLock = !RELATION_IS_LOCAL(index);
 	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
+		LockRelationForExtension(index);
 
 	buffer = ReadBuffer(index, P_NEW);
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
 	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+		UnlockRelationForExtension(index);
 
 	return buffer;
 }
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index d7d5e90..385d1cb 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -24,6 +24,7 @@
 #include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/extension_lock.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/snapmgr.h"
@@ -824,10 +825,10 @@ spgvacuumscan(spgBulkDeleteState *bds)
 	{
 		/* Get the current relation length */
 		if (needLock)
-			LockRelationForExtension(index, ExclusiveLock);
+			LockRelationForExtension(index);
 		num_pages = RelationGetNumberOfBlocks(index);
 		if (needLock)
-			UnlockRelationForExtension(index, ExclusiveLock);
+			UnlockRelationForExtension(index);
 
 		/* Quit if we've scanned the whole relation */
 		if (blkno >= num_pages)
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 20ce431..4a72223 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -54,6 +54,7 @@
 #include "portability/instr_time.h"
 #include "postmaster/autovacuum.h"
 #include "storage/bufmgr.h"
+#include "storage/extension_lock.h"
 #include "storage/freespace.h"
 #include "storage/lmgr.h"
 #include "utils/lsyscache.h"
@@ -860,8 +861,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
 			 * it's got exclusive lock on the whole relation.
 			 */
 			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
-			LockRelationForExtension(onerel, ExclusiveLock);
-			UnlockRelationForExtension(onerel, ExclusiveLock);
+			WaitForRelationExtensionLockToBeFree(onerel);
 			LockBufferForCleanup(buf);
 			if (PageIsNew(page))
 			{
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 5c256ff..210552f 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3616,6 +3616,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_CLOG_GROUP_UPDATE:
 			event_name = "ClogGroupUpdate";
 			break;
+		case WAIT_EVENT_RELATION_EXTENSION_LOCK:
+			event_name = "RelationExtensionLock";
+			break;
 		case WAIT_EVENT_REPLICATION_ORIGIN_DROP:
 			event_name = "ReplicationOriginDrop";
 			break;
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 4648473..172a48c 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -26,6 +26,7 @@
 #include "access/htup_details.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
+#include "storage/extension_lock.h"
 #include "storage/freespace.h"
 #include "storage/fsm_internals.h"
 #include "storage/lmgr.h"
@@ -624,7 +625,7 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 	 * Note that another backend might have extended or created the relation
 	 * by the time we get the lock.
 	 */
-	LockRelationForExtension(rel, ExclusiveLock);
+	LockRelationForExtension(rel);
 
 	/* Might have to re-open if a cache flush happened */
 	RelationOpenSmgr(rel);
@@ -652,7 +653,7 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 	/* Update local cache with the up-to-date size */
 	rel->rd_smgr->smgr_fsm_nblocks = fsm_nblocks_now;
 
-	UnlockRelationForExtension(rel, ExclusiveLock);
+	UnlockRelationForExtension(rel);
 
 	pfree(pg);
 }
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2d1ed14..3b6a6f7 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -35,6 +35,7 @@
 #include "replication/origin.h"
 #include "storage/bufmgr.h"
 #include "storage/dsm.h"
+#include "storage/extension_lock.h"
 #include "storage/ipc.h"
 #include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
@@ -133,6 +134,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 		size = add_size(size, BackgroundWorkerShmemSize());
 		size = add_size(size, MultiXactShmemSize());
 		size = add_size(size, LWLockShmemSize());
+		size = add_size(size, RelExtLockShmemSize());
 		size = add_size(size, ProcArrayShmemSize());
 		size = add_size(size, BackendStatusShmemSize());
 		size = add_size(size, SInvalShmemSize());
@@ -235,6 +237,11 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 	InitPredicateLocks();
 
 	/*
+	 * Set up relation extension lock manager
+	 */
+	InitRelExtLocks();
+
+	/*
 	 * Set up process table
 	 */
 	if (!IsUnderPostmaster)
diff --git a/src/backend/storage/lmgr/Makefile b/src/backend/storage/lmgr/Makefile
index e1b787e..2334a40 100644
--- a/src/backend/storage/lmgr/Makefile
+++ b/src/backend/storage/lmgr/Makefile
@@ -13,7 +13,7 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = lmgr.o lock.o proc.o deadlock.o lwlock.o lwlocknames.o spin.o \
-	s_lock.o predicate.o condition_variable.o
+	s_lock.o predicate.o condition_variable.o extension_lock.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/storage/lmgr/README b/src/backend/storage/lmgr/README
index 56b0a12..71eb293 100644
--- a/src/backend/storage/lmgr/README
+++ b/src/backend/storage/lmgr/README
@@ -29,6 +29,13 @@ process has to wait for an LWLock, it blocks on a SysV semaphore so as
 to not consume CPU time.  Waiting processes will be granted the lock in
 arrival order.  There is no timeout.
 
+* Relation extension locks.  The relation extension lock manager is
+specialized in relation extensions. In PostgreSQL 11 relation extension
+lock has been moved out of regular lock. It's similar to regular locks
+but doesn't have full dead lock detection, group locking and multiple
+lock modes. When conflicts occur, lock waits are implemented using
+condition variables.
+
 * Regular locks (a/k/a heavyweight locks).  The regular lock manager
 supports a variety of lock modes with table-driven semantics, and it has
 full deadlock detection and automatic release at transaction end.
@@ -40,9 +47,9 @@ Acquisition of either a spinlock or a lightweight lock causes query
 cancel and die() interrupts to be held off until all such locks are
 released. No such restriction exists for regular locks, however.  Also
 note that we can accept query cancel and die() interrupts while waiting
-for a regular lock, but we will not accept them while waiting for
-spinlocks or LW locks. It is therefore not a good idea to use LW locks
-when the wait time might exceed a few seconds.
+for a relation extension lock or a regular lock, but we will not accept
+them while waiting for spinlocks or LW locks. It is therefore not a good
+idea to use LW locks when the wait time might exceed a few seconds.
 
 The rest of this README file discusses the regular lock manager in detail.
 
diff --git a/src/backend/storage/lmgr/extension_lock.c b/src/backend/storage/lmgr/extension_lock.c
new file mode 100644
index 0000000..ed11d58
--- /dev/null
+++ b/src/backend/storage/lmgr/extension_lock.c
@@ -0,0 +1,463 @@
+/*-------------------------------------------------------------------------
+ *
+ * extension_lock.c
+ *	  Relation extension lock manager
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/storage/lmgr/extension_lock.c
+ *
+ * NOTES:
+ *
+ * This lock manager is specialized in relation extension locks; light
+ * weight and interruptible lock manager. It's similar to heavy-weight
+ * lock but doesn't have dead lock detection mechanism, group locking
+ * mechanism and multiple lock modes.
+ *
+ * The entries for relation extension locks are allocated on the shared
+ * memory as an array. The pair of database id and relation id maps to
+ * one of them by hashing.
+ *
+ * For lock acquisition we use an atomic compare-and-exchange on the
+ * state variable. When a process tries to acquire a lock that conflicts
+ * with existing lock, it is put to sleep using condition variables
+ * if not conditional locking. When release the lock, we use an atomic
+ * decrement to release the lock.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/extension_lock.h"
+#include "utils/rel.h"
+
+/* The total entries of relation extension lock on shared memory */
+#define N_RELEXTLOCK_ENTS 1024
+
+/*
+ * Compute the hash code associated with a RelExtLock.
+ *
+ * To avoid unnecessary recomputations of the hash code, we try to do this
+ * just once per function, and then pass it around as needed.  we can
+ * extract the index number of RelExtLockArray.
+ */
+#define RelExtLockTargetTagToIndex(relextlock_tag) \
+	(tag_hash((const void *) relextlock_tag, sizeof(RelExtLockTag)) \
+		% N_RELEXTLOCK_ENTS)
+
+#define SET_RELEXTLOCK_TAG(locktag, d, r) \
+	((locktag).dbid = (d), \
+	 (locktag).relid = (r))
+
+#define	RELEXT_VAL_LOCK		((uint32) ((1 << 25)))
+#define RELEXT_LOCK_MASK	((uint32) ((1 << 25)))
+
+/* Must be greater than MAX_BACKENDS - which is 2^23-1, so we're fine. */
+#define RELEXT_WAIT_COUNT_MASK	((uint32) ((1 << 24) - 1))
+
+/* This tag maps to one of entries on the RelExtLockArray array by hashing */
+typedef struct RelExtLockTag
+{
+	Oid		dbid;
+	Oid		relid;
+} RelExtLockTag;
+
+typedef struct RelExtLock
+{
+	pg_atomic_uint32	state; 	/* state of exclusive lock */
+	ConditionVariable	cv;
+} RelExtLock;
+
+/*
+ * This structure holds information per-object relation extension
+ * lock. "lock" variable represents the RelExtLockArray we are
+ * holding or had been holding before. If we're holding a relation
+ * extension lock on a relation, nLocks > 0. nLocks == 0 means that
+ * we don't hold any locks. We use this structure to keep track of
+ * holding relation extension locks , and to also store it as a cache.
+ * So when releasing the lock we don't invalidate the lock variable.
+ * We check the cache first, and then use it without touching
+ * RelExtLockArray if the lock is the same as what we just released.
+ *
+ * At most one lock can be held at once. Note that sometimes we
+ * could try to acquire a lock for the additional forks while holding
+ * the lock for the main fork; for example, adding extra relation
+ * blocks for both relation and its free space map. But since this
+ * lock manager doesn't distinguish between the forks, we just
+ * increment nLocks in the case.
+ */
+typedef	struct relextlock_handle
+{
+	Oid				relid;
+	RelExtLock		*lock;
+	int				nLocks;
+} relextlock_handle;
+
+static relextlock_handle held_relextlock;
+
+/* Pointer to array containing relation extension lock states */
+static RelExtLock *RelExtLockArray;
+
+static bool RelExtLockAcquire(Oid relid, bool conditional);
+static void RelExtLockRelease(Oid rleid, bool force);
+static bool RelExtLockAttemptLock(RelExtLock *relextlock);
+
+Size
+RelExtLockShmemSize(void)
+{
+	/* Relation extension locks array */
+	return mul_size(N_RELEXTLOCK_ENTS, sizeof(RelExtLock));
+}
+
+/*
+ * InitRelExtLock
+ *      Initialize the relation extension lock manager's data structures.
+ */
+void
+InitRelExtLocks(void)
+{
+	Size	size;
+	bool	found;
+	int		i;
+
+	size = mul_size(N_RELEXTLOCK_ENTS, sizeof(RelExtLock));
+	RelExtLockArray = (RelExtLock *)
+		ShmemInitStruct("Relation Extension Lock", size, &found);
+
+	/* we're the first - initialize */
+	if (!found)
+	{
+		for (i = 0; i < N_RELEXTLOCK_ENTS; i++)
+		{
+			RelExtLock *relextlock = &RelExtLockArray[i];
+
+			pg_atomic_init_u32(&(relextlock->state), 0);
+			ConditionVariableInit(&(relextlock->cv));
+		}
+	}
+}
+
+/*
+ *		LockRelationForExtension
+ *
+ * This lock is used to interlock addition of pages to relations.
+ * We need such locking because bufmgr/smgr definition of P_NEW is not
+ * race-condition-proof.
+ *
+ * We assume the caller is already holding some type of regular lock on
+ * the relation, so no AcceptInvalidationMessages call is needed here.
+ */
+void
+LockRelationForExtension(Relation relation)
+{
+	RelExtLockAcquire(relation->rd_id, false);
+}
+
+/*
+ *		ConditionalLockRelationForExtension
+ *
+ * As above, but only lock if we can get the lock without blocking.
+ * Returns TRUE iff the lock was acquired.
+ */
+bool
+ConditionalLockRelationForExtension(Relation relation)
+{
+	return RelExtLockAcquire(relation->rd_id, true);
+}
+
+/*
+ *		RelationExtensionLockWaiterCount
+ *
+ * Count the number of processes waiting for the given relation extension lock.
+ */
+int
+RelationExtensionLockWaiterCount(Relation relation)
+{
+	RelExtLockTag tag;
+	RelExtLock	*relextlock;
+	Oid			relid;
+	uint32		state;
+
+	relid = RelationGetRelid(relation);
+
+	/* Make a lock tag */
+	SET_RELEXTLOCK_TAG(tag, MyDatabaseId, relid);
+
+	relextlock = &RelExtLockArray[RelExtLockTargetTagToIndex(&tag)];
+	state = pg_atomic_read_u32(&(relextlock->state));
+
+	return (state & RELEXT_WAIT_COUNT_MASK);
+}
+
+/*
+ *		UnlockRelationForExtension
+ */
+void
+UnlockRelationForExtension(Relation relation)
+{
+	RelExtLockRelease(relation->rd_id, false);
+}
+
+/*
+ *		RelationExtensionLockReleaseAll
+ *
+ * release all currently-held relation extension locks
+ */
+void
+RelExtLockReleaseAll(void)
+{
+	if (held_relextlock.nLocks > 0)
+	{
+		RelExtLockRelease(held_relextlock.relid, true);
+	}
+}
+
+/*
+ *		RelExtLockHoldingLockCount
+ *
+ * Return the number of holding relation extension locks.
+ */
+int
+RelExtLockHoldingLockCount(void)
+{
+	return held_relextlock.nLocks;
+}
+
+/*
+ *		WaitForRelationExtensionLockToBeFree
+ *
+ * Wait for the relation extension lock on the given relation to
+ * be free without acquiring it.
+ */
+void
+WaitForRelationExtensionLockToBeFree(Relation relation)
+{
+	RelExtLock	*relextlock;
+	Oid		relid;
+	bool	registered_wait_list = false;
+
+	relid = RelationGetRelid(relation);
+
+	/* If the lock is held by me, no need to wait */
+	if (held_relextlock.nLocks > 0 && relid == held_relextlock.relid)
+		return;
+
+	/*
+	 * Luckily if we're trying to acquire the same lock as what we
+	 * had held just before, we don't need to get the entry from the
+	 * array by hashing.
+	 */
+	if (relid == held_relextlock.relid)
+		relextlock = held_relextlock.lock;
+	else
+	{
+		RelExtLockTag tag;
+
+		SET_RELEXTLOCK_TAG(tag, MyDatabaseId, relid);
+		relextlock = &RelExtLockArray[RelExtLockTargetTagToIndex(&tag)];
+	}
+
+	for (;;)
+	{
+		uint32	state;
+
+		state = pg_atomic_read_u32(&(relextlock)->state);
+
+		/* Break if nobody is holding the lock on this relation */
+		if ((state & RELEXT_LOCK_MASK) == 0)
+			break;
+
+		if (!registered_wait_list)
+		{
+			/* Increment wait count to be waken up by owner */
+			pg_atomic_add_fetch_u32(&(relextlock->state), 1);
+			registered_wait_list = true;
+		}
+
+		/* Sleep until the lock is released */
+		ConditionVariableSleep(&(relextlock->cv), WAIT_EVENT_RELATION_EXTENSION_LOCK);
+	}
+
+	ConditionVariableCancelSleep();
+
+	/* Before retuning, decrement the wait count if we had been waiting */
+	if (registered_wait_list)
+		pg_atomic_sub_fetch_u32(&(relextlock->state), 1);
+
+	return;
+}
+
+/*
+ * Acquire the relation extension lock. If we're trying to acquire the same
+ * lock as what already held, we just increment nLock locally and return
+ * without touching the RelExtLock array.
+ */
+static bool
+RelExtLockAcquire(Oid relid, bool conditional)
+{
+	RelExtLock	*relextlock;
+	bool	mustwait;
+	bool	registered_wait_list = false;
+
+	/*
+	 * If we already hold the lock, we can just increase the count locally.
+	 * Since we don't support dead lock detection for relation extension
+	 * lock and don't control the order of lock acquisition, it cannot not
+	 * happen that trying to take a new lock while holding an another lock.
+	 */
+	if (held_relextlock.nLocks > 0)
+	{
+		if (relid == held_relextlock.relid)
+		{
+			held_relextlock.nLocks++;
+			return true;
+		}
+		else
+			elog(ERROR,
+				 "cannot acquire relation extension locks for multiple relations at the same");
+	}
+
+	/*
+	 * If we're trying to acquire the same lock as what we just released
+	 * we don't need to get the entry from the array by hashing. we expect
+	 * to happen this case because it's a common case in acquisition of
+	 * relation extension locks.
+	 */
+	if (relid == held_relextlock.relid)
+		relextlock = held_relextlock.lock;
+	else
+	{
+		RelExtLockTag tag;
+
+		SET_RELEXTLOCK_TAG(tag, MyDatabaseId, relid);
+		relextlock = &RelExtLockArray[RelExtLockTargetTagToIndex(&tag)];
+	}
+
+	for (;;)
+	{
+		mustwait = RelExtLockAttemptLock(relextlock);
+
+		if (!mustwait)
+			break;	/* got the lock */
+
+		/* Could not got the lock, return iff in conditional locking */
+		if (conditional)
+			return false;
+
+		if (!registered_wait_list)
+		{
+			/* Increment wait count to register the wait list */
+			pg_atomic_add_fetch_u32(&(relextlock->state), 1);
+			registered_wait_list = true;
+		}
+
+		/* Sleep until the lock is released */
+		ConditionVariableSleep(&(relextlock->cv), WAIT_EVENT_RELATION_EXTENSION_LOCK);
+	}
+
+	ConditionVariableCancelSleep();
+
+	/* Decrement wait count if we had been waiting */
+	if (registered_wait_list)
+		pg_atomic_sub_fetch_u32(&(relextlock->state), 1);
+
+	Assert(!mustwait);
+
+	/* Remember lock held by this backend */
+	held_relextlock.relid = relid;
+	held_relextlock.lock = relextlock;
+	held_relextlock.nLocks = 1;
+
+	/* Always return true if not conditional lock */
+	return true;
+}
+
+/*
+ * RelExtLockRelease
+ *
+ * Release a previously acquired relation extension lock. If force is
+ * true, we release the all holding locks on the given relation.
+ */
+static void
+RelExtLockRelease(Oid relid, bool force)
+{
+	RelExtLockTag tag;
+	RelExtLock	*relextlock;
+	uint32	state;
+	uint32	wait_counts;
+
+	/* We should have acquired a lock before releasing */
+	Assert(held_relextlock.nLocks > 0);
+
+	if (relid != held_relextlock.relid)
+		ereport(ERROR,
+				(errcode(ERRCODE_LOCK_NOT_AVAILABLE),
+				 errmsg("relation extension lock for %u is not held",
+						relid)));
+
+	/* If force releasing, release all locks we're holding */
+	if (force)
+		held_relextlock.nLocks = 0;
+	else
+		held_relextlock.nLocks--;
+
+	Assert(held_relextlock.nLocks >= 0);
+
+	/* Return if we're still holding the lock even after computation */
+	if (held_relextlock.nLocks > 0)
+		return;
+
+	/* Get RelExtLock entry from the array */
+	SET_RELEXTLOCK_TAG(tag, MyDatabaseId, relid);
+	relextlock = &RelExtLockArray[RelExtLockTargetTagToIndex(&tag)];
+
+	/* Release the lock */
+	state = pg_atomic_sub_fetch_u32(&(relextlock->state), RELEXT_VAL_LOCK);
+
+	/* Wake up waiters if there is someone looking at this lock */
+	wait_counts = state & RELEXT_WAIT_COUNT_MASK;
+
+	if (wait_counts > 0)
+		ConditionVariableBroadcast(&(relextlock->cv));
+}
+
+/*
+ * Internal function that attempts to atomically acquire the relation
+ * extension lock.
+ *
+ * Returns true if the lock isn't free and we need to wait.
+ */
+static bool
+RelExtLockAttemptLock(RelExtLock *relextlock)
+{
+	uint32	oldstate;
+
+	oldstate = pg_atomic_read_u32(&relextlock->state);
+
+	while (true)
+	{
+		uint32	desired_state;
+		bool	lock_free;
+
+		desired_state = oldstate;
+
+		lock_free = (oldstate & RELEXT_LOCK_MASK) == 0;
+		if (lock_free)
+			desired_state += RELEXT_VAL_LOCK;
+
+		if (pg_atomic_compare_exchange_u32(&relextlock->state,
+										   &oldstate, desired_state))
+		{
+			if (lock_free)
+				return false;
+			else
+				return true;
+		}
+	}
+	pg_unreachable();
+}
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index da5679b..4fbc0c4 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -319,78 +319,6 @@ UnlockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode)
 }
 
 /*
- *		LockRelationForExtension
- *
- * This lock tag is used to interlock addition of pages to relations.
- * We need such locking because bufmgr/smgr definition of P_NEW is not
- * race-condition-proof.
- *
- * We assume the caller is already holding some type of regular lock on
- * the relation, so no AcceptInvalidationMessages call is needed here.
- */
-void
-LockRelationForExtension(Relation relation, LOCKMODE lockmode)
-{
-	LOCKTAG		tag;
-
-	SET_LOCKTAG_RELATION_EXTEND(tag,
-								relation->rd_lockInfo.lockRelId.dbId,
-								relation->rd_lockInfo.lockRelId.relId);
-
-	(void) LockAcquire(&tag, lockmode, false, false);
-}
-
-/*
- *		ConditionalLockRelationForExtension
- *
- * As above, but only lock if we can get the lock without blocking.
- * Returns true iff the lock was acquired.
- */
-bool
-ConditionalLockRelationForExtension(Relation relation, LOCKMODE lockmode)
-{
-	LOCKTAG		tag;
-
-	SET_LOCKTAG_RELATION_EXTEND(tag,
-								relation->rd_lockInfo.lockRelId.dbId,
-								relation->rd_lockInfo.lockRelId.relId);
-
-	return (LockAcquire(&tag, lockmode, false, true) != LOCKACQUIRE_NOT_AVAIL);
-}
-
-/*
- *		RelationExtensionLockWaiterCount
- *
- * Count the number of processes waiting for the given relation extension lock.
- */
-int
-RelationExtensionLockWaiterCount(Relation relation)
-{
-	LOCKTAG		tag;
-
-	SET_LOCKTAG_RELATION_EXTEND(tag,
-								relation->rd_lockInfo.lockRelId.dbId,
-								relation->rd_lockInfo.lockRelId.relId);
-
-	return LockWaiterCount(&tag);
-}
-
-/*
- *		UnlockRelationForExtension
- */
-void
-UnlockRelationForExtension(Relation relation, LOCKMODE lockmode)
-{
-	LOCKTAG		tag;
-
-	SET_LOCKTAG_RELATION_EXTEND(tag,
-								relation->rd_lockInfo.lockRelId.dbId,
-								relation->rd_lockInfo.lockRelId.relId);
-
-	LockRelease(&tag, lockmode, false);
-}
-
-/*
  *		LockPage
  *
  * Obtain a page-level lock.  This is currently used by some index access
@@ -961,12 +889,6 @@ DescribeLockTag(StringInfo buf, const LOCKTAG *tag)
 							 tag->locktag_field2,
 							 tag->locktag_field1);
 			break;
-		case LOCKTAG_RELATION_EXTEND:
-			appendStringInfo(buf,
-							 _("extension of relation %u of database %u"),
-							 tag->locktag_field2,
-							 tag->locktag_field1);
-			break;
 		case LOCKTAG_PAGE:
 			appendStringInfo(buf,
 							 _("page %u of relation %u of database %u"),
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 5833086..4583c64 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -40,6 +40,7 @@
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "pgstat.h"
+#include "storage/extension_lock.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "storage/sinvaladt.h"
@@ -717,6 +718,15 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	int			status;
 	bool		log_lock = false;
 
+	/*
+	 * We allow to take a relation extension lock after took a
+	 * heavy-weight lock. However, since we don't have dead lock
+	 * detection mechanism between heavy-weight lock and relation
+	 * extension lock it's not allowed taking an another heavy-weight
+	 * lock while holding a relation extension lock.
+	 */
+	Assert(RelExtLockHoldingLockCount() == 0);
+
 	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
 		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
 	lockMethodTable = LockMethods[lockmethodid];
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 5f6727d..40ab31a 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -44,6 +44,7 @@
 #include "replication/slot.h"
 #include "replication/syncrep.h"
 #include "storage/condition_variable.h"
+#include "storage/extension_lock.h"
 #include "storage/standby.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
@@ -765,6 +766,8 @@ ProcReleaseLocks(bool isCommit)
 		return;
 	/* If waiting, get off wait queue (should only be needed after error) */
 	LockErrorCleanup();
+	/* Release relation extension locks */
+	RelExtLockReleaseAll();
 	/* Release standard locks, including session-level if aborting */
 	LockReleaseAll(DEFAULT_LOCKMETHOD, !isCommit);
 	/* Release transaction-level advisory locks */
diff --git a/src/backend/utils/adt/lockfuncs.c b/src/backend/utils/adt/lockfuncs.c
index 9e0a8ab..6d8916c 100644
--- a/src/backend/utils/adt/lockfuncs.c
+++ b/src/backend/utils/adt/lockfuncs.c
@@ -25,7 +25,6 @@
 /* This must match enum LockTagType! */
 const char *const LockTagTypeNames[] = {
 	"relation",
-	"extend",
 	"page",
 	"tuple",
 	"transactionid",
@@ -234,7 +233,6 @@ pg_lock_status(PG_FUNCTION_ARGS)
 		switch ((LockTagType) instance->locktag.locktag_type)
 		{
 			case LOCKTAG_RELATION:
-			case LOCKTAG_RELATION_EXTEND:
 				values[1] = ObjectIdGetDatum(instance->locktag.locktag_field1);
 				values[2] = ObjectIdGetDatum(instance->locktag.locktag_field2);
 				nulls[3] = true;
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 089b7c3..b3611c3 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -813,6 +813,7 @@ typedef enum
 	WAIT_EVENT_PARALLEL_BITMAP_SCAN,
 	WAIT_EVENT_PROCARRAY_GROUP_UPDATE,
 	WAIT_EVENT_CLOG_GROUP_UPDATE,
+	WAIT_EVENT_RELATION_EXTENSION_LOCK,
 	WAIT_EVENT_REPLICATION_ORIGIN_DROP,
 	WAIT_EVENT_REPLICATION_SLOT_DROP,
 	WAIT_EVENT_SAFE_SNAPSHOT,
diff --git a/src/include/storage/extension_lock.h b/src/include/storage/extension_lock.h
new file mode 100644
index 0000000..cbce89b
--- /dev/null
+++ b/src/include/storage/extension_lock.h
@@ -0,0 +1,42 @@
+/*-------------------------------------------------------------------------
+ *
+ * extension_lock.h
+ *	  Relation extension lock manager
+ *
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/extension_lock.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef EXTENSION_LOCK_H
+#define EXTENSION_LOCK_H
+
+#ifdef FRONTEND
+#error "extension_lock.h may not be included from frontend code"
+#endif
+
+#include "port/atomics.h"
+#include "storage/s_lock.h"
+#include "storage/condition_variable.h"
+#include "storage/proclist_types.h"
+
+/* Number of partitions the shared relation extension lock tables are divided into */
+#define LOG2_NUM_RELEXTLOCK_PARTITIONS 4
+#define NUM_RELEXTLOCK_PARTITIONS      (1 << LOG2_NUM_RELEXTLOCK_PARTITIONS)
+
+/* Lock a relation for extension */
+extern Size RelExtLockShmemSize(void);
+extern void InitRelExtLocks(void);
+extern void LockRelationForExtension(Relation relation);
+extern void UnlockRelationForExtension(Relation relation);
+extern bool ConditionalLockRelationForExtension(Relation relation);
+extern int	RelationExtensionLockWaiterCount(Relation relation);
+extern void WaitForRelationExtensionLockToBeFree(Relation relation);
+extern void RelExtLockReleaseAll(void);
+extern int	RelExtLockHoldingLockCount(void);
+
+#endif	/* EXTENSION_LOCK_H */
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index 0b92322..7e6b80c 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -50,13 +50,6 @@ extern bool LockHasWaitersRelation(Relation relation, LOCKMODE lockmode);
 extern void LockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
 extern void UnlockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
 
-/* Lock a relation for extension */
-extern void LockRelationForExtension(Relation relation, LOCKMODE lockmode);
-extern void UnlockRelationForExtension(Relation relation, LOCKMODE lockmode);
-extern bool ConditionalLockRelationForExtension(Relation relation,
-									LOCKMODE lockmode);
-extern int	RelationExtensionLockWaiterCount(Relation relation);
-
 /* Lock a page (currently only used within indexes) */
 extern void LockPage(Relation relation, BlockNumber blkno, LOCKMODE lockmode);
 extern bool ConditionalLockPage(Relation relation, BlockNumber blkno, LOCKMODE lockmode);
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index 765431e..3be18ea 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -138,8 +138,6 @@ typedef uint16 LOCKMETHODID;
 typedef enum LockTagType
 {
 	LOCKTAG_RELATION,			/* whole relation */
-	/* ID info for a relation is DB OID + REL OID; DB OID = 0 if shared */
-	LOCKTAG_RELATION_EXTEND,	/* the right to extend a relation */
 	/* same ID info as RELATION */
 	LOCKTAG_PAGE,				/* one page of a relation */
 	/* ID info for a page is RELATION info + BlockNumber */
@@ -198,14 +196,6 @@ typedef struct LOCKTAG
 	 (locktag).locktag_type = LOCKTAG_RELATION, \
 	 (locktag).locktag_lockmethodid = DEFAULT_LOCKMETHOD)
 
-#define SET_LOCKTAG_RELATION_EXTEND(locktag,dboid,reloid) \
-	((locktag).locktag_field1 = (dboid), \
-	 (locktag).locktag_field2 = (reloid), \
-	 (locktag).locktag_field3 = 0, \
-	 (locktag).locktag_field4 = 0, \
-	 (locktag).locktag_type = LOCKTAG_RELATION_EXTEND, \
-	 (locktag).locktag_lockmethodid = DEFAULT_LOCKMETHOD)
-
 #define SET_LOCKTAG_PAGE(locktag,dboid,reloid,blocknum) \
 	((locktag).locktag_field1 = (dboid), \
 	 (locktag).locktag_field2 = (reloid), \

#35

Robert Haas

robertmhaas@gmail.com

about 8 years ago

In reply to: Masahiko Sawada (#34)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Fri, Dec 1, 2017 at 10:14 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

The patched is 10 times faster than current HEAD.

Nifty.

The first hunk in monitoring.sgml looks unnecessary.

The second hunk breaks the formatting of the documentation; you need
to adjust the "morerows" value from 9 to 8 here:

And similarly make this one 18:

+* Relation extension locks.  The relation extension lock manager is
+specialized in relation extensions. In PostgreSQL 11 relation extension
+lock has been moved out of regular lock. It's similar to regular locks
+but doesn't have full dead lock detection, group locking and multiple
+lock modes. When conflicts occur, lock waits are implemented using
+condition variables.

Higher up, it says that "Postgres uses four types of interprocess
locks", but because you added this, it's now a list of five items.

I suggest moving the section on relation locks to the end and
rewriting the text here as follows: Only one process can extend a
relation at a time; we use a specialized lock manager for this
purpose, which is much simpler than the regular lock manager. It is
similar to the lightweight lock mechanism, but is ever simpler because
there is only one lock mode and only one lock can be taken at a time.
A process holding a relation extension lock is interruptible, unlike a
process holding an LWLock.

+#define RelExtLockTargetTagToIndex(relextlock_tag) \
+    (tag_hash((const void *) relextlock_tag, sizeof(RelExtLockTag)) \
+        % N_RELEXTLOCK_ENTS)

How about using a static inline function for this?

+#define SET_RELEXTLOCK_TAG(locktag, d, r) \
+    ((locktag).dbid = (d), \
+     (locktag).relid = (r))

How about getting rid of this and just doing the assignments instead?

+#define RELEXT_VAL_LOCK     ((uint32) ((1 << 25)))
+#define RELEXT_LOCK_MASK    ((uint32) ((1 << 25)))

It seems confusing to have two macros for the same value and an
almost-interchangeable purpose. Maybe just call it RELEXT_LOCK_BIT?

+RelationExtensionLockWaiterCount(Relation relation)

Hmm. This is sort of problematic, because with then new design we
have no guarantee that the return value is actually accurate. I don't
think that's a functional problem, but the optics aren't great.

+    if (held_relextlock.nLocks > 0)
+    {
+        RelExtLockRelease(held_relextlock.relid, true);
+    }

Excess braces.

+int
+RelExtLockHoldingLockCount(void)
+{
+    return held_relextlock.nLocks;
+}

Maybe IsAnyRelationExtensionLockHeld(), returning bool?

+ /* If the lock is held by me, no need to wait */

If we already hold the lock, no need to wait.

+     * Luckily if we're trying to acquire the same lock as what we
+     * had held just before, we don't need to get the entry from the
+     * array by hashing.

We're not trying to acquire a lock here. "If the last relation
extension lock we touched is the same one for which we now need to
wait, we can use our cached pointer to the lock instead of recomputing
it."

+ registered_wait_list = true;

Isn't it really registered_wait_count? The only list here is
encapsulated in the CV.

+ /* Before retuning, decrement the wait count if we had been waiting */

returning -> returning, but I'd rewrite this as "Release any wait
count we hold."

+ * Acquire the relation extension lock. If we're trying to acquire the same
+ * lock as what already held, we just increment nLock locally and return
+ * without touching the RelExtLock array.

"Acquire a relation extension lock." I think you can forget the rest
of this; it duplicates comments in the function body.

+     * Since we don't support dead lock detection for relation extension
+     * lock and don't control the order of lock acquisition, it cannot not
+     * happen that trying to take a new lock while holding an another lock.

Since we don't do deadlock detection, caller must not try to take a
new relation extension lock while already holding them.

+        if (relid == held_relextlock.relid)
+        {
+            held_relextlock.nLocks++;
+            return true;
+        }
+        else
+            elog(ERROR,
+                 "cannot acquire relation extension locks for
multiple relations at the same");

I'd prefer if (relid != held_relextlock.relid) elog(ERROR, ...) to
save a level of indentation for the rest.

+     * If we're trying to acquire the same lock as what we just released
+     * we don't need to get the entry from the array by hashing. we expect
+     * to happen this case because it's a common case in acquisition of
+     * relation extension locks.

"If the last relation extension lock we touched is the same one for we
now need to acquire, we can use our cached pointer to the lock instead
of recomputing it. This is likely to be a common case in practice."

+ /* Could not got the lock, return iff in conditional locking */

"locking conditionally"

+ ConditionVariableSleep(&(relextlock->cv),
WAIT_EVENT_RELATION_EXTENSION_LOCK);

Break line at comma

+ /* Decrement wait count if we had been waiting */

"Release any wait count we hold."

+ /* Always return true if not conditional lock */

"We got the lock!"

+    /* If force releasing, release all locks we're holding */
+    if (force)
+        held_relextlock.nLocks = 0;
+    else
+        held_relextlock.nLocks--;
+
+    Assert(held_relextlock.nLocks >= 0);
+
+    /* Return if we're still holding the lock even after computation */
+    if (held_relextlock.nLocks > 0)
+        return;

I thought you were going to have the caller adjust nLocks?

+    /* Get RelExtLock entry from the array */
+    SET_RELEXTLOCK_TAG(tag, MyDatabaseId, relid);
+    relextlock = &RelExtLockArray[RelExtLockTargetTagToIndex(&tag)];

This seems to make no sense in RelExtLockRelease -- isn't the cache
guaranteed valid?

+ /* Wake up waiters if there is someone looking at this lock */

"If there may be waiters, wake them up."

+     * We allow to take a relation extension lock after took a
+     * heavy-weight lock. However, since we don't have dead lock
+     * detection mechanism between heavy-weight lock and relation
+     * extension lock it's not allowed taking an another heavy-weight
+     * lock while holding a relation extension lock.

"Relation extension locks don't participate in deadlock detection, so
make sure we don't try to acquire a heavyweight lock while holding
one."

+ /* Release relation extension locks */

"If we hold a relation extension lock, release it."

+/* Number of partitions the shared relation extension lock tables are
divided into */
+#define LOG2_NUM_RELEXTLOCK_PARTITIONS 4
+#define NUM_RELEXTLOCK_PARTITIONS      (1 << LOG2_NUM_RELEXTLOCK_PARTITIONS)

Dead code.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#36

Robert Haas

robertmhaas@gmail.com

about 8 years ago

In reply to: Robert Haas (#35)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Fri, Dec 1, 2017 at 1:28 PM, Robert Haas <robertmhaas@gmail.com> wrote:

[ lots of minor comments ]

When I took a break from sitting at the computer, I realized that I
think this has a more serious problem: won't it permanently leak
reference counts if someone hits ^C or an error occurs while the lock
is held? I think it will -- it probably needs to do cleanup at the
places where we do LWLockReleaseAll() that includes decrementing the
shared refcount if necessary, rather than doing cleanup at the places
we release heavyweight locks.

I might be wrong about the details here -- this is off the top of my head.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#37

Masahiko Sawada

sawada.mshk@gmail.com

about 8 years ago

In reply to: Robert Haas (#35)

1 attachment(s)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Sat, Dec 2, 2017 at 3:28 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Dec 1, 2017 at 10:14 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

The patched is 10 times faster than current HEAD.

Nifty.

Thank you for your dedicated reviewing the patch.

The first hunk in monitoring.sgml looks unnecessary.

You meant the following hunk?

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 8d461c8..7aa7981 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -669,8 +669,8 @@ postgres   27093  0.0  0.0  30096  2752 ?
Ss   11:34   0:00 postgres: ser
           Heavyweight locks, also known as lock manager locks or simply locks,
           primarily protect SQL-visible objects such as tables.  However,
           they are also used to ensure mutual exclusion for certain internal
-          operations such as relation extension.
<literal>wait_event</literal> will
-          identify the type of lock awaited.
+          operations such as waiting for a transaction to finish.
+          <literal>wait_event</literal> will identify the type of lock awaited.
          </para>
         </listitem>
         <listitem>

I think that since the extension locks are no longer a part of
heavyweight locks we should change the explanation.

The second hunk breaks the formatting of the documentation; you need
to adjust the "morerows" value from 9 to 8 here:

<entry morerows="9"><literal>Lock</literal></entry>

And similarly make this one 18:

<entry morerows="17"><literal>IPC</literal></entry>

Fixed.

+* Relation extension locks.  The relation extension lock manager is
+specialized in relation extensions. In PostgreSQL 11 relation extension
+lock has been moved out of regular lock. It's similar to regular locks
+but doesn't have full dead lock detection, group locking and multiple
+lock modes. When conflicts occur, lock waits are implemented using
+condition variables.

Higher up, it says that "Postgres uses four types of interprocess
locks", but because you added this, it's now a list of five items.

Fixed.

I suggest moving the section on relation locks to the end and
rewriting the text here as follows: Only one process can extend a
relation at a time; we use a specialized lock manager for this
purpose, which is much simpler than the regular lock manager. It is
similar to the lightweight lock mechanism, but is ever simpler because
there is only one lock mode and only one lock can be taken at a time.
A process holding a relation extension lock is interruptible, unlike a
process holding an LWLock.

Agreed and fixed.

+#define RelExtLockTargetTagToIndex(relextlock_tag) \
+    (tag_hash((const void *) relextlock_tag, sizeof(RelExtLockTag)) \
+        % N_RELEXTLOCK_ENTS)

How about using a static inline function for this?

Fixed.

+#define SET_RELEXTLOCK_TAG(locktag, d, r) \
+    ((locktag).dbid = (d), \
+     (locktag).relid = (r))
How about getting rid of this and just doing the assignments instead?

Fixed.

+#define RELEXT_VAL_LOCK     ((uint32) ((1 << 25)))
+#define RELEXT_LOCK_MASK    ((uint32) ((1 << 25)))
It seems confusing to have two macros for the same value and an
almost-interchangeable purpose. Maybe just call it RELEXT_LOCK_BIT?

Fixed.

+RelationExtensionLockWaiterCount(Relation relation)

Hmm. This is sort of problematic, because with then new design we
have no guarantee that the return value is actually accurate. I don't
think that's a functional problem, but the optics aren't great.

Yeah, with this patch we could overestimate it and then add extra
blocks to the relation. Since the number of extra blocks is capped at
512 I think it would not become serious problem.

+    if (held_relextlock.nLocks > 0)
+    {
+        RelExtLockRelease(held_relextlock.relid, true);
+    }

Excess braces.

Fixed.

+int
+RelExtLockHoldingLockCount(void)
+{
+    return held_relextlock.nLocks;
+}
Maybe IsAnyRelationExtensionLockHeld(), returning bool?

Fixed.

+ /* If the lock is held by me, no need to wait */

If we already hold the lock, no need to wait.

Fixed.

+     * Luckily if we're trying to acquire the same lock as what we
+     * had held just before, we don't need to get the entry from the
+     * array by hashing.
We're not trying to acquire a lock here. "If the last relation
extension lock we touched is the same one for which we now need to
wait, we can use our cached pointer to the lock instead of recomputing
it."

Fixed.

+ registered_wait_list = true;

Isn't it really registered_wait_count? The only list here is
encapsulated in the CV.

Changed to "waiting".

+ /* Before retuning, decrement the wait count if we had been waiting */

returning -> returning, but I'd rewrite this as "Release any wait
count we hold."

Fixed.

+ * Acquire the relation extension lock. If we're trying to acquire the same
+ * lock as what already held, we just increment nLock locally and return
+ * without touching the RelExtLock array.
"Acquire a relation extension lock." I think you can forget the rest
of this; it duplicates comments in the function body.

Fixed.

+     * Since we don't support dead lock detection for relation extension
+     * lock and don't control the order of lock acquisition, it cannot not
+     * happen that trying to take a new lock while holding an another lock.
Since we don't do deadlock detection, caller must not try to take a
new relation extension lock while already holding them.

Fixed.

+        if (relid == held_relextlock.relid)
+        {
+            held_relextlock.nLocks++;
+            return true;
+        }
+        else
+            elog(ERROR,
+                 "cannot acquire relation extension locks for
multiple relations at the same");

I'd prefer if (relid != held_relextlock.relid) elog(ERROR, ...) to
save a level of indentation for the rest.

Fixed.

+     * If we're trying to acquire the same lock as what we just released
+     * we don't need to get the entry from the array by hashing. we expect
+     * to happen this case because it's a common case in acquisition of
+     * relation extension locks.
"If the last relation extension lock we touched is the same one for we
now need to acquire, we can use our cached pointer to the lock instead
of recomputing it. This is likely to be a common case in practice."

Fixed.

+ /* Could not got the lock, return iff in conditional locking */

"locking conditionally"

Fixed.

+ ConditionVariableSleep(&(relextlock->cv),
WAIT_EVENT_RELATION_EXTENSION_LOCK);
Break line at comma

Fixed.

+ /* Decrement wait count if we had been waiting */

"Release any wait count we hold."

Fixed.

+ /* Always return true if not conditional lock */

"We got the lock!"

Fixed.

+    /* If force releasing, release all locks we're holding */
+    if (force)
+        held_relextlock.nLocks = 0;
+    else
+        held_relextlock.nLocks--;
+
+    Assert(held_relextlock.nLocks >= 0);
+
+    /* Return if we're still holding the lock even after computation */
+    if (held_relextlock.nLocks > 0)
+        return;

I thought you were going to have the caller adjust nLocks?

Yeah, I was supposed to change so but since we always release either
one lock or all relext locks I thought it'd better to pass a bool
rather than an int.

+    /* Get RelExtLock entry from the array */
+    SET_RELEXTLOCK_TAG(tag, MyDatabaseId, relid);
+    relextlock = &RelExtLockArray[RelExtLockTargetTagToIndex(&tag)];
This seems to make no sense in RelExtLockRelease -- isn't the cache
guaranteed valid?

Right, fixed.

+ /* Wake up waiters if there is someone looking at this lock */

"If there may be waiters, wake them up."

Fixed.

+     * We allow to take a relation extension lock after took a
+     * heavy-weight lock. However, since we don't have dead lock
+     * detection mechanism between heavy-weight lock and relation
+     * extension lock it's not allowed taking an another heavy-weight
+     * lock while holding a relation extension lock.
"Relation extension locks don't participate in deadlock detection, so
make sure we don't try to acquire a heavyweight lock while holding
one."

Fixed.

+ /* Release relation extension locks */

"If we hold a relation extension lock, release it."

Fixed.

+/* Number of partitions the shared relation extension lock tables are
divided into */
+#define LOG2_NUM_RELEXTLOCK_PARTITIONS 4
+#define NUM_RELEXTLOCK_PARTITIONS      (1 << LOG2_NUM_RELEXTLOCK_PARTITIONS)

Dead code.

Fixed.

When I took a break from sitting at the computer, I realized that I
think this has a more serious problem: won't it permanently leak
reference counts if someone hits ^C or an error occurs while the lock
is held? I think it will -- it probably needs to do cleanup at the
places where we do LWLockReleaseAll() that includes decrementing the
shared refcount if necessary, rather than doing cleanup at the places
we release heavyweight locks.
I might be wrong about the details here -- this is off the top of my head.

Good catch. It can leak reference counts if someone hits ^C or an
error occurs while waiting. Fixed in the latest patch. But since
RelExtLockReleaseAll() is called even when such situations I think we
don't need to change the place where releasing the all relext lock. We
just moved it from heavyweight locks. Am I missing something?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

Moving_extension_lock_out_of_heavyweight_lock_v10.patchapplication/octet-stream; name=Moving_extension_lock_out_of_heavyweight_lock_v10.patchDownload

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index b6f80d9..07dd3f7 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -669,8 +669,8 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
           Heavyweight locks, also known as lock manager locks or simply locks,
           primarily protect SQL-visible objects such as tables.  However,
           they are also used to ensure mutual exclusion for certain internal
-          operations such as relation extension.  <literal>wait_event</literal> will
-          identify the type of lock awaited.
+          operations such as waiting for a transaction to finish.
+          <literal>wait_event</literal> will identify the type of lock awaited.
          </para>
         </listitem>
         <listitem>
@@ -1122,15 +1122,11 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          execution.</entry>
         </row>
         <row>
-         <entry morerows="9"><literal>Lock</literal></entry>
+         <entry morerows="8"><literal>Lock</literal></entry>
          <entry><literal>relation</literal></entry>
          <entry>Waiting to acquire a lock on a relation.</entry>
         </row>
         <row>
-         <entry><literal>extend</literal></entry>
-         <entry>Waiting to extend a relation.</entry>
-        </row>
-        <row>
          <entry><literal>page</literal></entry>
          <entry>Waiting to acquire a lock on page of a relation.</entry>
         </row>
@@ -1263,7 +1259,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry>Waiting in an extension.</entry>
         </row>
         <row>
-         <entry morerows="17"><literal>IPC</literal></entry>
+         <entry morerows="18"><literal>IPC</literal></entry>
          <entry><literal>BgWorkerShutdown</literal></entry>
          <entry>Waiting for background worker to shut down.</entry>
         </row>
@@ -1320,6 +1316,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry>Waiting for group leader to update transaction status at transaction end.</entry>
         </row>
         <row>
+         <entry><literal>RelationExtensionLock</literal></entry>
+         <entry>Waiting to extend a relation.</entry>
+        </row>
+        <row>
          <entry><literal>ReplicationOriginDrop</literal></entry>
          <entry>Waiting for a replication origin to become inactive to be dropped.</entry>
         </row>
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index 09db5c6..05cca9d 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -17,6 +17,7 @@
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/extension_lock.h"
 #include "storage/freespace.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
@@ -623,8 +624,7 @@ brin_page_cleanup(Relation idxrel, Buffer buf)
 	 */
 	if (PageIsNew(page))
 	{
-		LockRelationForExtension(idxrel, ShareLock);
-		UnlockRelationForExtension(idxrel, ShareLock);
+		WaitForRelationExtensionLockToBeFree(idxrel);
 
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 		if (PageIsNew(page))
@@ -716,7 +716,7 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 			 */
 			if (!RELATION_IS_LOCAL(irel))
 			{
-				LockRelationForExtension(irel, ExclusiveLock);
+				LockRelationForExtension(irel);
 				extensionLockHeld = true;
 			}
 			buf = ReadBuffer(irel, P_NEW);
@@ -768,7 +768,7 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 				}
 
 				if (extensionLockHeld)
-					UnlockRelationForExtension(irel, ExclusiveLock);
+					UnlockRelationForExtension(irel);
 
 				ReleaseBuffer(buf);
 				return InvalidBuffer;
@@ -778,7 +778,7 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
 		if (extensionLockHeld)
-			UnlockRelationForExtension(irel, ExclusiveLock);
+			UnlockRelationForExtension(irel);
 
 		page = BufferGetPage(buf);
 
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index 03e53ce..af8f5ce 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -29,6 +29,7 @@
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/extension_lock.h"
 #include "storage/lmgr.h"
 #include "utils/rel.h"
 
@@ -570,7 +571,7 @@ revmap_physical_extend(BrinRevmap *revmap)
 	else
 	{
 		if (needLock)
-			LockRelationForExtension(irel, ExclusiveLock);
+			LockRelationForExtension(irel);
 
 		buf = ReadBuffer(irel, P_NEW);
 		if (BufferGetBlockNumber(buf) != mapBlk)
@@ -582,7 +583,7 @@ revmap_physical_extend(BrinRevmap *revmap)
 			 * page from under whoever is using it.
 			 */
 			if (needLock)
-				UnlockRelationForExtension(irel, ExclusiveLock);
+				UnlockRelationForExtension(irel);
 			LockBuffer(revmap->rm_metaBuf, BUFFER_LOCK_UNLOCK);
 			ReleaseBuffer(buf);
 			return;
@@ -591,7 +592,7 @@ revmap_physical_extend(BrinRevmap *revmap)
 		page = BufferGetPage(buf);
 
 		if (needLock)
-			UnlockRelationForExtension(irel, ExclusiveLock);
+			UnlockRelationForExtension(irel);
 	}
 
 	/* Check that it's a regular block (or an empty page) */
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index d9c6483..8d35918 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -21,6 +21,7 @@
 #include "catalog/pg_collation.h"
 #include "catalog/pg_type.h"
 #include "miscadmin.h"
+#include "storage/extension_lock.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/builtins.h"
@@ -325,13 +326,13 @@ GinNewBuffer(Relation index)
 	/* Must extend the file */
 	needLock = !RELATION_IS_LOCAL(index);
 	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
+		LockRelationForExtension(index);
 
 	buffer = ReadBuffer(index, P_NEW);
 	LockBuffer(buffer, GIN_EXCLUSIVE);
 
 	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+		UnlockRelationForExtension(index);
 
 	return buffer;
 }
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index 394bc83..d769a76 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -20,6 +20,7 @@
 #include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "postmaster/autovacuum.h"
+#include "storage/extension_lock.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/memutils.h"
@@ -716,10 +717,10 @@ ginvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	needLock = !RELATION_IS_LOCAL(index);
 
 	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
+		LockRelationForExtension(index);
 	npages = RelationGetNumberOfBlocks(index);
 	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+		UnlockRelationForExtension(index);
 
 	totFreePages = 0;
 
@@ -766,10 +767,10 @@ ginvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	stats->pages_free = totFreePages;
 
 	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
+		LockRelationForExtension(index);
 	stats->num_pages = RelationGetNumberOfBlocks(index);
 	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+		UnlockRelationForExtension(index);
 
 	return stats;
 }
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index d8d1c0a..76171a5 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -20,6 +20,7 @@
 #include "access/htup_details.h"
 #include "access/reloptions.h"
 #include "catalog/pg_opclass.h"
+#include "storage/extension_lock.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/builtins.h"
@@ -821,13 +822,13 @@ gistNewBuffer(Relation r)
 	needLock = !RELATION_IS_LOCAL(r);
 
 	if (needLock)
-		LockRelationForExtension(r, ExclusiveLock);
+		LockRelationForExtension(r);
 
 	buffer = ReadBuffer(r, P_NEW);
 	LockBuffer(buffer, GIST_EXCLUSIVE);
 
 	if (needLock)
-		UnlockRelationForExtension(r, ExclusiveLock);
+		UnlockRelationForExtension(r);
 
 	return buffer;
 }
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 77d9d12..42ef36a 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -18,6 +18,7 @@
 #include "access/gist_private.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
+#include "storage/extension_lock.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 
@@ -59,10 +60,10 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 
 	/* try to find deleted pages */
 	if (needLock)
-		LockRelationForExtension(rel, ExclusiveLock);
+		LockRelationForExtension(rel);
 	npages = RelationGetNumberOfBlocks(rel);
 	if (needLock)
-		UnlockRelationForExtension(rel, ExclusiveLock);
+		UnlockRelationForExtension(rel);
 
 	totFreePages = 0;
 	for (blkno = GIST_ROOT_BLKNO + 1; blkno < npages; blkno++)
@@ -91,10 +92,10 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	/* return statistics */
 	stats->pages_free = totFreePages;
 	if (needLock)
-		LockRelationForExtension(rel, ExclusiveLock);
+		LockRelationForExtension(rel);
 	stats->num_pages = RelationGetNumberOfBlocks(rel);
 	if (needLock)
-		UnlockRelationForExtension(rel, ExclusiveLock);
+		UnlockRelationForExtension(rel);
 
 	return stats;
 }
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 13e3bdc..9287f2d 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -20,6 +20,7 @@
 #include "access/htup_details.h"
 #include "access/visibilitymap.h"
 #include "storage/bufmgr.h"
+#include "storage/extension_lock.h"
 #include "storage/freespace.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
@@ -519,11 +520,11 @@ loop:
 	if (needLock)
 	{
 		if (!use_fsm)
-			LockRelationForExtension(relation, ExclusiveLock);
-		else if (!ConditionalLockRelationForExtension(relation, ExclusiveLock))
+			LockRelationForExtension(relation);
+		else if (!ConditionalLockRelationForExtension(relation))
 		{
 			/* Couldn't get the lock immediately; wait for it. */
-			LockRelationForExtension(relation, ExclusiveLock);
+			LockRelationForExtension(relation);
 
 			/*
 			 * Check if some other backend has extended a block for us while
@@ -537,7 +538,7 @@ loop:
 			 */
 			if (targetBlock != InvalidBlockNumber)
 			{
-				UnlockRelationForExtension(relation, ExclusiveLock);
+				UnlockRelationForExtension(relation);
 				goto loop;
 			}
 
@@ -576,7 +577,7 @@ loop:
 	 * against vacuumlazy.c --- see comments therein.
 	 */
 	if (needLock)
-		UnlockRelationForExtension(relation, ExclusiveLock);
+		UnlockRelationForExtension(relation);
 
 	/*
 	 * We need to initialize the empty new page.  Double-check that it really
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 4c2a13a..2efee68 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -90,6 +90,7 @@
 #include "access/xlog.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/extension_lock.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
 #include "utils/inval.h"
@@ -641,7 +642,7 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 	 * Note that another backend might have extended or created the relation
 	 * by the time we get the lock.
 	 */
-	LockRelationForExtension(rel, ExclusiveLock);
+	LockRelationForExtension(rel);
 
 	/* Might have to re-open if a cache flush happened */
 	RelationOpenSmgr(rel);
@@ -679,7 +680,7 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 	/* Update local cache with the up-to-date size */
 	rel->rd_smgr->smgr_vm_nblocks = vm_nblocks_now;
 
-	UnlockRelationForExtension(rel, ExclusiveLock);
+	UnlockRelationForExtension(rel);
 
 	pfree(pg);
 }
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index c774349..7824c92 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -28,6 +28,7 @@
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
+#include "storage/extension_lock.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
@@ -659,7 +660,7 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 		needLock = !RELATION_IS_LOCAL(rel);
 
 		if (needLock)
-			LockRelationForExtension(rel, ExclusiveLock);
+			LockRelationForExtension(rel);
 
 		buf = ReadBuffer(rel, P_NEW);
 
@@ -673,7 +674,7 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 		 * condition against btvacuumscan --- see comments therein.
 		 */
 		if (needLock)
-			UnlockRelationForExtension(rel, ExclusiveLock);
+			UnlockRelationForExtension(rel);
 
 		/* Initialize the new page before returning it */
 		page = BufferGetPage(buf);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 399e6a1..5af1c21 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -25,6 +25,7 @@
 #include "commands/vacuum.h"
 #include "pgstat.h"
 #include "storage/condition_variable.h"
+#include "storage/extension_lock.h"
 #include "storage/indexfsm.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
@@ -1058,10 +1059,10 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	{
 		/* Get the current relation length */
 		if (needLock)
-			LockRelationForExtension(rel, ExclusiveLock);
+			LockRelationForExtension(rel);
 		num_pages = RelationGetNumberOfBlocks(rel);
 		if (needLock)
-			UnlockRelationForExtension(rel, ExclusiveLock);
+			UnlockRelationForExtension(rel);
 
 		/* Quit if we've scanned the whole relation */
 		if (blkno >= num_pages)
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index bd5301f..0ff53a3 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -23,6 +23,7 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/builtins.h"
+#include "storage/extension_lock.h"
 #include "utils/index_selfuncs.h"
 #include "utils/lsyscache.h"
 
@@ -230,13 +231,13 @@ SpGistNewBuffer(Relation index)
 	/* Must extend the file */
 	needLock = !RELATION_IS_LOCAL(index);
 	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
+		LockRelationForExtension(index);
 
 	buffer = ReadBuffer(index, P_NEW);
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
 	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+		UnlockRelationForExtension(index);
 
 	return buffer;
 }
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index d7d5e90..385d1cb 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -24,6 +24,7 @@
 #include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/extension_lock.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/snapmgr.h"
@@ -824,10 +825,10 @@ spgvacuumscan(spgBulkDeleteState *bds)
 	{
 		/* Get the current relation length */
 		if (needLock)
-			LockRelationForExtension(index, ExclusiveLock);
+			LockRelationForExtension(index);
 		num_pages = RelationGetNumberOfBlocks(index);
 		if (needLock)
-			UnlockRelationForExtension(index, ExclusiveLock);
+			UnlockRelationForExtension(index);
 
 		/* Quit if we've scanned the whole relation */
 		if (blkno >= num_pages)
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 20ce431..4a72223 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -54,6 +54,7 @@
 #include "portability/instr_time.h"
 #include "postmaster/autovacuum.h"
 #include "storage/bufmgr.h"
+#include "storage/extension_lock.h"
 #include "storage/freespace.h"
 #include "storage/lmgr.h"
 #include "utils/lsyscache.h"
@@ -860,8 +861,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
 			 * it's got exclusive lock on the whole relation.
 			 */
 			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
-			LockRelationForExtension(onerel, ExclusiveLock);
-			UnlockRelationForExtension(onerel, ExclusiveLock);
+			WaitForRelationExtensionLockToBeFree(onerel);
 			LockBufferForCleanup(buf);
 			if (PageIsNew(page))
 			{
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 5c256ff..210552f 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3616,6 +3616,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_CLOG_GROUP_UPDATE:
 			event_name = "ClogGroupUpdate";
 			break;
+		case WAIT_EVENT_RELATION_EXTENSION_LOCK:
+			event_name = "RelationExtensionLock";
+			break;
 		case WAIT_EVENT_REPLICATION_ORIGIN_DROP:
 			event_name = "ReplicationOriginDrop";
 			break;
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 4648473..172a48c 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -26,6 +26,7 @@
 #include "access/htup_details.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
+#include "storage/extension_lock.h"
 #include "storage/freespace.h"
 #include "storage/fsm_internals.h"
 #include "storage/lmgr.h"
@@ -624,7 +625,7 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 	 * Note that another backend might have extended or created the relation
 	 * by the time we get the lock.
 	 */
-	LockRelationForExtension(rel, ExclusiveLock);
+	LockRelationForExtension(rel);
 
 	/* Might have to re-open if a cache flush happened */
 	RelationOpenSmgr(rel);
@@ -652,7 +653,7 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 	/* Update local cache with the up-to-date size */
 	rel->rd_smgr->smgr_fsm_nblocks = fsm_nblocks_now;
 
-	UnlockRelationForExtension(rel, ExclusiveLock);
+	UnlockRelationForExtension(rel);
 
 	pfree(pg);
 }
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2d1ed14..3b6a6f7 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -35,6 +35,7 @@
 #include "replication/origin.h"
 #include "storage/bufmgr.h"
 #include "storage/dsm.h"
+#include "storage/extension_lock.h"
 #include "storage/ipc.h"
 #include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
@@ -133,6 +134,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 		size = add_size(size, BackgroundWorkerShmemSize());
 		size = add_size(size, MultiXactShmemSize());
 		size = add_size(size, LWLockShmemSize());
+		size = add_size(size, RelExtLockShmemSize());
 		size = add_size(size, ProcArrayShmemSize());
 		size = add_size(size, BackendStatusShmemSize());
 		size = add_size(size, SInvalShmemSize());
@@ -235,6 +237,11 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 	InitPredicateLocks();
 
 	/*
+	 * Set up relation extension lock manager
+	 */
+	InitRelExtLocks();
+
+	/*
 	 * Set up process table
 	 */
 	if (!IsUnderPostmaster)
diff --git a/src/backend/storage/lmgr/Makefile b/src/backend/storage/lmgr/Makefile
index e1b787e..2334a40 100644
--- a/src/backend/storage/lmgr/Makefile
+++ b/src/backend/storage/lmgr/Makefile
@@ -13,7 +13,7 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = lmgr.o lock.o proc.o deadlock.o lwlock.o lwlocknames.o spin.o \
-	s_lock.o predicate.o condition_variable.o
+	s_lock.o predicate.o condition_variable.o extension_lock.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/storage/lmgr/README b/src/backend/storage/lmgr/README
index 56b0a12..960d1f3 100644
--- a/src/backend/storage/lmgr/README
+++ b/src/backend/storage/lmgr/README
@@ -3,7 +3,7 @@ src/backend/storage/lmgr/README
 Locking Overview
 ================
 
-Postgres uses four types of interprocess locks:
+Postgres uses five types of interprocess locks:
 
 * Spinlocks.  These are intended for *very* short-term locks.  If a lock
 is to be held more than a few dozen instructions, or across any sort of
@@ -36,13 +36,21 @@ Regular locks should be used for all user-driven lock requests.
 
 * SIReadLock predicate locks.  See separate README-SSI file for details.
 
+* Relation extension locks. Only one process can extend a relation at
+a time; we use a specialized lock manager for this purpose, which is
+much simpler than the regular lock manager.  It is similar to the
+lightweight lock mechanism, but is ever simpler because there is only
+one lock mode and only one lock can be taken at a time. A process holding
+a relation extension lock is interruptible, unlike a process holding an
+LWLock.
+
 Acquisition of either a spinlock or a lightweight lock causes query
 cancel and die() interrupts to be held off until all such locks are
 released. No such restriction exists for regular locks, however.  Also
 note that we can accept query cancel and die() interrupts while waiting
-for a regular lock, but we will not accept them while waiting for
-spinlocks or LW locks. It is therefore not a good idea to use LW locks
-when the wait time might exceed a few seconds.
+for a relation extension lock or a regular lock, but we will not accept
+them while waiting for spinlocks or LW locks. It is therefore not a good
+idea to use LW locks when the wait time might exceed a few seconds.
 
 The rest of this README file discusses the regular lock manager in detail.
 
diff --git a/src/backend/storage/lmgr/extension_lock.c b/src/backend/storage/lmgr/extension_lock.c
new file mode 100644
index 0000000..1299c0e
--- /dev/null
+++ b/src/backend/storage/lmgr/extension_lock.c
@@ -0,0 +1,483 @@
+/*-------------------------------------------------------------------------
+ *
+ * extension_lock.c
+ *	  Relation extension lock manager
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/storage/lmgr/extension_lock.c
+ *
+ * NOTES:
+ *
+ * This lock manager is specialized in relation extension locks; light
+ * weight and interruptible lock manager. It's similar to heavy-weight
+ * lock but doesn't have dead lock detection mechanism, group locking
+ * mechanism and multiple lock modes.
+ *
+ * The entries for relation extension locks are allocated on the shared
+ * memory as an array. The pair of database id and relation id maps to
+ * one of them by hashing.
+ *
+ * For lock acquisition we use an atomic compare-and-exchange on the
+ * state variable. When a process tries to acquire a lock that conflicts
+ * with existing lock, it is put to sleep using condition variables
+ * if not conditional locking. When release the lock, we use an atomic
+ * decrement to release the lock.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/extension_lock.h"
+#include "utils/rel.h"
+
+/* The total entries of relation extension lock on shared memory */
+#define N_RELEXTLOCK_ENTS 1024
+
+#define RELEXT_LOCK_BIT		((uint32) ((1 << 25)))
+
+/* Must be greater than MAX_BACKENDS - which is 2^23-1, so we're fine. */
+#define RELEXT_WAIT_COUNT_MASK	((uint32) ((1 << 24) - 1))
+
+/* This tag maps to one of entries on the RelExtLockArray array by hashing */
+typedef struct RelExtLockTag
+{
+	Oid		dbid;
+	Oid		relid;
+} RelExtLockTag;
+
+typedef struct RelExtLock
+{
+	pg_atomic_uint32	state; 	/* state of exclusive lock */
+	ConditionVariable	cv;
+} RelExtLock;
+
+/*
+ * This structure holds information per-object relation extension
+ * lock. "lock" variable represents the RelExtLockArray we are
+ * holding, waiting for or had been holding before. If we're holding
+ * a relation extension lock on a relation, nLocks > 0. nLocks == 0
+ * means that we don't hold any locks. We use this structure to keep
+ * track of holding relation extension locks, and to also store it
+ * as a cache. So when releasing the lock we don't invalidate the lock
+ * variable. We check the cache first, and then use it without touching
+ * RelExtLockArray if the relation extension lock is the same as what
+ * we just touched.
+ *
+ * At most one lock can be held at once. Note that sometimes we
+ * could try to acquire a lock for the additional forks while holding
+ * the lock for the main fork; for example, adding extra relation
+ * blocks for both relation and its free space map. But since this
+ * lock manager doesn't distinguish between the forks, we just
+ * increment nLocks in the case.
+ */
+typedef	struct relextlock_handle
+{
+	Oid				relid;
+	RelExtLock		*lock;
+	int				nLocks;		/* > 0 means holding it */
+	bool			waiting;	/* true if we're waiting it */
+} relextlock_handle;
+
+static relextlock_handle held_relextlock;
+
+/* Pointer to array containing relation extension lock states */
+static RelExtLock *RelExtLockArray;
+
+static bool RelExtLockAcquire(Oid relid, bool conditional);
+static void RelExtLockRelease(Oid rleid, bool force);
+static bool RelExtLockAttemptLock(RelExtLock *relextlock);
+static inline uint32 RelExtLockTargetTagToIndex(RelExtLockTag *locktag);
+
+Size
+RelExtLockShmemSize(void)
+{
+	/* Relation extension locks array */
+	return mul_size(N_RELEXTLOCK_ENTS, sizeof(RelExtLock));
+}
+
+/*
+ * InitRelExtLock
+ *      Initialize the relation extension lock manager's data structures.
+ */
+void
+InitRelExtLocks(void)
+{
+	Size	size;
+	bool	found;
+	int		i;
+
+	size = mul_size(N_RELEXTLOCK_ENTS, sizeof(RelExtLock));
+	RelExtLockArray = (RelExtLock *)
+		ShmemInitStruct("Relation Extension Lock", size, &found);
+
+	/* we're the first - initialize */
+	if (!found)
+	{
+		for (i = 0; i < N_RELEXTLOCK_ENTS; i++)
+		{
+			RelExtLock *relextlock = &RelExtLockArray[i];
+
+			pg_atomic_init_u32(&(relextlock->state), 0);
+			ConditionVariableInit(&(relextlock->cv));
+		}
+	}
+}
+
+/*
+ *		LockRelationForExtension
+ *
+ * This lock is used to interlock addition of pages to relations.
+ * We need such locking because bufmgr/smgr definition of P_NEW is not
+ * race-condition-proof.
+ *
+ * We assume the caller is already holding some type of regular lock on
+ * the relation, so no AcceptInvalidationMessages call is needed here.
+ */
+void
+LockRelationForExtension(Relation relation)
+{
+	RelExtLockAcquire(relation->rd_id, false);
+}
+
+/*
+ *		ConditionalLockRelationForExtension
+ *
+ * As above, but only lock if we can get the lock without blocking.
+ * Returns TRUE iff the lock was acquired.
+ */
+bool
+ConditionalLockRelationForExtension(Relation relation)
+{
+	return RelExtLockAcquire(relation->rd_id, true);
+}
+
+/*
+ *		RelationExtensionLockWaiterCount
+ *
+ * Count the number of processes waiting for the given relation extension
+ * lock. Note that since the lock for multiple relations uses the same
+ * RelExtLock entry, the return value might not be accurate.
+ */
+int
+RelationExtensionLockWaiterCount(Relation relation)
+{
+	RelExtLockTag tag;
+	RelExtLock	*relextlock;
+	uint32		state;
+
+	/* Make a lock tag */
+	tag.dbid = MyDatabaseId;
+	tag.relid = RelationGetRelid(relation);
+
+	relextlock = &RelExtLockArray[RelExtLockTargetTagToIndex(&tag)];
+	state = pg_atomic_read_u32(&(relextlock->state));
+
+	return (state & RELEXT_WAIT_COUNT_MASK);
+}
+
+/*
+ *		UnlockRelationForExtension
+ */
+void
+UnlockRelationForExtension(Relation relation)
+{
+	RelExtLockRelease(relation->rd_id, false);
+}
+
+/*
+ *		RelationExtensionLockReleaseAll
+ *
+ * release all currently-held relation extension locks
+ */
+void
+RelExtLockReleaseAll(void)
+{
+	if (held_relextlock.nLocks > 0)
+		RelExtLockRelease(held_relextlock.relid, true);
+	else if (held_relextlock.waiting)
+	{
+		/*
+		 * Decrement the ref counts if we don't hold the lock but
+		 * was waiting for the lock.
+		 */
+		pg_atomic_sub_fetch_u32(&(held_relextlock.lock->state), 1);
+	}
+}
+
+/*
+ *		IsAnyRelationExtensionLockHeld
+ *
+ * Return true if we're holding relation extension locks.
+ */
+bool
+IsAnyRelationExtensionLockHeld(void)
+{
+	return held_relextlock.nLocks > 0;
+}
+
+/*
+ *		WaitForRelationExtensionLockToBeFree
+ *
+ * Wait for the relation extension lock on the given relation to
+ * be free without acquiring it.
+ */
+void
+WaitForRelationExtensionLockToBeFree(Relation relation)
+{
+	RelExtLock	*relextlock;
+	Oid		relid;
+
+	relid = RelationGetRelid(relation);
+
+	/* If we already hold the lock, no need to wait */
+	if (held_relextlock.nLocks > 0 && relid == held_relextlock.relid)
+		return;
+
+	/*
+	 * If the last relation extension lock we touched is the same
+	 * one for which we now need to wait, we can use our cached
+	 * pointer to the lock instead of recomputing it.
+	 */
+	if (relid == held_relextlock.relid)
+		relextlock = held_relextlock.lock;
+	else
+	{
+		RelExtLockTag tag;
+
+		/* Make a lock tag */
+		tag.dbid = MyDatabaseId;
+		tag.relid = relid;
+
+		relextlock = &RelExtLockArray[RelExtLockTargetTagToIndex(&tag)];
+
+		/* Remember the lock we're interested in */
+		held_relextlock.relid = relid;
+		held_relextlock.lock = relextlock;
+	}
+
+	for (;;)
+	{
+		uint32	state;
+
+		state = pg_atomic_read_u32(&(relextlock)->state);
+
+		/* Break if nobody is holding the lock on this relation */
+		if ((state & RELEXT_LOCK_BIT) == 0)
+			break;
+
+		/* Could not get the lock, prepare to wait */
+		if (!held_relextlock.waiting)
+		{
+			pg_atomic_add_fetch_u32(&(relextlock->state), 1);
+			held_relextlock.waiting = true;
+		}
+
+		/* Sleep until the lock is released */
+		ConditionVariableSleep(&(relextlock->cv),
+							   WAIT_EVENT_RELATION_EXTENSION_LOCK);
+	}
+
+	ConditionVariableCancelSleep();
+
+	/* Release any wait count we hold */
+	if (held_relextlock.waiting)
+	{
+		pg_atomic_sub_fetch_u32(&(relextlock->state), 1);
+		held_relextlock.waiting = false;
+	}
+
+	return;
+}
+
+/*
+ * Compute the hash code associated with a RelExtLock.
+ *
+ * To avoid unnecessary recomputations of the hash code, we try to do this
+ * just once per function, and then pass it around as needed.  we can
+ * extract the index number of RelExtLockArray.
+ */
+static inline uint32
+RelExtLockTargetTagToIndex(RelExtLockTag *locktag)
+{
+	return (tag_hash((const void *) locktag, sizeof(RelExtLockTag))
+			% N_RELEXTLOCK_ENTS);
+}
+
+/*
+ * Acquire a relation extension lock.
+ */
+static bool
+RelExtLockAcquire(Oid relid, bool conditional)
+{
+	RelExtLock	*relextlock;
+	bool	mustwait;
+
+	/*
+	 * If we already hold the lock, we can just increase the count locally.
+	 * Since we don't do deadlock detection, caller must not try to take a
+	 * new relation extension lock while already holding them.
+	 */
+	if (held_relextlock.nLocks > 0)
+	{
+		if (relid != held_relextlock.relid)
+			elog(ERROR,
+				 "cannot acquire relation extension locks for multiple relations at the same");
+
+		held_relextlock.nLocks++;
+		return true;
+	}
+
+	/*
+	 * If the last relation extension lock we touched is the same one for
+	 * we now need to acquire, we can use our cached pointer to the lock
+	 * instead of recomputing it.  This is likely to be a common case in
+	 * practice.
+	 */
+	if (relid == held_relextlock.relid)
+		relextlock = held_relextlock.lock;
+	else
+	{
+		RelExtLockTag tag;
+
+		/* Make a lock tag */
+		tag.dbid = MyDatabaseId;
+		tag.relid = relid;
+
+		relextlock = &RelExtLockArray[RelExtLockTargetTagToIndex(&tag)];
+
+		/* Remeber the lock we're interested in */
+		held_relextlock.relid = relid;
+		held_relextlock.lock = relextlock;
+	}
+
+	held_relextlock.waiting = false;
+	for (;;)
+	{
+		mustwait = RelExtLockAttemptLock(relextlock);
+
+		if (!mustwait)
+			break;	/* got the lock */
+
+		/* Could not got the lock, return iff in locking conditionally */
+		if (conditional)
+			return false;
+
+		/* Could not get the lock, prepare to wait */
+		if (!held_relextlock.waiting)
+		{
+			pg_atomic_add_fetch_u32(&(relextlock->state), 1);
+			held_relextlock.waiting = true;
+		}
+
+		/* Sleep until the lock is released */
+		ConditionVariableSleep(&(relextlock->cv),
+							   WAIT_EVENT_RELATION_EXTENSION_LOCK);
+	}
+
+	ConditionVariableCancelSleep();
+
+	/* Release any wait count we hold */
+	if (held_relextlock.waiting)
+	{
+		pg_atomic_sub_fetch_u32(&(relextlock->state), 1);
+		held_relextlock.waiting = false;
+	}
+
+	Assert(!mustwait);
+
+	/* Remember lock held by this backend */
+	held_relextlock.relid = relid;
+	held_relextlock.lock = relextlock;
+	held_relextlock.nLocks = 1;
+
+	/* We got the lock! */
+	return true;
+}
+
+/*
+ * RelExtLockRelease
+ *
+ * Release a previously acquired relation extension lock. If force is
+ * true, we release the all holding locks on the given relation.
+ */
+static void
+RelExtLockRelease(Oid relid, bool force)
+{
+	RelExtLock	*relextlock;
+	uint32	state;
+	uint32	wait_counts;
+
+	/* We should have acquired a lock before releasing */
+	Assert(held_relextlock.nLocks > 0);
+
+	if (relid != held_relextlock.relid)
+		ereport(ERROR,
+				(errcode(ERRCODE_LOCK_NOT_AVAILABLE),
+				 errmsg("relation extension lock for %u is not held",
+						relid)));
+
+	/* If force releasing, release all locks we're holding */
+	if (force)
+		held_relextlock.nLocks = 0;
+	else
+		held_relextlock.nLocks--;
+
+	Assert(held_relextlock.nLocks >= 0);
+
+	/* Return if we're still holding the lock even after computation */
+	if (held_relextlock.nLocks > 0)
+		return;
+
+	relextlock = held_relextlock.lock;
+
+	/* Release the lock */
+	state = pg_atomic_sub_fetch_u32(&(relextlock->state), RELEXT_LOCK_BIT);
+
+	/* If there may be waiters, wake them up */
+	wait_counts = state & RELEXT_WAIT_COUNT_MASK;
+
+	if (wait_counts > 0)
+		ConditionVariableBroadcast(&(relextlock->cv));
+}
+
+/*
+ * Internal function that attempts to atomically acquire the relation
+ * extension lock.
+ *
+ * Returns true if the lock isn't free and we need to wait.
+ */
+static bool
+RelExtLockAttemptLock(RelExtLock *relextlock)
+{
+	uint32	oldstate;
+
+	oldstate = pg_atomic_read_u32(&relextlock->state);
+
+	while (true)
+	{
+		uint32	desired_state;
+		bool	lock_free;
+
+		desired_state = oldstate;
+
+		lock_free = (oldstate & RELEXT_LOCK_BIT) == 0;
+		if (lock_free)
+			desired_state += RELEXT_LOCK_BIT;
+
+		if (pg_atomic_compare_exchange_u32(&relextlock->state,
+										   &oldstate, desired_state))
+		{
+			if (lock_free)
+				return false;
+			else
+				return true;
+		}
+	}
+	pg_unreachable();
+}
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index da5679b..4fbc0c4 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -319,78 +319,6 @@ UnlockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode)
 }
 
 /*
- *		LockRelationForExtension
- *
- * This lock tag is used to interlock addition of pages to relations.
- * We need such locking because bufmgr/smgr definition of P_NEW is not
- * race-condition-proof.
- *
- * We assume the caller is already holding some type of regular lock on
- * the relation, so no AcceptInvalidationMessages call is needed here.
- */
-void
-LockRelationForExtension(Relation relation, LOCKMODE lockmode)
-{
-	LOCKTAG		tag;
-
-	SET_LOCKTAG_RELATION_EXTEND(tag,
-								relation->rd_lockInfo.lockRelId.dbId,
-								relation->rd_lockInfo.lockRelId.relId);
-
-	(void) LockAcquire(&tag, lockmode, false, false);
-}
-
-/*
- *		ConditionalLockRelationForExtension
- *
- * As above, but only lock if we can get the lock without blocking.
- * Returns true iff the lock was acquired.
- */
-bool
-ConditionalLockRelationForExtension(Relation relation, LOCKMODE lockmode)
-{
-	LOCKTAG		tag;
-
-	SET_LOCKTAG_RELATION_EXTEND(tag,
-								relation->rd_lockInfo.lockRelId.dbId,
-								relation->rd_lockInfo.lockRelId.relId);
-
-	return (LockAcquire(&tag, lockmode, false, true) != LOCKACQUIRE_NOT_AVAIL);
-}
-
-/*
- *		RelationExtensionLockWaiterCount
- *
- * Count the number of processes waiting for the given relation extension lock.
- */
-int
-RelationExtensionLockWaiterCount(Relation relation)
-{
-	LOCKTAG		tag;
-
-	SET_LOCKTAG_RELATION_EXTEND(tag,
-								relation->rd_lockInfo.lockRelId.dbId,
-								relation->rd_lockInfo.lockRelId.relId);
-
-	return LockWaiterCount(&tag);
-}
-
-/*
- *		UnlockRelationForExtension
- */
-void
-UnlockRelationForExtension(Relation relation, LOCKMODE lockmode)
-{
-	LOCKTAG		tag;
-
-	SET_LOCKTAG_RELATION_EXTEND(tag,
-								relation->rd_lockInfo.lockRelId.dbId,
-								relation->rd_lockInfo.lockRelId.relId);
-
-	LockRelease(&tag, lockmode, false);
-}
-
-/*
  *		LockPage
  *
  * Obtain a page-level lock.  This is currently used by some index access
@@ -961,12 +889,6 @@ DescribeLockTag(StringInfo buf, const LOCKTAG *tag)
 							 tag->locktag_field2,
 							 tag->locktag_field1);
 			break;
-		case LOCKTAG_RELATION_EXTEND:
-			appendStringInfo(buf,
-							 _("extension of relation %u of database %u"),
-							 tag->locktag_field2,
-							 tag->locktag_field1);
-			break;
 		case LOCKTAG_PAGE:
 			appendStringInfo(buf,
 							 _("page %u of relation %u of database %u"),
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 5833086..3eaf8cb 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -40,6 +40,7 @@
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "pgstat.h"
+#include "storage/extension_lock.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "storage/sinvaladt.h"
@@ -717,6 +718,13 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	int			status;
 	bool		log_lock = false;
 
+	/*
+	 * Relation extension locks don't participate in deadlock detection,
+	 * so make sure we don't try to acquire a heavyweight lock while
+	 * holding one.
+	 */
+	Assert(IsAnyRelationExtensionLockHeld() == 0);
+
 	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
 		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
 	lockMethodTable = LockMethods[lockmethodid];
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 5f6727d..b1b0c63 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -44,6 +44,7 @@
 #include "replication/slot.h"
 #include "replication/syncrep.h"
 #include "storage/condition_variable.h"
+#include "storage/extension_lock.h"
 #include "storage/standby.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
@@ -765,6 +766,8 @@ ProcReleaseLocks(bool isCommit)
 		return;
 	/* If waiting, get off wait queue (should only be needed after error) */
 	LockErrorCleanup();
+	/* If we hold a relation extension lock, release it */
+	RelExtLockReleaseAll();
 	/* Release standard locks, including session-level if aborting */
 	LockReleaseAll(DEFAULT_LOCKMETHOD, !isCommit);
 	/* Release transaction-level advisory locks */
diff --git a/src/backend/utils/adt/lockfuncs.c b/src/backend/utils/adt/lockfuncs.c
index 9e0a8ab..6d8916c 100644
--- a/src/backend/utils/adt/lockfuncs.c
+++ b/src/backend/utils/adt/lockfuncs.c
@@ -25,7 +25,6 @@
 /* This must match enum LockTagType! */
 const char *const LockTagTypeNames[] = {
 	"relation",
-	"extend",
 	"page",
 	"tuple",
 	"transactionid",
@@ -234,7 +233,6 @@ pg_lock_status(PG_FUNCTION_ARGS)
 		switch ((LockTagType) instance->locktag.locktag_type)
 		{
 			case LOCKTAG_RELATION:
-			case LOCKTAG_RELATION_EXTEND:
 				values[1] = ObjectIdGetDatum(instance->locktag.locktag_field1);
 				values[2] = ObjectIdGetDatum(instance->locktag.locktag_field2);
 				nulls[3] = true;
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 089b7c3..b3611c3 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -813,6 +813,7 @@ typedef enum
 	WAIT_EVENT_PARALLEL_BITMAP_SCAN,
 	WAIT_EVENT_PROCARRAY_GROUP_UPDATE,
 	WAIT_EVENT_CLOG_GROUP_UPDATE,
+	WAIT_EVENT_RELATION_EXTENSION_LOCK,
 	WAIT_EVENT_REPLICATION_ORIGIN_DROP,
 	WAIT_EVENT_REPLICATION_SLOT_DROP,
 	WAIT_EVENT_SAFE_SNAPSHOT,
diff --git a/src/include/storage/extension_lock.h b/src/include/storage/extension_lock.h
new file mode 100644
index 0000000..27fda42
--- /dev/null
+++ b/src/include/storage/extension_lock.h
@@ -0,0 +1,38 @@
+/*-------------------------------------------------------------------------
+ *
+ * extension_lock.h
+ *	  Relation extension lock manager
+ *
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/extension_lock.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef EXTENSION_LOCK_H
+#define EXTENSION_LOCK_H
+
+#ifdef FRONTEND
+#error "extension_lock.h may not be included from frontend code"
+#endif
+
+#include "port/atomics.h"
+#include "storage/s_lock.h"
+#include "storage/condition_variable.h"
+#include "storage/proclist_types.h"
+
+/* Lock a relation for extension */
+extern Size RelExtLockShmemSize(void);
+extern void InitRelExtLocks(void);
+extern void LockRelationForExtension(Relation relation);
+extern void UnlockRelationForExtension(Relation relation);
+extern bool ConditionalLockRelationForExtension(Relation relation);
+extern int	RelationExtensionLockWaiterCount(Relation relation);
+extern void WaitForRelationExtensionLockToBeFree(Relation relation);
+extern void RelExtLockReleaseAll(void);
+extern bool	IsAnyRelationExtensionLockHeld(void);
+
+#endif	/* EXTENSION_LOCK_H */
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index 0b92322..7e6b80c 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -50,13 +50,6 @@ extern bool LockHasWaitersRelation(Relation relation, LOCKMODE lockmode);
 extern void LockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
 extern void UnlockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
 
-/* Lock a relation for extension */
-extern void LockRelationForExtension(Relation relation, LOCKMODE lockmode);
-extern void UnlockRelationForExtension(Relation relation, LOCKMODE lockmode);
-extern bool ConditionalLockRelationForExtension(Relation relation,
-									LOCKMODE lockmode);
-extern int	RelationExtensionLockWaiterCount(Relation relation);
-
 /* Lock a page (currently only used within indexes) */
 extern void LockPage(Relation relation, BlockNumber blkno, LOCKMODE lockmode);
 extern bool ConditionalLockPage(Relation relation, BlockNumber blkno, LOCKMODE lockmode);
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index 765431e..3be18ea 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -138,8 +138,6 @@ typedef uint16 LOCKMETHODID;
 typedef enum LockTagType
 {
 	LOCKTAG_RELATION,			/* whole relation */
-	/* ID info for a relation is DB OID + REL OID; DB OID = 0 if shared */
-	LOCKTAG_RELATION_EXTEND,	/* the right to extend a relation */
 	/* same ID info as RELATION */
 	LOCKTAG_PAGE,				/* one page of a relation */
 	/* ID info for a page is RELATION info + BlockNumber */
@@ -198,14 +196,6 @@ typedef struct LOCKTAG
 	 (locktag).locktag_type = LOCKTAG_RELATION, \
 	 (locktag).locktag_lockmethodid = DEFAULT_LOCKMETHOD)
 
-#define SET_LOCKTAG_RELATION_EXTEND(locktag,dboid,reloid) \
-	((locktag).locktag_field1 = (dboid), \
-	 (locktag).locktag_field2 = (reloid), \
-	 (locktag).locktag_field3 = 0, \
-	 (locktag).locktag_field4 = 0, \
-	 (locktag).locktag_type = LOCKTAG_RELATION_EXTEND, \
-	 (locktag).locktag_lockmethodid = DEFAULT_LOCKMETHOD)
-
 #define SET_LOCKTAG_PAGE(locktag,dboid,reloid,blocknum) \
 	((locktag).locktag_field1 = (dboid), \
 	 (locktag).locktag_field2 = (reloid), \

#38

Robert Haas

robertmhaas@gmail.com

about 8 years ago

In reply to: Masahiko Sawada (#37)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Fri, Dec 8, 2017 at 3:20 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

The first hunk in monitoring.sgml looks unnecessary.

You meant the following hunk?

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 8d461c8..7aa7981 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -669,8 +669,8 @@ postgres   27093  0.0  0.0  30096  2752 ?
Ss   11:34   0:00 postgres: ser
Heavyweight locks, also known as lock manager locks or simply locks,
primarily protect SQL-visible objects such as tables.  However,
they are also used to ensure mutual exclusion for certain internal
-          operations such as relation extension.
<literal>wait_event</literal> will
-          identify the type of lock awaited.
+          operations such as waiting for a transaction to finish.
+          <literal>wait_event</literal> will identify the type of lock awaited.
</para>
</listitem>
<listitem>

I think that since the extension locks are no longer a part of
heavyweight locks we should change the explanation.

Yes, you are right.

+RelationExtensionLockWaiterCount(Relation relation)

Hmm. This is sort of problematic, because with then new design we
have no guarantee that the return value is actually accurate. I don't
think that's a functional problem, but the optics aren't great.

Yeah, with this patch we could overestimate it and then add extra
blocks to the relation. Since the number of extra blocks is capped at
512 I think it would not become serious problem.

How about renaming it EstimateNumberOfExtensionLockWaiters?

+    /* If force releasing, release all locks we're holding */
+    if (force)
+        held_relextlock.nLocks = 0;
+    else
+        held_relextlock.nLocks--;
+
+    Assert(held_relextlock.nLocks >= 0);
+
+    /* Return if we're still holding the lock even after computation */
+    if (held_relextlock.nLocks > 0)
+        return;
I thought you were going to have the caller adjust nLocks?
Yeah, I was supposed to change so but since we always release either
one lock or all relext locks I thought it'd better to pass a bool
rather than an int.

I don't see why you need to pass either one. The caller can set
held_relextlock.nLocks either with -- or = 0, and then call
RelExtLockRelease() only if the resulting value is 0.

When I took a break from sitting at the computer, I realized that I
think this has a more serious problem: won't it permanently leak
reference counts if someone hits ^C or an error occurs while the lock
is held? I think it will -- it probably needs to do cleanup at the
places where we do LWLockReleaseAll() that includes decrementing the
shared refcount if necessary, rather than doing cleanup at the places
we release heavyweight locks.
I might be wrong about the details here -- this is off the top of my head.

Good catch. It can leak reference counts if someone hits ^C or an
error occurs while waiting. Fixed in the latest patch. But since
RelExtLockReleaseAll() is called even when such situations I think we
don't need to change the place where releasing the all relext lock. We
just moved it from heavyweight locks. Am I missing something?

Hmm, that might be an OK way to handle it. I don't see a problem off
the top of my head. It might be clearer to rename it to
RelExtLockCleanup() though, since it is not just releasing the lock
but also any wait count we hold.

+/* Must be greater than MAX_BACKENDS - which is 2^23-1, so we're fine. */
+#define RELEXT_WAIT_COUNT_MASK    ((uint32) ((1 << 24) - 1))

Let's drop the comment here and instead add a StaticAssertStmt() that
checks this.

I am slightly puzzled, though. If I read this correctly, bits 0-23
are used for the waiter count, bit 24 is always 0, bit 25 indicates
the presence or absence of an exclusive lock, and bits 26+ are always
0. That seems slightly odd. Shouldn't we either use the highest
available bit for the locker (bit 31) or the lowest one (bit 24)? The
former seems better, in case MAX_BACKENDS changes later. We could
make RELEXT_WAIT_COUNT_MASK bigger too, just in case.

+        /* Make a lock tag */
+        tag.dbid = MyDatabaseId;
+        tag.relid = relid;

What about shared relations? I bet we need to use 0 in that case.
Otherwise, if backends in two different databases try to extend the
same shared relation at the same time, we'll (probably) fail to notice
that they conflict.

+ * To avoid unnecessary recomputations of the hash code, we try to do this
+ * just once per function, and then pass it around as needed.  we can
+ * extract the index number of RelExtLockArray.

This is just a copy-and-paste from lock.c, but actually we have a more
sophisticated scheme here. I think you can just drop this comment
altogether, really.

+    return (tag_hash((const void *) locktag, sizeof(RelExtLockTag))
+            % N_RELEXTLOCK_ENTS);

I would drop the outermost set of parentheses. Is the cast to (const
void *) really doing anything?

+ "cannot acquire relation extension locks for
multiple relations at the same");

cannot simultaneously acquire more than one distinct relation lock?
As you have it, you'd have to add the word "time" at the end, but my
version is shorter.

+ /* Sleep until the lock is released */

Really, there's no guarantee that the lock will be released when we
wake up. I think just /* Sleep until something happens, then recheck
*/

+        lock_free = (oldstate & RELEXT_LOCK_BIT) == 0;
+        if (lock_free)
+            desired_state += RELEXT_LOCK_BIT;
+
+        if (pg_atomic_compare_exchange_u32(&relextlock->state,
+                                           &oldstate, desired_state))
+        {
+            if (lock_free)
+                return false;
+            else
+                return true;
+        }

Hmm. If the lock is not free, we attempt to compare-and-swap anyway,
but then return false? Why not just lock_free = (oldstate &
RELEXT_LOCK_BIT) == 0; if (!lock_free) return true; if
(pg_atomic_compare_exchange(&relextlock->state, &oldstate, oldstate |
RELEXT_LOCK_BIT)) return false;

+ Assert(IsAnyRelationExtensionLockHeld() == 0);

Since this is return bool now, it should just be
Assert(!IsAnyRelationExtensionLockHeld()).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#39

Masahiko Sawada

sawada.mshk@gmail.com

about 8 years ago

In reply to: Robert Haas (#38)

1 attachment(s)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Sat, Dec 9, 2017 at 2:00 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Dec 8, 2017 at 3:20 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
The first hunk in monitoring.sgml looks unnecessary.

You meant the following hunk?
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 8d461c8..7aa7981 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -669,8 +669,8 @@ postgres   27093  0.0  0.0  30096  2752 ?
Ss   11:34   0:00 postgres: ser
Heavyweight locks, also known as lock manager locks or simply locks,
primarily protect SQL-visible objects such as tables.  However,
they are also used to ensure mutual exclusion for certain internal
-          operations such as relation extension.
<literal>wait_event</literal> will
-          identify the type of lock awaited.
+          operations such as waiting for a transaction to finish.
+          <literal>wait_event</literal> will identify the type of lock awaited.
</para>
</listitem>
<listitem>
I think that since the extension locks are no longer a part of
heavyweight locks we should change the explanation.
Yes, you are right.

+RelationExtensionLockWaiterCount(Relation relation)

Hmm. This is sort of problematic, because with then new design we
have no guarantee that the return value is actually accurate. I don't
think that's a functional problem, but the optics aren't great.

Yeah, with this patch we could overestimate it and then add extra
blocks to the relation. Since the number of extra blocks is capped at
512 I think it would not become serious problem.

How about renaming it EstimateNumberOfExtensionLockWaiters?

Agreed, fixed.

+    /* If force releasing, release all locks we're holding */
+    if (force)
+        held_relextlock.nLocks = 0;
+    else
+        held_relextlock.nLocks--;
+
+    Assert(held_relextlock.nLocks >= 0);
+
+    /* Return if we're still holding the lock even after computation */
+    if (held_relextlock.nLocks > 0)
+        return;
I thought you were going to have the caller adjust nLocks?
Yeah, I was supposed to change so but since we always release either
one lock or all relext locks I thought it'd better to pass a bool
rather than an int.
I don't see why you need to pass either one. The caller can set
held_relextlock.nLocks either with -- or = 0, and then call
RelExtLockRelease() only if the resulting value is 0.

Fixed.

When I took a break from sitting at the computer, I realized that I
think this has a more serious problem: won't it permanently leak
reference counts if someone hits ^C or an error occurs while the lock
is held? I think it will -- it probably needs to do cleanup at the
places where we do LWLockReleaseAll() that includes decrementing the
shared refcount if necessary, rather than doing cleanup at the places
we release heavyweight locks.
I might be wrong about the details here -- this is off the top of my head.

Good catch. It can leak reference counts if someone hits ^C or an
error occurs while waiting. Fixed in the latest patch. But since
RelExtLockReleaseAll() is called even when such situations I think we
don't need to change the place where releasing the all relext lock. We
just moved it from heavyweight locks. Am I missing something?

Hmm, that might be an OK way to handle it. I don't see a problem off
the top of my head. It might be clearer to rename it to
RelExtLockCleanup() though, since it is not just releasing the lock
but also any wait count we hold.

Yeah, it seems better. Fixed.

+/* Must be greater than MAX_BACKENDS - which is 2^23-1, so we're fine. */
+#define RELEXT_WAIT_COUNT_MASK    ((uint32) ((1 << 24) - 1))
Let's drop the comment here and instead add a StaticAssertStmt() that
checks this.

Fixed. I added StaticAssertStmt() to InitRelExtLocks().

I am slightly puzzled, though. If I read this correctly, bits 0-23
are used for the waiter count, bit 24 is always 0, bit 25 indicates
the presence or absence of an exclusive lock, and bits 26+ are always
0. That seems slightly odd. Shouldn't we either use the highest
available bit for the locker (bit 31) or the lowest one (bit 24)? The
former seems better, in case MAX_BACKENDS changes later. We could
make RELEXT_WAIT_COUNT_MASK bigger too, just in case.

I agree with the former. Fixed.

+        /* Make a lock tag */
+        tag.dbid = MyDatabaseId;
+        tag.relid = relid;
What about shared relations? I bet we need to use 0 in that case.
Otherwise, if backends in two different databases try to extend the
same shared relation at the same time, we'll (probably) fail to notice
that they conflict.

You're right. I changed it so that we set invalidOId to tag.dbid if
the relation is shared relation.

+ * To avoid unnecessary recomputations of the hash code, we try to do this
+ * just once per function, and then pass it around as needed.  we can
+ * extract the index number of RelExtLockArray.
This is just a copy-and-paste from lock.c, but actually we have a more
sophisticated scheme here. I think you can just drop this comment
altogether, really.

Fixed.

+    return (tag_hash((const void *) locktag, sizeof(RelExtLockTag))
+            % N_RELEXTLOCK_ENTS);
I would drop the outermost set of parentheses. Is the cast to (const
void *) really doing anything?

Fixed.

+ "cannot acquire relation extension locks for
multiple relations at the same");

cannot simultaneously acquire more than one distinct relation lock?
As you have it, you'd have to add the word "time" at the end, but my
version is shorter.

I wanted to mean, cannot acquire relation extension locks for multiple
relations at the "time". Fixed.

+ /* Sleep until the lock is released */

Really, there's no guarantee that the lock will be released when we
wake up. I think just /* Sleep until something happens, then recheck
*/

Fixed.

+        lock_free = (oldstate & RELEXT_LOCK_BIT) == 0;
+        if (lock_free)
+            desired_state += RELEXT_LOCK_BIT;
+
+        if (pg_atomic_compare_exchange_u32(&relextlock->state,
+                                           &oldstate, desired_state))
+        {
+            if (lock_free)
+                return false;
+            else
+                return true;
+        }
Hmm. If the lock is not free, we attempt to compare-and-swap anyway,
but then return false? Why not just lock_free = (oldstate &
RELEXT_LOCK_BIT) == 0; if (!lock_free) return true; if
(pg_atomic_compare_exchange(&relextlock->state, &oldstate, oldstate |
RELEXT_LOCK_BIT)) return false;

Fixed.

+ Assert(IsAnyRelationExtensionLockHeld() == 0);

Since this is return bool now, it should just be
Assert(!IsAnyRelationExtensionLockHeld()).

Fixed.

Attached updated version patch. Please review it.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

Moving_extension_lock_out_of_heavyweight_lock_v11.patchapplication/octet-stream; name=Moving_extension_lock_out_of_heavyweight_lock_v11.patchDownload

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index b6f80d9..07dd3f7 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -669,8 +669,8 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
           Heavyweight locks, also known as lock manager locks or simply locks,
           primarily protect SQL-visible objects such as tables.  However,
           they are also used to ensure mutual exclusion for certain internal
-          operations such as relation extension.  <literal>wait_event</literal> will
-          identify the type of lock awaited.
+          operations such as waiting for a transaction to finish.
+          <literal>wait_event</literal> will identify the type of lock awaited.
          </para>
         </listitem>
         <listitem>
@@ -1122,15 +1122,11 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          execution.</entry>
         </row>
         <row>
-         <entry morerows="9"><literal>Lock</literal></entry>
+         <entry morerows="8"><literal>Lock</literal></entry>
          <entry><literal>relation</literal></entry>
          <entry>Waiting to acquire a lock on a relation.</entry>
         </row>
         <row>
-         <entry><literal>extend</literal></entry>
-         <entry>Waiting to extend a relation.</entry>
-        </row>
-        <row>
          <entry><literal>page</literal></entry>
          <entry>Waiting to acquire a lock on page of a relation.</entry>
         </row>
@@ -1263,7 +1259,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry>Waiting in an extension.</entry>
         </row>
         <row>
-         <entry morerows="17"><literal>IPC</literal></entry>
+         <entry morerows="18"><literal>IPC</literal></entry>
          <entry><literal>BgWorkerShutdown</literal></entry>
          <entry>Waiting for background worker to shut down.</entry>
         </row>
@@ -1320,6 +1316,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry>Waiting for group leader to update transaction status at transaction end.</entry>
         </row>
         <row>
+         <entry><literal>RelationExtensionLock</literal></entry>
+         <entry>Waiting to extend a relation.</entry>
+        </row>
+        <row>
          <entry><literal>ReplicationOriginDrop</literal></entry>
          <entry>Waiting for a replication origin to become inactive to be dropped.</entry>
         </row>
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index 09db5c6..05cca9d 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -17,6 +17,7 @@
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/extension_lock.h"
 #include "storage/freespace.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
@@ -623,8 +624,7 @@ brin_page_cleanup(Relation idxrel, Buffer buf)
 	 */
 	if (PageIsNew(page))
 	{
-		LockRelationForExtension(idxrel, ShareLock);
-		UnlockRelationForExtension(idxrel, ShareLock);
+		WaitForRelationExtensionLockToBeFree(idxrel);
 
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 		if (PageIsNew(page))
@@ -716,7 +716,7 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 			 */
 			if (!RELATION_IS_LOCAL(irel))
 			{
-				LockRelationForExtension(irel, ExclusiveLock);
+				LockRelationForExtension(irel);
 				extensionLockHeld = true;
 			}
 			buf = ReadBuffer(irel, P_NEW);
@@ -768,7 +768,7 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 				}
 
 				if (extensionLockHeld)
-					UnlockRelationForExtension(irel, ExclusiveLock);
+					UnlockRelationForExtension(irel);
 
 				ReleaseBuffer(buf);
 				return InvalidBuffer;
@@ -778,7 +778,7 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
 		if (extensionLockHeld)
-			UnlockRelationForExtension(irel, ExclusiveLock);
+			UnlockRelationForExtension(irel);
 
 		page = BufferGetPage(buf);
 
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index 03e53ce..af8f5ce 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -29,6 +29,7 @@
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/extension_lock.h"
 #include "storage/lmgr.h"
 #include "utils/rel.h"
 
@@ -570,7 +571,7 @@ revmap_physical_extend(BrinRevmap *revmap)
 	else
 	{
 		if (needLock)
-			LockRelationForExtension(irel, ExclusiveLock);
+			LockRelationForExtension(irel);
 
 		buf = ReadBuffer(irel, P_NEW);
 		if (BufferGetBlockNumber(buf) != mapBlk)
@@ -582,7 +583,7 @@ revmap_physical_extend(BrinRevmap *revmap)
 			 * page from under whoever is using it.
 			 */
 			if (needLock)
-				UnlockRelationForExtension(irel, ExclusiveLock);
+				UnlockRelationForExtension(irel);
 			LockBuffer(revmap->rm_metaBuf, BUFFER_LOCK_UNLOCK);
 			ReleaseBuffer(buf);
 			return;
@@ -591,7 +592,7 @@ revmap_physical_extend(BrinRevmap *revmap)
 		page = BufferGetPage(buf);
 
 		if (needLock)
-			UnlockRelationForExtension(irel, ExclusiveLock);
+			UnlockRelationForExtension(irel);
 	}
 
 	/* Check that it's a regular block (or an empty page) */
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index d9c6483..8d35918 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -21,6 +21,7 @@
 #include "catalog/pg_collation.h"
 #include "catalog/pg_type.h"
 #include "miscadmin.h"
+#include "storage/extension_lock.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/builtins.h"
@@ -325,13 +326,13 @@ GinNewBuffer(Relation index)
 	/* Must extend the file */
 	needLock = !RELATION_IS_LOCAL(index);
 	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
+		LockRelationForExtension(index);
 
 	buffer = ReadBuffer(index, P_NEW);
 	LockBuffer(buffer, GIN_EXCLUSIVE);
 
 	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+		UnlockRelationForExtension(index);
 
 	return buffer;
 }
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index 394bc83..d769a76 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -20,6 +20,7 @@
 #include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "postmaster/autovacuum.h"
+#include "storage/extension_lock.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/memutils.h"
@@ -716,10 +717,10 @@ ginvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	needLock = !RELATION_IS_LOCAL(index);
 
 	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
+		LockRelationForExtension(index);
 	npages = RelationGetNumberOfBlocks(index);
 	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+		UnlockRelationForExtension(index);
 
 	totFreePages = 0;
 
@@ -766,10 +767,10 @@ ginvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	stats->pages_free = totFreePages;
 
 	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
+		LockRelationForExtension(index);
 	stats->num_pages = RelationGetNumberOfBlocks(index);
 	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+		UnlockRelationForExtension(index);
 
 	return stats;
 }
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index d8d1c0a..76171a5 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -20,6 +20,7 @@
 #include "access/htup_details.h"
 #include "access/reloptions.h"
 #include "catalog/pg_opclass.h"
+#include "storage/extension_lock.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/builtins.h"
@@ -821,13 +822,13 @@ gistNewBuffer(Relation r)
 	needLock = !RELATION_IS_LOCAL(r);
 
 	if (needLock)
-		LockRelationForExtension(r, ExclusiveLock);
+		LockRelationForExtension(r);
 
 	buffer = ReadBuffer(r, P_NEW);
 	LockBuffer(buffer, GIST_EXCLUSIVE);
 
 	if (needLock)
-		UnlockRelationForExtension(r, ExclusiveLock);
+		UnlockRelationForExtension(r);
 
 	return buffer;
 }
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 77d9d12..42ef36a 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -18,6 +18,7 @@
 #include "access/gist_private.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
+#include "storage/extension_lock.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 
@@ -59,10 +60,10 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 
 	/* try to find deleted pages */
 	if (needLock)
-		LockRelationForExtension(rel, ExclusiveLock);
+		LockRelationForExtension(rel);
 	npages = RelationGetNumberOfBlocks(rel);
 	if (needLock)
-		UnlockRelationForExtension(rel, ExclusiveLock);
+		UnlockRelationForExtension(rel);
 
 	totFreePages = 0;
 	for (blkno = GIST_ROOT_BLKNO + 1; blkno < npages; blkno++)
@@ -91,10 +92,10 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	/* return statistics */
 	stats->pages_free = totFreePages;
 	if (needLock)
-		LockRelationForExtension(rel, ExclusiveLock);
+		LockRelationForExtension(rel);
 	stats->num_pages = RelationGetNumberOfBlocks(rel);
 	if (needLock)
-		UnlockRelationForExtension(rel, ExclusiveLock);
+		UnlockRelationForExtension(rel);
 
 	return stats;
 }
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 13e3bdc..fc2c9b4 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -20,6 +20,7 @@
 #include "access/htup_details.h"
 #include "access/visibilitymap.h"
 #include "storage/bufmgr.h"
+#include "storage/extension_lock.h"
 #include "storage/freespace.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
@@ -186,7 +187,7 @@ RelationAddExtraBlocks(Relation relation, BulkInsertState bistate)
 	Buffer		buffer;
 
 	/* Use the length of the lock wait queue to judge how much to extend. */
-	lockWaiters = RelationExtensionLockWaiterCount(relation);
+	lockWaiters = EstimateNumberOfExtensionLockWaiters(relation);
 	if (lockWaiters <= 0)
 		return;
 
@@ -519,11 +520,11 @@ loop:
 	if (needLock)
 	{
 		if (!use_fsm)
-			LockRelationForExtension(relation, ExclusiveLock);
-		else if (!ConditionalLockRelationForExtension(relation, ExclusiveLock))
+			LockRelationForExtension(relation);
+		else if (!ConditionalLockRelationForExtension(relation))
 		{
 			/* Couldn't get the lock immediately; wait for it. */
-			LockRelationForExtension(relation, ExclusiveLock);
+			LockRelationForExtension(relation);
 
 			/*
 			 * Check if some other backend has extended a block for us while
@@ -537,7 +538,7 @@ loop:
 			 */
 			if (targetBlock != InvalidBlockNumber)
 			{
-				UnlockRelationForExtension(relation, ExclusiveLock);
+				UnlockRelationForExtension(relation);
 				goto loop;
 			}
 
@@ -576,7 +577,7 @@ loop:
 	 * against vacuumlazy.c --- see comments therein.
 	 */
 	if (needLock)
-		UnlockRelationForExtension(relation, ExclusiveLock);
+		UnlockRelationForExtension(relation);
 
 	/*
 	 * We need to initialize the empty new page.  Double-check that it really
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 4c2a13a..2efee68 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -90,6 +90,7 @@
 #include "access/xlog.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/extension_lock.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
 #include "utils/inval.h"
@@ -641,7 +642,7 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 	 * Note that another backend might have extended or created the relation
 	 * by the time we get the lock.
 	 */
-	LockRelationForExtension(rel, ExclusiveLock);
+	LockRelationForExtension(rel);
 
 	/* Might have to re-open if a cache flush happened */
 	RelationOpenSmgr(rel);
@@ -679,7 +680,7 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 	/* Update local cache with the up-to-date size */
 	rel->rd_smgr->smgr_vm_nblocks = vm_nblocks_now;
 
-	UnlockRelationForExtension(rel, ExclusiveLock);
+	UnlockRelationForExtension(rel);
 
 	pfree(pg);
 }
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index c774349..7824c92 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -28,6 +28,7 @@
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
+#include "storage/extension_lock.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
@@ -659,7 +660,7 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 		needLock = !RELATION_IS_LOCAL(rel);
 
 		if (needLock)
-			LockRelationForExtension(rel, ExclusiveLock);
+			LockRelationForExtension(rel);
 
 		buf = ReadBuffer(rel, P_NEW);
 
@@ -673,7 +674,7 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 		 * condition against btvacuumscan --- see comments therein.
 		 */
 		if (needLock)
-			UnlockRelationForExtension(rel, ExclusiveLock);
+			UnlockRelationForExtension(rel);
 
 		/* Initialize the new page before returning it */
 		page = BufferGetPage(buf);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 399e6a1..5af1c21 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -25,6 +25,7 @@
 #include "commands/vacuum.h"
 #include "pgstat.h"
 #include "storage/condition_variable.h"
+#include "storage/extension_lock.h"
 #include "storage/indexfsm.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
@@ -1058,10 +1059,10 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	{
 		/* Get the current relation length */
 		if (needLock)
-			LockRelationForExtension(rel, ExclusiveLock);
+			LockRelationForExtension(rel);
 		num_pages = RelationGetNumberOfBlocks(rel);
 		if (needLock)
-			UnlockRelationForExtension(rel, ExclusiveLock);
+			UnlockRelationForExtension(rel);
 
 		/* Quit if we've scanned the whole relation */
 		if (blkno >= num_pages)
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index bd5301f..0ff53a3 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -23,6 +23,7 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/builtins.h"
+#include "storage/extension_lock.h"
 #include "utils/index_selfuncs.h"
 #include "utils/lsyscache.h"
 
@@ -230,13 +231,13 @@ SpGistNewBuffer(Relation index)
 	/* Must extend the file */
 	needLock = !RELATION_IS_LOCAL(index);
 	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
+		LockRelationForExtension(index);
 
 	buffer = ReadBuffer(index, P_NEW);
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
 	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+		UnlockRelationForExtension(index);
 
 	return buffer;
 }
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index d7d5e90..385d1cb 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -24,6 +24,7 @@
 #include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/extension_lock.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/snapmgr.h"
@@ -824,10 +825,10 @@ spgvacuumscan(spgBulkDeleteState *bds)
 	{
 		/* Get the current relation length */
 		if (needLock)
-			LockRelationForExtension(index, ExclusiveLock);
+			LockRelationForExtension(index);
 		num_pages = RelationGetNumberOfBlocks(index);
 		if (needLock)
-			UnlockRelationForExtension(index, ExclusiveLock);
+			UnlockRelationForExtension(index);
 
 		/* Quit if we've scanned the whole relation */
 		if (blkno >= num_pages)
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 20ce431..4a72223 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -54,6 +54,7 @@
 #include "portability/instr_time.h"
 #include "postmaster/autovacuum.h"
 #include "storage/bufmgr.h"
+#include "storage/extension_lock.h"
 #include "storage/freespace.h"
 #include "storage/lmgr.h"
 #include "utils/lsyscache.h"
@@ -860,8 +861,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
 			 * it's got exclusive lock on the whole relation.
 			 */
 			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
-			LockRelationForExtension(onerel, ExclusiveLock);
-			UnlockRelationForExtension(onerel, ExclusiveLock);
+			WaitForRelationExtensionLockToBeFree(onerel);
 			LockBufferForCleanup(buf);
 			if (PageIsNew(page))
 			{
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 5c256ff..210552f 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3616,6 +3616,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_CLOG_GROUP_UPDATE:
 			event_name = "ClogGroupUpdate";
 			break;
+		case WAIT_EVENT_RELATION_EXTENSION_LOCK:
+			event_name = "RelationExtensionLock";
+			break;
 		case WAIT_EVENT_REPLICATION_ORIGIN_DROP:
 			event_name = "ReplicationOriginDrop";
 			break;
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 4648473..172a48c 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -26,6 +26,7 @@
 #include "access/htup_details.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
+#include "storage/extension_lock.h"
 #include "storage/freespace.h"
 #include "storage/fsm_internals.h"
 #include "storage/lmgr.h"
@@ -624,7 +625,7 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 	 * Note that another backend might have extended or created the relation
 	 * by the time we get the lock.
 	 */
-	LockRelationForExtension(rel, ExclusiveLock);
+	LockRelationForExtension(rel);
 
 	/* Might have to re-open if a cache flush happened */
 	RelationOpenSmgr(rel);
@@ -652,7 +653,7 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 	/* Update local cache with the up-to-date size */
 	rel->rd_smgr->smgr_fsm_nblocks = fsm_nblocks_now;
 
-	UnlockRelationForExtension(rel, ExclusiveLock);
+	UnlockRelationForExtension(rel);
 
 	pfree(pg);
 }
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2d1ed14..3b6a6f7 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -35,6 +35,7 @@
 #include "replication/origin.h"
 #include "storage/bufmgr.h"
 #include "storage/dsm.h"
+#include "storage/extension_lock.h"
 #include "storage/ipc.h"
 #include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
@@ -133,6 +134,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 		size = add_size(size, BackgroundWorkerShmemSize());
 		size = add_size(size, MultiXactShmemSize());
 		size = add_size(size, LWLockShmemSize());
+		size = add_size(size, RelExtLockShmemSize());
 		size = add_size(size, ProcArrayShmemSize());
 		size = add_size(size, BackendStatusShmemSize());
 		size = add_size(size, SInvalShmemSize());
@@ -235,6 +237,11 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 	InitPredicateLocks();
 
 	/*
+	 * Set up relation extension lock manager
+	 */
+	InitRelExtLocks();
+
+	/*
 	 * Set up process table
 	 */
 	if (!IsUnderPostmaster)
diff --git a/src/backend/storage/lmgr/Makefile b/src/backend/storage/lmgr/Makefile
index e1b787e..2334a40 100644
--- a/src/backend/storage/lmgr/Makefile
+++ b/src/backend/storage/lmgr/Makefile
@@ -13,7 +13,7 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = lmgr.o lock.o proc.o deadlock.o lwlock.o lwlocknames.o spin.o \
-	s_lock.o predicate.o condition_variable.o
+	s_lock.o predicate.o condition_variable.o extension_lock.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/storage/lmgr/README b/src/backend/storage/lmgr/README
index 56b0a12..960d1f3 100644
--- a/src/backend/storage/lmgr/README
+++ b/src/backend/storage/lmgr/README
@@ -3,7 +3,7 @@ src/backend/storage/lmgr/README
 Locking Overview
 ================
 
-Postgres uses four types of interprocess locks:
+Postgres uses five types of interprocess locks:
 
 * Spinlocks.  These are intended for *very* short-term locks.  If a lock
 is to be held more than a few dozen instructions, or across any sort of
@@ -36,13 +36,21 @@ Regular locks should be used for all user-driven lock requests.
 
 * SIReadLock predicate locks.  See separate README-SSI file for details.
 
+* Relation extension locks. Only one process can extend a relation at
+a time; we use a specialized lock manager for this purpose, which is
+much simpler than the regular lock manager.  It is similar to the
+lightweight lock mechanism, but is ever simpler because there is only
+one lock mode and only one lock can be taken at a time. A process holding
+a relation extension lock is interruptible, unlike a process holding an
+LWLock.
+
 Acquisition of either a spinlock or a lightweight lock causes query
 cancel and die() interrupts to be held off until all such locks are
 released. No such restriction exists for regular locks, however.  Also
 note that we can accept query cancel and die() interrupts while waiting
-for a regular lock, but we will not accept them while waiting for
-spinlocks or LW locks. It is therefore not a good idea to use LW locks
-when the wait time might exceed a few seconds.
+for a relation extension lock or a regular lock, but we will not accept
+them while waiting for spinlocks or LW locks. It is therefore not a good
+idea to use LW locks when the wait time might exceed a few seconds.
 
 The rest of this README file discusses the regular lock manager in detail.
 
diff --git a/src/backend/storage/lmgr/extension_lock.c b/src/backend/storage/lmgr/extension_lock.c
new file mode 100644
index 0000000..368c96a
--- /dev/null
+++ b/src/backend/storage/lmgr/extension_lock.c
@@ -0,0 +1,494 @@
+/*-------------------------------------------------------------------------
+ *
+ * extension_lock.c
+ *	  Relation extension lock manager
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/storage/lmgr/extension_lock.c
+ *
+ * NOTES:
+ *
+ * This lock manager is specialized in relation extension locks; light
+ * weight and interruptible lock manager. It's similar to heavy-weight
+ * lock but doesn't have dead lock detection mechanism, group locking
+ * mechanism and multiple lock modes.
+ *
+ * The entries for relation extension locks are allocated on the shared
+ * memory as an array. The pair of database id and relation id maps to
+ * one of them by hashing.
+ *
+ * For lock acquisition we use an atomic compare-and-exchange on the
+ * state variable. When a process tries to acquire a lock that conflicts
+ * with existing lock, it is put to sleep using condition variables
+ * if not conditional locking. When release the lock, we use an atomic
+ * decrement to release the lock.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "pgstat.h"
+
+#include "catalog/catalog.h"
+#include "postmaster/postmaster.h"
+#include "storage/extension_lock.h"
+#include "utils/rel.h"
+
+/* The total entries of relation extension lock on shared memory */
+#define N_RELEXTLOCK_ENTS 1024
+
+#define RELEXT_LOCK_BIT		((uint32) ((1 << 30)))
+#define RELEXT_WAIT_COUNT_MASK	((uint32) ((1 << 24) - 1))
+
+/* This tag maps to one of entries on the RelExtLockArray array by hashing */
+typedef struct RelExtLockTag
+{
+	Oid		dbid;	/* InvalidOid if the relation is shared relation */
+	Oid		relid;
+} RelExtLockTag;
+
+typedef struct RelExtLock
+{
+	pg_atomic_uint32	state; 	/* state of exclusive lock */
+	ConditionVariable	cv;
+} RelExtLock;
+
+/*
+ * This structure holds information per-object relation extension
+ * lock. "lock" variable represents the RelExtLockArray we are
+ * holding, waiting for or had been holding before. If we're holding
+ * a relation extension lock on a relation, nLocks > 0. nLocks == 0
+ * means that we don't hold any locks. We use this structure to keep
+ * track of holding relation extension locks, and to also store it
+ * as a cache. So when releasing the lock we don't invalidate the lock
+ * variable. We check the cache first, and then use it without touching
+ * RelExtLockArray if the relation extension lock is the same as what
+ * we just touched.
+ *
+ * At most one lock can be held at once. Note that sometimes we
+ * could try to acquire a lock for the additional forks while holding
+ * the lock for the main fork; for example, adding extra relation
+ * blocks for both relation and its free space map. But since this
+ * lock manager doesn't distinguish between the forks, we just
+ * increment nLocks in the case.
+ */
+typedef	struct relextlock_handle
+{
+	Oid				relid;
+	RelExtLock		*lock;
+	int				nLocks;		/* > 0 means holding it */
+	bool			waiting;	/* true if we're waiting it */
+} relextlock_handle;
+
+static relextlock_handle held_relextlock;
+
+/* Pointer to array containing relation extension lock states */
+static RelExtLock *RelExtLockArray;
+
+static bool RelExtLockAcquire(Oid relid, bool conditional);
+static void RelExtLockRelease(Oid rleid);
+static bool RelExtLockAttemptLock(RelExtLock *relextlock);
+static inline uint32 RelExtLockTargetTagToIndex(RelExtLockTag *locktag);
+
+Size
+RelExtLockShmemSize(void)
+{
+	/* Relation extension locks array */
+	return mul_size(N_RELEXTLOCK_ENTS, sizeof(RelExtLock));
+}
+
+/*
+ * InitRelExtLock
+ *      Initialize the relation extension lock manager's data structures.
+ */
+void
+InitRelExtLocks(void)
+{
+	Size	size;
+	bool	found;
+	int		i;
+
+	/*
+	 * This static assertion verifies that we have enough space for
+	 * waiter count of relation extension lock.
+	 */
+	StaticAssertStmt(RELEXT_WAIT_COUNT_MASK >= MAX_BACKENDS,
+					 "maximum waiter count of relation extension lock exceeds MAX_BACKENDS");
+
+	size = mul_size(N_RELEXTLOCK_ENTS, sizeof(RelExtLock));
+	RelExtLockArray = (RelExtLock *)
+		ShmemInitStruct("Relation Extension Lock", size, &found);
+
+	/* we're the first - initialize */
+	if (!found)
+	{
+		for (i = 0; i < N_RELEXTLOCK_ENTS; i++)
+		{
+			RelExtLock *relextlock = &RelExtLockArray[i];
+
+			pg_atomic_init_u32(&(relextlock->state), 0);
+			ConditionVariableInit(&(relextlock->cv));
+		}
+	}
+}
+
+/*
+ *		LockRelationForExtension
+ *
+ * This lock is used to interlock addition of pages to relations.
+ * We need such locking because bufmgr/smgr definition of P_NEW is not
+ * race-condition-proof.
+ *
+ * We assume the caller is already holding some type of regular lock on
+ * the relation, so no AcceptInvalidationMessages call is needed here.
+ */
+void
+LockRelationForExtension(Relation relation)
+{
+	RelExtLockAcquire(RelationGetRelid(relation), false);
+}
+
+/*
+ *		ConditionalLockRelationForExtension
+ *
+ * As above, but only lock if we can get the lock without blocking.
+ * Returns TRUE iff the lock was acquired.
+ */
+bool
+ConditionalLockRelationForExtension(Relation relation)
+{
+	return RelExtLockAcquire(RelationGetRelid(relation), true);
+}
+
+/*
+ *		EstimateNumberOfExtensionLockWaiters
+ *
+ * Estimate the number of processes waiting for the given relation
+ * extension lock. Note that since the lock for multiple relations
+ * uses the same RelExtLock entry, the return value might not be
+ * accurate.
+ */
+int
+EstimateNumberOfExtensionLockWaiters(Relation relation)
+{
+	RelExtLockTag tag;
+	RelExtLock	*relextlock;
+	uint32		state;
+	Oid			relid = RelationGetRelid(relation);
+
+	/* Make a lock tag */
+	tag.dbid = IsSharedRelation(relid) ? InvalidOid : MyDatabaseId;
+	tag.relid = relid;
+
+	relextlock = &RelExtLockArray[RelExtLockTargetTagToIndex(&tag)];
+	state = pg_atomic_read_u32(&(relextlock->state));
+
+	return (state & RELEXT_WAIT_COUNT_MASK);
+}
+
+/*
+ *		UnlockRelationForExtension
+ */
+void
+UnlockRelationForExtension(Relation relation)
+{
+	Oid relid = RelationGetRelid(relation);
+
+	if (held_relextlock.nLocks > 0)
+	{
+		if (relid != held_relextlock.relid)
+			ereport(ERROR,
+					(errcode(ERRCODE_LOCK_NOT_AVAILABLE),
+					 errmsg("relation extension lock for %u is not held",
+							relid)));
+
+		/* Decrement lock counts locally */
+		held_relextlock.nLocks--;
+
+		if (held_relextlock.nLocks == 0)
+			RelExtLockRelease(relid);
+	}
+	else
+		ereport(ERROR,
+				(errcode(ERRCODE_LOCK_NOT_AVAILABLE),
+				 errmsg("attempted to release relation extension lock for %u while holding no relation extension locks",
+						relid)));
+}
+
+/*
+ *		RelationExtensionLockReleaseAll
+ *
+ * release all currently-held relation extension locks
+ */
+void
+RelExtLockCleanup(void)
+{
+	if (held_relextlock.nLocks > 0)
+	{
+		/* Forcibly release all locks */
+		held_relextlock.nLocks = 0;
+
+		RelExtLockRelease(held_relextlock.relid);
+	}
+	else if (held_relextlock.waiting)
+	{
+		/*
+		 * Decrement the ref counts if we don't hold the lock but
+		 * was waiting for the lock. This can happen if the query
+		 * cancels or occurs an error while waiting for the lock.
+		 */
+		pg_atomic_sub_fetch_u32(&(held_relextlock.lock->state), 1);
+	}
+}
+
+/*
+ *		IsAnyRelationExtensionLockHeld
+ *
+ * Return true if we're holding relation extension locks.
+ */
+bool
+IsAnyRelationExtensionLockHeld(void)
+{
+	return held_relextlock.nLocks > 0;
+}
+
+/*
+ *		WaitForRelationExtensionLockToBeFree
+ *
+ * Wait for the relation extension lock on the given relation to
+ * be free without acquiring it.
+ */
+void
+WaitForRelationExtensionLockToBeFree(Relation relation)
+{
+	RelExtLock	*relextlock;
+	Oid		relid;
+
+	relid = RelationGetRelid(relation);
+
+	/* If we already hold the lock, no need to wait */
+	if (held_relextlock.nLocks > 0 && relid == held_relextlock.relid)
+		return;
+
+	/*
+	 * If the last relation extension lock we touched is the same
+	 * one for which we now need to wait, we can use our cached
+	 * pointer to the lock instead of recomputing it.
+	 */
+	if (relid == held_relextlock.relid)
+		relextlock = held_relextlock.lock;
+	else
+	{
+		RelExtLockTag tag;
+
+		/* Make a lock tag */
+		tag.dbid = IsSharedRelation(relid) ? InvalidOid : MyDatabaseId;
+		tag.relid = relid;
+
+		relextlock = &RelExtLockArray[RelExtLockTargetTagToIndex(&tag)];
+
+		/* Remember the lock we're interested in */
+		held_relextlock.relid = relid;
+		held_relextlock.lock = relextlock;
+	}
+
+	for (;;)
+	{
+		uint32	state;
+
+		state = pg_atomic_read_u32(&(relextlock)->state);
+
+		/* Break if nobody is holding the lock on this relation */
+		if ((state & RELEXT_LOCK_BIT) == 0)
+			break;
+
+		/* Could not get the lock, prepare to wait */
+		if (!held_relextlock.waiting)
+		{
+			pg_atomic_add_fetch_u32(&(relextlock->state), 1);
+			held_relextlock.waiting = true;
+		}
+
+		/* Sleep until something happens, then recheck */
+		ConditionVariableSleep(&(relextlock->cv),
+							   WAIT_EVENT_RELATION_EXTENSION_LOCK);
+	}
+
+	ConditionVariableCancelSleep();
+
+	/* Release any wait count we hold */
+	if (held_relextlock.waiting)
+	{
+		pg_atomic_sub_fetch_u32(&(relextlock->state), 1);
+		held_relextlock.waiting = false;
+	}
+
+	return;
+}
+
+/*
+ * Compute the hash code associated with a RelExtLock.
+ */
+static inline uint32
+RelExtLockTargetTagToIndex(RelExtLockTag *locktag)
+{
+	return tag_hash(locktag, sizeof(RelExtLockTag))	% N_RELEXTLOCK_ENTS;
+}
+
+/*
+ * Acquire a relation extension lock.
+ */
+static bool
+RelExtLockAcquire(Oid relid, bool conditional)
+{
+	RelExtLock	*relextlock;
+	bool	mustwait;
+
+	/*
+	 * If we already hold the lock, we can just increase the count locally.
+	 * Since we don't do deadlock detection, caller must not try to take a
+	 * new relation extension lock while already holding them.
+	 */
+	if (held_relextlock.nLocks > 0)
+	{
+		if (relid != held_relextlock.relid)
+			elog(ERROR,
+				 "cannot acquire relation extension locks for multiple relations at the time");
+
+		held_relextlock.nLocks++;
+		return true;
+	}
+
+	/*
+	 * If the last relation extension lock we touched is the same one for
+	 * we now need to acquire, we can use our cached pointer to the lock
+	 * instead of recomputing it.  This is likely to be a common case in
+	 * practice.
+	 */
+	if (relid == held_relextlock.relid)
+		relextlock = held_relextlock.lock;
+	else
+	{
+		RelExtLockTag tag;
+
+		/* Make a lock tag */
+		tag.dbid = IsSharedRelation(relid) ? InvalidOid : MyDatabaseId;
+		tag.relid = relid;
+
+		relextlock = &RelExtLockArray[RelExtLockTargetTagToIndex(&tag)];
+
+		/* Remember the lock we're interested in */
+		held_relextlock.relid = relid;
+		held_relextlock.lock = relextlock;
+	}
+
+	held_relextlock.waiting = false;
+	for (;;)
+	{
+		mustwait = RelExtLockAttemptLock(relextlock);
+
+		if (!mustwait)
+			break;	/* got the lock */
+
+		/* Could not got the lock, return iff in locking conditionally */
+		if (conditional)
+			return false;
+
+		/* Could not get the lock, prepare to wait */
+		if (!held_relextlock.waiting)
+		{
+			pg_atomic_add_fetch_u32(&(relextlock->state), 1);
+			held_relextlock.waiting = true;
+		}
+
+		/* Sleep until something happens, then recheck */
+		ConditionVariableSleep(&(relextlock->cv),
+							   WAIT_EVENT_RELATION_EXTENSION_LOCK);
+	}
+
+	ConditionVariableCancelSleep();
+
+	/* Release any wait count we hold */
+	if (held_relextlock.waiting)
+	{
+		pg_atomic_sub_fetch_u32(&(relextlock->state), 1);
+		held_relextlock.waiting = false;
+	}
+
+	Assert(!mustwait);
+
+	/* Remember lock held by this backend */
+	held_relextlock.relid = relid;
+	held_relextlock.lock = relextlock;
+	held_relextlock.nLocks = 1;
+
+	/* We got the lock! */
+	return true;
+}
+
+/*
+ * RelExtLockRelease
+ *
+ * Release a previously acquired relation extension lock.
+ */
+static void
+RelExtLockRelease(Oid relid)
+{
+	RelExtLock	*relextlock;
+	uint32	state;
+	uint32	wait_counts;
+
+	Assert(held_relextlock.nLocks == 0);
+
+	if (relid != held_relextlock.relid)
+		ereport(ERROR,
+				(errcode(ERRCODE_LOCK_NOT_AVAILABLE),
+				 errmsg("relation extension lock for %u is not held",
+						relid)));
+
+	relextlock = held_relextlock.lock;
+
+	/* Release the lock */
+	state = pg_atomic_sub_fetch_u32(&(relextlock->state), RELEXT_LOCK_BIT);
+
+	/* If there may be waiters, wake them up */
+	wait_counts = state & RELEXT_WAIT_COUNT_MASK;
+
+	if (wait_counts > 0)
+		ConditionVariableBroadcast(&(relextlock->cv));
+}
+
+/*
+ * Internal function that attempts to atomically acquire the relation
+ * extension lock.
+ *
+ * Returns true if the lock isn't free and we need to wait.
+ */
+static bool
+RelExtLockAttemptLock(RelExtLock *relextlock)
+{
+	uint32	oldstate;
+
+	oldstate = pg_atomic_read_u32(&relextlock->state);
+
+	while (true)
+	{
+		bool	lock_free;
+
+		lock_free = (oldstate & RELEXT_LOCK_BIT) == 0;
+
+		if (!lock_free)
+			return true;
+
+		if (pg_atomic_compare_exchange_u32(&relextlock->state,
+										   &oldstate, oldstate | RELEXT_LOCK_BIT))
+			return false;
+	}
+
+	pg_unreachable();
+}
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index da5679b..4fbc0c4 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -319,78 +319,6 @@ UnlockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode)
 }
 
 /*
- *		LockRelationForExtension
- *
- * This lock tag is used to interlock addition of pages to relations.
- * We need such locking because bufmgr/smgr definition of P_NEW is not
- * race-condition-proof.
- *
- * We assume the caller is already holding some type of regular lock on
- * the relation, so no AcceptInvalidationMessages call is needed here.
- */
-void
-LockRelationForExtension(Relation relation, LOCKMODE lockmode)
-{
-	LOCKTAG		tag;
-
-	SET_LOCKTAG_RELATION_EXTEND(tag,
-								relation->rd_lockInfo.lockRelId.dbId,
-								relation->rd_lockInfo.lockRelId.relId);
-
-	(void) LockAcquire(&tag, lockmode, false, false);
-}
-
-/*
- *		ConditionalLockRelationForExtension
- *
- * As above, but only lock if we can get the lock without blocking.
- * Returns true iff the lock was acquired.
- */
-bool
-ConditionalLockRelationForExtension(Relation relation, LOCKMODE lockmode)
-{
-	LOCKTAG		tag;
-
-	SET_LOCKTAG_RELATION_EXTEND(tag,
-								relation->rd_lockInfo.lockRelId.dbId,
-								relation->rd_lockInfo.lockRelId.relId);
-
-	return (LockAcquire(&tag, lockmode, false, true) != LOCKACQUIRE_NOT_AVAIL);
-}
-
-/*
- *		RelationExtensionLockWaiterCount
- *
- * Count the number of processes waiting for the given relation extension lock.
- */
-int
-RelationExtensionLockWaiterCount(Relation relation)
-{
-	LOCKTAG		tag;
-
-	SET_LOCKTAG_RELATION_EXTEND(tag,
-								relation->rd_lockInfo.lockRelId.dbId,
-								relation->rd_lockInfo.lockRelId.relId);
-
-	return LockWaiterCount(&tag);
-}
-
-/*
- *		UnlockRelationForExtension
- */
-void
-UnlockRelationForExtension(Relation relation, LOCKMODE lockmode)
-{
-	LOCKTAG		tag;
-
-	SET_LOCKTAG_RELATION_EXTEND(tag,
-								relation->rd_lockInfo.lockRelId.dbId,
-								relation->rd_lockInfo.lockRelId.relId);
-
-	LockRelease(&tag, lockmode, false);
-}
-
-/*
  *		LockPage
  *
  * Obtain a page-level lock.  This is currently used by some index access
@@ -961,12 +889,6 @@ DescribeLockTag(StringInfo buf, const LOCKTAG *tag)
 							 tag->locktag_field2,
 							 tag->locktag_field1);
 			break;
-		case LOCKTAG_RELATION_EXTEND:
-			appendStringInfo(buf,
-							 _("extension of relation %u of database %u"),
-							 tag->locktag_field2,
-							 tag->locktag_field1);
-			break;
 		case LOCKTAG_PAGE:
 			appendStringInfo(buf,
 							 _("page %u of relation %u of database %u"),
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 5833086..5a623fa 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -40,6 +40,7 @@
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "pgstat.h"
+#include "storage/extension_lock.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "storage/sinvaladt.h"
@@ -717,6 +718,13 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	int			status;
 	bool		log_lock = false;
 
+	/*
+	 * Relation extension locks don't participate in deadlock detection,
+	 * so make sure we don't try to acquire a heavyweight lock while
+	 * holding one.
+	 */
+	Assert(!IsAnyRelationExtensionLockHeld());
+
 	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
 		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
 	lockMethodTable = LockMethods[lockmethodid];
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 5f6727d..efa811a 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -44,6 +44,7 @@
 #include "replication/slot.h"
 #include "replication/syncrep.h"
 #include "storage/condition_variable.h"
+#include "storage/extension_lock.h"
 #include "storage/standby.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
@@ -765,6 +766,8 @@ ProcReleaseLocks(bool isCommit)
 		return;
 	/* If waiting, get off wait queue (should only be needed after error) */
 	LockErrorCleanup();
+	/* If we hold a relation extension lock, release it */
+	RelExtLockCleanup();
 	/* Release standard locks, including session-level if aborting */
 	LockReleaseAll(DEFAULT_LOCKMETHOD, !isCommit);
 	/* Release transaction-level advisory locks */
diff --git a/src/backend/utils/adt/lockfuncs.c b/src/backend/utils/adt/lockfuncs.c
index 9e0a8ab..6d8916c 100644
--- a/src/backend/utils/adt/lockfuncs.c
+++ b/src/backend/utils/adt/lockfuncs.c
@@ -25,7 +25,6 @@
 /* This must match enum LockTagType! */
 const char *const LockTagTypeNames[] = {
 	"relation",
-	"extend",
 	"page",
 	"tuple",
 	"transactionid",
@@ -234,7 +233,6 @@ pg_lock_status(PG_FUNCTION_ARGS)
 		switch ((LockTagType) instance->locktag.locktag_type)
 		{
 			case LOCKTAG_RELATION:
-			case LOCKTAG_RELATION_EXTEND:
 				values[1] = ObjectIdGetDatum(instance->locktag.locktag_field1);
 				values[2] = ObjectIdGetDatum(instance->locktag.locktag_field2);
 				nulls[3] = true;
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 089b7c3..b3611c3 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -813,6 +813,7 @@ typedef enum
 	WAIT_EVENT_PARALLEL_BITMAP_SCAN,
 	WAIT_EVENT_PROCARRAY_GROUP_UPDATE,
 	WAIT_EVENT_CLOG_GROUP_UPDATE,
+	WAIT_EVENT_RELATION_EXTENSION_LOCK,
 	WAIT_EVENT_REPLICATION_ORIGIN_DROP,
 	WAIT_EVENT_REPLICATION_SLOT_DROP,
 	WAIT_EVENT_SAFE_SNAPSHOT,
diff --git a/src/include/storage/extension_lock.h b/src/include/storage/extension_lock.h
new file mode 100644
index 0000000..850f19c
--- /dev/null
+++ b/src/include/storage/extension_lock.h
@@ -0,0 +1,38 @@
+/*-------------------------------------------------------------------------
+ *
+ * extension_lock.h
+ *	  Relation extension lock manager
+ *
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/extension_lock.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef EXTENSION_LOCK_H
+#define EXTENSION_LOCK_H
+
+#ifdef FRONTEND
+#error "extension_lock.h may not be included from frontend code"
+#endif
+
+#include "port/atomics.h"
+#include "storage/s_lock.h"
+#include "storage/condition_variable.h"
+#include "storage/proclist_types.h"
+
+/* Lock a relation for extension */
+extern Size RelExtLockShmemSize(void);
+extern void InitRelExtLocks(void);
+extern void LockRelationForExtension(Relation relation);
+extern void UnlockRelationForExtension(Relation relation);
+extern bool ConditionalLockRelationForExtension(Relation relation);
+extern int	EstimateNumberOfExtensionLockWaiters(Relation relation);
+extern void WaitForRelationExtensionLockToBeFree(Relation relation);
+extern void RelExtLockCleanup(void);
+extern bool	IsAnyRelationExtensionLockHeld(void);
+
+#endif	/* EXTENSION_LOCK_H */
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index 0b92322..7e6b80c 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -50,13 +50,6 @@ extern bool LockHasWaitersRelation(Relation relation, LOCKMODE lockmode);
 extern void LockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
 extern void UnlockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
 
-/* Lock a relation for extension */
-extern void LockRelationForExtension(Relation relation, LOCKMODE lockmode);
-extern void UnlockRelationForExtension(Relation relation, LOCKMODE lockmode);
-extern bool ConditionalLockRelationForExtension(Relation relation,
-									LOCKMODE lockmode);
-extern int	RelationExtensionLockWaiterCount(Relation relation);
-
 /* Lock a page (currently only used within indexes) */
 extern void LockPage(Relation relation, BlockNumber blkno, LOCKMODE lockmode);
 extern bool ConditionalLockPage(Relation relation, BlockNumber blkno, LOCKMODE lockmode);
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index 765431e..3be18ea 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -138,8 +138,6 @@ typedef uint16 LOCKMETHODID;
 typedef enum LockTagType
 {
 	LOCKTAG_RELATION,			/* whole relation */
-	/* ID info for a relation is DB OID + REL OID; DB OID = 0 if shared */
-	LOCKTAG_RELATION_EXTEND,	/* the right to extend a relation */
 	/* same ID info as RELATION */
 	LOCKTAG_PAGE,				/* one page of a relation */
 	/* ID info for a page is RELATION info + BlockNumber */
@@ -198,14 +196,6 @@ typedef struct LOCKTAG
 	 (locktag).locktag_type = LOCKTAG_RELATION, \
 	 (locktag).locktag_lockmethodid = DEFAULT_LOCKMETHOD)
 
-#define SET_LOCKTAG_RELATION_EXTEND(locktag,dboid,reloid) \
-	((locktag).locktag_field1 = (dboid), \
-	 (locktag).locktag_field2 = (reloid), \
-	 (locktag).locktag_field3 = 0, \
-	 (locktag).locktag_field4 = 0, \
-	 (locktag).locktag_type = LOCKTAG_RELATION_EXTEND, \
-	 (locktag).locktag_lockmethodid = DEFAULT_LOCKMETHOD)
-
 #define SET_LOCKTAG_PAGE(locktag,dboid,reloid,blocknum) \
 	((locktag).locktag_field1 = (dboid), \
 	 (locktag).locktag_field2 = (reloid), \

#40

Robert Haas

robertmhaas@gmail.com

about 8 years ago

In reply to: Masahiko Sawada (#39)

1 attachment(s)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Sun, Dec 10, 2017 at 11:51 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Attached updated version patch. Please review it.

I went over this today; please find attached an updated version which
I propose to commit.

Changes:

- Various formatting fixes, including running pgindent.

- Various comment updates.

- Make RELEXT_WAIT_COUNT_MASK equal RELEXT_LOCK_BIT - 1 rather than
some unnecessarily smaller number.

- In InitRelExtLocks, don't bother using mul_size; we already know it
won't overflow, because we did the same thing in RelExtLockShmemSize.

- When we run into an error trying to release a lock, log it as a
WARNING and don't mark it as translatable. Follows lock.c. An ERROR
here probably just recurses infinitely.

- Don't bother passing OID to RelExtLockRelease.

- Reorder functions a bit for (IMHO) better clarity.

- Make UnlockRelationForExtension just use a single message for both
failure modes. They are closely-enough related that I think that's
fine.

- Make WaitForRelationExtensionLockToBeFree complain if we already
hold an extension lock.

- In RelExtLockCleanup, clear held_relextlock.waiting. This would've
made for a nasty bug.

- Also in that function, assert that we don't hold both a lock and a wait count.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

extension-lock-v12.patchapplication/octet-stream; name=extension-lock-v12.patchDownload

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index b6f80d9708..07dd3f7082 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -669,8 +669,8 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
           Heavyweight locks, also known as lock manager locks or simply locks,
           primarily protect SQL-visible objects such as tables.  However,
           they are also used to ensure mutual exclusion for certain internal
-          operations such as relation extension.  <literal>wait_event</literal> will
-          identify the type of lock awaited.
+          operations such as waiting for a transaction to finish.
+          <literal>wait_event</literal> will identify the type of lock awaited.
          </para>
         </listitem>
         <listitem>
@@ -1122,15 +1122,11 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          execution.</entry>
         </row>
         <row>
-         <entry morerows="9"><literal>Lock</literal></entry>
+         <entry morerows="8"><literal>Lock</literal></entry>
          <entry><literal>relation</literal></entry>
          <entry>Waiting to acquire a lock on a relation.</entry>
         </row>
         <row>
-         <entry><literal>extend</literal></entry>
-         <entry>Waiting to extend a relation.</entry>
-        </row>
-        <row>
          <entry><literal>page</literal></entry>
          <entry>Waiting to acquire a lock on page of a relation.</entry>
         </row>
@@ -1263,7 +1259,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry>Waiting in an extension.</entry>
         </row>
         <row>
-         <entry morerows="17"><literal>IPC</literal></entry>
+         <entry morerows="18"><literal>IPC</literal></entry>
          <entry><literal>BgWorkerShutdown</literal></entry>
          <entry>Waiting for background worker to shut down.</entry>
         </row>
@@ -1320,6 +1316,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry>Waiting for group leader to update transaction status at transaction end.</entry>
         </row>
         <row>
+         <entry><literal>RelationExtensionLock</literal></entry>
+         <entry>Waiting to extend a relation.</entry>
+        </row>
+        <row>
          <entry><literal>ReplicationOriginDrop</literal></entry>
          <entry>Waiting for a replication origin to become inactive to be dropped.</entry>
         </row>
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index 09db5c6f8f..05cca9d293 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -17,6 +17,7 @@
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/extension_lock.h"
 #include "storage/freespace.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
@@ -623,8 +624,7 @@ brin_page_cleanup(Relation idxrel, Buffer buf)
 	 */
 	if (PageIsNew(page))
 	{
-		LockRelationForExtension(idxrel, ShareLock);
-		UnlockRelationForExtension(idxrel, ShareLock);
+		WaitForRelationExtensionLockToBeFree(idxrel);
 
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 		if (PageIsNew(page))
@@ -716,7 +716,7 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 			 */
 			if (!RELATION_IS_LOCAL(irel))
 			{
-				LockRelationForExtension(irel, ExclusiveLock);
+				LockRelationForExtension(irel);
 				extensionLockHeld = true;
 			}
 			buf = ReadBuffer(irel, P_NEW);
@@ -768,7 +768,7 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 				}
 
 				if (extensionLockHeld)
-					UnlockRelationForExtension(irel, ExclusiveLock);
+					UnlockRelationForExtension(irel);
 
 				ReleaseBuffer(buf);
 				return InvalidBuffer;
@@ -778,7 +778,7 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
 		if (extensionLockHeld)
-			UnlockRelationForExtension(irel, ExclusiveLock);
+			UnlockRelationForExtension(irel);
 
 		page = BufferGetPage(buf);
 
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index 03e53ce43e..af8f5ce21b 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -29,6 +29,7 @@
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/extension_lock.h"
 #include "storage/lmgr.h"
 #include "utils/rel.h"
 
@@ -570,7 +571,7 @@ revmap_physical_extend(BrinRevmap *revmap)
 	else
 	{
 		if (needLock)
-			LockRelationForExtension(irel, ExclusiveLock);
+			LockRelationForExtension(irel);
 
 		buf = ReadBuffer(irel, P_NEW);
 		if (BufferGetBlockNumber(buf) != mapBlk)
@@ -582,7 +583,7 @@ revmap_physical_extend(BrinRevmap *revmap)
 			 * page from under whoever is using it.
 			 */
 			if (needLock)
-				UnlockRelationForExtension(irel, ExclusiveLock);
+				UnlockRelationForExtension(irel);
 			LockBuffer(revmap->rm_metaBuf, BUFFER_LOCK_UNLOCK);
 			ReleaseBuffer(buf);
 			return;
@@ -591,7 +592,7 @@ revmap_physical_extend(BrinRevmap *revmap)
 		page = BufferGetPage(buf);
 
 		if (needLock)
-			UnlockRelationForExtension(irel, ExclusiveLock);
+			UnlockRelationForExtension(irel);
 	}
 
 	/* Check that it's a regular block (or an empty page) */
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index d9c6483437..8d35918409 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -21,6 +21,7 @@
 #include "catalog/pg_collation.h"
 #include "catalog/pg_type.h"
 #include "miscadmin.h"
+#include "storage/extension_lock.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/builtins.h"
@@ -325,13 +326,13 @@ GinNewBuffer(Relation index)
 	/* Must extend the file */
 	needLock = !RELATION_IS_LOCAL(index);
 	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
+		LockRelationForExtension(index);
 
 	buffer = ReadBuffer(index, P_NEW);
 	LockBuffer(buffer, GIN_EXCLUSIVE);
 
 	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+		UnlockRelationForExtension(index);
 
 	return buffer;
 }
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index 394bc832a4..d769a76bda 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -20,6 +20,7 @@
 #include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "postmaster/autovacuum.h"
+#include "storage/extension_lock.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/memutils.h"
@@ -716,10 +717,10 @@ ginvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	needLock = !RELATION_IS_LOCAL(index);
 
 	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
+		LockRelationForExtension(index);
 	npages = RelationGetNumberOfBlocks(index);
 	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+		UnlockRelationForExtension(index);
 
 	totFreePages = 0;
 
@@ -766,10 +767,10 @@ ginvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	stats->pages_free = totFreePages;
 
 	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
+		LockRelationForExtension(index);
 	stats->num_pages = RelationGetNumberOfBlocks(index);
 	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+		UnlockRelationForExtension(index);
 
 	return stats;
 }
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index d8d1c0acfc..76171a54b9 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -20,6 +20,7 @@
 #include "access/htup_details.h"
 #include "access/reloptions.h"
 #include "catalog/pg_opclass.h"
+#include "storage/extension_lock.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/builtins.h"
@@ -821,13 +822,13 @@ gistNewBuffer(Relation r)
 	needLock = !RELATION_IS_LOCAL(r);
 
 	if (needLock)
-		LockRelationForExtension(r, ExclusiveLock);
+		LockRelationForExtension(r);
 
 	buffer = ReadBuffer(r, P_NEW);
 	LockBuffer(buffer, GIST_EXCLUSIVE);
 
 	if (needLock)
-		UnlockRelationForExtension(r, ExclusiveLock);
+		UnlockRelationForExtension(r);
 
 	return buffer;
 }
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 77d9d12f0b..42ef36a57e 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -18,6 +18,7 @@
 #include "access/gist_private.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
+#include "storage/extension_lock.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 
@@ -59,10 +60,10 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 
 	/* try to find deleted pages */
 	if (needLock)
-		LockRelationForExtension(rel, ExclusiveLock);
+		LockRelationForExtension(rel);
 	npages = RelationGetNumberOfBlocks(rel);
 	if (needLock)
-		UnlockRelationForExtension(rel, ExclusiveLock);
+		UnlockRelationForExtension(rel);
 
 	totFreePages = 0;
 	for (blkno = GIST_ROOT_BLKNO + 1; blkno < npages; blkno++)
@@ -91,10 +92,10 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	/* return statistics */
 	stats->pages_free = totFreePages;
 	if (needLock)
-		LockRelationForExtension(rel, ExclusiveLock);
+		LockRelationForExtension(rel);
 	stats->num_pages = RelationGetNumberOfBlocks(rel);
 	if (needLock)
-		UnlockRelationForExtension(rel, ExclusiveLock);
+		UnlockRelationForExtension(rel);
 
 	return stats;
 }
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 13e3bdca50..fc2c9b4028 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -20,6 +20,7 @@
 #include "access/htup_details.h"
 #include "access/visibilitymap.h"
 #include "storage/bufmgr.h"
+#include "storage/extension_lock.h"
 #include "storage/freespace.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
@@ -186,7 +187,7 @@ RelationAddExtraBlocks(Relation relation, BulkInsertState bistate)
 	Buffer		buffer;
 
 	/* Use the length of the lock wait queue to judge how much to extend. */
-	lockWaiters = RelationExtensionLockWaiterCount(relation);
+	lockWaiters = EstimateNumberOfExtensionLockWaiters(relation);
 	if (lockWaiters <= 0)
 		return;
 
@@ -519,11 +520,11 @@ loop:
 	if (needLock)
 	{
 		if (!use_fsm)
-			LockRelationForExtension(relation, ExclusiveLock);
-		else if (!ConditionalLockRelationForExtension(relation, ExclusiveLock))
+			LockRelationForExtension(relation);
+		else if (!ConditionalLockRelationForExtension(relation))
 		{
 			/* Couldn't get the lock immediately; wait for it. */
-			LockRelationForExtension(relation, ExclusiveLock);
+			LockRelationForExtension(relation);
 
 			/*
 			 * Check if some other backend has extended a block for us while
@@ -537,7 +538,7 @@ loop:
 			 */
 			if (targetBlock != InvalidBlockNumber)
 			{
-				UnlockRelationForExtension(relation, ExclusiveLock);
+				UnlockRelationForExtension(relation);
 				goto loop;
 			}
 
@@ -576,7 +577,7 @@ loop:
 	 * against vacuumlazy.c --- see comments therein.
 	 */
 	if (needLock)
-		UnlockRelationForExtension(relation, ExclusiveLock);
+		UnlockRelationForExtension(relation);
 
 	/*
 	 * We need to initialize the empty new page.  Double-check that it really
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 4c2a13aeba..2efee686e3 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -90,6 +90,7 @@
 #include "access/xlog.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/extension_lock.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
 #include "utils/inval.h"
@@ -641,7 +642,7 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 	 * Note that another backend might have extended or created the relation
 	 * by the time we get the lock.
 	 */
-	LockRelationForExtension(rel, ExclusiveLock);
+	LockRelationForExtension(rel);
 
 	/* Might have to re-open if a cache flush happened */
 	RelationOpenSmgr(rel);
@@ -679,7 +680,7 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 	/* Update local cache with the up-to-date size */
 	rel->rd_smgr->smgr_vm_nblocks = vm_nblocks_now;
 
-	UnlockRelationForExtension(rel, ExclusiveLock);
+	UnlockRelationForExtension(rel);
 
 	pfree(pg);
 }
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index c77434904e..7824c925ca 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -28,6 +28,7 @@
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
+#include "storage/extension_lock.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
@@ -659,7 +660,7 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 		needLock = !RELATION_IS_LOCAL(rel);
 
 		if (needLock)
-			LockRelationForExtension(rel, ExclusiveLock);
+			LockRelationForExtension(rel);
 
 		buf = ReadBuffer(rel, P_NEW);
 
@@ -673,7 +674,7 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 		 * condition against btvacuumscan --- see comments therein.
 		 */
 		if (needLock)
-			UnlockRelationForExtension(rel, ExclusiveLock);
+			UnlockRelationForExtension(rel);
 
 		/* Initialize the new page before returning it */
 		page = BufferGetPage(buf);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 399e6a1ae5..5af1c21d19 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -25,6 +25,7 @@
 #include "commands/vacuum.h"
 #include "pgstat.h"
 #include "storage/condition_variable.h"
+#include "storage/extension_lock.h"
 #include "storage/indexfsm.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
@@ -1058,10 +1059,10 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	{
 		/* Get the current relation length */
 		if (needLock)
-			LockRelationForExtension(rel, ExclusiveLock);
+			LockRelationForExtension(rel);
 		num_pages = RelationGetNumberOfBlocks(rel);
 		if (needLock)
-			UnlockRelationForExtension(rel, ExclusiveLock);
+			UnlockRelationForExtension(rel);
 
 		/* Quit if we've scanned the whole relation */
 		if (blkno >= num_pages)
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index bd5301f383..0ff53a3c7d 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -23,6 +23,7 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/builtins.h"
+#include "storage/extension_lock.h"
 #include "utils/index_selfuncs.h"
 #include "utils/lsyscache.h"
 
@@ -230,13 +231,13 @@ SpGistNewBuffer(Relation index)
 	/* Must extend the file */
 	needLock = !RELATION_IS_LOCAL(index);
 	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
+		LockRelationForExtension(index);
 
 	buffer = ReadBuffer(index, P_NEW);
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
 	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+		UnlockRelationForExtension(index);
 
 	return buffer;
 }
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index d7d5e90ef3..385d1cb8a2 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -24,6 +24,7 @@
 #include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/extension_lock.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/snapmgr.h"
@@ -824,10 +825,10 @@ spgvacuumscan(spgBulkDeleteState *bds)
 	{
 		/* Get the current relation length */
 		if (needLock)
-			LockRelationForExtension(index, ExclusiveLock);
+			LockRelationForExtension(index);
 		num_pages = RelationGetNumberOfBlocks(index);
 		if (needLock)
-			UnlockRelationForExtension(index, ExclusiveLock);
+			UnlockRelationForExtension(index);
 
 		/* Quit if we've scanned the whole relation */
 		if (blkno >= num_pages)
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 20ce431e46..4a722230f1 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -54,6 +54,7 @@
 #include "portability/instr_time.h"
 #include "postmaster/autovacuum.h"
 #include "storage/bufmgr.h"
+#include "storage/extension_lock.h"
 #include "storage/freespace.h"
 #include "storage/lmgr.h"
 #include "utils/lsyscache.h"
@@ -860,8 +861,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
 			 * it's got exclusive lock on the whole relation.
 			 */
 			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
-			LockRelationForExtension(onerel, ExclusiveLock);
-			UnlockRelationForExtension(onerel, ExclusiveLock);
+			WaitForRelationExtensionLockToBeFree(onerel);
 			LockBufferForCleanup(buf);
 			if (PageIsNew(page))
 			{
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 5c256ff8ab..210552ff10 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3616,6 +3616,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_CLOG_GROUP_UPDATE:
 			event_name = "ClogGroupUpdate";
 			break;
+		case WAIT_EVENT_RELATION_EXTENSION_LOCK:
+			event_name = "RelationExtensionLock";
+			break;
 		case WAIT_EVENT_REPLICATION_ORIGIN_DROP:
 			event_name = "ReplicationOriginDrop";
 			break;
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 4648473523..172a48c788 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -26,6 +26,7 @@
 #include "access/htup_details.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
+#include "storage/extension_lock.h"
 #include "storage/freespace.h"
 #include "storage/fsm_internals.h"
 #include "storage/lmgr.h"
@@ -624,7 +625,7 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 	 * Note that another backend might have extended or created the relation
 	 * by the time we get the lock.
 	 */
-	LockRelationForExtension(rel, ExclusiveLock);
+	LockRelationForExtension(rel);
 
 	/* Might have to re-open if a cache flush happened */
 	RelationOpenSmgr(rel);
@@ -652,7 +653,7 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 	/* Update local cache with the up-to-date size */
 	rel->rd_smgr->smgr_fsm_nblocks = fsm_nblocks_now;
 
-	UnlockRelationForExtension(rel, ExclusiveLock);
+	UnlockRelationForExtension(rel);
 
 	pfree(pg);
 }
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2d1ed143e0..3b6a6f7cc2 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -35,6 +35,7 @@
 #include "replication/origin.h"
 #include "storage/bufmgr.h"
 #include "storage/dsm.h"
+#include "storage/extension_lock.h"
 #include "storage/ipc.h"
 #include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
@@ -133,6 +134,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 		size = add_size(size, BackgroundWorkerShmemSize());
 		size = add_size(size, MultiXactShmemSize());
 		size = add_size(size, LWLockShmemSize());
+		size = add_size(size, RelExtLockShmemSize());
 		size = add_size(size, ProcArrayShmemSize());
 		size = add_size(size, BackendStatusShmemSize());
 		size = add_size(size, SInvalShmemSize());
@@ -235,6 +237,11 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 	InitPredicateLocks();
 
 	/*
+	 * Set up relation extension lock manager
+	 */
+	InitRelExtLocks();
+
+	/*
 	 * Set up process table
 	 */
 	if (!IsUnderPostmaster)
diff --git a/src/backend/storage/lmgr/Makefile b/src/backend/storage/lmgr/Makefile
index e1b787e838..2334a40875 100644
--- a/src/backend/storage/lmgr/Makefile
+++ b/src/backend/storage/lmgr/Makefile
@@ -13,7 +13,7 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = lmgr.o lock.o proc.o deadlock.o lwlock.o lwlocknames.o spin.o \
-	s_lock.o predicate.o condition_variable.o
+	s_lock.o predicate.o condition_variable.o extension_lock.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/storage/lmgr/README b/src/backend/storage/lmgr/README
index 56b0a12a3e..960d1f3f09 100644
--- a/src/backend/storage/lmgr/README
+++ b/src/backend/storage/lmgr/README
@@ -3,7 +3,7 @@ src/backend/storage/lmgr/README
 Locking Overview
 ================
 
-Postgres uses four types of interprocess locks:
+Postgres uses five types of interprocess locks:
 
 * Spinlocks.  These are intended for *very* short-term locks.  If a lock
 is to be held more than a few dozen instructions, or across any sort of
@@ -36,13 +36,21 @@ Regular locks should be used for all user-driven lock requests.
 
 * SIReadLock predicate locks.  See separate README-SSI file for details.
 
+* Relation extension locks. Only one process can extend a relation at
+a time; we use a specialized lock manager for this purpose, which is
+much simpler than the regular lock manager.  It is similar to the
+lightweight lock mechanism, but is ever simpler because there is only
+one lock mode and only one lock can be taken at a time. A process holding
+a relation extension lock is interruptible, unlike a process holding an
+LWLock.
+
 Acquisition of either a spinlock or a lightweight lock causes query
 cancel and die() interrupts to be held off until all such locks are
 released. No such restriction exists for regular locks, however.  Also
 note that we can accept query cancel and die() interrupts while waiting
-for a regular lock, but we will not accept them while waiting for
-spinlocks or LW locks. It is therefore not a good idea to use LW locks
-when the wait time might exceed a few seconds.
+for a relation extension lock or a regular lock, but we will not accept
+them while waiting for spinlocks or LW locks. It is therefore not a good
+idea to use LW locks when the wait time might exceed a few seconds.
 
 The rest of this README file discusses the regular lock manager in detail.
 
diff --git a/src/backend/storage/lmgr/extension_lock.c b/src/backend/storage/lmgr/extension_lock.c
new file mode 100644
index 0000000000..67ca763ce4
--- /dev/null
+++ b/src/backend/storage/lmgr/extension_lock.c
@@ -0,0 +1,469 @@
+/*-------------------------------------------------------------------------
+ *
+ * extension_lock.c
+ *	  Relation extension lock manager
+ *
+ * This specialized lock manager is used only for relation extension
+ * locks.  Unlike the heavyweight lock manager, it doesn't provide
+ * deadlock detection or group locking.  Unlike lwlock.c, extension lock
+ * waits are interruptible.  Unlike both systems, there is only one lock
+ * mode.
+ *
+ * False sharing is possible.  We have a fixed-size array of locks, and
+ * every database OID/relation OID combination is mapped to a slot in
+ * the array.  Therefore, if two processes try to extend relations that
+ * map to the same array slot, they will contend even though it would
+ * be OK to let both proceed at once.  Since these locks are typically
+ * taken only for very short periods of time, this doesn't seem likely
+ * to be a big problem in practice.  If it is, we could make the array
+ * bigger.
+ *
+ * The extension lock manager is much faster than the regular heavyweight
+ * lock manager.  The lack of group locking is a feature, not a bug,
+ * because while cooperating backends can all (for example) access a
+ * relation on which they jointly hold AccessExclusiveLock at the same time,
+ * it's not safe for them to extend the relation at the same time.
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/backend/storage/lmgr/extension_lock.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "pgstat.h"
+
+#include "catalog/catalog.h"
+#include "postmaster/postmaster.h"
+#include "storage/extension_lock.h"
+#include "utils/rel.h"
+
+#define N_RELEXTLOCK_ENTS		1024
+
+/*
+ * We can't use bit 31 as the lock bit because pg_atomic_sub_fetch_u32 can't
+ * handle an attempt to subtract INT_MIN.
+ */
+#define RELEXT_LOCK_BIT			((uint32) 1 << 30)
+#define RELEXT_WAIT_COUNT_MASK	(RELEXT_LOCK_BIT - 1)
+
+typedef struct RelExtLockTag
+{
+	Oid			dbid;			/* InvalidOid for a shared relation */
+	Oid			relid;
+} RelExtLockTag;
+
+typedef struct RelExtLock
+{
+	pg_atomic_uint32 state;
+	ConditionVariable cv;
+} RelExtLock;
+
+/*
+ * Backend-private state for relation extension locks.  "relid" is the last
+ * relation whose RelExtLock we looked up, and "lock" is a pointer to the
+ * RelExtLock to which it mapped.  This speeds up the fairly common case where
+ * we acquire the same relation extension lock repeatedly.  nLocks is 0 is the
+ * number of times we've acquired that lock; 0 means we don't hold it, while
+ * any value >0 means we do.
+ *
+ * A backend can't hold more than one relation extension lock at the same
+ * time, although it can hold the same lock more than once.  Sometimes we try
+ * to acquire a lock for additional forks while already holding the lock for
+ * the main fork; for example, this might happen when adding extra relation
+ * blocks for both relation and its free space map. But since this lock
+ * manager doesn't distinguish between the forks, we just increment nLocks in
+ * the case.
+ */
+typedef struct relextlock_handle
+{
+	Oid			relid;
+	RelExtLock *lock;
+	int			nLocks;			/* > 0 means holding it */
+	bool		waiting;		/* true if we're waiting it */
+} relextlock_handle;
+
+static relextlock_handle held_relextlock;
+static RelExtLock *RelExtLockArray;
+
+static bool RelExtLockAcquire(Oid relid, bool conditional);
+static bool RelExtLockAttemptLock(RelExtLock *relextlock);
+static void RelExtLockRelease(void);
+static inline uint32 RelExtLockTargetTagToIndex(RelExtLockTag *locktag);
+
+/*
+ * Estimate space required for a fixed-size array of RelExtLock structures.
+ */
+Size
+RelExtLockShmemSize(void)
+{
+	return mul_size(N_RELEXTLOCK_ENTS, sizeof(RelExtLock));
+}
+
+/*
+ * Initialize extension lock manager.
+ */
+void
+InitRelExtLocks(void)
+{
+	bool		found;
+	int			i;
+
+	/* Verify that we have enough bits for maximum possible waiter count. */
+	StaticAssertStmt(RELEXT_WAIT_COUNT_MASK >= MAX_BACKENDS,
+					 "maximum waiter count of relation extension lock exceeds MAX_BACKENDS");
+
+	RelExtLockArray = (RelExtLock *)
+		ShmemInitStruct("Relation Extension Lock",
+						N_RELEXTLOCK_ENTS * sizeof(RelExtLock),
+						&found);
+
+	/* we're the first - initialize */
+	if (!found)
+	{
+		for (i = 0; i < N_RELEXTLOCK_ENTS; i++)
+		{
+			RelExtLock *relextlock = &RelExtLockArray[i];
+
+			pg_atomic_init_u32(&(relextlock->state), 0);
+			ConditionVariableInit(&(relextlock->cv));
+		}
+	}
+}
+
+/*
+ * This lock is used to interlock addition of pages to relations.
+ * We need such locking because bufmgr/smgr definition of P_NEW is not
+ * race-condition-proof.
+ *
+ * We assume the caller is already holding some type of regular lock on
+ * the relation, so no AcceptInvalidationMessages call is needed here.
+ */
+void
+LockRelationForExtension(Relation relation)
+{
+	RelExtLockAcquire(RelationGetRelid(relation), false);
+}
+
+/*
+ * As above, but only lock if we can get the lock without blocking.
+ * Returns TRUE iff the lock was acquired.
+ */
+bool
+ConditionalLockRelationForExtension(Relation relation)
+{
+	return RelExtLockAcquire(RelationGetRelid(relation), true);
+}
+
+/*
+ * Estimate the number of processes waiting for the given relation extension
+ * lock. Note that since multiple relations hash to the same RelExtLock entry,
+ * the return value might be inflated.
+ */
+int
+EstimateNumberOfExtensionLockWaiters(Relation relation)
+{
+	RelExtLockTag tag;
+	RelExtLock *relextlock;
+	uint32		state;
+	Oid			relid = RelationGetRelid(relation);
+
+	/* Make a lock tag */
+	tag.dbid = IsSharedRelation(relid) ? InvalidOid : MyDatabaseId;
+	tag.relid = relid;
+
+	relextlock = &RelExtLockArray[RelExtLockTargetTagToIndex(&tag)];
+	state = pg_atomic_read_u32(&(relextlock->state));
+
+	return (state & RELEXT_WAIT_COUNT_MASK);
+}
+
+/*
+ * Release a previously-acquired extension lock.
+ */
+void
+UnlockRelationForExtension(Relation relation)
+{
+	Oid			relid = RelationGetRelid(relation);
+
+	if (held_relextlock.nLocks <= 0 || relid != held_relextlock.relid)
+	{
+		elog(WARNING,
+			 "relation extension lock for %u is not held",
+			 relid);
+		return;
+	}
+
+	/*
+	 * If we acquired it multiple times, only change shared state when we have
+	 * released it as many times as we acquired it.
+	 */
+	if (--held_relextlock.nLocks == 0)
+		RelExtLockRelease();
+}
+
+/*
+ * Release any extension lock held, and any wait count for an extension lock.
+ * This is intended to be invoked during error cleanup.
+ */
+void
+RelExtLockCleanup(void)
+{
+	if (held_relextlock.nLocks > 0)
+	{
+		/* Release the lock even if we acquired it multiple times. */
+		held_relextlock.nLocks = 0;
+		RelExtLockRelease();
+		Assert(!held_relextlock.waiting);
+	}
+	else if (held_relextlock.waiting)
+	{
+		/* We were waiting for the lock; release the wait count we held. */
+		held_relextlock.waiting = false;
+		pg_atomic_sub_fetch_u32(&(held_relextlock.lock->state), 1);
+	}
+}
+
+/*
+ * Are we holding any extension lock?
+ */
+bool
+IsAnyRelationExtensionLockHeld(void)
+{
+	return held_relextlock.nLocks > 0;
+}
+
+/*
+ *		WaitForRelationExtensionLockToBeFree
+ *
+ * Wait for the relation extension lock on the given relation to
+ * be free without acquiring it.
+ */
+void
+WaitForRelationExtensionLockToBeFree(Relation relation)
+{
+	RelExtLock *relextlock;
+	Oid			relid;
+
+	relid = RelationGetRelid(relation);
+
+	if (held_relextlock.nLocks > 0)
+	{
+		/*
+		 * If we already hold the lock, nobody else does, so we can return
+		 * immediately.
+		 */
+		if (relid == held_relextlock.relid)
+			return;
+		elog(ERROR,
+			 "can only manipulate one relation extension lock at a time");
+	}
+
+	/*
+	 * If the last relation extension lock we touched is the same one for
+	 * which we now need to wait, we can use our cached pointer to the lock
+	 * instead of recomputing it.
+	 */
+	if (relid == held_relextlock.relid)
+		relextlock = held_relextlock.lock;
+	else
+	{
+		RelExtLockTag tag;
+
+		tag.dbid = IsSharedRelation(relid) ? InvalidOid : MyDatabaseId;
+		tag.relid = relid;
+		relextlock = &RelExtLockArray[RelExtLockTargetTagToIndex(&tag)];
+		held_relextlock.relid = relid;
+		held_relextlock.lock = relextlock;
+	}
+
+	for (;;)
+	{
+		uint32		state;
+
+		state = pg_atomic_read_u32(&(relextlock)->state);
+
+		/* Break if nobody is holding the lock on this relation */
+		if ((state & RELEXT_LOCK_BIT) == 0)
+			break;
+
+		/* Could not get the lock, prepare to wait */
+		if (!held_relextlock.waiting)
+		{
+			pg_atomic_add_fetch_u32(&(relextlock->state), 1);
+			held_relextlock.waiting = true;
+		}
+
+		/* Sleep until something happens, then recheck */
+		ConditionVariableSleep(&(relextlock->cv),
+							   WAIT_EVENT_RELATION_EXTENSION_LOCK);
+	}
+
+	ConditionVariableCancelSleep();
+
+	/* Release any wait count we hold */
+	if (held_relextlock.waiting)
+	{
+		pg_atomic_sub_fetch_u32(&(relextlock->state), 1);
+		held_relextlock.waiting = false;
+	}
+}
+
+/*
+ * Compute the hash code associated with a RelExtLock.
+ */
+static inline uint32
+RelExtLockTargetTagToIndex(RelExtLockTag *locktag)
+{
+	return tag_hash(locktag, sizeof(RelExtLockTag)) % N_RELEXTLOCK_ENTS;
+}
+
+/*
+ * Acquire a relation extension lock.
+ */
+static bool
+RelExtLockAcquire(Oid relid, bool conditional)
+{
+	RelExtLock *relextlock;
+	bool		mustwait;
+
+	/*
+	 * If we already hold the lock, we can just increase the count locally.
+	 * Since we don't do deadlock detection, caller must not try to take a new
+	 * relation extension lock while already holding them.
+	 */
+	if (held_relextlock.nLocks > 0)
+	{
+		if (relid != held_relextlock.relid)
+			elog(ERROR,
+				 "can only acquire one relation extension lock at a time");
+
+		held_relextlock.nLocks++;
+		return true;
+	}
+
+	/*
+	 * If the last relation extension lock we touched is the same one for we
+	 * now need to acquire, we can use our cached pointer to the lock instead
+	 * of recomputing it.  This is likely to be a common case in practice.
+	 */
+	if (relid == held_relextlock.relid)
+		relextlock = held_relextlock.lock;
+	else
+	{
+		RelExtLockTag tag;
+
+		/* Make a lock tag */
+		tag.dbid = IsSharedRelation(relid) ? InvalidOid : MyDatabaseId;
+		tag.relid = relid;
+
+		relextlock = &RelExtLockArray[RelExtLockTargetTagToIndex(&tag)];
+
+		/* Remember the lock we're interested in */
+		held_relextlock.relid = relid;
+		held_relextlock.lock = relextlock;
+	}
+
+	held_relextlock.waiting = false;
+	for (;;)
+	{
+		mustwait = RelExtLockAttemptLock(relextlock);
+
+		if (!mustwait)
+			break;				/* got the lock */
+
+		/* Could not got the lock, return iff in locking conditionally */
+		if (conditional)
+			return false;
+
+		/* Could not get the lock, prepare to wait */
+		if (!held_relextlock.waiting)
+		{
+			pg_atomic_add_fetch_u32(&(relextlock->state), 1);
+			held_relextlock.waiting = true;
+		}
+
+		/* Sleep until something happens, then recheck */
+		ConditionVariableSleep(&(relextlock->cv),
+							   WAIT_EVENT_RELATION_EXTENSION_LOCK);
+	}
+
+	ConditionVariableCancelSleep();
+
+	/* Release any wait count we hold */
+	if (held_relextlock.waiting)
+	{
+		pg_atomic_sub_fetch_u32(&(relextlock->state), 1);
+		held_relextlock.waiting = false;
+	}
+
+	Assert(!mustwait);
+
+	/* Remember lock held by this backend */
+	held_relextlock.relid = relid;
+	held_relextlock.lock = relextlock;
+	held_relextlock.nLocks = 1;
+
+	/* We got the lock! */
+	return true;
+}
+
+/*
+ * Attempt to atomically acquire the relation extension lock.
+ *
+ * Returns true if the lock isn't free and we need to wait.
+ */
+static bool
+RelExtLockAttemptLock(RelExtLock *relextlock)
+{
+	uint32		oldstate;
+
+	oldstate = pg_atomic_read_u32(&relextlock->state);
+
+	while (true)
+	{
+		bool		lock_free;
+
+		lock_free = (oldstate & RELEXT_LOCK_BIT) == 0;
+
+		if (!lock_free)
+			return true;
+
+		if (pg_atomic_compare_exchange_u32(&relextlock->state,
+										   &oldstate,
+										   oldstate | RELEXT_LOCK_BIT))
+			return false;
+	}
+
+	pg_unreachable();
+}
+
+/*
+ * Release extension lock in shared memory.  Should be called when our local
+ * lock count drops to 0.
+ */
+static void
+RelExtLockRelease(void)
+{
+	RelExtLock *relextlock;
+	uint32		state;
+	uint32		wait_counts;
+
+	Assert(held_relextlock.nLocks == 0);
+
+	relextlock = held_relextlock.lock;
+
+	/* Release the lock */
+	state = pg_atomic_sub_fetch_u32(&(relextlock->state), RELEXT_LOCK_BIT);
+
+	/* If there may be waiters, wake them up */
+	wait_counts = state & RELEXT_WAIT_COUNT_MASK;
+
+	if (wait_counts > 0)
+		ConditionVariableBroadcast(&(relextlock->cv));
+}
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index da5679b7a3..4fbc0c4a67 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -319,78 +319,6 @@ UnlockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode)
 }
 
 /*
- *		LockRelationForExtension
- *
- * This lock tag is used to interlock addition of pages to relations.
- * We need such locking because bufmgr/smgr definition of P_NEW is not
- * race-condition-proof.
- *
- * We assume the caller is already holding some type of regular lock on
- * the relation, so no AcceptInvalidationMessages call is needed here.
- */
-void
-LockRelationForExtension(Relation relation, LOCKMODE lockmode)
-{
-	LOCKTAG		tag;
-
-	SET_LOCKTAG_RELATION_EXTEND(tag,
-								relation->rd_lockInfo.lockRelId.dbId,
-								relation->rd_lockInfo.lockRelId.relId);
-
-	(void) LockAcquire(&tag, lockmode, false, false);
-}
-
-/*
- *		ConditionalLockRelationForExtension
- *
- * As above, but only lock if we can get the lock without blocking.
- * Returns true iff the lock was acquired.
- */
-bool
-ConditionalLockRelationForExtension(Relation relation, LOCKMODE lockmode)
-{
-	LOCKTAG		tag;
-
-	SET_LOCKTAG_RELATION_EXTEND(tag,
-								relation->rd_lockInfo.lockRelId.dbId,
-								relation->rd_lockInfo.lockRelId.relId);
-
-	return (LockAcquire(&tag, lockmode, false, true) != LOCKACQUIRE_NOT_AVAIL);
-}
-
-/*
- *		RelationExtensionLockWaiterCount
- *
- * Count the number of processes waiting for the given relation extension lock.
- */
-int
-RelationExtensionLockWaiterCount(Relation relation)
-{
-	LOCKTAG		tag;
-
-	SET_LOCKTAG_RELATION_EXTEND(tag,
-								relation->rd_lockInfo.lockRelId.dbId,
-								relation->rd_lockInfo.lockRelId.relId);
-
-	return LockWaiterCount(&tag);
-}
-
-/*
- *		UnlockRelationForExtension
- */
-void
-UnlockRelationForExtension(Relation relation, LOCKMODE lockmode)
-{
-	LOCKTAG		tag;
-
-	SET_LOCKTAG_RELATION_EXTEND(tag,
-								relation->rd_lockInfo.lockRelId.dbId,
-								relation->rd_lockInfo.lockRelId.relId);
-
-	LockRelease(&tag, lockmode, false);
-}
-
-/*
  *		LockPage
  *
  * Obtain a page-level lock.  This is currently used by some index access
@@ -961,12 +889,6 @@ DescribeLockTag(StringInfo buf, const LOCKTAG *tag)
 							 tag->locktag_field2,
 							 tag->locktag_field1);
 			break;
-		case LOCKTAG_RELATION_EXTEND:
-			appendStringInfo(buf,
-							 _("extension of relation %u of database %u"),
-							 tag->locktag_field2,
-							 tag->locktag_field1);
-			break;
 		case LOCKTAG_PAGE:
 			appendStringInfo(buf,
 							 _("page %u of relation %u of database %u"),
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 5833086c62..5a623fa1ed 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -40,6 +40,7 @@
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "pgstat.h"
+#include "storage/extension_lock.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "storage/sinvaladt.h"
@@ -717,6 +718,13 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	int			status;
 	bool		log_lock = false;
 
+	/*
+	 * Relation extension locks don't participate in deadlock detection,
+	 * so make sure we don't try to acquire a heavyweight lock while
+	 * holding one.
+	 */
+	Assert(!IsAnyRelationExtensionLockHeld());
+
 	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
 		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
 	lockMethodTable = LockMethods[lockmethodid];
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 5f6727d501..9167767a24 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -44,6 +44,7 @@
 #include "replication/slot.h"
 #include "replication/syncrep.h"
 #include "storage/condition_variable.h"
+#include "storage/extension_lock.h"
 #include "storage/standby.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
@@ -765,6 +766,8 @@ ProcReleaseLocks(bool isCommit)
 		return;
 	/* If waiting, get off wait queue (should only be needed after error) */
 	LockErrorCleanup();
+	/* Release any relation extension lock or wait counts */
+	RelExtLockCleanup();
 	/* Release standard locks, including session-level if aborting */
 	LockReleaseAll(DEFAULT_LOCKMETHOD, !isCommit);
 	/* Release transaction-level advisory locks */
diff --git a/src/backend/utils/adt/lockfuncs.c b/src/backend/utils/adt/lockfuncs.c
index 9e0a8ab79d..6d8916c200 100644
--- a/src/backend/utils/adt/lockfuncs.c
+++ b/src/backend/utils/adt/lockfuncs.c
@@ -25,7 +25,6 @@
 /* This must match enum LockTagType! */
 const char *const LockTagTypeNames[] = {
 	"relation",
-	"extend",
 	"page",
 	"tuple",
 	"transactionid",
@@ -234,7 +233,6 @@ pg_lock_status(PG_FUNCTION_ARGS)
 		switch ((LockTagType) instance->locktag.locktag_type)
 		{
 			case LOCKTAG_RELATION:
-			case LOCKTAG_RELATION_EXTEND:
 				values[1] = ObjectIdGetDatum(instance->locktag.locktag_field1);
 				values[2] = ObjectIdGetDatum(instance->locktag.locktag_field2);
 				nulls[3] = true;
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 089b7c3a10..b3611c3e22 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -813,6 +813,7 @@ typedef enum
 	WAIT_EVENT_PARALLEL_BITMAP_SCAN,
 	WAIT_EVENT_PROCARRAY_GROUP_UPDATE,
 	WAIT_EVENT_CLOG_GROUP_UPDATE,
+	WAIT_EVENT_RELATION_EXTENSION_LOCK,
 	WAIT_EVENT_REPLICATION_ORIGIN_DROP,
 	WAIT_EVENT_REPLICATION_SLOT_DROP,
 	WAIT_EVENT_SAFE_SNAPSHOT,
diff --git a/src/include/storage/extension_lock.h b/src/include/storage/extension_lock.h
new file mode 100644
index 0000000000..850f19c316
--- /dev/null
+++ b/src/include/storage/extension_lock.h
@@ -0,0 +1,38 @@
+/*-------------------------------------------------------------------------
+ *
+ * extension_lock.h
+ *	  Relation extension lock manager
+ *
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/extension_lock.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef EXTENSION_LOCK_H
+#define EXTENSION_LOCK_H
+
+#ifdef FRONTEND
+#error "extension_lock.h may not be included from frontend code"
+#endif
+
+#include "port/atomics.h"
+#include "storage/s_lock.h"
+#include "storage/condition_variable.h"
+#include "storage/proclist_types.h"
+
+/* Lock a relation for extension */
+extern Size RelExtLockShmemSize(void);
+extern void InitRelExtLocks(void);
+extern void LockRelationForExtension(Relation relation);
+extern void UnlockRelationForExtension(Relation relation);
+extern bool ConditionalLockRelationForExtension(Relation relation);
+extern int	EstimateNumberOfExtensionLockWaiters(Relation relation);
+extern void WaitForRelationExtensionLockToBeFree(Relation relation);
+extern void RelExtLockCleanup(void);
+extern bool	IsAnyRelationExtensionLockHeld(void);
+
+#endif	/* EXTENSION_LOCK_H */
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index 0b923227a2..7e6b80c78a 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -50,13 +50,6 @@ extern bool LockHasWaitersRelation(Relation relation, LOCKMODE lockmode);
 extern void LockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
 extern void UnlockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
 
-/* Lock a relation for extension */
-extern void LockRelationForExtension(Relation relation, LOCKMODE lockmode);
-extern void UnlockRelationForExtension(Relation relation, LOCKMODE lockmode);
-extern bool ConditionalLockRelationForExtension(Relation relation,
-									LOCKMODE lockmode);
-extern int	RelationExtensionLockWaiterCount(Relation relation);
-
 /* Lock a page (currently only used within indexes) */
 extern void LockPage(Relation relation, BlockNumber blkno, LOCKMODE lockmode);
 extern bool ConditionalLockPage(Relation relation, BlockNumber blkno, LOCKMODE lockmode);
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index 765431e299..3be18ea1c5 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -138,8 +138,6 @@ typedef uint16 LOCKMETHODID;
 typedef enum LockTagType
 {
 	LOCKTAG_RELATION,			/* whole relation */
-	/* ID info for a relation is DB OID + REL OID; DB OID = 0 if shared */
-	LOCKTAG_RELATION_EXTEND,	/* the right to extend a relation */
 	/* same ID info as RELATION */
 	LOCKTAG_PAGE,				/* one page of a relation */
 	/* ID info for a page is RELATION info + BlockNumber */
@@ -198,14 +196,6 @@ typedef struct LOCKTAG
 	 (locktag).locktag_type = LOCKTAG_RELATION, \
 	 (locktag).locktag_lockmethodid = DEFAULT_LOCKMETHOD)
 
-#define SET_LOCKTAG_RELATION_EXTEND(locktag,dboid,reloid) \
-	((locktag).locktag_field1 = (dboid), \
-	 (locktag).locktag_field2 = (reloid), \
-	 (locktag).locktag_field3 = 0, \
-	 (locktag).locktag_field4 = 0, \
-	 (locktag).locktag_type = LOCKTAG_RELATION_EXTEND, \
-	 (locktag).locktag_lockmethodid = DEFAULT_LOCKMETHOD)
-
 #define SET_LOCKTAG_PAGE(locktag,dboid,reloid,blocknum) \
 	((locktag).locktag_field1 = (dboid), \
 	 (locktag).locktag_field2 = (reloid), \
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 72eb9fd390..4fb5b91441 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1834,6 +1834,8 @@ RegisNode
 RegisteredBgWorker
 ReindexObjectType
 ReindexStmt
+RelExtLock
+RelExtLockTag
 RelFileNode
 RelFileNodeBackend
 RelIdCacheEnt
@@ -2956,6 +2958,7 @@ registered_buffer
 regmatch_t
 regoff_t
 regproc
+relextlock_handle
 relopt_bool
 relopt_gen
 relopt_int

#41

Andres Freund

andres@anarazel.de

about 8 years ago

In reply to: Robert Haas (#40)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

Hi,

On 2017-12-11 15:15:50 -0500, Robert Haas wrote:

+* Relation extension locks. Only one process can extend a relation at
+a time; we use a specialized lock manager for this purpose, which is
+much simpler than the regular lock manager.  It is similar to the
+lightweight lock mechanism, but is ever simpler because there is only
+one lock mode and only one lock can be taken at a time. A process holding
+a relation extension lock is interruptible, unlike a process holding an
+LWLock.

+/*-------------------------------------------------------------------------
+ *
+ * extension_lock.c
+ *	  Relation extension lock manager
+ *
+ * This specialized lock manager is used only for relation extension
+ * locks.  Unlike the heavyweight lock manager, it doesn't provide
+ * deadlock detection or group locking.  Unlike lwlock.c, extension lock
+ * waits are interruptible.  Unlike both systems, there is only one lock
+ * mode.
+ *
+ * False sharing is possible.  We have a fixed-size array of locks, and
+ * every database OID/relation OID combination is mapped to a slot in
+ * the array.  Therefore, if two processes try to extend relations that
+ * map to the same array slot, they will contend even though it would
+ * be OK to let both proceed at once.  Since these locks are typically
+ * taken only for very short periods of time, this doesn't seem likely
+ * to be a big problem in practice.  If it is, we could make the array
+ * bigger.

For me "very short periods of time" and journaled metadatachanging
filesystem operations don't quite mesh. Language lawyering aside, this
seems quite likely to bite us down the road.

It's imo perfectly fine to say that there's only a limited number of
file extension locks, but that there's a far from neglegible chance of
conflict even without the array being full doesn't seem nice. Think this
needs use some open addressing like conflict handling or something
alike.

Greetings,

Andres Freund

#42

Robert Haas

robertmhaas@gmail.com

about 8 years ago

In reply to: Andres Freund (#41)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Mon, Dec 11, 2017 at 3:25 PM, Andres Freund <andres@anarazel.de> wrote:

For me "very short periods of time" and journaled metadatachanging
filesystem operations don't quite mesh. Language lawyering aside, this
seems quite likely to bite us down the road.

It's imo perfectly fine to say that there's only a limited number of
file extension locks, but that there's a far from neglegible chance of
conflict even without the array being full doesn't seem nice. Think this
needs use some open addressing like conflict handling or something
alike.

I guess we could consider that, but I'm not really convinced that it's
solving a real problem. Right now, you start having meaningful chance
of lock-manager lock contention when the number of concurrent
processes in the system requesting heavyweight locks is still in the
single digits, because there are only 16 lock-manager locks. With
this, there are effectively 1024 partitions.

Now I realize you're going to point out, not wrongly, that we're
contending on the locks themselves rather than the locks protecting
the locks, and that this makes everything worse because the hold time
is much longer. Fair enough. On the other hand, what workload would
actually be harmed? I think you basically have to imagine a lot of
relations being extended simultaneously, like a parallel bulk load,
and an underlying filesystem which performs individual operations
slowly but scales really well. I'm slightly skeptical that's how
real-world filesystems behave.

It might be a good idea, though, to test how parallel bulk loading
behaves with this patch applied, maybe even after reducing
N_RELEXTLOCK_ENTS to simulate an unfortunate number of collisions.

This isn't a zero-sum game. If we add collision resolution, we're
going to slow down the ordinary uncontended case; the bookkeeping will
get significantly more complicated. That is only worth doing if the
current behavior produces pathological cases on workloads that are
actually somewhat realistic.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#43

Masahiko Sawada

sawada.mshk@gmail.com

about 8 years ago

In reply to: Robert Haas (#40)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Tue, Dec 12, 2017 at 5:15 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sun, Dec 10, 2017 at 11:51 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Attached updated version patch. Please review it.

I went over this today; please find attached an updated version which
I propose to commit.

Changes:

- Various formatting fixes, including running pgindent.

- Various comment updates.

- Make RELEXT_WAIT_COUNT_MASK equal RELEXT_LOCK_BIT - 1 rather than
some unnecessarily smaller number.

- In InitRelExtLocks, don't bother using mul_size; we already know it
won't overflow, because we did the same thing in RelExtLockShmemSize.

- When we run into an error trying to release a lock, log it as a
WARNING and don't mark it as translatable. Follows lock.c. An ERROR
here probably just recurses infinitely.

- Don't bother passing OID to RelExtLockRelease.

- Reorder functions a bit for (IMHO) better clarity.

- Make UnlockRelationForExtension just use a single message for both
failure modes. They are closely-enough related that I think that's
fine.

- Make WaitForRelationExtensionLockToBeFree complain if we already
hold an extension lock.

- In RelExtLockCleanup, clear held_relextlock.waiting. This would've
made for a nasty bug.

- Also in that function, assert that we don't hold both a lock and a wait count.

Thank you for updating the patch. Here is two minor comments.

+ * we acquire the same relation extension lock repeatedly.  nLocks is 0 is the
+ * number of times we've acquired that lock;

Should it be "nLocks is the number of times we've acquired that lock:"?

+    /* Remember lock held by this backend */
+    held_relextlock.relid = relid;
+    held_relextlock.lock = relextlock;
+    held_relextlock.nLocks = 1;

We set held_relextlock.relid and held_relextlock.lock again. Can we remove them?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#44

Andres Freund

andres@anarazel.de

about 8 years ago

In reply to: Robert Haas (#42)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On 2017-12-11 15:55:42 -0500, Robert Haas wrote:

On Mon, Dec 11, 2017 at 3:25 PM, Andres Freund <andres@anarazel.de> wrote:

For me "very short periods of time" and journaled metadatachanging
filesystem operations don't quite mesh. Language lawyering aside, this
seems quite likely to bite us down the road.

It's imo perfectly fine to say that there's only a limited number of
file extension locks, but that there's a far from neglegible chance of
conflict even without the array being full doesn't seem nice. Think this
needs use some open addressing like conflict handling or something
alike.

I guess we could consider that, but I'm not really convinced that it's
solving a real problem. Right now, you start having meaningful chance
of lock-manager lock contention when the number of concurrent
processes in the system requesting heavyweight locks is still in the
single digits, because there are only 16 lock-manager locks. With
this, there are effectively 1024 partitions.

Now I realize you're going to point out, not wrongly, that we're
contending on the locks themselves rather than the locks protecting
the locks, and that this makes everything worse because the hold time
is much longer.

Indeed.

Fair enough. On the other hand, what workload would actually be
harmed? I think you basically have to imagine a lot of relations
being extended simultaneously, like a parallel bulk load, and an
underlying filesystem which performs individual operations slowly but
scales really well. I'm slightly skeptical that's how real-world
filesystems behave.

Or just two independent relations on two different filesystems.

It might be a good idea, though, to test how parallel bulk loading
behaves with this patch applied, maybe even after reducing
N_RELEXTLOCK_ENTS to simulate an unfortunate number of collisions.

Yea, that sounds like a good plan. Measure two COPYs to relations on
different filesystems, reduce N_RELEXTLOCK_ENTS to 1, and measure
performance. Then increase the concurrency of the copies to each
relation.

This isn't a zero-sum game. If we add collision resolution, we're
going to slow down the ordinary uncontended case; the bookkeeping will
get significantly more complicated. That is only worth doing if the
current behavior produces pathological cases on workloads that are
actually somewhat realistic.

Yea, measuring sounds like a good plan.

Greetings,

Andres Freund

#45

Robert Haas

robertmhaas@gmail.com

about 8 years ago

In reply to: Masahiko Sawada (#43)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Mon, Dec 11, 2017 at 4:10 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Thank you for updating the patch. Here is two minor comments.
+ * we acquire the same relation extension lock repeatedly.  nLocks is 0 is the
+ * number of times we've acquired that lock;
Should it be "nLocks is the number of times we've acquired that lock:"?

Yes.

+    /* Remember lock held by this backend */
+    held_relextlock.relid = relid;
+    held_relextlock.lock = relextlock;
+    held_relextlock.nLocks = 1;
We set held_relextlock.relid and held_relextlock.lock again. Can we remove them?

Yes.

Can you also try the experiment Andres mentions: "Measure two COPYs to
relations on different filesystems, reduce N_RELEXTLOCK_ENTS to 1, and
measure performance. Then increase the concurrency of the copies to
each relation." We want to see whether and how much this regresses
performance in that case. It simulates the case of a hash collision.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#46

Masahiko Sawada

sawada.mshk@gmail.com

about 8 years ago

In reply to: Robert Haas (#45)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Wed, Dec 13, 2017 at 12:42 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Dec 11, 2017 at 4:10 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Thank you for updating the patch. Here is two minor comments.
+ * we acquire the same relation extension lock repeatedly.  nLocks is 0 is the
+ * number of times we've acquired that lock;
Should it be "nLocks is the number of times we've acquired that lock:"?
Yes.
+    /* Remember lock held by this backend */
+    held_relextlock.relid = relid;
+    held_relextlock.lock = relextlock;
+    held_relextlock.nLocks = 1;
We set held_relextlock.relid and held_relextlock.lock again. Can we remove them?
Yes.

Can you also try the experiment Andres mentions: "Measure two COPYs to
relations on different filesystems, reduce N_RELEXTLOCK_ENTS to 1, and
measure performance.

Yes. I'll measure the performance on such environment.

Then increase the concurrency of the copies to
each relation." We want to see whether and how much this regresses
performance in that case. It simulates the case of a hash collision.

When we add extra blocks on a relation do we access to the disk? I
guess we just call lseek and write and don't access to the disk. If so
the performance degradation regression might not be much.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#47

Andres Freund

andres@anarazel.de

about 8 years ago

In reply to: Masahiko Sawada (#46)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On 2017-12-13 16:02:45 +0900, Masahiko Sawada wrote:

When we add extra blocks on a relation do we access to the disk? I
guess we just call lseek and write and don't access to the disk. If so
the performance degradation regression might not be much.

Usually changes in the file size require the filesystem to perform
metadata operations, which in turn requires journaling on most
FSs. Which'll often result in synchronous disk writes.

Greetings,

Andres Freund

#48

Masahiko Sawada

sawada.mshk@gmail.com

about 8 years ago

In reply to: Andres Freund (#47)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Wed, Dec 13, 2017 at 4:30 PM, Andres Freund <andres@anarazel.de> wrote:

On 2017-12-13 16:02:45 +0900, Masahiko Sawada wrote:

When we add extra blocks on a relation do we access to the disk? I
guess we just call lseek and write and don't access to the disk. If so
the performance degradation regression might not be much.

Usually changes in the file size require the filesystem to perform
metadata operations, which in turn requires journaling on most
FSs. Which'll often result in synchronous disk writes.

Thank you. I understood the reason why this measurement should use two
different filesystems.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#49

Masahiko Sawada

sawada.mshk@gmail.com

about 8 years ago

In reply to: Masahiko Sawada (#48)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Wed, Dec 13, 2017 at 5:57 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Dec 13, 2017 at 4:30 PM, Andres Freund <andres@anarazel.de> wrote:

On 2017-12-13 16:02:45 +0900, Masahiko Sawada wrote:

When we add extra blocks on a relation do we access to the disk? I
guess we just call lseek and write and don't access to the disk. If so
the performance degradation regression might not be much.

Usually changes in the file size require the filesystem to perform
metadata operations, which in turn requires journaling on most
FSs. Which'll often result in synchronous disk writes.

Thank you. I understood the reason why this measurement should use two
different filesystems.

Here is the result.
I've measured the through-put with some cases on my virtual machine.
Each client loads 48k file to each different relations located on
either xfs filesystem or ext4 filesystem, for 30 sec.

Case 1: COPYs to relations on different filessystems(xfs and ext4) and
N_RELEXTLOCK_ENTS is 1024

clients = 2, avg = 296.2068
clients = 5, avg = 372.0707
clients = 10, avg = 389.8850
clients = 50, avg = 428.8050

Case 2: COPYs to relations on different filessystems(xfs and ext4) and
N_RELEXTLOCK_ENTS is 1

clients = 2, avg = 294.3633
clients = 5, avg = 358.9364
clients = 10, avg = 383.6945
clients = 50, avg = 424.3687

And the result of current HEAD is following.

clients = 2, avg = 284.9976
clients = 5, avg = 356.1726
clients = 10, avg = 375.9856
clients = 50, avg = 429.5745

In case2, the through-put got decreased compare to case 1 but it seems
to be almost same as current HEAD. Because the speed of acquiring and
releasing extension lock got x10 faster than current HEAD as I
mentioned before, the performance degradation may not have gotten
decreased than I expected even in case 2.
Since my machine doesn't have enough resources the result of clients =
50 might not be a valid result.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#50

Robert Haas

robertmhaas@gmail.com

about 8 years ago

In reply to: Masahiko Sawada (#49)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Thu, Dec 14, 2017 at 5:45 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Here is the result.
I've measured the through-put with some cases on my virtual machine.
Each client loads 48k file to each different relations located on
either xfs filesystem or ext4 filesystem, for 30 sec.

Case 1: COPYs to relations on different filessystems(xfs and ext4) and
N_RELEXTLOCK_ENTS is 1024

clients = 2, avg = 296.2068
clients = 5, avg = 372.0707
clients = 10, avg = 389.8850
clients = 50, avg = 428.8050

Case 2: COPYs to relations on different filessystems(xfs and ext4) and
N_RELEXTLOCK_ENTS is 1

clients = 2, avg = 294.3633
clients = 5, avg = 358.9364
clients = 10, avg = 383.6945
clients = 50, avg = 424.3687

And the result of current HEAD is following.

clients = 2, avg = 284.9976
clients = 5, avg = 356.1726
clients = 10, avg = 375.9856
clients = 50, avg = 429.5745

In case2, the through-put got decreased compare to case 1 but it seems
to be almost same as current HEAD. Because the speed of acquiring and
releasing extension lock got x10 faster than current HEAD as I
mentioned before, the performance degradation may not have gotten
decreased than I expected even in case 2.
Since my machine doesn't have enough resources the result of clients =
50 might not be a valid result.

I have to admit that result is surprising to me.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#51

Masahiko Sawada

sawada.mshk@gmail.com

about 8 years ago

In reply to: Robert Haas (#50)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Sun, Dec 17, 2017 at 12:27 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Dec 14, 2017 at 5:45 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Here is the result.
I've measured the through-put with some cases on my virtual machine.
Each client loads 48k file to each different relations located on
either xfs filesystem or ext4 filesystem, for 30 sec.

Case 1: COPYs to relations on different filessystems(xfs and ext4) and
N_RELEXTLOCK_ENTS is 1024

clients = 2, avg = 296.2068
clients = 5, avg = 372.0707
clients = 10, avg = 389.8850
clients = 50, avg = 428.8050

Case 2: COPYs to relations on different filessystems(xfs and ext4) and
N_RELEXTLOCK_ENTS is 1

clients = 2, avg = 294.3633
clients = 5, avg = 358.9364
clients = 10, avg = 383.6945
clients = 50, avg = 424.3687

And the result of current HEAD is following.

clients = 2, avg = 284.9976
clients = 5, avg = 356.1726
clients = 10, avg = 375.9856
clients = 50, avg = 429.5745

In case2, the through-put got decreased compare to case 1 but it seems
to be almost same as current HEAD. Because the speed of acquiring and
releasing extension lock got x10 faster than current HEAD as I
mentioned before, the performance degradation may not have gotten
decreased than I expected even in case 2.
Since my machine doesn't have enough resources the result of clients =
50 might not be a valid result.

I have to admit that result is surprising to me.

I think the environment I used for performance measurement did not
have enough resources. I will do the same benchmark on an another
environment to see if it was a valid result, and will share it.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#52

Masahiko Sawada

sawada.mshk@gmail.com

about 8 years ago

In reply to: Masahiko Sawada (#51)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Mon, Dec 18, 2017 at 2:04 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sun, Dec 17, 2017 at 12:27 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Dec 14, 2017 at 5:45 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Here is the result.
I've measured the through-put with some cases on my virtual machine.
Each client loads 48k file to each different relations located on
either xfs filesystem or ext4 filesystem, for 30 sec.

Case 1: COPYs to relations on different filessystems(xfs and ext4) and
N_RELEXTLOCK_ENTS is 1024

clients = 2, avg = 296.2068
clients = 5, avg = 372.0707
clients = 10, avg = 389.8850
clients = 50, avg = 428.8050

Case 2: COPYs to relations on different filessystems(xfs and ext4) and
N_RELEXTLOCK_ENTS is 1

clients = 2, avg = 294.3633
clients = 5, avg = 358.9364
clients = 10, avg = 383.6945
clients = 50, avg = 424.3687

And the result of current HEAD is following.

clients = 2, avg = 284.9976
clients = 5, avg = 356.1726
clients = 10, avg = 375.9856
clients = 50, avg = 429.5745

In case2, the through-put got decreased compare to case 1 but it seems
to be almost same as current HEAD. Because the speed of acquiring and
releasing extension lock got x10 faster than current HEAD as I
mentioned before, the performance degradation may not have gotten
decreased than I expected even in case 2.
Since my machine doesn't have enough resources the result of clients =
50 might not be a valid result.

I have to admit that result is surprising to me.

I think the environment I used for performance measurement did not
have enough resources. I will do the same benchmark on an another
environment to see if it was a valid result, and will share it.

I did performance measurement on an different environment where has 4
cores and physically separated two disk volumes. Also I've change the
benchmarking so that COPYs load only 300 integer tuples which are not
fit within single page, and changed tables to unlogged tables to
observe the overhead of locking/unlocking relext locks.

Case 1: COPYs to relations on different filessystems(xfs and ext4) and
N_RELEXTLOCK_ENTS is 1024

clients = 1, avg = 3033.8933
clients = 2, avg = 5992.9077
clients = 4, avg = 8055.9515
clients = 8, avg = 8468.9306
clients = 16, avg = 7718.6879

Case 2: COPYs to relations on different filessystems(xfs and ext4) and
N_RELEXTLOCK_ENTS is 1

clients = 1, avg = 3012.4993
clients = 2, avg = 5854.9966
clients = 4, avg = 7380.6082
clients = 8, avg = 7091.8367
clients = 16, avg = 7573.2904

And the result of current HEAD is following.

clients = 1, avg = 2962.2416
clients = 2, avg = 5856.9774
clients = 4, avg = 7561.1376
clients = 8, avg = 7252.0192
clients = 16, avg = 7916.7651

As per the above results, compared with current HEAD the through-put
of case 1 got increased up to 17%. On the other hand, the through-put
of case 2 got decreased 2%~5%.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#53

Mithun Cy

mithun.cy@enterprisedb.com

about 8 years ago

In reply to: Masahiko Sawada (#52)

3 attachment(s)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Tue, Dec 19, 2017 at 5:52 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Dec 18, 2017 at 2:04 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sun, Dec 17, 2017 at 12:27 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I have to admit that result is surprising to me.

I think the environment I used for performance measurement did not
have enough resources. I will do the same benchmark on an another
environment to see if it was a valid result, and will share it.

I did performance measurement on an different environment where has 4
cores and physically separated two disk volumes. Also I've change the
benchmarking so that COPYs load only 300 integer tuples which are not
fit within single page, and changed tables to unlogged tables to
observe the overhead of locking/unlocking relext locks.

I ran same test as asked by Robert it was just an extension of tests
[1]: /messages/by-id/CAFiTN-tkX6gs-jL8VrPxg6OG9VUAKnObUq7r7pWQqASzdF5OwA@mail.gmail.com -- Thanks and Regards Mithun C Y EnterpriseDB: http://www.enterprisedb.com

Machine : cthulhu
------------------------
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 128
On-line CPU(s) list: 0-127
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 8
NUMA node(s): 8
Vendor ID: GenuineIntel
CPU family: 6
Model: 47
Model name: Intel(R) Xeon(R) CPU E7- 8830 @ 2.13GHz
Stepping: 2
CPU MHz: 1064.000
CPU max MHz: 2129.0000
CPU min MHz: 1064.0000
BogoMIPS: 4266.59
Virtualization: VT-x
Hypervisor vendor: vertical
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 24576K
NUMA node0 CPU(s): 0-7,64-71
NUMA node1 CPU(s): 8-15,72-79
NUMA node2 CPU(s): 16-23,80-87
NUMA node3 CPU(s): 24-31,88-95
NUMA node4 CPU(s): 32-39,96-103
NUMA node5 CPU(s): 40-47,104-111
NUMA node6 CPU(s): 48-55,112-119
NUMA node7 CPU(s): 56-63,120-127

It has 2 discs with different filesytem as below
/dev/mapper/vg_mag-data2 ext4 5.1T 3.6T 1.2T 76% /mnt/data-mag2
/dev/mapper/vg_mag-data1 xfs 5.1T 1.6T 3.6T 31% /mnt/data-mag

I have created 2 tables each one on above filesystem.

test_size_copy.sh --> automated script to run copy test.
copy_script1, copy_script2 -> copy pg_bench script's used by
test_size_copy.sh to load to 2 different tables.

To run above copy_scripts in parallel I have run it with equal weights as below.
./pgbench -c $threads -j $threads -f copy_script1@1 -f copy_script2@1
-T 120 postgres >> test_results.txt

Results :
-----------

Clients HEAD-TPS
--------- ---------------
1 84.460734
2 121.359035
4 175.886335
8 268.764828
16 369.996667
32 439.032756
64 482.185392

Clients N_RELEXTLOCK_ENTS = 1024 %diff with DEAD
----------------------------------------------------------------------------------
1 87.165777 3.20272258112273
2 131.094037 8.02165409439848
4 181.667104 3.2866504381935
8 267.412856 -0.503031594595423
16 376.118671 1.65461058058666
32 460.756357 4.94805927419228
64 492.723975 2.18558736428913

Not much of an improvement from HEAD

Clients N_RELEXTLOCK_ENTS = 1 %diff with HEAD
-----------------------------------------------------------------------------
1 86.288574 2.16412990206786
2 131.398667 8.27266960387414
4 168.681079 -4.09654109854526
8 245.841999 -8.52895416806549
16 321.972147 -12.9797169226933
32 375.783299 -14.4065462395703
64 360.134531 -25.3120196142317

So in case of N_RELEXTLOCK_ENTS = 1 we can see regression as high 25%. ?

[1]: /messages/by-id/CAFiTN-tkX6gs-jL8VrPxg6OG9VUAKnObUq7r7pWQqASzdF5OwA@mail.gmail.com -- Thanks and Regards Mithun C Y EnterpriseDB: http://www.enterprisedb.com
--
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com

#54

Robert Haas

robertmhaas@gmail.com

about 8 years ago

In reply to: Mithun Cy (#53)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Tue, Jan 2, 2018 at 1:09 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:

So in case of N_RELEXTLOCK_ENTS = 1 we can see regression as high 25%. ?

So now the question is: what do these results mean for this patch?

I think that the chances of someone simultaneously bulk-loading 16 or
more relations that all happen to hash to the same relation extension
lock bucket is pretty darn small. Most people aren't going to be
running 16 bulk loads at the same time in the first place, and if they
are, then there's a good chance that at least some of those loads are
either actually to the same relation, or that many or all of the loads
are targeting the same filesystem and the bottleneck will occur at
that level, or that the loads are to relations which hash to different
buckets. Now, if we want to reduce the chances of hash collisions, we
could boost the default value of N_RELEXTLOCK_ENTS to 2048 or 4096.

However, if we take the position that no hash collision probability is
low enough and that we must eliminate all chance of false collisions,
except perhaps when the table is full, then we have to make this
locking mechanism a whole lot more complicated. We can no longer
compute the location of the lock we need without first taking some
other kind of lock that protects the mapping from {db_oid, rel_oid} ->
{memory address of the relevant lock}. We can no longer cache the
location where we found the lock last time so that we can retake it.
If we do that, we're adding extra cycles and extra atomics and extra
code that can harbor bugs to every relation extension to guard against
something which I'm not sure is really going to happen. Something
that's 3-8% faster in a case that occurs all the time and as much as
25% slower in a case that virtually never arises seems like it might
be a win overall.

However, it's quite possible that I'm not seeing the whole picture
here. Thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#55

Masahiko Sawada

sawada.mshk@gmail.com

about 8 years ago

In reply to: Robert Haas (#54)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Fri, Jan 5, 2018 at 1:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Jan 2, 2018 at 1:09 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:

So in case of N_RELEXTLOCK_ENTS = 1 we can see regression as high 25%. ?

Thank you for the performance measurement!

So now the question is: what do these results mean for this patch?

I think that the chances of someone simultaneously bulk-loading 16 or
more relations that all happen to hash to the same relation extension
lock bucket is pretty darn small. Most people aren't going to be
running 16 bulk loads at the same time in the first place, and if they
are, then there's a good chance that at least some of those loads are
either actually to the same relation, or that many or all of the loads
are targeting the same filesystem and the bottleneck will occur at
that level, or that the loads are to relations which hash to different
buckets. Now, if we want to reduce the chances of hash collisions, we
could boost the default value of N_RELEXTLOCK_ENTS to 2048 or 4096.

However, if we take the position that no hash collision probability is
low enough and that we must eliminate all chance of false collisions,
except perhaps when the table is full, then we have to make this
locking mechanism a whole lot more complicated. We can no longer
compute the location of the lock we need without first taking some
other kind of lock that protects the mapping from {db_oid, rel_oid} ->
{memory address of the relevant lock}. We can no longer cache the
location where we found the lock last time so that we can retake it.
If we do that, we're adding extra cycles and extra atomics and extra
code that can harbor bugs to every relation extension to guard against
something which I'm not sure is really going to happen. Something
that's 3-8% faster in a case that occurs all the time and as much as
25% slower in a case that virtually never arises seems like it might
be a win overall.

However, it's quite possible that I'm not seeing the whole picture
here. Thoughts?

I agree that the chances of the case where through-put got worse is
pretty small and we can get performance improvement in common cases.
Also, we could mistakenly overestimate the number of blocks we need to
add by false collisions. Thereby the performance might got worse and
we extend a relation more than necessary but I think the chances are
small. Considering the further parallel operations (e.g. parallel
loading, parallel index creation etc) multiple processes will be
taking a relext lock of the same relation. Thinking of that, the
benefit of this patch that improves the speeds of acquiring/releasing
the lock would be effective.

In short I personally think the current patch is simple and the result
is not a bad. But If community cannot accept these degradations we
have to deal with the problem. For example, we could make the length
of relext lock array configurable by users. That way, users can reduce
the possibility of collisions. Or we could improve the relext lock
manager to eliminate false collision by changing it to a
open-addressing hash table. The code would get complex but false
collisions don't happen unless the array is not full.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#56

Masahiko Sawada

sawada.mshk@gmail.com

almost 8 years ago

In reply to: Masahiko Sawada (#55)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Sun, Jan 7, 2018 at 11:26 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Jan 5, 2018 at 1:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Jan 2, 2018 at 1:09 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:

So in case of N_RELEXTLOCK_ENTS = 1 we can see regression as high 25%. ?

Thank you for the performance measurement!

So now the question is: what do these results mean for this patch?

I think that the chances of someone simultaneously bulk-loading 16 or
more relations that all happen to hash to the same relation extension
lock bucket is pretty darn small. Most people aren't going to be
running 16 bulk loads at the same time in the first place, and if they
are, then there's a good chance that at least some of those loads are
either actually to the same relation, or that many or all of the loads
are targeting the same filesystem and the bottleneck will occur at
that level, or that the loads are to relations which hash to different
buckets. Now, if we want to reduce the chances of hash collisions, we
could boost the default value of N_RELEXTLOCK_ENTS to 2048 or 4096.

However, if we take the position that no hash collision probability is
low enough and that we must eliminate all chance of false collisions,
except perhaps when the table is full, then we have to make this
locking mechanism a whole lot more complicated. We can no longer
compute the location of the lock we need without first taking some
other kind of lock that protects the mapping from {db_oid, rel_oid} ->
{memory address of the relevant lock}. We can no longer cache the
location where we found the lock last time so that we can retake it.
If we do that, we're adding extra cycles and extra atomics and extra
code that can harbor bugs to every relation extension to guard against
something which I'm not sure is really going to happen. Something
that's 3-8% faster in a case that occurs all the time and as much as
25% slower in a case that virtually never arises seems like it might
be a win overall.

However, it's quite possible that I'm not seeing the whole picture
here. Thoughts?

I agree that the chances of the case where through-put got worse is
pretty small and we can get performance improvement in common cases.
Also, we could mistakenly overestimate the number of blocks we need to
add by false collisions. Thereby the performance might got worse and
we extend a relation more than necessary but I think the chances are
small. Considering the further parallel operations (e.g. parallel
loading, parallel index creation etc) multiple processes will be
taking a relext lock of the same relation. Thinking of that, the
benefit of this patch that improves the speeds of acquiring/releasing
the lock would be effective.

In short I personally think the current patch is simple and the result
is not a bad. But If community cannot accept these degradations we
have to deal with the problem. For example, we could make the length
of relext lock array configurable by users. That way, users can reduce
the possibility of collisions. Or we could improve the relext lock
manager to eliminate false collision by changing it to a
open-addressing hash table. The code would get complex but false
collisions don't happen unless the array is not full.

On second thought, perhaps we should also do performance measurement
with the patch that uses HTAB instead a fixed array. Probably the
performance with that patch will be equal to or slightly greater than
current HEAD, hopefully not be worse. In addition to that, if the
performance degradation by false collision doesn't happen or we can
avoid it by increasing GUC parameter, I think it's better than current
fixed array approach. Thoughts?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#57

Andres Freund

andres@anarazel.de

almost 8 years ago

In reply to: Robert Haas (#54)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

Hi,

On 2018-01-04 11:39:40 -0500, Robert Haas wrote:

On Tue, Jan 2, 2018 at 1:09 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:

So in case of N_RELEXTLOCK_ENTS = 1 we can see regression as high 25%. ?

So now the question is: what do these results mean for this patch?

I think that the chances of someone simultaneously bulk-loading 16 or
more relations that all happen to hash to the same relation extension
lock bucket is pretty darn small.

I'm not convinced that that's true. Especially with partitioning in the
mix.

Also, birthday paradoxon and all that make collisions not that
unlikely. And you really don't need a 16 way conflict to feel pain,
you'll imo feel it earlier.

I think bumping up the size a bit would make that less likely. Not sure
it actually addresses the issue.

However, if we take the position that no hash collision probability is
low enough and that we must eliminate all chance of false collisions,
except perhaps when the table is full, then we have to make this
locking mechanism a whole lot more complicated. We can no longer
compute the location of the lock we need without first taking some
other kind of lock that protects the mapping from {db_oid, rel_oid} ->
{memory address of the relevant lock}.

Hm, that's not necessarily true, is it? Wile not trivial, it also
doesn't seem impossible?

Greetings,

Andres Freund

#58

Robert Haas

robertmhaas@gmail.com

almost 8 years ago

In reply to: Andres Freund (#57)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Thu, Mar 1, 2018 at 2:17 PM, Andres Freund <andres@anarazel.de> wrote:

However, if we take the position that no hash collision probability is
low enough and that we must eliminate all chance of false collisions,
except perhaps when the table is full, then we have to make this
locking mechanism a whole lot more complicated. We can no longer
compute the location of the lock we need without first taking some
other kind of lock that protects the mapping from {db_oid, rel_oid} ->
{memory address of the relevant lock}.

Hm, that's not necessarily true, is it? Wile not trivial, it also
doesn't seem impossible?

You can't both store every lock at a fixed address and at the same
time put locks at a different address if the one they would have used
is already occupied.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#59

Andres Freund

andres@anarazel.de

almost 8 years ago

In reply to: Robert Haas (#58)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On 2018-03-01 15:37:17 -0500, Robert Haas wrote:

On Thu, Mar 1, 2018 at 2:17 PM, Andres Freund <andres@anarazel.de> wrote:

However, if we take the position that no hash collision probability is
low enough and that we must eliminate all chance of false collisions,
except perhaps when the table is full, then we have to make this
locking mechanism a whole lot more complicated. We can no longer
compute the location of the lock we need without first taking some
other kind of lock that protects the mapping from {db_oid, rel_oid} ->
{memory address of the relevant lock}.

Hm, that's not necessarily true, is it? Wile not trivial, it also
doesn't seem impossible?

You can't both store every lock at a fixed address and at the same
time put locks at a different address if the one they would have used
is already occupied.

Right, but why does that require a lock?

Greetings,

Andres Freund

#60

Robert Haas

robertmhaas@gmail.com

almost 8 years ago

In reply to: Andres Freund (#59)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Thu, Mar 1, 2018 at 3:40 PM, Andres Freund <andres@anarazel.de> wrote:

You can't both store every lock at a fixed address and at the same
time put locks at a different address if the one they would have used
is already occupied.

Right, but why does that require a lock?

Maybe I'm being dense here but ... how could it not?

If the lock for relation X is always at pointer P, then I can compute
the address for the lock and assume it will be there, because that's
where it *always is*.

If the lock for relation X can be at any of various addresses
depending on other system activity, then I cannot assume that an
address that I compute for it remains valid except for so long as I
hold a lock strong enough to keep it from being moved.

Concretely, I imagine that if you put the lock at different addresses
at different times, you would implement that by reclaiming unused
entries to make room for new entries that you need to allocate. So if
I hold the lock at 0x1000, I can probably it will assume it will stay
there for as long as I hold it. But the instant I release it, even
for a moment, somebody might garbage-collect the entry and reallocate
it for something else. Now the next time I need it it will be
elsewhere. I'll have to search for it, I presume, while holding some
analogue of the buffer-mapping lock. In the patch as proposed, that's
not needed. Once you know that the lock for relation 123 is at
0x1000, you can just keep locking it at that same address without
checking anything, which is quite appealing given that the same
backend extending the same relation many times in a row is a pretty
common pattern.

If you have a clever idea how to make this work with as few atomic
operations as the current patch uses while at the same time reducing
the possibility of contention, I'm all ears. But I don't see how to
do that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#61

Michael Paquier

michael@paquier.xyz

almost 8 years ago

In reply to: Robert Haas (#60)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Thu, Mar 01, 2018 at 04:01:28PM -0500, Robert Haas wrote:

If you have a clever idea how to make this work with as few atomic
operations as the current patch uses while at the same time reducing
the possibility of contention, I'm all ears. But I don't see how to
do that.

This thread has no activity since the beginning of the commit fest, and
it seems that it would be hard to reach something committable for v11,
so I am marking it as returned with feedback.
--
Michael

#62

Masahiko Sawada

sawada.mshk@gmail.com

almost 8 years ago

In reply to: Michael Paquier (#61)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Fri, Mar 30, 2018 at 4:43 PM, Michael Paquier <michael@paquier.xyz> wrote:

On Thu, Mar 01, 2018 at 04:01:28PM -0500, Robert Haas wrote:

If you have a clever idea how to make this work with as few atomic
operations as the current patch uses while at the same time reducing
the possibility of contention, I'm all ears. But I don't see how to
do that.

This thread has no activity since the beginning of the commit fest, and
it seems that it would be hard to reach something committable for v11,
so I am marking it as returned with feedback.

Thank you.

The probability of performance degradation can be reduced by
increasing N_RELEXTLOCK_ENTS. But as Robert mentioned, while keeping
fast and simple implementation like acquiring lock by a few atomic
operation it's hard to improve or at least keep the current
performance on all cases. I was thinking that this patch is necessary
by parallel DML operations and vacuum but if the community cannot
accept this approach it might be better to mark it as "Rejected" and
then I should reconsider the design of parallel vacuum.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#63

Robert Haas

robertmhaas@gmail.com

almost 8 years ago

In reply to: Masahiko Sawada (#62)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Tue, Apr 10, 2018 at 5:40 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

The probability of performance degradation can be reduced by
increasing N_RELEXTLOCK_ENTS. But as Robert mentioned, while keeping
fast and simple implementation like acquiring lock by a few atomic
operation it's hard to improve or at least keep the current
performance on all cases. I was thinking that this patch is necessary
by parallel DML operations and vacuum but if the community cannot
accept this approach it might be better to mark it as "Rejected" and
then I should reconsider the design of parallel vacuum.

I'm sorry that I didn't get time to work further on this during the
CommitFest. In terms of moving forward, I'd still like to hear what
Andres has to say about the comments I made on March 1st.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#64

Masahiko Sawada

sawada.mshk@gmail.com

almost 8 years ago

In reply to: Robert Haas (#63)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Wed, Apr 11, 2018 at 1:40 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Apr 10, 2018 at 5:40 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

The probability of performance degradation can be reduced by
increasing N_RELEXTLOCK_ENTS. But as Robert mentioned, while keeping
fast and simple implementation like acquiring lock by a few atomic
operation it's hard to improve or at least keep the current
performance on all cases. I was thinking that this patch is necessary
by parallel DML operations and vacuum but if the community cannot
accept this approach it might be better to mark it as "Rejected" and
then I should reconsider the design of parallel vacuum.

I'm sorry that I didn't get time to work further on this during the
CommitFest.

Never mind. There was a lot of items especially at the last CommitFest.

In terms of moving forward, I'd still like to hear what
Andres has to say about the comments I made on March 1st.

Yeah, agreed.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#65

Robert Haas

robertmhaas@gmail.com

over 7 years ago

In reply to: Masahiko Sawada (#64)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Tue, Apr 10, 2018 at 9:08 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Never mind. There was a lot of items especially at the last CommitFest.

In terms of moving forward, I'd still like to hear what
Andres has to say about the comments I made on March 1st.

Yeah, agreed.

$ ping -n andres.freund
Request timeout for icmp_seq 0
Request timeout for icmp_seq 1
Request timeout for icmp_seq 2
Request timeout for icmp_seq 3
Request timeout for icmp_seq 4
^C
--- andres.freund ping statistics ---
6 packets transmitted, 0 packets received, 100.0% packet loss

Meanwhile, /messages/by-id/4c171ffe-e3ee-acc5-9066-a40d52bc5ae9@postgrespro.ru
shows that this patch has some benefits for other cases, which is a
point in favor IMHO.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#66

Masahiko Sawada

sawada.mshk@gmail.com

over 7 years ago

In reply to: Robert Haas (#65)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Thu, Apr 26, 2018 at 3:30 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Apr 10, 2018 at 9:08 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Never mind. There was a lot of items especially at the last CommitFest.

In terms of moving forward, I'd still like to hear what
Andres has to say about the comments I made on March 1st.

Yeah, agreed.
$ ping -n andres.freund
Request timeout for icmp_seq 0
Request timeout for icmp_seq 1
Request timeout for icmp_seq 2
Request timeout for icmp_seq 3
Request timeout for icmp_seq 4
^C
--- andres.freund ping statistics ---
6 packets transmitted, 0 packets received, 100.0% packet loss
Meanwhile, /messages/by-id/4c171ffe-e3ee-acc5-9066-a40d52bc5ae9@postgrespro.ru
shows that this patch has some benefits for other cases, which is a
point in favor IMHO.

Thank you for sharing. That's good to know.

Andres pointed out the performance degradation due to hash collision
when multiple loading. I think the point is that it happens at where
users don't know. Therefore even if we make N_RELEXTLOCK_ENTS
configurable parameter, since users don't know the hash collision they
don't know when they should tune it.

So it's just an idea but how about adding an SQL-callable function
that returns the estimated number of lock waiters of the given
relation? Since user knows how many processes are loading to the
relation, if a returned value by the function is greater than the
expected value user can know hash collision and will be able to start
to consider to increase N_RELEXTLOCK_ENTS.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#67

Robert Haas

robertmhaas@gmail.com

over 7 years ago

In reply to: Masahiko Sawada (#66)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Thu, Apr 26, 2018 at 2:10 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Thank you for sharing. That's good to know.

Andres pointed out the performance degradation due to hash collision
when multiple loading. I think the point is that it happens at where
users don't know. Therefore even if we make N_RELEXTLOCK_ENTS
configurable parameter, since users don't know the hash collision they
don't know when they should tune it.

So it's just an idea but how about adding an SQL-callable function
that returns the estimated number of lock waiters of the given
relation? Since user knows how many processes are loading to the
relation, if a returned value by the function is greater than the
expected value user can know hash collision and will be able to start
to consider to increase N_RELEXTLOCK_ENTS.

I don't think that's a very useful suggestion. Changing
N_RELEXTLOCK_ENTS requires a recompile, which is going to be
impractical for most users. Even if we made it a GUC, we don't want
users to have to tune stuff like this. If we actually think this is
going to be a problem, we'd probably better rethink the desgin.

I think the real question is whether the scenario is common enough to
worry about. In practice, you'd have to be extremely unlucky to be
doing many bulk loads at the same time that all happened to hash to
the same bucket.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#68

Andres Freund

andres@anarazel.de

over 7 years ago

In reply to: Robert Haas (#67)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

Hi,

On 2018-04-26 15:08:24 -0400, Robert Haas wrote:

I don't think that's a very useful suggestion. Changing
N_RELEXTLOCK_ENTS requires a recompile, which is going to be
impractical for most users. Even if we made it a GUC, we don't want
users to have to tune stuff like this. If we actually think this is
going to be a problem, we'd probably better rethink the desgin.

Agreed.

I think the real question is whether the scenario is common enough to
worry about. In practice, you'd have to be extremely unlucky to be
doing many bulk loads at the same time that all happened to hash to
the same bucket.

With a bunch of parallel bulkloads into partitioned tables that really
doesn't seem that unlikely?

Greetings,

Andres Freund

#69

Robert Haas

robertmhaas@gmail.com

over 7 years ago

In reply to: Andres Freund (#68)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Thu, Apr 26, 2018 at 3:10 PM, Andres Freund <andres@anarazel.de> wrote:

I think the real question is whether the scenario is common enough to
worry about. In practice, you'd have to be extremely unlucky to be
doing many bulk loads at the same time that all happened to hash to
the same bucket.

With a bunch of parallel bulkloads into partitioned tables that really
doesn't seem that unlikely?

It increases the likelihood of collisions, but probably decreases the
number of cases where the contention gets really bad.

For example, suppose each table has 100 partitions and you are
bulk-loading 10 of them at a time. It's virtually certain that you
will have some collisions, but the amount of contention within each
bucket will remain fairly low because each backend spends only 1% of
its time in the bucket corresponding to any given partition.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#70

Alex Ignatov

a.ignatov@postgrespro.ru

over 7 years ago

In reply to: Robert Haas (#69)

RE: [HACKERS] Moving relation extension locks out of heavyweight lock manager

-----Original Message-----
From: Robert Haas <robertmhaas@gmail.com>
Sent: Thursday, April 26, 2018 10:25 PM
To: Andres Freund <andres@anarazel.de>
Cc: Masahiko Sawada <sawada.mshk@gmail.com>; Michael Paquier <michael@paquier.xyz>; Mithun Cy <mithun.cy@enterprisedb.com>; Tom Lane <tgl@sss.pgh.pa.us>; Thomas Munro <thomas.munro@enterprisedb.com>; Amit Kapila <amit.kapila16@gmail.com>; PostgreSQL-development <pgsql-hackers@postgresql.org>
Subject: Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Thu, Apr 26, 2018 at 3:10 PM, Andres Freund <andres@anarazel.de> wrote:

I think the real question is whether the scenario is common enough to
worry about. In practice, you'd have to be extremely unlucky to be
doing many bulk loads at the same time that all happened to hash to
the same bucket.

With a bunch of parallel bulkloads into partitioned tables that really
doesn't seem that unlikely?

It increases the likelihood of collisions, but probably decreases the number of cases where the contention gets really bad.

For example, suppose each table has 100 partitions and you are bulk-loading 10 of them at a time. It's virtually certain that you will have some collisions, but the amount of contention within each bucket will remain fairly low because each backend spends only 1% of its time in the bucket corresponding to any given partition.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company

Hello!
I want to try to test this patch on 302(704 ht) core machine.

Patching on master (commit 81256cd05f0745353c6572362155b57250a0d2a0) is ok but
got some error while compiling :

gistvacuum.c: In function ‘gistvacuumcleanup’:
gistvacuum.c:92:3: error: too many arguments to function ‘LockRelationForExtension’
LockRelationForExtension(rel, ExclusiveLock);
^
In file included from gistvacuum.c:21:0:
../../../../src/include/storage/extension_lock.h:30:13: note: declared here
extern void LockRelationForExtension(Relation relation);
^
gistvacuum.c:95:3: error: too many arguments to function ‘UnlockRelationForExtension’
UnlockRelationForExtension(rel, ExclusiveLock);
^
In file included from gistvacuum.c:21:0:
../../../../src/include/storage/extension_lock.h:31:13: note: declared here
extern void UnlockRelationForExtension(Relation relation);

--
Alex Ignatov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#71

Alex Ignatov

a.ignatov@postgrespro.ru

over 7 years ago

In reply to: Robert Haas (#58)

RE: [HACKERS] Moving relation extension locks out of heavyweight lock manager

--
Alex Ignatov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

-----Original Message-----
From: Alex Ignatov <a.ignatov@postgrespro.ru>
Sent: Monday, May 21, 2018 6:00 PM
To: 'Robert Haas' <robertmhaas@gmail.com>; 'Andres Freund' <andres@anarazel.de>
Cc: 'Masahiko Sawada' <sawada.mshk@gmail.com>; 'Michael Paquier' <michael@paquier.xyz>; 'Mithun Cy' <mithun.cy@enterprisedb.com>; 'Tom Lane' <tgl@sss.pgh.pa.us>; 'Thomas Munro' <thomas.munro@enterprisedb.com>; 'Amit Kapila' <amit.kapila16@gmail.com>; 'PostgreSQL-development' <pgsql-hackers@postgresql.org>
Subject: RE: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Thu, Apr 26, 2018 at 3:10 PM, Andres Freund <andres@anarazel.de> wrote:

I think the real question is whether the scenario is common enough to
worry about. In practice, you'd have to be extremely unlucky to be
doing many bulk loads at the same time that all happened to hash to
the same bucket.

With a bunch of parallel bulkloads into partitioned tables that really
doesn't seem that unlikely?

It increases the likelihood of collisions, but probably decreases the number of cases where the contention gets really bad.

For example, suppose each table has 100 partitions and you are bulk-loading 10 of them at a time. It's virtually certain that you will have some collisions, but the amount of contention within each bucket will remain fairly low because each backend spends only 1% of its time in the bucket corresponding to any given partition.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company

Hello!
I want to try to test this patch on 302(704 ht) core machine.

Patching on master (commit 81256cd05f0745353c6572362155b57250a0d2a0) is ok but got some error while compiling :

gistvacuum.c: In function ‘gistvacuumcleanup’:
gistvacuum.c:92:3: error: too many arguments to function ‘LockRelationForExtension’
LockRelationForExtension(rel, ExclusiveLock);
^
In file included from gistvacuum.c:21:0:
../../../../src/include/storage/extension_lock.h:30:13: note: declared here extern void LockRelationForExtension(Relation relation);
^
gistvacuum.c:95:3: error: too many arguments to function ‘UnlockRelationForExtension’
UnlockRelationForExtension(rel, ExclusiveLock);
^
In file included from gistvacuum.c:21:0:
../../../../src/include/storage/extension_lock.h:31:13: note: declared here extern void UnlockRelationForExtension(Relation relation);

Sorry, forgot to mention that patch version is extension-lock-v12.patch

--
Alex Ignatov
Postgres Professional: http://www.postgrespro.com The Russian Postgres Company

Import Notes

Reply to msg id not found:

#72

Masahiko Sawada

sawada.mshk@gmail.com

over 7 years ago

In reply to: Alex Ignatov (#71)

1 attachment(s)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Tue, May 22, 2018 at 12:05 AM, Alex Ignatov <a.ignatov@postgrespro.ru> wrote:

--
Alex Ignatov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

-----Original Message-----
From: Alex Ignatov <a.ignatov@postgrespro.ru>
Sent: Monday, May 21, 2018 6:00 PM
To: 'Robert Haas' <robertmhaas@gmail.com>; 'Andres Freund' <andres@anarazel.de>
Cc: 'Masahiko Sawada' <sawada.mshk@gmail.com>; 'Michael Paquier' <michael@paquier.xyz>; 'Mithun Cy' <mithun.cy@enterprisedb.com>; 'Tom Lane' <tgl@sss.pgh.pa.us>; 'Thomas Munro' <thomas.munro@enterprisedb.com>; 'Amit Kapila' <amit.kapila16@gmail.com>; 'PostgreSQL-development' <pgsql-hackers@postgresql.org>
Subject: RE: [HACKERS] Moving relation extension locks out of heavyweight lock manager

-----Original Message-----
From: Robert Haas <robertmhaas@gmail.com>
Sent: Thursday, April 26, 2018 10:25 PM
To: Andres Freund <andres@anarazel.de>
Cc: Masahiko Sawada <sawada.mshk@gmail.com>; Michael Paquier <michael@paquier.xyz>; Mithun Cy <mithun.cy@enterprisedb.com>; Tom Lane <tgl@sss.pgh.pa.us>; Thomas Munro <thomas.munro@enterprisedb.com>; Amit Kapila <amit.kapila16@gmail.com>; PostgreSQL-development <pgsql-hackers@postgresql.org>
Subject: Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Thu, Apr 26, 2018 at 3:10 PM, Andres Freund <andres@anarazel.de> wrote:

I think the real question is whether the scenario is common enough to
worry about. In practice, you'd have to be extremely unlucky to be
doing many bulk loads at the same time that all happened to hash to
the same bucket.

With a bunch of parallel bulkloads into partitioned tables that really
doesn't seem that unlikely?

It increases the likelihood of collisions, but probably decreases the number of cases where the contention gets really bad.

For example, suppose each table has 100 partitions and you are bulk-loading 10 of them at a time. It's virtually certain that you will have some collisions, but the amount of contention within each bucket will remain fairly low because each backend spends only 1% of its time in the bucket corresponding to any given partition.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company

Hello!
I want to try to test this patch on 302(704 ht) core machine.

Patching on master (commit 81256cd05f0745353c6572362155b57250a0d2a0) is ok but got some error while compiling :

Thank you for reporting.
Attached an rebased patch with current HEAD.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

extension-lock-v13.patchapplication/octet-stream; name=extension-lock-v13.patchDownload

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index c278076..152fe59 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -669,8 +669,8 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
           Heavyweight locks, also known as lock manager locks or simply locks,
           primarily protect SQL-visible objects such as tables.  However,
           they are also used to ensure mutual exclusion for certain internal
-          operations such as relation extension.  <literal>wait_event</literal> will
-          identify the type of lock awaited.
+          operations such as waiting for a transaction to finish.
+          <literal>wait_event</literal> will identify the type of lock awaited.
          </para>
         </listitem>
         <listitem>
@@ -1127,15 +1127,11 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          counters during Parallel Hash plan execution.</entry>
         </row>
         <row>
-         <entry morerows="9"><literal>Lock</literal></entry>
+         <entry morerows="8"><literal>Lock</literal></entry>
          <entry><literal>relation</literal></entry>
          <entry>Waiting to acquire a lock on a relation.</entry>
         </row>
         <row>
-         <entry><literal>extend</literal></entry>
-         <entry>Waiting to extend a relation.</entry>
-        </row>
-        <row>
          <entry><literal>page</literal></entry>
          <entry>Waiting to acquire a lock on page of a relation.</entry>
         </row>
@@ -1268,7 +1264,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry>Waiting in an extension.</entry>
         </row>
         <row>
-         <entry morerows="33"><literal>IPC</literal></entry>
+         <entry morerows="34"><literal>IPC</literal></entry>
          <entry><literal>BgWorkerShutdown</literal></entry>
          <entry>Waiting for background worker to shut down.</entry>
         </row>
@@ -1389,6 +1385,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry>Waiting for group leader to update transaction status at transaction end.</entry>
         </row>
         <row>
+         <entry><literal>RelationExtensionLock</literal></entry>
+         <entry>Waiting to extend a relation.</entry>
+        </row>
+        <row>
          <entry><literal>ReplicationOriginDrop</literal></entry>
          <entry>Waiting for a replication origin to become inactive to be dropped.</entry>
         </row>
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index 040cb62..2af6aa5 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -17,6 +17,7 @@
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/extension_lock.h"
 #include "storage/freespace.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
@@ -634,8 +635,7 @@ brin_page_cleanup(Relation idxrel, Buffer buf)
 	 */
 	if (PageIsNew(page))
 	{
-		LockRelationForExtension(idxrel, ShareLock);
-		UnlockRelationForExtension(idxrel, ShareLock);
+		WaitForRelationExtensionLockToBeFree(idxrel);
 
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 		if (PageIsNew(page))
@@ -729,7 +729,7 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 			 */
 			if (!RELATION_IS_LOCAL(irel))
 			{
-				LockRelationForExtension(irel, ExclusiveLock);
+				LockRelationForExtension(irel);
 				extensionLockHeld = true;
 			}
 			buf = ReadBuffer(irel, P_NEW);
@@ -777,7 +777,7 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 					brin_initialize_empty_new_buffer(irel, buf);
 
 				if (extensionLockHeld)
-					UnlockRelationForExtension(irel, ExclusiveLock);
+					UnlockRelationForExtension(irel);
 
 				ReleaseBuffer(buf);
 
@@ -795,7 +795,7 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
 		if (extensionLockHeld)
-			UnlockRelationForExtension(irel, ExclusiveLock);
+			UnlockRelationForExtension(irel);
 
 		page = BufferGetPage(buf);
 
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index f0dd72a..4a3ae19 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -29,6 +29,7 @@
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/extension_lock.h"
 #include "storage/lmgr.h"
 #include "utils/rel.h"
 
@@ -570,7 +571,7 @@ revmap_physical_extend(BrinRevmap *revmap)
 	else
 	{
 		if (needLock)
-			LockRelationForExtension(irel, ExclusiveLock);
+			LockRelationForExtension(irel);
 
 		buf = ReadBuffer(irel, P_NEW);
 		if (BufferGetBlockNumber(buf) != mapBlk)
@@ -582,7 +583,7 @@ revmap_physical_extend(BrinRevmap *revmap)
 			 * page from under whoever is using it.
 			 */
 			if (needLock)
-				UnlockRelationForExtension(irel, ExclusiveLock);
+				UnlockRelationForExtension(irel);
 			LockBuffer(revmap->rm_metaBuf, BUFFER_LOCK_UNLOCK);
 			ReleaseBuffer(buf);
 			return;
@@ -591,7 +592,7 @@ revmap_physical_extend(BrinRevmap *revmap)
 		page = BufferGetPage(buf);
 
 		if (needLock)
-			UnlockRelationForExtension(irel, ExclusiveLock);
+			UnlockRelationForExtension(irel);
 	}
 
 	/* Check that it's a regular block (or an empty page) */
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 0a32182..b7e1fc1 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -21,6 +21,7 @@
 #include "catalog/pg_collation.h"
 #include "catalog/pg_type.h"
 #include "miscadmin.h"
+#include "storage/extension_lock.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
@@ -327,13 +328,13 @@ GinNewBuffer(Relation index)
 	/* Must extend the file */
 	needLock = !RELATION_IS_LOCAL(index);
 	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
+		LockRelationForExtension(index);
 
 	buffer = ReadBuffer(index, P_NEW);
 	LockBuffer(buffer, GIN_EXCLUSIVE);
 
 	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+		UnlockRelationForExtension(index);
 
 	return buffer;
 }
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index 3104bc1..82232fd 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -20,6 +20,7 @@
 #include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "postmaster/autovacuum.h"
+#include "storage/extension_lock.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
@@ -723,10 +724,10 @@ ginvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	needLock = !RELATION_IS_LOCAL(index);
 
 	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
+		LockRelationForExtension(index);
 	npages = RelationGetNumberOfBlocks(index);
 	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+		UnlockRelationForExtension(index);
 
 	totFreePages = 0;
 
@@ -773,10 +774,10 @@ ginvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	stats->pages_free = totFreePages;
 
 	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
+		LockRelationForExtension(index);
 	stats->num_pages = RelationGetNumberOfBlocks(index);
 	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+		UnlockRelationForExtension(index);
 
 	return stats;
 }
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 55cccd2..c96798c 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -20,6 +20,7 @@
 #include "access/htup_details.h"
 #include "access/reloptions.h"
 #include "catalog/pg_opclass.h"
+#include "storage/extension_lock.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/builtins.h"
@@ -821,13 +822,13 @@ gistNewBuffer(Relation r)
 	needLock = !RELATION_IS_LOCAL(r);
 
 	if (needLock)
-		LockRelationForExtension(r, ExclusiveLock);
+		LockRelationForExtension(r);
 
 	buffer = ReadBuffer(r, P_NEW);
 	LockBuffer(buffer, GIST_EXCLUSIVE);
 
 	if (needLock)
-		UnlockRelationForExtension(r, ExclusiveLock);
+		UnlockRelationForExtension(r);
 
 	return buffer;
 }
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 5948218..a6c50c2 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -18,6 +18,7 @@
 #include "access/gist_private.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
+#include "storage/extension_lock.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 
@@ -50,10 +51,10 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 
 	/* try to find deleted pages */
 	if (needLock)
-		LockRelationForExtension(rel, ExclusiveLock);
+		LockRelationForExtension(rel);
 	npages = RelationGetNumberOfBlocks(rel);
 	if (needLock)
-		UnlockRelationForExtension(rel, ExclusiveLock);
+		UnlockRelationForExtension(rel);
 
 	totFreePages = 0;
 	tuplesCount = 0;
@@ -88,10 +89,10 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	/* return statistics */
 	stats->pages_free = totFreePages;
 	if (needLock)
-		LockRelationForExtension(rel, ExclusiveLock);
+		LockRelationForExtension(rel);
 	stats->num_pages = RelationGetNumberOfBlocks(rel);
 	if (needLock)
-		UnlockRelationForExtension(rel, ExclusiveLock);
+		UnlockRelationForExtension(rel);
 	stats->num_index_tuples = tuplesCount;
 	stats->estimated_count = false;
 
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index b8b5871..135ddca 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -20,6 +20,7 @@
 #include "access/htup_details.h"
 #include "access/visibilitymap.h"
 #include "storage/bufmgr.h"
+#include "storage/extension_lock.h"
 #include "storage/freespace.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
@@ -183,7 +184,7 @@ RelationAddExtraBlocks(Relation relation, BulkInsertState bistate)
 	int			lockWaiters;
 
 	/* Use the length of the lock wait queue to judge how much to extend. */
-	lockWaiters = RelationExtensionLockWaiterCount(relation);
+	lockWaiters = EstimateNumberOfExtensionLockWaiters(relation);
 	if (lockWaiters <= 0)
 		return;
 
@@ -535,11 +536,11 @@ loop:
 	if (needLock)
 	{
 		if (!use_fsm)
-			LockRelationForExtension(relation, ExclusiveLock);
-		else if (!ConditionalLockRelationForExtension(relation, ExclusiveLock))
+			LockRelationForExtension(relation);
+		else if (!ConditionalLockRelationForExtension(relation))
 		{
 			/* Couldn't get the lock immediately; wait for it. */
-			LockRelationForExtension(relation, ExclusiveLock);
+			LockRelationForExtension(relation);
 
 			/*
 			 * Check if some other backend has extended a block for us while
@@ -553,7 +554,7 @@ loop:
 			 */
 			if (targetBlock != InvalidBlockNumber)
 			{
-				UnlockRelationForExtension(relation, ExclusiveLock);
+				UnlockRelationForExtension(relation);
 				goto loop;
 			}
 
@@ -592,7 +593,7 @@ loop:
 	 * against vacuumlazy.c --- see comments therein.
 	 */
 	if (needLock)
-		UnlockRelationForExtension(relation, ExclusiveLock);
+		UnlockRelationForExtension(relation);
 
 	/*
 	 * We need to initialize the empty new page.  Double-check that it really
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index b251e69..890e062 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -90,6 +90,7 @@
 #include "access/xlog.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/extension_lock.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
 #include "utils/inval.h"
@@ -641,7 +642,7 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 	 * Note that another backend might have extended or created the relation
 	 * by the time we get the lock.
 	 */
-	LockRelationForExtension(rel, ExclusiveLock);
+	LockRelationForExtension(rel);
 
 	/* Might have to re-open if a cache flush happened */
 	RelationOpenSmgr(rel);
@@ -679,7 +680,7 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 	/* Update local cache with the up-to-date size */
 	rel->rd_smgr->smgr_vm_nblocks = vm_nblocks_now;
 
-	UnlockRelationForExtension(rel, ExclusiveLock);
+	UnlockRelationForExtension(rel);
 
 	pfree(pg);
 }
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 22b4a75..02f9e96 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -28,6 +28,7 @@
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
+#include "storage/extension_lock.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
@@ -817,7 +818,7 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 		needLock = !RELATION_IS_LOCAL(rel);
 
 		if (needLock)
-			LockRelationForExtension(rel, ExclusiveLock);
+			LockRelationForExtension(rel);
 
 		buf = ReadBuffer(rel, P_NEW);
 
@@ -831,7 +832,7 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 		 * condition against btvacuumscan --- see comments therein.
 		 */
 		if (needLock)
-			UnlockRelationForExtension(rel, ExclusiveLock);
+			UnlockRelationForExtension(rel);
 
 		/* Initialize the new page before returning it */
 		page = BufferGetPage(buf);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 27a3032..8b22b2e 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -28,6 +28,7 @@
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "storage/condition_variable.h"
+#include "storage/extension_lock.h"
 #include "storage/indexfsm.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
@@ -1014,10 +1015,10 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	{
 		/* Get the current relation length */
 		if (needLock)
-			LockRelationForExtension(rel, ExclusiveLock);
+			LockRelationForExtension(rel);
 		num_pages = RelationGetNumberOfBlocks(rel);
 		if (needLock)
-			UnlockRelationForExtension(rel, ExclusiveLock);
+			UnlockRelationForExtension(rel);
 
 		/* Quit if we've scanned the whole relation */
 		if (blkno >= num_pages)
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 4a9b5da..dee54e1 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -23,6 +23,7 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/builtins.h"
+#include "storage/extension_lock.h"
 #include "utils/index_selfuncs.h"
 #include "utils/lsyscache.h"
 
@@ -248,13 +249,13 @@ SpGistNewBuffer(Relation index)
 	/* Must extend the file */
 	needLock = !RELATION_IS_LOCAL(index);
 	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
+		LockRelationForExtension(index);
 
 	buffer = ReadBuffer(index, P_NEW);
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
 	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+		UnlockRelationForExtension(index);
 
 	return buffer;
 }
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index a83a4b5..88e1a94 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -24,6 +24,7 @@
 #include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/extension_lock.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/snapmgr.h"
@@ -824,10 +825,10 @@ spgvacuumscan(spgBulkDeleteState *bds)
 	{
 		/* Get the current relation length */
 		if (needLock)
-			LockRelationForExtension(index, ExclusiveLock);
+			LockRelationForExtension(index);
 		num_pages = RelationGetNumberOfBlocks(index);
 		if (needLock)
-			UnlockRelationForExtension(index, ExclusiveLock);
+			UnlockRelationForExtension(index);
 
 		/* Quit if we've scanned the whole relation */
 		if (blkno >= num_pages)
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 5649a70..c4fafd0 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -53,6 +53,7 @@
 #include "portability/instr_time.h"
 #include "postmaster/autovacuum.h"
 #include "storage/bufmgr.h"
+#include "storage/extension_lock.h"
 #include "storage/freespace.h"
 #include "storage/lmgr.h"
 #include "utils/lsyscache.h"
@@ -872,8 +873,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
 			 * it's got exclusive lock on the whole relation.
 			 */
 			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
-			LockRelationForExtension(onerel, ExclusiveLock);
-			UnlockRelationForExtension(onerel, ExclusiveLock);
+			WaitForRelationExtensionLockToBeFree(onerel);
 			LockBufferForCleanup(buf);
 			if (PageIsNew(page))
 			{
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 084573e..adfceff 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3671,6 +3671,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_CLOG_GROUP_UPDATE:
 			event_name = "ClogGroupUpdate";
 			break;
+		case WAIT_EVENT_RELATION_EXTENSION_LOCK:
+			event_name = "RelationExtensionLock";
+			break;
 		case WAIT_EVENT_REPLICATION_ORIGIN_DROP:
 			event_name = "ReplicationOriginDrop";
 			break;
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 65c4e74..0706dd7 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -26,6 +26,7 @@
 #include "access/htup_details.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
+#include "storage/extension_lock.h"
 #include "storage/freespace.h"
 #include "storage/fsm_internals.h"
 #include "storage/lmgr.h"
@@ -611,7 +612,7 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 	 * Note that another backend might have extended or created the relation
 	 * by the time we get the lock.
 	 */
-	LockRelationForExtension(rel, ExclusiveLock);
+	LockRelationForExtension(rel);
 
 	/* Might have to re-open if a cache flush happened */
 	RelationOpenSmgr(rel);
@@ -639,7 +640,7 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 	/* Update local cache with the up-to-date size */
 	rel->rd_smgr->smgr_fsm_nblocks = fsm_nblocks_now;
 
-	UnlockRelationForExtension(rel, ExclusiveLock);
+	UnlockRelationForExtension(rel);
 
 	pfree(pg);
 }
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 0c86a58..521c485 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -35,6 +35,7 @@
 #include "replication/origin.h"
 #include "storage/bufmgr.h"
 #include "storage/dsm.h"
+#include "storage/extension_lock.h"
 #include "storage/ipc.h"
 #include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
@@ -133,6 +134,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 		size = add_size(size, BackgroundWorkerShmemSize());
 		size = add_size(size, MultiXactShmemSize());
 		size = add_size(size, LWLockShmemSize());
+		size = add_size(size, RelExtLockShmemSize());
 		size = add_size(size, ProcArrayShmemSize());
 		size = add_size(size, BackendStatusShmemSize());
 		size = add_size(size, SInvalShmemSize());
@@ -235,6 +237,11 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 	InitPredicateLocks();
 
 	/*
+	 * Set up relation extension lock manager
+	 */
+	InitRelExtLocks();
+
+	/*
 	 * Set up process table
 	 */
 	if (!IsUnderPostmaster)
diff --git a/src/backend/storage/lmgr/Makefile b/src/backend/storage/lmgr/Makefile
index 8179f6d..fcb6d46 100644
--- a/src/backend/storage/lmgr/Makefile
+++ b/src/backend/storage/lmgr/Makefile
@@ -13,7 +13,7 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = lmgr.o lock.o proc.o deadlock.o lwlock.o lwlocknames.o spin.o \
-	s_lock.o predicate.o condition_variable.o
+	s_lock.o predicate.o condition_variable.o extension_lock.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/storage/lmgr/README b/src/backend/storage/lmgr/README
index 56b0a12..960d1f3 100644
--- a/src/backend/storage/lmgr/README
+++ b/src/backend/storage/lmgr/README
@@ -3,7 +3,7 @@ src/backend/storage/lmgr/README
 Locking Overview
 ================
 
-Postgres uses four types of interprocess locks:
+Postgres uses five types of interprocess locks:
 
 * Spinlocks.  These are intended for *very* short-term locks.  If a lock
 is to be held more than a few dozen instructions, or across any sort of
@@ -36,13 +36,21 @@ Regular locks should be used for all user-driven lock requests.
 
 * SIReadLock predicate locks.  See separate README-SSI file for details.
 
+* Relation extension locks. Only one process can extend a relation at
+a time; we use a specialized lock manager for this purpose, which is
+much simpler than the regular lock manager.  It is similar to the
+lightweight lock mechanism, but is ever simpler because there is only
+one lock mode and only one lock can be taken at a time. A process holding
+a relation extension lock is interruptible, unlike a process holding an
+LWLock.
+
 Acquisition of either a spinlock or a lightweight lock causes query
 cancel and die() interrupts to be held off until all such locks are
 released. No such restriction exists for regular locks, however.  Also
 note that we can accept query cancel and die() interrupts while waiting
-for a regular lock, but we will not accept them while waiting for
-spinlocks or LW locks. It is therefore not a good idea to use LW locks
-when the wait time might exceed a few seconds.
+for a relation extension lock or a regular lock, but we will not accept
+them while waiting for spinlocks or LW locks. It is therefore not a good
+idea to use LW locks when the wait time might exceed a few seconds.
 
 The rest of this README file discusses the regular lock manager in detail.
 
diff --git a/src/backend/storage/lmgr/extension_lock.c b/src/backend/storage/lmgr/extension_lock.c
new file mode 100644
index 0000000..5a3bf5e
--- /dev/null
+++ b/src/backend/storage/lmgr/extension_lock.c
@@ -0,0 +1,469 @@
+/*-------------------------------------------------------------------------
+ *
+ * extension_lock.c
+ *	  Relation extension lock manager
+ *
+ * This specialized lock manager is used only for relation extension
+ * locks.  Unlike the heavyweight lock manager, it doesn't provide
+ * deadlock detection or group locking.  Unlike lwlock.c, extension lock
+ * waits are interruptible.  Unlike both systems, there is only one lock
+ * mode.
+ *
+ * False sharing is possible.  We have a fixed-size array of locks, and
+ * every database OID/relation OID combination is mapped to a slot in
+ * the array.  Therefore, if two processes try to extend relations that
+ * map to the same array slot, they will contend even though it would
+ * be OK to let both proceed at once.  Since these locks are typically
+ * taken only for very short periods of time, this doesn't seem likely
+ * to be a big problem in practice.  If it is, we could make the array
+ * bigger.
+ *
+ * The extension lock manager is much faster than the regular heavyweight
+ * lock manager.  The lack of group locking is a feature, not a bug,
+ * because while cooperating backends can all (for example) access a
+ * relation on which they jointly hold AccessExclusiveLock at the same time,
+ * it's not safe for them to extend the relation at the same time.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/backend/storage/lmgr/extension_lock.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "pgstat.h"
+
+#include "catalog/catalog.h"
+#include "postmaster/postmaster.h"
+#include "storage/extension_lock.h"
+#include "utils/rel.h"
+
+#define N_RELEXTLOCK_ENTS		1024
+
+/*
+ * We can't use bit 31 as the lock bit because pg_atomic_sub_fetch_u32 can't
+ * handle an attempt to subtract INT_MIN.
+ */
+#define RELEXT_LOCK_BIT			((uint32) 1 << 30)
+#define RELEXT_WAIT_COUNT_MASK	(RELEXT_LOCK_BIT - 1)
+
+typedef struct RelExtLockTag
+{
+	Oid			dbid;			/* InvalidOid for a shared relation */
+	Oid			relid;
+} RelExtLockTag;
+
+typedef struct RelExtLock
+{
+	pg_atomic_uint32 state;
+	ConditionVariable cv;
+} RelExtLock;
+
+/*
+ * Backend-private state for relation extension locks.  "relid" is the last
+ * relation whose RelExtLock we looked up, and "lock" is a pointer to the
+ * RelExtLock to which it mapped.  This speeds up the fairly common case where
+ * we acquire the same relation extension lock repeatedly.  nLocks is 0 is the
+ * number of times we've acquired that lock; 0 means we don't hold it, while
+ * any value >0 means we do.
+ *
+ * A backend can't hold more than one relation extension lock at the same
+ * time, although it can hold the same lock more than once.  Sometimes we try
+ * to acquire a lock for additional forks while already holding the lock for
+ * the main fork; for example, this might happen when adding extra relation
+ * blocks for both relation and its free space map. But since this lock
+ * manager doesn't distinguish between the forks, we just increment nLocks in
+ * the case.
+ */
+typedef struct relextlock_handle
+{
+	Oid			relid;
+	RelExtLock *lock;
+	int			nLocks;			/* > 0 means holding it */
+	bool		waiting;		/* true if we're waiting it */
+} relextlock_handle;
+
+static relextlock_handle held_relextlock;
+static RelExtLock *RelExtLockArray;
+
+static bool RelExtLockAcquire(Oid relid, bool conditional);
+static bool RelExtLockAttemptLock(RelExtLock *relextlock);
+static void RelExtLockRelease(void);
+static inline uint32 RelExtLockTargetTagToIndex(RelExtLockTag *locktag);
+
+/*
+ * Estimate space required for a fixed-size array of RelExtLock structures.
+ */
+Size
+RelExtLockShmemSize(void)
+{
+	return mul_size(N_RELEXTLOCK_ENTS, sizeof(RelExtLock));
+}
+
+/*
+ * Initialize extension lock manager.
+ */
+void
+InitRelExtLocks(void)
+{
+	bool		found;
+	int			i;
+
+	/* Verify that we have enough bits for maximum possible waiter count. */
+	StaticAssertStmt(RELEXT_WAIT_COUNT_MASK >= MAX_BACKENDS,
+					 "maximum waiter count of relation extension lock exceeds MAX_BACKENDS");
+
+	RelExtLockArray = (RelExtLock *)
+		ShmemInitStruct("Relation Extension Lock",
+						N_RELEXTLOCK_ENTS * sizeof(RelExtLock),
+						&found);
+
+	/* we're the first - initialize */
+	if (!found)
+	{
+		for (i = 0; i < N_RELEXTLOCK_ENTS; i++)
+		{
+			RelExtLock *relextlock = &RelExtLockArray[i];
+
+			pg_atomic_init_u32(&(relextlock->state), 0);
+			ConditionVariableInit(&(relextlock->cv));
+		}
+	}
+}
+
+/*
+ * This lock is used to interlock addition of pages to relations.
+ * We need such locking because bufmgr/smgr definition of P_NEW is not
+ * race-condition-proof.
+ *
+ * We assume the caller is already holding some type of regular lock on
+ * the relation, so no AcceptInvalidationMessages call is needed here.
+ */
+void
+LockRelationForExtension(Relation relation)
+{
+	RelExtLockAcquire(RelationGetRelid(relation), false);
+}
+
+/*
+ * As above, but only lock if we can get the lock without blocking.
+ * Returns TRUE iff the lock was acquired.
+ */
+bool
+ConditionalLockRelationForExtension(Relation relation)
+{
+	return RelExtLockAcquire(RelationGetRelid(relation), true);
+}
+
+/*
+ * Estimate the number of processes waiting for the given relation extension
+ * lock. Note that since multiple relations hash to the same RelExtLock entry,
+ * the return value might be inflated.
+ */
+int
+EstimateNumberOfExtensionLockWaiters(Relation relation)
+{
+	RelExtLockTag tag;
+	RelExtLock *relextlock;
+	uint32		state;
+	Oid			relid = RelationGetRelid(relation);
+
+	/* Make a lock tag */
+	tag.dbid = IsSharedRelation(relid) ? InvalidOid : MyDatabaseId;
+	tag.relid = relid;
+
+	relextlock = &RelExtLockArray[RelExtLockTargetTagToIndex(&tag)];
+	state = pg_atomic_read_u32(&(relextlock->state));
+
+	return (state & RELEXT_WAIT_COUNT_MASK);
+}
+
+/*
+ * Release a previously-acquired extension lock.
+ */
+void
+UnlockRelationForExtension(Relation relation)
+{
+	Oid			relid = RelationGetRelid(relation);
+
+	if (held_relextlock.nLocks <= 0 || relid != held_relextlock.relid)
+	{
+		elog(WARNING,
+			 "relation extension lock for %u is not held",
+			 relid);
+		return;
+	}
+
+	/*
+	 * If we acquired it multiple times, only change shared state when we have
+	 * released it as many times as we acquired it.
+	 */
+	if (--held_relextlock.nLocks == 0)
+		RelExtLockRelease();
+}
+
+/*
+ * Release any extension lock held, and any wait count for an extension lock.
+ * This is intended to be invoked during error cleanup.
+ */
+void
+RelExtLockCleanup(void)
+{
+	if (held_relextlock.nLocks > 0)
+	{
+		/* Release the lock even if we acquired it multiple times. */
+		held_relextlock.nLocks = 0;
+		RelExtLockRelease();
+		Assert(!held_relextlock.waiting);
+	}
+	else if (held_relextlock.waiting)
+	{
+		/* We were waiting for the lock; release the wait count we held. */
+		held_relextlock.waiting = false;
+		pg_atomic_sub_fetch_u32(&(held_relextlock.lock->state), 1);
+	}
+}
+
+/*
+ * Are we holding any extension lock?
+ */
+bool
+IsAnyRelationExtensionLockHeld(void)
+{
+	return held_relextlock.nLocks > 0;
+}
+
+/*
+ *		WaitForRelationExtensionLockToBeFree
+ *
+ * Wait for the relation extension lock on the given relation to
+ * be free without acquiring it.
+ */
+void
+WaitForRelationExtensionLockToBeFree(Relation relation)
+{
+	RelExtLock *relextlock;
+	Oid			relid;
+
+	relid = RelationGetRelid(relation);
+
+	if (held_relextlock.nLocks > 0)
+	{
+		/*
+		 * If we already hold the lock, nobody else does, so we can return
+		 * immediately.
+		 */
+		if (relid == held_relextlock.relid)
+			return;
+		elog(ERROR,
+			 "can only manipulate one relation extension lock at a time");
+	}
+
+	/*
+	 * If the last relation extension lock we touched is the same one for
+	 * which we now need to wait, we can use our cached pointer to the lock
+	 * instead of recomputing it.
+	 */
+	if (relid == held_relextlock.relid)
+		relextlock = held_relextlock.lock;
+	else
+	{
+		RelExtLockTag tag;
+
+		tag.dbid = IsSharedRelation(relid) ? InvalidOid : MyDatabaseId;
+		tag.relid = relid;
+		relextlock = &RelExtLockArray[RelExtLockTargetTagToIndex(&tag)];
+		held_relextlock.relid = relid;
+		held_relextlock.lock = relextlock;
+	}
+
+	for (;;)
+	{
+		uint32		state;
+
+		state = pg_atomic_read_u32(&(relextlock)->state);
+
+		/* Break if nobody is holding the lock on this relation */
+		if ((state & RELEXT_LOCK_BIT) == 0)
+			break;
+
+		/* Could not get the lock, prepare to wait */
+		if (!held_relextlock.waiting)
+		{
+			pg_atomic_add_fetch_u32(&(relextlock->state), 1);
+			held_relextlock.waiting = true;
+		}
+
+		/* Sleep until something happens, then recheck */
+		ConditionVariableSleep(&(relextlock->cv),
+							   WAIT_EVENT_RELATION_EXTENSION_LOCK);
+	}
+
+	ConditionVariableCancelSleep();
+
+	/* Release any wait count we hold */
+	if (held_relextlock.waiting)
+	{
+		pg_atomic_sub_fetch_u32(&(relextlock->state), 1);
+		held_relextlock.waiting = false;
+	}
+}
+
+/*
+ * Compute the hash code associated with a RelExtLock.
+ */
+static inline uint32
+RelExtLockTargetTagToIndex(RelExtLockTag *locktag)
+{
+	return tag_hash(locktag, sizeof(RelExtLockTag)) % N_RELEXTLOCK_ENTS;
+}
+
+/*
+ * Acquire a relation extension lock.
+ */
+static bool
+RelExtLockAcquire(Oid relid, bool conditional)
+{
+	RelExtLock *relextlock;
+	bool		mustwait;
+
+	/*
+	 * If we already hold the lock, we can just increase the count locally.
+	 * Since we don't do deadlock detection, caller must not try to take a new
+	 * relation extension lock while already holding them.
+	 */
+	if (held_relextlock.nLocks > 0)
+	{
+		if (relid != held_relextlock.relid)
+			elog(ERROR,
+				 "can only acquire one relation extension lock at a time");
+
+		held_relextlock.nLocks++;
+		return true;
+	}
+
+	/*
+	 * If the last relation extension lock we touched is the same one for we
+	 * now need to acquire, we can use our cached pointer to the lock instead
+	 * of recomputing it.  This is likely to be a common case in practice.
+	 */
+	if (relid == held_relextlock.relid)
+		relextlock = held_relextlock.lock;
+	else
+	{
+		RelExtLockTag tag;
+
+		/* Make a lock tag */
+		tag.dbid = IsSharedRelation(relid) ? InvalidOid : MyDatabaseId;
+		tag.relid = relid;
+
+		relextlock = &RelExtLockArray[RelExtLockTargetTagToIndex(&tag)];
+
+		/* Remember the lock we're interested in */
+		held_relextlock.relid = relid;
+		held_relextlock.lock = relextlock;
+	}
+
+	held_relextlock.waiting = false;
+	for (;;)
+	{
+		mustwait = RelExtLockAttemptLock(relextlock);
+
+		if (!mustwait)
+			break;				/* got the lock */
+
+		/* Could not got the lock, return iff in locking conditionally */
+		if (conditional)
+			return false;
+
+		/* Could not get the lock, prepare to wait */
+		if (!held_relextlock.waiting)
+		{
+			pg_atomic_add_fetch_u32(&(relextlock->state), 1);
+			held_relextlock.waiting = true;
+		}
+
+		/* Sleep until something happens, then recheck */
+		ConditionVariableSleep(&(relextlock->cv),
+							   WAIT_EVENT_RELATION_EXTENSION_LOCK);
+	}
+
+	ConditionVariableCancelSleep();
+
+	/* Release any wait count we hold */
+	if (held_relextlock.waiting)
+	{
+		pg_atomic_sub_fetch_u32(&(relextlock->state), 1);
+		held_relextlock.waiting = false;
+	}
+
+	Assert(!mustwait);
+
+	/* Remember lock held by this backend */
+	held_relextlock.relid = relid;
+	held_relextlock.lock = relextlock;
+	held_relextlock.nLocks = 1;
+
+	/* We got the lock! */
+	return true;
+}
+
+/*
+ * Attempt to atomically acquire the relation extension lock.
+ *
+ * Returns true if the lock isn't free and we need to wait.
+ */
+static bool
+RelExtLockAttemptLock(RelExtLock *relextlock)
+{
+	uint32		oldstate;
+
+	oldstate = pg_atomic_read_u32(&relextlock->state);
+
+	while (true)
+	{
+		bool		lock_free;
+
+		lock_free = (oldstate & RELEXT_LOCK_BIT) == 0;
+
+		if (!lock_free)
+			return true;
+
+		if (pg_atomic_compare_exchange_u32(&relextlock->state,
+										   &oldstate,
+										   oldstate | RELEXT_LOCK_BIT))
+			return false;
+	}
+
+	pg_unreachable();
+}
+
+/*
+ * Release extension lock in shared memory.  Should be called when our local
+ * lock count drops to 0.
+ */
+static void
+RelExtLockRelease(void)
+{
+	RelExtLock *relextlock;
+	uint32		state;
+	uint32		wait_counts;
+
+	Assert(held_relextlock.nLocks == 0);
+
+	relextlock = held_relextlock.lock;
+
+	/* Release the lock */
+	state = pg_atomic_sub_fetch_u32(&(relextlock->state), RELEXT_LOCK_BIT);
+
+	/* If there may be waiters, wake them up */
+	wait_counts = state & RELEXT_WAIT_COUNT_MASK;
+
+	if (wait_counts > 0)
+		ConditionVariableBroadcast(&(relextlock->cv));
+}
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 7b2dcb6..712511c 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -319,78 +319,6 @@ UnlockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode)
 }
 
 /*
- *		LockRelationForExtension
- *
- * This lock tag is used to interlock addition of pages to relations.
- * We need such locking because bufmgr/smgr definition of P_NEW is not
- * race-condition-proof.
- *
- * We assume the caller is already holding some type of regular lock on
- * the relation, so no AcceptInvalidationMessages call is needed here.
- */
-void
-LockRelationForExtension(Relation relation, LOCKMODE lockmode)
-{
-	LOCKTAG		tag;
-
-	SET_LOCKTAG_RELATION_EXTEND(tag,
-								relation->rd_lockInfo.lockRelId.dbId,
-								relation->rd_lockInfo.lockRelId.relId);
-
-	(void) LockAcquire(&tag, lockmode, false, false);
-}
-
-/*
- *		ConditionalLockRelationForExtension
- *
- * As above, but only lock if we can get the lock without blocking.
- * Returns true iff the lock was acquired.
- */
-bool
-ConditionalLockRelationForExtension(Relation relation, LOCKMODE lockmode)
-{
-	LOCKTAG		tag;
-
-	SET_LOCKTAG_RELATION_EXTEND(tag,
-								relation->rd_lockInfo.lockRelId.dbId,
-								relation->rd_lockInfo.lockRelId.relId);
-
-	return (LockAcquire(&tag, lockmode, false, true) != LOCKACQUIRE_NOT_AVAIL);
-}
-
-/*
- *		RelationExtensionLockWaiterCount
- *
- * Count the number of processes waiting for the given relation extension lock.
- */
-int
-RelationExtensionLockWaiterCount(Relation relation)
-{
-	LOCKTAG		tag;
-
-	SET_LOCKTAG_RELATION_EXTEND(tag,
-								relation->rd_lockInfo.lockRelId.dbId,
-								relation->rd_lockInfo.lockRelId.relId);
-
-	return LockWaiterCount(&tag);
-}
-
-/*
- *		UnlockRelationForExtension
- */
-void
-UnlockRelationForExtension(Relation relation, LOCKMODE lockmode)
-{
-	LOCKTAG		tag;
-
-	SET_LOCKTAG_RELATION_EXTEND(tag,
-								relation->rd_lockInfo.lockRelId.dbId,
-								relation->rd_lockInfo.lockRelId.relId);
-
-	LockRelease(&tag, lockmode, false);
-}
-
-/*
  *		LockPage
  *
  * Obtain a page-level lock.  This is currently used by some index access
@@ -987,12 +915,6 @@ DescribeLockTag(StringInfo buf, const LOCKTAG *tag)
 							 tag->locktag_field2,
 							 tag->locktag_field1);
 			break;
-		case LOCKTAG_RELATION_EXTEND:
-			appendStringInfo(buf,
-							 _("extension of relation %u of database %u"),
-							 tag->locktag_field2,
-							 tag->locktag_field1);
-			break;
 		case LOCKTAG_PAGE:
 			appendStringInfo(buf,
 							 _("page %u of relation %u of database %u"),
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index dc3d8d9..ce3ab61 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -40,6 +40,7 @@
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "pgstat.h"
+#include "storage/extension_lock.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "storage/sinvaladt.h"
@@ -717,6 +718,13 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	int			status;
 	bool		log_lock = false;
 
+	/*
+	 * Relation extension locks don't participate in deadlock detection,
+	 * so make sure we don't try to acquire a heavyweight lock while
+	 * holding one.
+	 */
+	Assert(!IsAnyRelationExtensionLockHeld());
+
 	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
 		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
 	lockMethodTable = LockMethods[lockmethodid];
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 6f30e08..8e4cb5b 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -44,6 +44,7 @@
 #include "replication/slot.h"
 #include "replication/syncrep.h"
 #include "storage/condition_variable.h"
+#include "storage/extension_lock.h"
 #include "storage/standby.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
@@ -765,6 +766,8 @@ ProcReleaseLocks(bool isCommit)
 		return;
 	/* If waiting, get off wait queue (should only be needed after error) */
 	LockErrorCleanup();
+	/* Release any relation extension lock or wait counts */
+	RelExtLockCleanup();
 	/* Release standard locks, including session-level if aborting */
 	LockReleaseAll(DEFAULT_LOCKMETHOD, !isCommit);
 	/* Release transaction-level advisory locks */
diff --git a/src/backend/utils/adt/lockfuncs.c b/src/backend/utils/adt/lockfuncs.c
index 66c09a1..b2531f3 100644
--- a/src/backend/utils/adt/lockfuncs.c
+++ b/src/backend/utils/adt/lockfuncs.c
@@ -25,7 +25,6 @@
 /* This must match enum LockTagType! */
 const char *const LockTagTypeNames[] = {
 	"relation",
-	"extend",
 	"page",
 	"tuple",
 	"transactionid",
@@ -234,7 +233,6 @@ pg_lock_status(PG_FUNCTION_ARGS)
 		switch ((LockTagType) instance->locktag.locktag_type)
 		{
 			case LOCKTAG_RELATION:
-			case LOCKTAG_RELATION_EXTEND:
 				values[1] = ObjectIdGetDatum(instance->locktag.locktag_field1);
 				values[2] = ObjectIdGetDatum(instance->locktag.locktag_field2);
 				nulls[3] = true;
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index be2f592..b1958e8 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -829,6 +829,7 @@ typedef enum
 	WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN,
 	WAIT_EVENT_PROCARRAY_GROUP_UPDATE,
 	WAIT_EVENT_CLOG_GROUP_UPDATE,
+	WAIT_EVENT_RELATION_EXTENSION_LOCK,
 	WAIT_EVENT_REPLICATION_ORIGIN_DROP,
 	WAIT_EVENT_REPLICATION_SLOT_DROP,
 	WAIT_EVENT_SAFE_SNAPSHOT,
diff --git a/src/include/storage/extension_lock.h b/src/include/storage/extension_lock.h
new file mode 100644
index 0000000..0b26fa5
--- /dev/null
+++ b/src/include/storage/extension_lock.h
@@ -0,0 +1,38 @@
+/*-------------------------------------------------------------------------
+ *
+ * extension_lock.h
+ *	  Relation extension lock manager
+ *
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/extension_lock.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef EXTENSION_LOCK_H
+#define EXTENSION_LOCK_H
+
+#ifdef FRONTEND
+#error "extension_lock.h may not be included from frontend code"
+#endif
+
+#include "port/atomics.h"
+#include "storage/s_lock.h"
+#include "storage/condition_variable.h"
+#include "storage/proclist_types.h"
+
+/* Lock a relation for extension */
+extern Size RelExtLockShmemSize(void);
+extern void InitRelExtLocks(void);
+extern void LockRelationForExtension(Relation relation);
+extern void UnlockRelationForExtension(Relation relation);
+extern bool ConditionalLockRelationForExtension(Relation relation);
+extern int	EstimateNumberOfExtensionLockWaiters(Relation relation);
+extern void WaitForRelationExtensionLockToBeFree(Relation relation);
+extern void RelExtLockCleanup(void);
+extern bool	IsAnyRelationExtensionLockHeld(void);
+
+#endif	/* EXTENSION_LOCK_H */
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index a217de9..2036af1 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -50,13 +50,6 @@ extern bool LockHasWaitersRelation(Relation relation, LOCKMODE lockmode);
 extern void LockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
 extern void UnlockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
 
-/* Lock a relation for extension */
-extern void LockRelationForExtension(Relation relation, LOCKMODE lockmode);
-extern void UnlockRelationForExtension(Relation relation, LOCKMODE lockmode);
-extern bool ConditionalLockRelationForExtension(Relation relation,
-									LOCKMODE lockmode);
-extern int	RelationExtensionLockWaiterCount(Relation relation);
-
 /* Lock a page (currently only used within indexes) */
 extern void LockPage(Relation relation, BlockNumber blkno, LOCKMODE lockmode);
 extern bool ConditionalLockPage(Relation relation, BlockNumber blkno, LOCKMODE lockmode);
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index 777da71..74450f7 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -138,8 +138,6 @@ typedef uint16 LOCKMETHODID;
 typedef enum LockTagType
 {
 	LOCKTAG_RELATION,			/* whole relation */
-	/* ID info for a relation is DB OID + REL OID; DB OID = 0 if shared */
-	LOCKTAG_RELATION_EXTEND,	/* the right to extend a relation */
 	/* same ID info as RELATION */
 	LOCKTAG_PAGE,				/* one page of a relation */
 	/* ID info for a page is RELATION info + BlockNumber */
@@ -198,14 +196,6 @@ typedef struct LOCKTAG
 	 (locktag).locktag_type = LOCKTAG_RELATION, \
 	 (locktag).locktag_lockmethodid = DEFAULT_LOCKMETHOD)
 
-#define SET_LOCKTAG_RELATION_EXTEND(locktag,dboid,reloid) \
-	((locktag).locktag_field1 = (dboid), \
-	 (locktag).locktag_field2 = (reloid), \
-	 (locktag).locktag_field3 = 0, \
-	 (locktag).locktag_field4 = 0, \
-	 (locktag).locktag_type = LOCKTAG_RELATION_EXTEND, \
-	 (locktag).locktag_lockmethodid = DEFAULT_LOCKMETHOD)
-
 #define SET_LOCKTAG_PAGE(locktag,dboid,reloid,blocknum) \
 	((locktag).locktag_field1 = (dboid), \
 	 (locktag).locktag_field2 = (reloid), \
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 54850ee..71321c0 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1913,6 +1913,8 @@ RegisNode
 RegisteredBgWorker
 ReindexObjectType
 ReindexStmt
+RelExtLock
+RelExtLockTag
 RelFileNode
 RelFileNodeBackend
 RelIdCacheEnt
@@ -3063,6 +3065,7 @@ registered_buffer
 regmatch_t
 regoff_t
 regproc
+relextlock_handle
 relopt_bool
 relopt_gen
 relopt_int

#73

Konstantin Knizhnik

k.knizhnik@postgrespro.ru

over 7 years ago

In reply to: Masahiko Sawada (#66)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On 26.04.2018 09:10, Masahiko Sawada wrote:

On Thu, Apr 26, 2018 at 3:30 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Apr 10, 2018 at 9:08 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Never mind. There was a lot of items especially at the last CommitFest.

In terms of moving forward, I'd still like to hear what
Andres has to say about the comments I made on March 1st.

Yeah, agreed.
$ ping -n andres.freund
Request timeout for icmp_seq 0
Request timeout for icmp_seq 1
Request timeout for icmp_seq 2
Request timeout for icmp_seq 3
Request timeout for icmp_seq 4
^C
--- andres.freund ping statistics ---
6 packets transmitted, 0 packets received, 100.0% packet loss
Meanwhile, /messages/by-id/4c171ffe-e3ee-acc5-9066-a40d52bc5ae9@postgrespro.ru
shows that this patch has some benefits for other cases, which is a
point in favor IMHO.
Thank you for sharing. That's good to know.

Andres pointed out the performance degradation due to hash collision
when multiple loading. I think the point is that it happens at where
users don't know. Therefore even if we make N_RELEXTLOCK_ENTS
configurable parameter, since users don't know the hash collision they
don't know when they should tune it.

So it's just an idea but how about adding an SQL-callable function
that returns the estimated number of lock waiters of the given
relation? Since user knows how many processes are loading to the
relation, if a returned value by the function is greater than the
expected value user can know hash collision and will be able to start
to consider to increase N_RELEXTLOCK_ENTS.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

We in PostgresProc were faced with lock extension contention problem at
two more customers and tried to use this patch (v13) to address this issue.
Unfortunately replacing heavy lock with lwlock couldn't completely
eliminate contention, now most of backends are blocked on conditional
variable:

0x00007fb03a318903 in __epoll_wait_nocancel () from /lib64/libc.so.6
#0 0x00007fb03a318903 in __epoll_wait_nocancel () from /lib64/libc.so.6
#1 0x00000000007024ee in WaitEventSetWait ()
#2 0x0000000000718fa6 in ConditionVariableSleep ()
#3 0x000000000071954d in RelExtLockAcquire ()
#4 0x00000000004ba99d in RelationGetBufferForTuple ()
#5 0x00000000004b3f18 in heap_insert ()
#6 0x00000000006109c8 in ExecInsert ()
#7 0x0000000000611a49 in ExecModifyTable ()
#8 0x00000000005ef97a in standard_ExecutorRun ()
#9 0x000000000072440a in ProcessQuery ()
#10 0x0000000000724631 in PortalRunMulti ()
#11 0x00000000007250ec in PortalRun ()
#12 0x0000000000721287 in exec_simple_query ()
#13 0x0000000000722532 in PostgresMain ()
#14 0x000000000047a9eb in ServerLoop ()
#15 0x00000000006b9fe9 in PostmasterMain ()
#16 0x000000000047b431 in main ()

Obviously there is nothing surprising here: if a lot of processes try to
acquire the same exclusive lock, then high contention is expected.
I just want to notice that this patch is not able to completely
eliminate the problem with large number of concurrent inserts to the
same table.

Second problem we observed was even more critical: if backed is granted
relation extension lock and then got some error before releasing this lock,
then abort of the current transaction doesn't release this lock (unlike
heavy weight lock) and the relation is kept locked.
So database is actually stalled and server has to be restarted.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#74

Andres Freund

andres@anarazel.de

over 7 years ago

In reply to: Konstantin Knizhnik (#73)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

Hi,

On 2018-06-04 16:47:29 +0300, Konstantin Knizhnik wrote:

We in PostgresProc were faced with lock extension contention problem at two
more customers and tried to use this patch (v13) to address this issue.
Unfortunately replacing heavy lock with lwlock couldn't completely eliminate
contention, now most of backends are blocked on conditional variable:

0x00007fb03a318903 in __epoll_wait_nocancel () from /lib64/libc.so.6
#0ï¿½ 0x00007fb03a318903 in __epoll_wait_nocancel () from /lib64/libc.so.6
#1ï¿½ 0x00000000007024ee in WaitEventSetWait ()
#2ï¿½ 0x0000000000718fa6 in ConditionVariableSleep ()
#3ï¿½ 0x000000000071954d in RelExtLockAcquire ()

That doesn't necessarily mean that the postgres code is to fault
here. It's entirely possible that the filesystem or storage is the
bottleneck. Could you briefly describe workload & hardware?

Second problem we observed was even more critical: if backed is granted
relation extension lock and then got some error before releasing this lock,
then abort of the current transaction doesn't release this lock (unlike
heavy weight lock) and the relation is kept locked.
So database is actually stalled and server has to be restarted.

That obvioulsy needs to be fixed...

Greetings,

Andres Freund

#75

Masahiko Sawada

sawada.mshk@gmail.com

over 7 years ago

In reply to: Konstantin Knizhnik (#73)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Mon, Jun 4, 2018 at 10:47 PM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:

On 26.04.2018 09:10, Masahiko Sawada wrote:
On Thu, Apr 26, 2018 at 3:30 AM, Robert Haas <robertmhaas@gmail.com>
wrote:
On Tue, Apr 10, 2018 at 9:08 PM, Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

Never mind. There was a lot of items especially at the last CommitFest.

In terms of moving forward, I'd still like to hear what
Andres has to say about the comments I made on March 1st.

Yeah, agreed.
$ ping -n andres.freund
Request timeout for icmp_seq 0
Request timeout for icmp_seq 1
Request timeout for icmp_seq 2
Request timeout for icmp_seq 3
Request timeout for icmp_seq 4
^C
--- andres.freund ping statistics ---
6 packets transmitted, 0 packets received, 100.0% packet loss
Meanwhile,
/messages/by-id/4c171ffe-e3ee-acc5-9066-a40d52bc5ae9@postgrespro.ru
shows that this patch has some benefits for other cases, which is a
point in favor IMHO.
Thank you for sharing. That's good to know.

Andres pointed out the performance degradation due to hash collision
when multiple loading. I think the point is that it happens at where
users don't know. Therefore even if we make N_RELEXTLOCK_ENTS
configurable parameter, since users don't know the hash collision they
don't know when they should tune it.

So it's just an idea but how about adding an SQL-callable function
that returns the estimated number of lock waiters of the given
relation? Since user knows how many processes are loading to the
relation, if a returned value by the function is greater than the
expected value user can know hash collision and will be able to start
to consider to increase N_RELEXTLOCK_ENTS.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
We in PostgresProc were faced with lock extension contention problem at two
more customers and tried to use this patch (v13) to address this issue.
Unfortunately replacing heavy lock with lwlock couldn't completely eliminate
contention, now most of backends are blocked on conditional variable:

0x00007fb03a318903 in __epoll_wait_nocancel () from /lib64/libc.so.6
#0 0x00007fb03a318903 in __epoll_wait_nocancel () from /lib64/libc.so.6
#1 0x00000000007024ee in WaitEventSetWait ()
#2 0x0000000000718fa6 in ConditionVariableSleep ()
#3 0x000000000071954d in RelExtLockAcquire ()
#4 0x00000000004ba99d in RelationGetBufferForTuple ()
#5 0x00000000004b3f18 in heap_insert ()
#6 0x00000000006109c8 in ExecInsert ()
#7 0x0000000000611a49 in ExecModifyTable ()
#8 0x00000000005ef97a in standard_ExecutorRun ()
#9 0x000000000072440a in ProcessQuery ()
#10 0x0000000000724631 in PortalRunMulti ()
#11 0x00000000007250ec in PortalRun ()
#12 0x0000000000721287 in exec_simple_query ()
#13 0x0000000000722532 in PostgresMain ()
#14 0x000000000047a9eb in ServerLoop ()
#15 0x00000000006b9fe9 in PostmasterMain ()
#16 0x000000000047b431 in main ()

Obviously there is nothing surprising here: if a lot of processes try to
acquire the same exclusive lock, then high contention is expected.
I just want to notice that this patch is not able to completely eliminate
the problem with large number of concurrent inserts to the same table.

Second problem we observed was even more critical: if backed is granted
relation extension lock and then got some error before releasing this lock,
then abort of the current transaction doesn't release this lock (unlike
heavy weight lock) and the relation is kept locked.
So database is actually stalled and server has to be restarted.

Thank you for reporting.

Regarding the second problem, I tried to reproduce that bug with
latest version patch (v13) but could not. When transaction aborts, we
call ReousrceOwnerRelease()->ResourceOwnerReleaseInternal()->ProcReleaseLocks()->RelExtLockCleanup()
and clear either relext lock bits we are holding or waiting. If we
raise an error after we added a relext lock bit but before we
increment its holding count then the relext lock is remained, but I
couldn't see the code raises an error between them. Could you please
share the concrete reproduction steps of the cause of database stalled
if possible?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#76

Konstantin Knizhnik

k.knizhnik@postgrespro.ru

over 7 years ago

In reply to: Andres Freund (#74)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On 04.06.2018 21:42, Andres Freund wrote:

Hi,

On 2018-06-04 16:47:29 +0300, Konstantin Knizhnik wrote:

We in PostgresProc were faced with lock extension contention problem at two
more customers and tried to use this patch (v13) to address this issue.
Unfortunately replacing heavy lock with lwlock couldn't completely eliminate
contention, now most of backends are blocked on conditional variable:

0x00007fb03a318903 in __epoll_wait_nocancel () from /lib64/libc.so.6
#0 0x00007fb03a318903 in __epoll_wait_nocancel () from /lib64/libc.so.6
#1 0x00000000007024ee in WaitEventSetWait ()
#2 0x0000000000718fa6 in ConditionVariableSleep ()
#3 0x000000000071954d in RelExtLockAcquire ()

That doesn't necessarily mean that the postgres code is to fault
here. It's entirely possible that the filesystem or storage is the
bottleneck. Could you briefly describe workload & hardware?

Workload is combination of inserts and selects.
Looks like shared locks obtained by select cause starvation of inserts,
trying to get exclusive relation extension lock.
The problem is fixed by fair lwlock patch, implemented by Alexander
Korotkov. This patch prevents granting of shared lock if wait queue is
not empty.
May be we should use this patch or find some other way to prevent
starvation of writers on relation extension locks for such workloads.

Second problem we observed was even more critical: if backed is granted
relation extension lock and then got some error before releasing this lock,
then abort of the current transaction doesn't release this lock (unlike
heavy weight lock) and the relation is kept locked.
So database is actually stalled and server has to be restarted.

That obvioulsy needs to be fixed...

Sorry, looks like the problem is more obscure than I expected.
What we have observed is that all backends are blocked in lwlock (sorry
stack trace is not complete):

#0 0x00007ff5a9c566d6 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0, futex_word=0x7ff3c57b9b38) at ../sysdeps/unix/sysv/lin
ux/futex-internal.h:205
#1 do_futex_wait (sem=sem@entry=0x7ff3c57b9b38, abstime=0x0) at sem_waitcommon.c:111
#2 0x00007ff5a9c567c8 in __new_sem_wait_slow (sem=sem@entry=0x7ff3c57b9b38, abstime=0x0) at sem_waitcommon.c:181 #3 0x00007ff5a9c56839 in __new_sem_wait (sem=sem@entry=0x7ff3c57b9b38) at sem_wait.c:42 #4 0x000056290c901582 in PGSemaphoreLock (sema=0x7ff3c57b9b38) at pg_sema.c:310
#5 0x000056290c97923c in LWLockAcquire (lock=0x7ff3c7038c64, mode=LW_SHARED) at ./build/../src/backend/storage/lmgr/lwlock.c:1233

I happen after error in disk write operation. Unfortunately we do not have core files and not able to reproduce the problem.
All LW locks should be cleared by LWLockReleaseAll but ... for some reasons it doesn't happen.
We will continue investigation and try to reproduce the problem.
I will let you know if we find the reason of the problem.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#77

Konstantin Knizhnik

k.knizhnik@postgrespro.ru

over 7 years ago

In reply to: Masahiko Sawada (#75)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On 05.06.2018 07:22, Masahiko Sawada wrote:

On Mon, Jun 4, 2018 at 10:47 PM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:
On 26.04.2018 09:10, Masahiko Sawada wrote:
On Thu, Apr 26, 2018 at 3:30 AM, Robert Haas <robertmhaas@gmail.com>
wrote:
On Tue, Apr 10, 2018 at 9:08 PM, Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

Never mind. There was a lot of items especially at the last CommitFest.

In terms of moving forward, I'd still like to hear what
Andres has to say about the comments I made on March 1st.

Yeah, agreed.
$ ping -n andres.freund
Request timeout for icmp_seq 0
Request timeout for icmp_seq 1
Request timeout for icmp_seq 2
Request timeout for icmp_seq 3
Request timeout for icmp_seq 4
^C
--- andres.freund ping statistics ---
6 packets transmitted, 0 packets received, 100.0% packet loss
Meanwhile,
/messages/by-id/4c171ffe-e3ee-acc5-9066-a40d52bc5ae9@postgrespro.ru
shows that this patch has some benefits for other cases, which is a
point in favor IMHO.
Thank you for sharing. That's good to know.

Andres pointed out the performance degradation due to hash collision
when multiple loading. I think the point is that it happens at where
users don't know. Therefore even if we make N_RELEXTLOCK_ENTS
configurable parameter, since users don't know the hash collision they
don't know when they should tune it.

So it's just an idea but how about adding an SQL-callable function
that returns the estimated number of lock waiters of the given
relation? Since user knows how many processes are loading to the
relation, if a returned value by the function is greater than the
expected value user can know hash collision and will be able to start
to consider to increase N_RELEXTLOCK_ENTS.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
We in PostgresProc were faced with lock extension contention problem at two
more customers and tried to use this patch (v13) to address this issue.
Unfortunately replacing heavy lock with lwlock couldn't completely eliminate
contention, now most of backends are blocked on conditional variable:

0x00007fb03a318903 in __epoll_wait_nocancel () from /lib64/libc.so.6
#0 0x00007fb03a318903 in __epoll_wait_nocancel () from /lib64/libc.so.6
#1 0x00000000007024ee in WaitEventSetWait ()
#2 0x0000000000718fa6 in ConditionVariableSleep ()
#3 0x000000000071954d in RelExtLockAcquire ()
#4 0x00000000004ba99d in RelationGetBufferForTuple ()
#5 0x00000000004b3f18 in heap_insert ()
#6 0x00000000006109c8 in ExecInsert ()
#7 0x0000000000611a49 in ExecModifyTable ()
#8 0x00000000005ef97a in standard_ExecutorRun ()
#9 0x000000000072440a in ProcessQuery ()
#10 0x0000000000724631 in PortalRunMulti ()
#11 0x00000000007250ec in PortalRun ()
#12 0x0000000000721287 in exec_simple_query ()
#13 0x0000000000722532 in PostgresMain ()
#14 0x000000000047a9eb in ServerLoop ()
#15 0x00000000006b9fe9 in PostmasterMain ()
#16 0x000000000047b431 in main ()

Obviously there is nothing surprising here: if a lot of processes try to
acquire the same exclusive lock, then high contention is expected.
I just want to notice that this patch is not able to completely eliminate
the problem with large number of concurrent inserts to the same table.

Second problem we observed was even more critical: if backed is granted
relation extension lock and then got some error before releasing this lock,
then abort of the current transaction doesn't release this lock (unlike
heavy weight lock) and the relation is kept locked.
So database is actually stalled and server has to be restarted.
Thank you for reporting.

Regarding the second problem, I tried to reproduce that bug with
latest version patch (v13) but could not. When transaction aborts, we
call ReousrceOwnerRelease()->ResourceOwnerReleaseInternal()->ProcReleaseLocks()->RelExtLockCleanup()
and clear either relext lock bits we are holding or waiting. If we
raise an error after we added a relext lock bit but before we
increment its holding count then the relext lock is remained, but I
couldn't see the code raises an error between them. Could you please
share the concrete reproduction

Sorry, my original guess that LW-locks are not released in case of
transaction abort is not correct.
There was really situation when all backends were blocked in relation
extension lock and looks like it happens after disk write error,
but as far as it happens at customer's site, we have no time for
investigation and not able to reproduce this problem locally.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#78

Alexander Korotkov

a.korotkov@postgrespro.ru

over 7 years ago

In reply to: Konstantin Knizhnik (#76)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Tue, Jun 5, 2018 at 12:48 PM Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:

Workload is combination of inserts and selects.
Looks like shared locks obtained by select cause starvation of inserts, trying to get exclusive relation extension lock.
The problem is fixed by fair lwlock patch, implemented by Alexander Korotkov. This patch prevents granting of shared lock if wait queue is not empty.
May be we should use this patch or find some other way to prevent starvation of writers on relation extension locks for such workloads.

Fair lwlock patch really fixed starvation of exclusive lwlock waiters.
But that starvation happens not on relation extension lock – selects
don't get shared relation extension lock. The real issue there was
not relation extension lock itself, but the time spent inside this
lock. It appears that buffer replacement happening inside relation
extension lock is affected by starvation on exclusive buffer mapping
lwlocks and buffer content lwlocks, caused by many concurrent shared
lockers. So, fair lwlock patch have no direct influence to relation
extension lock, which is naturally not even lwlock...

I'll post fair lwlock path in a separate thread. It requires detailed
consideration and benchmarking, because there is a risk of regression
on specific workloads.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#79

Masahiko Sawada

sawada.mshk@gmail.com

over 7 years ago

In reply to: Konstantin Knizhnik (#77)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Tue, Jun 5, 2018 at 6:47 PM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:

On 05.06.2018 07:22, Masahiko Sawada wrote:
On Mon, Jun 4, 2018 at 10:47 PM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:
On 26.04.2018 09:10, Masahiko Sawada wrote:
On Thu, Apr 26, 2018 at 3:30 AM, Robert Haas <robertmhaas@gmail.com>
wrote:
On Tue, Apr 10, 2018 at 9:08 PM, Masahiko Sawada
<sawada.mshk@gmail.com>
wrote:

Never mind. There was a lot of items especially at the last
CommitFest.

In terms of moving forward, I'd still like to hear what
Andres has to say about the comments I made on March 1st.

Yeah, agreed.
$ ping -n andres.freund
Request timeout for icmp_seq 0
Request timeout for icmp_seq 1
Request timeout for icmp_seq 2
Request timeout for icmp_seq 3
Request timeout for icmp_seq 4
^C
--- andres.freund ping statistics ---
6 packets transmitted, 0 packets received, 100.0% packet loss
Meanwhile,

/messages/by-id/4c171ffe-e3ee-acc5-9066-a40d52bc5ae9@postgrespro.ru
shows that this patch has some benefits for other cases, which is a
point in favor IMHO.
Thank you for sharing. That's good to know.

Andres pointed out the performance degradation due to hash collision
when multiple loading. I think the point is that it happens at where
users don't know. Therefore even if we make N_RELEXTLOCK_ENTS
configurable parameter, since users don't know the hash collision they
don't know when they should tune it.

So it's just an idea but how about adding an SQL-callable function
that returns the estimated number of lock waiters of the given
relation? Since user knows how many processes are loading to the
relation, if a returned value by the function is greater than the
expected value user can know hash collision and will be able to start
to consider to increase N_RELEXTLOCK_ENTS.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
We in PostgresProc were faced with lock extension contention problem at
two
more customers and tried to use this patch (v13) to address this issue.
Unfortunately replacing heavy lock with lwlock couldn't completely
eliminate
contention, now most of backends are blocked on conditional variable:

0x00007fb03a318903 in __epoll_wait_nocancel () from /lib64/libc.so.6
#0 0x00007fb03a318903 in __epoll_wait_nocancel () from /lib64/libc.so.6
#1 0x00000000007024ee in WaitEventSetWait ()
#2 0x0000000000718fa6 in ConditionVariableSleep ()
#3 0x000000000071954d in RelExtLockAcquire ()
#4 0x00000000004ba99d in RelationGetBufferForTuple ()
#5 0x00000000004b3f18 in heap_insert ()
#6 0x00000000006109c8 in ExecInsert ()
#7 0x0000000000611a49 in ExecModifyTable ()
#8 0x00000000005ef97a in standard_ExecutorRun ()
#9 0x000000000072440a in ProcessQuery ()
#10 0x0000000000724631 in PortalRunMulti ()
#11 0x00000000007250ec in PortalRun ()
#12 0x0000000000721287 in exec_simple_query ()
#13 0x0000000000722532 in PostgresMain ()
#14 0x000000000047a9eb in ServerLoop ()
#15 0x00000000006b9fe9 in PostmasterMain ()
#16 0x000000000047b431 in main ()

Obviously there is nothing surprising here: if a lot of processes try to
acquire the same exclusive lock, then high contention is expected.
I just want to notice that this patch is not able to completely eliminate
the problem with large number of concurrent inserts to the same table.

Second problem we observed was even more critical: if backed is granted
relation extension lock and then got some error before releasing this
lock,
then abort of the current transaction doesn't release this lock (unlike
heavy weight lock) and the relation is kept locked.
So database is actually stalled and server has to be restarted.
Thank you for reporting.

Regarding the second problem, I tried to reproduce that bug with
latest version patch (v13) but could not. When transaction aborts, we
call
ReousrceOwnerRelease()->ResourceOwnerReleaseInternal()->ProcReleaseLocks()->RelExtLockCleanup()
and clear either relext lock bits we are holding or waiting. If we
raise an error after we added a relext lock bit but before we
increment its holding count then the relext lock is remained, but I
couldn't see the code raises an error between them. Could you please
share the concrete reproduction
Sorry, my original guess that LW-locks are not released in case of
transaction abort is not correct.
There was really situation when all backends were blocked in relation
extension lock and looks like it happens after disk write error,

You're saying that it is not correct that LWlock are not released but
it's correct that all backends were blocked in relext lock, but in
other your mail you're saying something opposite. Which is correct?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#80

Konstantin Knizhnik

k.knizhnik@postgrespro.ru

over 7 years ago

In reply to: Masahiko Sawada (#79)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On 05.06.2018 13:29, Masahiko Sawada wrote:

On Tue, Jun 5, 2018 at 6:47 PM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:
On 05.06.2018 07:22, Masahiko Sawada wrote:
On Mon, Jun 4, 2018 at 10:47 PM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:
On 26.04.2018 09:10, Masahiko Sawada wrote:
On Thu, Apr 26, 2018 at 3:30 AM, Robert Haas <robertmhaas@gmail.com>
wrote:
On Tue, Apr 10, 2018 at 9:08 PM, Masahiko Sawada
<sawada.mshk@gmail.com>
wrote:

Never mind. There was a lot of items especially at the last
CommitFest.

In terms of moving forward, I'd still like to hear what
Andres has to say about the comments I made on March 1st.

Yeah, agreed.
$ ping -n andres.freund
Request timeout for icmp_seq 0
Request timeout for icmp_seq 1
Request timeout for icmp_seq 2
Request timeout for icmp_seq 3
Request timeout for icmp_seq 4
^C
--- andres.freund ping statistics ---
6 packets transmitted, 0 packets received, 100.0% packet loss
Meanwhile,

/messages/by-id/4c171ffe-e3ee-acc5-9066-a40d52bc5ae9@postgrespro.ru
shows that this patch has some benefits for other cases, which is a
point in favor IMHO.
Thank you for sharing. That's good to know.

Andres pointed out the performance degradation due to hash collision
when multiple loading. I think the point is that it happens at where
users don't know. Therefore even if we make N_RELEXTLOCK_ENTS
configurable parameter, since users don't know the hash collision they
don't know when they should tune it.

So it's just an idea but how about adding an SQL-callable function
that returns the estimated number of lock waiters of the given
relation? Since user knows how many processes are loading to the
relation, if a returned value by the function is greater than the
expected value user can know hash collision and will be able to start
to consider to increase N_RELEXTLOCK_ENTS.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
We in PostgresProc were faced with lock extension contention problem at
two
more customers and tried to use this patch (v13) to address this issue.
Unfortunately replacing heavy lock with lwlock couldn't completely
eliminate
contention, now most of backends are blocked on conditional variable:

0x00007fb03a318903 in __epoll_wait_nocancel () from /lib64/libc.so.6
#0 0x00007fb03a318903 in __epoll_wait_nocancel () from /lib64/libc.so.6
#1 0x00000000007024ee in WaitEventSetWait ()
#2 0x0000000000718fa6 in ConditionVariableSleep ()
#3 0x000000000071954d in RelExtLockAcquire ()
#4 0x00000000004ba99d in RelationGetBufferForTuple ()
#5 0x00000000004b3f18 in heap_insert ()
#6 0x00000000006109c8 in ExecInsert ()
#7 0x0000000000611a49 in ExecModifyTable ()
#8 0x00000000005ef97a in standard_ExecutorRun ()
#9 0x000000000072440a in ProcessQuery ()
#10 0x0000000000724631 in PortalRunMulti ()
#11 0x00000000007250ec in PortalRun ()
#12 0x0000000000721287 in exec_simple_query ()
#13 0x0000000000722532 in PostgresMain ()
#14 0x000000000047a9eb in ServerLoop ()
#15 0x00000000006b9fe9 in PostmasterMain ()
#16 0x000000000047b431 in main ()

Obviously there is nothing surprising here: if a lot of processes try to
acquire the same exclusive lock, then high contention is expected.
I just want to notice that this patch is not able to completely eliminate
the problem with large number of concurrent inserts to the same table.

Second problem we observed was even more critical: if backed is granted
relation extension lock and then got some error before releasing this
lock,
then abort of the current transaction doesn't release this lock (unlike
heavy weight lock) and the relation is kept locked.
So database is actually stalled and server has to be restarted.
Thank you for reporting.

Regarding the second problem, I tried to reproduce that bug with
latest version patch (v13) but could not. When transaction aborts, we
call
ReousrceOwnerRelease()->ResourceOwnerReleaseInternal()->ProcReleaseLocks()->RelExtLockCleanup()
and clear either relext lock bits we are holding or waiting. If we
raise an error after we added a relext lock bit but before we
increment its holding count then the relext lock is remained, but I
couldn't see the code raises an error between them. Could you please
share the concrete reproduction
Sorry, my original guess that LW-locks are not released in case of
transaction abort is not correct.
There was really situation when all backends were blocked in relation
extension lock and looks like it happens after disk write error,
You're saying that it is not correct that LWlock are not released but
it's correct that all backends were blocked in relext lock, but in
other your mail you're saying something opposite. Which is correct?

I am sorry for confusion. I have not investigated core files myself and
just share information received from our engineer.
Looks like this problem may be related with relation extension locks at all.
Sorry for false alarm.

#81

Andres Freund

andres@anarazel.de

over 7 years ago

In reply to: Alexander Korotkov (#78)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On 2018-06-05 13:09:08 +0300, Alexander Korotkov wrote:

On Tue, Jun 5, 2018 at 12:48 PM Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:

Workload is combination of inserts and selects.
Looks like shared locks obtained by select cause starvation of inserts, trying to get exclusive relation extension lock.
The problem is fixed by fair lwlock patch, implemented by Alexander Korotkov. This patch prevents granting of shared lock if wait queue is not empty.
May be we should use this patch or find some other way to prevent starvation of writers on relation extension locks for such workloads.

Fair lwlock patch really fixed starvation of exclusive lwlock waiters.
But that starvation happens not on relation extension lock – selects
don't get shared relation extension lock. The real issue there was
not relation extension lock itself, but the time spent inside this
lock.

Yea, that makes a lot more sense to me.

It appears that buffer replacement happening inside relation
extension lock is affected by starvation on exclusive buffer mapping
lwlocks and buffer content lwlocks, caused by many concurrent shared
lockers. So, fair lwlock patch have no direct influence to relation
extension lock, which is naturally not even lwlock...

Yea, that makes sense. I wonder how much the fix here is to "pre-clear"
a victim buffer, and how much is a saner buffer replacement
implementation (either by going away from O(NBuffers), or by having a
queue of clean victim buffers like my bgwriter replacement).

I'll post fair lwlock path in a separate thread. It requires detailed
consideration and benchmarking, because there is a risk of regression
on specific workloads.

I bet that doing it naively will regress massively in a number of cases.

Greetings,

Andres Freund

#82

Alexander Korotkov

a.korotkov@postgrespro.ru

over 7 years ago

In reply to: Andres Freund (#81)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Tue, Jun 5, 2018 at 4:02 PM Andres Freund <andres@anarazel.de> wrote:

On 2018-06-05 13:09:08 +0300, Alexander Korotkov wrote:

It appears that buffer replacement happening inside relation
extension lock is affected by starvation on exclusive buffer mapping
lwlocks and buffer content lwlocks, caused by many concurrent shared
lockers. So, fair lwlock patch have no direct influence to relation
extension lock, which is naturally not even lwlock...

Yea, that makes sense. I wonder how much the fix here is to "pre-clear"
a victim buffer, and how much is a saner buffer replacement
implementation (either by going away from O(NBuffers), or by having a
queue of clean victim buffers like my bgwriter replacement).

The particular thing I observed on our environment is BufferAlloc()
waiting hours on buffer partition lock. Increasing NUM_BUFFER_PARTITIONS
didn't give any significant help. It appears that very hot page (root page of
some frequently used index) reside on that partition, so this partition was
continuously under shared lock. So, in order to resolve without changing
LWLock, we probably should move our buffers hash table to something
lockless.

I'll post fair lwlock path in a separate thread. It requires detailed
consideration and benchmarking, because there is a risk of regression
on specific workloads.

I bet that doing it naively will regress massively in a number of cases.

Yes, I suspect the same. However, I tend to think that something is wrong
with LWLock itself. It seems that it is the only of our locks, which provides
some lockers almost infinite starvations under certain workloads. In contrast,
even our SpinLock gives all the waiting processes nearly same chances to
acquire it. So, I think idea of improving LWLock in this aspect deserves
at least further investigation.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#83

Amit Kapila

amit.kapila16@gmail.com

over 7 years ago

In reply to: Alexander Korotkov (#82)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Tue, Jun 5, 2018 at 7:35 PM, Alexander Korotkov <
a.korotkov@postgrespro.ru> wrote:

On Tue, Jun 5, 2018 at 4:02 PM Andres Freund <andres@anarazel.de> wrote:

On 2018-06-05 13:09:08 +0300, Alexander Korotkov wrote:

It appears that buffer replacement happening inside relation
extension lock is affected by starvation on exclusive buffer mapping
lwlocks and buffer content lwlocks, caused by many concurrent shared
lockers. So, fair lwlock patch have no direct influence to relation
extension lock, which is naturally not even lwlock...

Yea, that makes sense. I wonder how much the fix here is to "pre-clear"
a victim buffer, and how much is a saner buffer replacement
implementation (either by going away from O(NBuffers), or by having a
queue of clean victim buffers like my bgwriter replacement).

The particular thing I observed on our environment is BufferAlloc()
waiting hours on buffer partition lock. Increasing NUM_BUFFER_PARTITIONS
didn't give any significant help. It appears that very hot page (root
page of
some frequently used index) reside on that partition, so this partition was
continuously under shared lock. So, in order to resolve without changing
LWLock, we probably should move our buffers hash table to something
lockless.

I think Robert's chash stuff [1]/messages/by-id/CA+TgmoYE4t-Pt+v08kMO5u_XN-HNKBWtfMgcUXEGBrQiVgdV9Q@mail.gmail.com might be helpful to reduce the contention
you are seeing.

[1]: /messages/by-id/CA+TgmoYE4t-Pt+v08kMO5u_XN-HNKBWtfMgcUXEGBrQiVgdV9Q@mail.gmail.com
/messages/by-id/CA+TgmoYE4t-Pt+v08kMO5u_XN-HNKBWtfMgcUXEGBrQiVgdV9Q@mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#84

Masahiko Sawada

sawada.mshk@gmail.com

over 7 years ago

In reply to: Robert Haas (#69)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Fri, Apr 27, 2018 at 4:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Apr 26, 2018 at 3:10 PM, Andres Freund <andres@anarazel.de> wrote:

I think the real question is whether the scenario is common enough to
worry about. In practice, you'd have to be extremely unlucky to be
doing many bulk loads at the same time that all happened to hash to
the same bucket.

With a bunch of parallel bulkloads into partitioned tables that really
doesn't seem that unlikely?

It increases the likelihood of collisions, but probably decreases the
number of cases where the contention gets really bad.

For example, suppose each table has 100 partitions and you are
bulk-loading 10 of them at a time. It's virtually certain that you
will have some collisions, but the amount of contention within each
bucket will remain fairly low because each backend spends only 1% of
its time in the bucket corresponding to any given partition.

I share another result of performance evaluation between current HEAD
and current HEAD with v13 patch(N_RELEXTLOCK_ENTS = 1024).

Type of table: normal table, unlogged table
Number of child tables : 16, 64 (all tables are located on the same tablespace)
Number of clients : 32
Number of trials : 100
Duration: 180 seconds for each trials

The hardware spec of server is Intel Xeon 2.4GHz (HT 160cores), 256GB
RAM, NVMe SSD 1.5TB.
Each clients load 10kB random data across all partitioned tables.

Here is the result.

childs | type | target | avg_tps | diff with HEAD
--------+----------+---------+------------+------------------
16 | normal | HEAD | 1643.833 |
16 | normal | Patched | 1619.5404 | 0.985222
16 | unlogged | HEAD | 9069.3543 |
16 | unlogged | Patched | 9368.0263 | 1.032932
64 | normal | HEAD | 1598.698 |
64 | normal | Patched | 1587.5906 | 0.993052
64 | unlogged | HEAD | 9629.7315 |
64 | unlogged | Patched | 10208.2196 | 1.060073
(8 rows)

For normal tables, loading tps decreased 1% ~ 2% with this patch
whereas it increased 3% ~ 6% for unlogged tables. There were
collisions at 0 ~ 5 relation extension lock slots between 2 relations
in the 64 child tables case but it didn't seem to affect the tps.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#85

Michael Paquier

michael@paquier.xyz

over 7 years ago

In reply to: Amit Kapila (#83)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Wed, Jun 06, 2018 at 07:03:47PM +0530, Amit Kapila wrote:

I think Robert's chash stuff [1] might be helpful to reduce the contention
you are seeing.

Latest patch available does not apply, so I moved it to next CF. The
thread has died a bit as well...
--
Michael

#86

Dmitry Dolgov

9erthalion6@gmail.com

about 7 years ago

In reply to: Michael Paquier (#85)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Mon, Oct 1, 2018 at 8:54 AM Michael Paquier <michael@paquier.xyz> wrote:

On Wed, Jun 06, 2018 at 07:03:47PM +0530, Amit Kapila wrote:

I think Robert's chash stuff [1] might be helpful to reduce the contention
you are seeing.

Latest patch available does not apply, so I moved it to next CF. The
thread has died a bit as well...

Unfortunately, patch is still needs to be rebased. Could you do this, are there
any plans about the patch?

#87

Masahiko Sawada

sawada.mshk@gmail.com

about 7 years ago

In reply to: Dmitry Dolgov (#86)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Fri, Nov 30, 2018 at 1:17 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

On Mon, Oct 1, 2018 at 8:54 AM Michael Paquier <michael@paquier.xyz> wrote:

On Wed, Jun 06, 2018 at 07:03:47PM +0530, Amit Kapila wrote:

I think Robert's chash stuff [1] might be helpful to reduce the contention
you are seeing.

Latest patch available does not apply, so I moved it to next CF. The
thread has died a bit as well...

Unfortunately, patch is still needs to be rebased. Could you do this, are there
any plans about the patch?

I have a plan but it's a future plan. This patch is for parallel
vacuum patch. As I mentioned at that thread[1]/messages/by-id/CAD21AoDhAutvKbQ37Btf4taMVbQaOaSvOpxpLgu814T1-OqYGg@mail.gmail.com, I'm focusing on only
parallel index vacuum, which would not require the relation extension
lock improvements for now. Therefore, I want to withdraw this patch
and to reactivate when we need this enhancement.

So I think we can mark it as 'Returned with feedback'.

[1]: /messages/by-id/CAD21AoDhAutvKbQ37Btf4taMVbQaOaSvOpxpLgu814T1-OqYGg@mail.gmail.com

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#88

Amit Kapila

amit.kapila16@gmail.com

almost 6 years ago

In reply to: Masahiko Sawada (#84)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Tue, Jun 26, 2018 at 12:47 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Apr 27, 2018 at 4:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Apr 26, 2018 at 3:10 PM, Andres Freund <andres@anarazel.de> wrote:

I think the real question is whether the scenario is common enough to
worry about. In practice, you'd have to be extremely unlucky to be
doing many bulk loads at the same time that all happened to hash to
the same bucket.

With a bunch of parallel bulkloads into partitioned tables that really
doesn't seem that unlikely?

It increases the likelihood of collisions, but probably decreases the
number of cases where the contention gets really bad.

For example, suppose each table has 100 partitions and you are
bulk-loading 10 of them at a time. It's virtually certain that you
will have some collisions, but the amount of contention within each
bucket will remain fairly low because each backend spends only 1% of
its time in the bucket corresponding to any given partition.

I share another result of performance evaluation between current HEAD
and current HEAD with v13 patch(N_RELEXTLOCK_ENTS = 1024).

Type of table: normal table, unlogged table
Number of child tables : 16, 64 (all tables are located on the same tablespace)
Number of clients : 32
Number of trials : 100
Duration: 180 seconds for each trials

The hardware spec of server is Intel Xeon 2.4GHz (HT 160cores), 256GB
RAM, NVMe SSD 1.5TB.
Each clients load 10kB random data across all partitioned tables.

Here is the result.

childs | type | target | avg_tps | diff with HEAD
--------+----------+---------+------------+------------------
16 | normal | HEAD | 1643.833 |
16 | normal | Patched | 1619.5404 | 0.985222
16 | unlogged | HEAD | 9069.3543 |
16 | unlogged | Patched | 9368.0263 | 1.032932
64 | normal | HEAD | 1598.698 |
64 | normal | Patched | 1587.5906 | 0.993052
64 | unlogged | HEAD | 9629.7315 |
64 | unlogged | Patched | 10208.2196 | 1.060073
(8 rows)

For normal tables, loading tps decreased 1% ~ 2% with this patch
whereas it increased 3% ~ 6% for unlogged tables. There were
collisions at 0 ~ 5 relation extension lock slots between 2 relations
in the 64 child tables case but it didn't seem to affect the tps.

AFAIU, this resembles the workload that Andres was worried about. I
think we should once run this test in a different environment, but
considering this to be correct and repeatable, where do we go with
this patch especially when we know it improves many workloads [1]/messages/by-id/4c171ffe-e3ee-acc5-9066-a40d52bc5ae9@postgrespro.ru as
well. We know that on a pathological case constructed by Mithun [2]/messages/by-id/CAD__Oug52j=DQMoP2b=VY7wZb0S9wMNu4irXOH3-ZjFkzWZPGg@mail.gmail.com,
this causes regression as well. I am not sure if the test done by
Mithun really mimics any real-world workload as he has tested by
making N_RELEXTLOCK_ENTS = 1 to hit the worst case.

Sawada-San, if you have a script or data for the test done by you,
then please share it so that others can also try to reproduce it.

[1]: /messages/by-id/4c171ffe-e3ee-acc5-9066-a40d52bc5ae9@postgrespro.ru
[2]: /messages/by-id/CAD__Oug52j=DQMoP2b=VY7wZb0S9wMNu4irXOH3-ZjFkzWZPGg@mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#89

Masahiko Sawada

sawada.mshk@gmail.com

almost 6 years ago

In reply to: Amit Kapila (#88)

1 attachment(s)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Mon, Feb 3, 2020 at 8:03 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jun 26, 2018 at 12:47 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Apr 27, 2018 at 4:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Apr 26, 2018 at 3:10 PM, Andres Freund <andres@anarazel.de> wrote:

I think the real question is whether the scenario is common enough to
worry about. In practice, you'd have to be extremely unlucky to be
doing many bulk loads at the same time that all happened to hash to
the same bucket.

With a bunch of parallel bulkloads into partitioned tables that really
doesn't seem that unlikely?

It increases the likelihood of collisions, but probably decreases the
number of cases where the contention gets really bad.

For example, suppose each table has 100 partitions and you are
bulk-loading 10 of them at a time. It's virtually certain that you
will have some collisions, but the amount of contention within each
bucket will remain fairly low because each backend spends only 1% of
its time in the bucket corresponding to any given partition.

I share another result of performance evaluation between current HEAD
and current HEAD with v13 patch(N_RELEXTLOCK_ENTS = 1024).

Type of table: normal table, unlogged table
Number of child tables : 16, 64 (all tables are located on the same tablespace)
Number of clients : 32
Number of trials : 100
Duration: 180 seconds for each trials

The hardware spec of server is Intel Xeon 2.4GHz (HT 160cores), 256GB
RAM, NVMe SSD 1.5TB.
Each clients load 10kB random data across all partitioned tables.

Here is the result.

childs | type | target | avg_tps | diff with HEAD
--------+----------+---------+------------+------------------
16 | normal | HEAD | 1643.833 |
16 | normal | Patched | 1619.5404 | 0.985222
16 | unlogged | HEAD | 9069.3543 |
16 | unlogged | Patched | 9368.0263 | 1.032932
64 | normal | HEAD | 1598.698 |
64 | normal | Patched | 1587.5906 | 0.993052
64 | unlogged | HEAD | 9629.7315 |
64 | unlogged | Patched | 10208.2196 | 1.060073
(8 rows)

For normal tables, loading tps decreased 1% ~ 2% with this patch
whereas it increased 3% ~ 6% for unlogged tables. There were
collisions at 0 ~ 5 relation extension lock slots between 2 relations
in the 64 child tables case but it didn't seem to affect the tps.

AFAIU, this resembles the workload that Andres was worried about. I
think we should once run this test in a different environment, but
considering this to be correct and repeatable, where do we go with
this patch especially when we know it improves many workloads [1] as
well. We know that on a pathological case constructed by Mithun [2],
this causes regression as well. I am not sure if the test done by
Mithun really mimics any real-world workload as he has tested by
making N_RELEXTLOCK_ENTS = 1 to hit the worst case.

Sawada-San, if you have a script or data for the test done by you,
then please share it so that others can also try to reproduce it.

Unfortunately the environment I used for performance verification is
no longer available.

I agree to run this test in a different environment. I've attached the
rebased version patch. I'm measuring the performance with/without
patch, so will share the results.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

v14-0001-Move-relation-extension-locks-out-of-heavyweigth.patchapplication/octet-stream; name=v14-0001-Move-relation-extension-locks-out-of-heavyweigth.patchDownload

From aa9648afcc0131358b18f0b111515e77af8dfe92 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 5 Feb 2020 11:47:30 +0900
Subject: [PATCH v14] Move relation extension locks out of heavyweigth lock.

---
 doc/src/sgml/monitoring.sgml              |  16 +-
 src/backend/access/brin/brin_pageops.c    |  10 +-
 src/backend/access/brin/brin_revmap.c     |   7 +-
 src/backend/access/gin/ginutil.c          |   5 +-
 src/backend/access/gin/ginvacuum.c        |   9 +-
 src/backend/access/gist/gistutil.c        |   5 +-
 src/backend/access/gist/gistvacuum.c      |   5 +-
 src/backend/access/heap/hio.c             |  13 +-
 src/backend/access/heap/vacuumlazy.c      |   1 +
 src/backend/access/heap/visibilitymap.c   |   5 +-
 src/backend/access/nbtree/nbtpage.c       |   5 +-
 src/backend/access/nbtree/nbtree.c        |   5 +-
 src/backend/access/spgist/spgutils.c      |   5 +-
 src/backend/access/spgist/spgvacuum.c     |   5 +-
 src/backend/postmaster/pgstat.c           |   3 +
 src/backend/storage/freespace/freespace.c |   5 +-
 src/backend/storage/ipc/ipci.c            |   7 +
 src/backend/storage/lmgr/Makefile         |   1 +
 src/backend/storage/lmgr/README           |  16 +-
 src/backend/storage/lmgr/extension_lock.c | 469 ++++++++++++++++++++++
 src/backend/storage/lmgr/lmgr.c           |  78 ----
 src/backend/storage/lmgr/lock.c           |   8 +
 src/backend/storage/lmgr/proc.c           |   3 +
 src/backend/utils/adt/lockfuncs.c         |   2 -
 src/include/pgstat.h                      |   1 +
 src/include/storage/extension_lock.h      |  38 ++
 src/include/storage/lmgr.h                |   7 -
 src/include/storage/lock.h                |  10 -
 src/tools/pgindent/typedefs.list          |   1 +
 29 files changed, 600 insertions(+), 145 deletions(-)
 create mode 100644 src/backend/storage/lmgr/extension_lock.c
 create mode 100644 src/include/storage/extension_lock.h

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 8839699079..13283e6d67 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -714,8 +714,8 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
           Heavyweight locks, also known as lock manager locks or simply locks,
           primarily protect SQL-visible objects such as tables.  However,
           they are also used to ensure mutual exclusion for certain internal
-          operations such as relation extension.  <literal>wait_event</literal> will
-          identify the type of lock awaited.
+          operations such as waiting for a transaction to finish.
+          <literal>wait_event</literal> will identify the type of lock awaited.
          </para>
         </listitem>
         <listitem>
@@ -1178,14 +1178,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          counters during Parallel Hash plan execution.</entry>
         </row>
         <row>
-         <entry morerows="9"><literal>Lock</literal></entry>
+         <entry morerows="8"><literal>Lock</literal></entry>
          <entry><literal>relation</literal></entry>
          <entry>Waiting to acquire a lock on a relation.</entry>
         </row>
-        <row>
-         <entry><literal>extend</literal></entry>
-         <entry>Waiting to extend a relation.</entry>
-        </row>
         <row>
          <entry><literal>page</literal></entry>
          <entry>Waiting to acquire a lock on page of a relation.</entry>
@@ -1319,7 +1315,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry>Waiting in an extension.</entry>
         </row>
         <row>
-         <entry morerows="36"><literal>IPC</literal></entry>
+         <entry morerows="37"><literal>IPC</literal></entry>
          <entry><literal>BgWorkerShutdown</literal></entry>
          <entry>Waiting for background worker to shut down.</entry>
         </row>
@@ -1451,6 +1447,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry><literal>Promote</literal></entry>
          <entry>Waiting for standby promotion.</entry>
         </row>
+        <row>
+         <entry><literal>RelationExtensionLock</literal></entry>
+         <entry>Waiting to extend a relation.</entry>
+        </row>
         <row>
          <entry><literal>ReplicationOriginDrop</literal></entry>
          <entry>Waiting for a replication origin to become inactive to be dropped.</entry>
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index 87de0b855b..f62c9e2fe3 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -17,6 +17,7 @@
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/extension_lock.h"
 #include "storage/freespace.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
@@ -633,8 +634,7 @@ brin_page_cleanup(Relation idxrel, Buffer buf)
 	 */
 	if (PageIsNew(page))
 	{
-		LockRelationForExtension(idxrel, ShareLock);
-		UnlockRelationForExtension(idxrel, ShareLock);
+		WaitForRelationExtensionLockToBeFree(idxrel);
 
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 		if (PageIsNew(page))
@@ -728,7 +728,7 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 			 */
 			if (!RELATION_IS_LOCAL(irel))
 			{
-				LockRelationForExtension(irel, ExclusiveLock);
+				LockRelationForExtension(irel);
 				extensionLockHeld = true;
 			}
 			buf = ReadBuffer(irel, P_NEW);
@@ -776,7 +776,7 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 					brin_initialize_empty_new_buffer(irel, buf);
 
 				if (extensionLockHeld)
-					UnlockRelationForExtension(irel, ExclusiveLock);
+					UnlockRelationForExtension(irel);
 
 				ReleaseBuffer(buf);
 
@@ -794,7 +794,7 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
 		if (extensionLockHeld)
-			UnlockRelationForExtension(irel, ExclusiveLock);
+			UnlockRelationForExtension(irel);
 
 		page = BufferGetPage(buf);
 
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index 9c4b3e2202..75dd1b805d 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -29,6 +29,7 @@
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/extension_lock.h"
 #include "storage/lmgr.h"
 #include "utils/rel.h"
 
@@ -571,7 +572,7 @@ revmap_physical_extend(BrinRevmap *revmap)
 	else
 	{
 		if (needLock)
-			LockRelationForExtension(irel, ExclusiveLock);
+			LockRelationForExtension(irel);
 
 		buf = ReadBuffer(irel, P_NEW);
 		if (BufferGetBlockNumber(buf) != mapBlk)
@@ -583,7 +584,7 @@ revmap_physical_extend(BrinRevmap *revmap)
 			 * page from under whoever is using it.
 			 */
 			if (needLock)
-				UnlockRelationForExtension(irel, ExclusiveLock);
+				UnlockRelationForExtension(irel);
 			LockBuffer(revmap->rm_metaBuf, BUFFER_LOCK_UNLOCK);
 			ReleaseBuffer(buf);
 			return;
@@ -592,7 +593,7 @@ revmap_physical_extend(BrinRevmap *revmap)
 		page = BufferGetPage(buf);
 
 		if (needLock)
-			UnlockRelationForExtension(irel, ExclusiveLock);
+			UnlockRelationForExtension(irel);
 	}
 
 	/* Check that it's a regular block (or an empty page) */
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index a7e55caf28..59f1323e1a 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -22,6 +22,7 @@
 #include "catalog/pg_type.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
+#include "storage/extension_lock.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
@@ -327,13 +328,13 @@ GinNewBuffer(Relation index)
 	/* Must extend the file */
 	needLock = !RELATION_IS_LOCAL(index);
 	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
+		LockRelationForExtension(index);
 
 	buffer = ReadBuffer(index, P_NEW);
 	LockBuffer(buffer, GIN_EXCLUSIVE);
 
 	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+		UnlockRelationForExtension(index);
 
 	return buffer;
 }
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index 260cedff88..64e82434c4 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -20,6 +20,7 @@
 #include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "postmaster/autovacuum.h"
+#include "storage/extension_lock.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
@@ -733,10 +734,10 @@ ginvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	needLock = !RELATION_IS_LOCAL(index);
 
 	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
+		LockRelationForExtension(index);
 	npages = RelationGetNumberOfBlocks(index);
 	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+		UnlockRelationForExtension(index);
 
 	totFreePages = 0;
 
@@ -783,10 +784,10 @@ ginvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	stats->pages_free = totFreePages;
 
 	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
+		LockRelationForExtension(index);
 	stats->num_pages = RelationGetNumberOfBlocks(index);
 	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+		UnlockRelationForExtension(index);
 
 	return stats;
 }
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index dd975b164c..c9c6854323 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -19,6 +19,7 @@
 #include "access/htup_details.h"
 #include "access/reloptions.h"
 #include "catalog/pg_opclass.h"
+#include "storage/extension_lock.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/float.h"
@@ -866,13 +867,13 @@ gistNewBuffer(Relation r)
 	needLock = !RELATION_IS_LOCAL(r);
 
 	if (needLock)
-		LockRelationForExtension(r, ExclusiveLock);
+		LockRelationForExtension(r);
 
 	buffer = ReadBuffer(r, P_NEW);
 	LockBuffer(buffer, GIST_EXCLUSIVE);
 
 	if (needLock)
-		UnlockRelationForExtension(r, ExclusiveLock);
+		UnlockRelationForExtension(r);
 
 	return buffer;
 }
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index a9c616c772..34e8759632 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -20,6 +20,7 @@
 #include "commands/vacuum.h"
 #include "lib/integerset.h"
 #include "miscadmin.h"
+#include "storage/extension_lock.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/memutils.h"
@@ -195,10 +196,10 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	{
 		/* Get the current relation length */
 		if (needLock)
-			LockRelationForExtension(rel, ExclusiveLock);
+			LockRelationForExtension(rel);
 		num_pages = RelationGetNumberOfBlocks(rel);
 		if (needLock)
-			UnlockRelationForExtension(rel, ExclusiveLock);
+			UnlockRelationForExtension(rel);
 
 		/* Quit if we've scanned the whole relation */
 		if (blkno >= num_pages)
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index aa3f14c019..5df754aa21 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -20,6 +20,7 @@
 #include "access/htup_details.h"
 #include "access/visibilitymap.h"
 #include "storage/bufmgr.h"
+#include "storage/extension_lock.h"
 #include "storage/freespace.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
@@ -191,7 +192,7 @@ RelationAddExtraBlocks(Relation relation, BulkInsertState bistate)
 	int			lockWaiters;
 
 	/* Use the length of the lock wait queue to judge how much to extend. */
-	lockWaiters = RelationExtensionLockWaiterCount(relation);
+	lockWaiters = EstimateNumberOfExtensionLockWaiters(relation);
 	if (lockWaiters <= 0)
 		return;
 
@@ -555,11 +556,11 @@ loop:
 	if (needLock)
 	{
 		if (!use_fsm)
-			LockRelationForExtension(relation, ExclusiveLock);
-		else if (!ConditionalLockRelationForExtension(relation, ExclusiveLock))
+			LockRelationForExtension(relation);
+		else if (!ConditionalLockRelationForExtension(relation))
 		{
 			/* Couldn't get the lock immediately; wait for it. */
-			LockRelationForExtension(relation, ExclusiveLock);
+			LockRelationForExtension(relation);
 
 			/*
 			 * Check if some other backend has extended a block for us while
@@ -573,7 +574,7 @@ loop:
 			 */
 			if (targetBlock != InvalidBlockNumber)
 			{
-				UnlockRelationForExtension(relation, ExclusiveLock);
+				UnlockRelationForExtension(relation);
 				goto loop;
 			}
 
@@ -613,7 +614,7 @@ loop:
 	 * the relation some more.
 	 */
 	if (needLock)
-		UnlockRelationForExtension(relation, ExclusiveLock);
+		UnlockRelationForExtension(relation);
 
 	/*
 	 * Lock the other buffer. It's guaranteed to be of a lower page number
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8ce501151e..d448c448aa 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -71,6 +71,7 @@
 #include "portability/instr_time.h"
 #include "postmaster/autovacuum.h"
 #include "storage/bufmgr.h"
+#include "storage/extension_lock.h"
 #include "storage/freespace.h"
 #include "storage/lmgr.h"
 #include "tcop/tcopprot.h"
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 0a51678c40..e4389b6961 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -92,6 +92,7 @@
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "storage/bufmgr.h"
+#include "storage/extension_lock.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
 #include "utils/inval.h"
@@ -632,7 +633,7 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 	 * Note that another backend might have extended or created the relation
 	 * by the time we get the lock.
 	 */
-	LockRelationForExtension(rel, ExclusiveLock);
+	LockRelationForExtension(rel);
 
 	/* Might have to re-open if a cache flush happened */
 	RelationOpenSmgr(rel);
@@ -670,5 +671,5 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 	/* Update local cache with the up-to-date size */
 	rel->rd_smgr->smgr_vm_nblocks = vm_nblocks_now;
 
-	UnlockRelationForExtension(rel, ExclusiveLock);
+	UnlockRelationForExtension(rel);
 }
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index f05cbe7467..defa9ad163 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -28,6 +28,7 @@
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "miscadmin.h"
+#include "storage/extension_lock.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
@@ -850,7 +851,7 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 		needLock = !RELATION_IS_LOCAL(rel);
 
 		if (needLock)
-			LockRelationForExtension(rel, ExclusiveLock);
+			LockRelationForExtension(rel);
 
 		buf = ReadBuffer(rel, P_NEW);
 
@@ -864,7 +865,7 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 		 * condition against btvacuumscan --- see comments therein.
 		 */
 		if (needLock)
-			UnlockRelationForExtension(rel, ExclusiveLock);
+			UnlockRelationForExtension(rel);
 
 		/* Initialize the new page before returning it */
 		page = BufferGetPage(buf);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 5254bc7ef5..ecf3ecef64 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -29,6 +29,7 @@
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "storage/condition_variable.h"
+#include "storage/extension_lock.h"
 #include "storage/indexfsm.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
@@ -1017,10 +1018,10 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	{
 		/* Get the current relation length */
 		if (needLock)
-			LockRelationForExtension(rel, ExclusiveLock);
+			LockRelationForExtension(rel);
 		num_pages = RelationGetNumberOfBlocks(rel);
 		if (needLock)
-			UnlockRelationForExtension(rel, ExclusiveLock);
+			UnlockRelationForExtension(rel);
 
 		if (info->report_progress)
 			pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 4924ae1c59..a57845d615 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -23,6 +23,7 @@
 #include "access/xact.h"
 #include "catalog/pg_amop.h"
 #include "commands/vacuum.h"
+#include "storage/extension_lock.h"
 #include "storage/bufmgr.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
@@ -258,13 +259,13 @@ SpGistNewBuffer(Relation index)
 	/* Must extend the file */
 	needLock = !RELATION_IS_LOCAL(index);
 	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
+		LockRelationForExtension(index);
 
 	buffer = ReadBuffer(index, P_NEW);
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
 	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+		UnlockRelationForExtension(index);
 
 	return buffer;
 }
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index bd98707f3c..f6c1156bf9 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -24,6 +24,7 @@
 #include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/extension_lock.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/snapmgr.h"
@@ -824,10 +825,10 @@ spgvacuumscan(spgBulkDeleteState *bds)
 	{
 		/* Get the current relation length */
 		if (needLock)
-			LockRelationForExtension(index, ExclusiveLock);
+			LockRelationForExtension(index);
 		num_pages = RelationGetNumberOfBlocks(index);
 		if (needLock)
-			UnlockRelationForExtension(index, ExclusiveLock);
+			UnlockRelationForExtension(index);
 
 		/* Quit if we've scanned the whole relation */
 		if (blkno >= num_pages)
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 7169509a79..81acdf3a26 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3836,6 +3836,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_PROMOTE:
 			event_name = "Promote";
 			break;
+		case WAIT_EVENT_RELATION_EXTENSION_LOCK:
+			event_name = "RelationExtensionLock";
+			break;
 		case WAIT_EVENT_REPLICATION_ORIGIN_DROP:
 			event_name = "ReplicationOriginDrop";
 			break;
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index a5083db02b..6e363adecc 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -26,6 +26,7 @@
 #include "access/htup_details.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
+#include "storage/extension_lock.h"
 #include "storage/freespace.h"
 #include "storage/fsm_internals.h"
 #include "storage/lmgr.h"
@@ -613,7 +614,7 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 	 * Note that another backend might have extended or created the relation
 	 * by the time we get the lock.
 	 */
-	LockRelationForExtension(rel, ExclusiveLock);
+	LockRelationForExtension(rel);
 
 	/* Might have to re-open if a cache flush happened */
 	RelationOpenSmgr(rel);
@@ -641,7 +642,7 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 	/* Update local cache with the up-to-date size */
 	rel->rd_smgr->smgr_fsm_nblocks = fsm_nblocks_now;
 
-	UnlockRelationForExtension(rel, ExclusiveLock);
+	UnlockRelationForExtension(rel);
 }
 
 /*
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 427b0d59cd..6b76301af9 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -35,6 +35,7 @@
 #include "replication/walsender.h"
 #include "storage/bufmgr.h"
 #include "storage/dsm.h"
+#include "storage/extension_lock.h"
 #include "storage/ipc.h"
 #include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
@@ -131,6 +132,7 @@ CreateSharedMemoryAndSemaphores(void)
 		size = add_size(size, BackgroundWorkerShmemSize());
 		size = add_size(size, MultiXactShmemSize());
 		size = add_size(size, LWLockShmemSize());
+		size = add_size(size, RelExtLockShmemSize());
 		size = add_size(size, ProcArrayShmemSize());
 		size = add_size(size, BackendStatusShmemSize());
 		size = add_size(size, SInvalShmemSize());
@@ -228,6 +230,11 @@ CreateSharedMemoryAndSemaphores(void)
 	 */
 	InitPredicateLocks();
 
+	/*
+	 * Set up relation extension lock manager
+	 */
+	InitRelExtLocks();
+
 	/*
 	 * Set up process table
 	 */
diff --git a/src/backend/storage/lmgr/Makefile b/src/backend/storage/lmgr/Makefile
index 829b792fcb..1aed82703c 100644
--- a/src/backend/storage/lmgr/Makefile
+++ b/src/backend/storage/lmgr/Makefile
@@ -15,6 +15,7 @@ include $(top_builddir)/src/Makefile.global
 OBJS = \
 	condition_variable.o \
 	deadlock.o \
+	extension_lock.o \
 	lmgr.o \
 	lock.o \
 	lwlock.o \
diff --git a/src/backend/storage/lmgr/README b/src/backend/storage/lmgr/README
index 56b0a12a3e..960d1f3f09 100644
--- a/src/backend/storage/lmgr/README
+++ b/src/backend/storage/lmgr/README
@@ -3,7 +3,7 @@ src/backend/storage/lmgr/README
 Locking Overview
 ================
 
-Postgres uses four types of interprocess locks:
+Postgres uses five types of interprocess locks:
 
 * Spinlocks.  These are intended for *very* short-term locks.  If a lock
 is to be held more than a few dozen instructions, or across any sort of
@@ -36,13 +36,21 @@ Regular locks should be used for all user-driven lock requests.
 
 * SIReadLock predicate locks.  See separate README-SSI file for details.
 
+* Relation extension locks. Only one process can extend a relation at
+a time; we use a specialized lock manager for this purpose, which is
+much simpler than the regular lock manager.  It is similar to the
+lightweight lock mechanism, but is ever simpler because there is only
+one lock mode and only one lock can be taken at a time. A process holding
+a relation extension lock is interruptible, unlike a process holding an
+LWLock.
+
 Acquisition of either a spinlock or a lightweight lock causes query
 cancel and die() interrupts to be held off until all such locks are
 released. No such restriction exists for regular locks, however.  Also
 note that we can accept query cancel and die() interrupts while waiting
-for a regular lock, but we will not accept them while waiting for
-spinlocks or LW locks. It is therefore not a good idea to use LW locks
-when the wait time might exceed a few seconds.
+for a relation extension lock or a regular lock, but we will not accept
+them while waiting for spinlocks or LW locks. It is therefore not a good
+idea to use LW locks when the wait time might exceed a few seconds.
 
 The rest of this README file discusses the regular lock manager in detail.
 
diff --git a/src/backend/storage/lmgr/extension_lock.c b/src/backend/storage/lmgr/extension_lock.c
new file mode 100644
index 0000000000..5a3bf5ea59
--- /dev/null
+++ b/src/backend/storage/lmgr/extension_lock.c
@@ -0,0 +1,469 @@
+/*-------------------------------------------------------------------------
+ *
+ * extension_lock.c
+ *	  Relation extension lock manager
+ *
+ * This specialized lock manager is used only for relation extension
+ * locks.  Unlike the heavyweight lock manager, it doesn't provide
+ * deadlock detection or group locking.  Unlike lwlock.c, extension lock
+ * waits are interruptible.  Unlike both systems, there is only one lock
+ * mode.
+ *
+ * False sharing is possible.  We have a fixed-size array of locks, and
+ * every database OID/relation OID combination is mapped to a slot in
+ * the array.  Therefore, if two processes try to extend relations that
+ * map to the same array slot, they will contend even though it would
+ * be OK to let both proceed at once.  Since these locks are typically
+ * taken only for very short periods of time, this doesn't seem likely
+ * to be a big problem in practice.  If it is, we could make the array
+ * bigger.
+ *
+ * The extension lock manager is much faster than the regular heavyweight
+ * lock manager.  The lack of group locking is a feature, not a bug,
+ * because while cooperating backends can all (for example) access a
+ * relation on which they jointly hold AccessExclusiveLock at the same time,
+ * it's not safe for them to extend the relation at the same time.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/backend/storage/lmgr/extension_lock.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "pgstat.h"
+
+#include "catalog/catalog.h"
+#include "postmaster/postmaster.h"
+#include "storage/extension_lock.h"
+#include "utils/rel.h"
+
+#define N_RELEXTLOCK_ENTS		1024
+
+/*
+ * We can't use bit 31 as the lock bit because pg_atomic_sub_fetch_u32 can't
+ * handle an attempt to subtract INT_MIN.
+ */
+#define RELEXT_LOCK_BIT			((uint32) 1 << 30)
+#define RELEXT_WAIT_COUNT_MASK	(RELEXT_LOCK_BIT - 1)
+
+typedef struct RelExtLockTag
+{
+	Oid			dbid;			/* InvalidOid for a shared relation */
+	Oid			relid;
+} RelExtLockTag;
+
+typedef struct RelExtLock
+{
+	pg_atomic_uint32 state;
+	ConditionVariable cv;
+} RelExtLock;
+
+/*
+ * Backend-private state for relation extension locks.  "relid" is the last
+ * relation whose RelExtLock we looked up, and "lock" is a pointer to the
+ * RelExtLock to which it mapped.  This speeds up the fairly common case where
+ * we acquire the same relation extension lock repeatedly.  nLocks is 0 is the
+ * number of times we've acquired that lock; 0 means we don't hold it, while
+ * any value >0 means we do.
+ *
+ * A backend can't hold more than one relation extension lock at the same
+ * time, although it can hold the same lock more than once.  Sometimes we try
+ * to acquire a lock for additional forks while already holding the lock for
+ * the main fork; for example, this might happen when adding extra relation
+ * blocks for both relation and its free space map. But since this lock
+ * manager doesn't distinguish between the forks, we just increment nLocks in
+ * the case.
+ */
+typedef struct relextlock_handle
+{
+	Oid			relid;
+	RelExtLock *lock;
+	int			nLocks;			/* > 0 means holding it */
+	bool		waiting;		/* true if we're waiting it */
+} relextlock_handle;
+
+static relextlock_handle held_relextlock;
+static RelExtLock *RelExtLockArray;
+
+static bool RelExtLockAcquire(Oid relid, bool conditional);
+static bool RelExtLockAttemptLock(RelExtLock *relextlock);
+static void RelExtLockRelease(void);
+static inline uint32 RelExtLockTargetTagToIndex(RelExtLockTag *locktag);
+
+/*
+ * Estimate space required for a fixed-size array of RelExtLock structures.
+ */
+Size
+RelExtLockShmemSize(void)
+{
+	return mul_size(N_RELEXTLOCK_ENTS, sizeof(RelExtLock));
+}
+
+/*
+ * Initialize extension lock manager.
+ */
+void
+InitRelExtLocks(void)
+{
+	bool		found;
+	int			i;
+
+	/* Verify that we have enough bits for maximum possible waiter count. */
+	StaticAssertStmt(RELEXT_WAIT_COUNT_MASK >= MAX_BACKENDS,
+					 "maximum waiter count of relation extension lock exceeds MAX_BACKENDS");
+
+	RelExtLockArray = (RelExtLock *)
+		ShmemInitStruct("Relation Extension Lock",
+						N_RELEXTLOCK_ENTS * sizeof(RelExtLock),
+						&found);
+
+	/* we're the first - initialize */
+	if (!found)
+	{
+		for (i = 0; i < N_RELEXTLOCK_ENTS; i++)
+		{
+			RelExtLock *relextlock = &RelExtLockArray[i];
+
+			pg_atomic_init_u32(&(relextlock->state), 0);
+			ConditionVariableInit(&(relextlock->cv));
+		}
+	}
+}
+
+/*
+ * This lock is used to interlock addition of pages to relations.
+ * We need such locking because bufmgr/smgr definition of P_NEW is not
+ * race-condition-proof.
+ *
+ * We assume the caller is already holding some type of regular lock on
+ * the relation, so no AcceptInvalidationMessages call is needed here.
+ */
+void
+LockRelationForExtension(Relation relation)
+{
+	RelExtLockAcquire(RelationGetRelid(relation), false);
+}
+
+/*
+ * As above, but only lock if we can get the lock without blocking.
+ * Returns TRUE iff the lock was acquired.
+ */
+bool
+ConditionalLockRelationForExtension(Relation relation)
+{
+	return RelExtLockAcquire(RelationGetRelid(relation), true);
+}
+
+/*
+ * Estimate the number of processes waiting for the given relation extension
+ * lock. Note that since multiple relations hash to the same RelExtLock entry,
+ * the return value might be inflated.
+ */
+int
+EstimateNumberOfExtensionLockWaiters(Relation relation)
+{
+	RelExtLockTag tag;
+	RelExtLock *relextlock;
+	uint32		state;
+	Oid			relid = RelationGetRelid(relation);
+
+	/* Make a lock tag */
+	tag.dbid = IsSharedRelation(relid) ? InvalidOid : MyDatabaseId;
+	tag.relid = relid;
+
+	relextlock = &RelExtLockArray[RelExtLockTargetTagToIndex(&tag)];
+	state = pg_atomic_read_u32(&(relextlock->state));
+
+	return (state & RELEXT_WAIT_COUNT_MASK);
+}
+
+/*
+ * Release a previously-acquired extension lock.
+ */
+void
+UnlockRelationForExtension(Relation relation)
+{
+	Oid			relid = RelationGetRelid(relation);
+
+	if (held_relextlock.nLocks <= 0 || relid != held_relextlock.relid)
+	{
+		elog(WARNING,
+			 "relation extension lock for %u is not held",
+			 relid);
+		return;
+	}
+
+	/*
+	 * If we acquired it multiple times, only change shared state when we have
+	 * released it as many times as we acquired it.
+	 */
+	if (--held_relextlock.nLocks == 0)
+		RelExtLockRelease();
+}
+
+/*
+ * Release any extension lock held, and any wait count for an extension lock.
+ * This is intended to be invoked during error cleanup.
+ */
+void
+RelExtLockCleanup(void)
+{
+	if (held_relextlock.nLocks > 0)
+	{
+		/* Release the lock even if we acquired it multiple times. */
+		held_relextlock.nLocks = 0;
+		RelExtLockRelease();
+		Assert(!held_relextlock.waiting);
+	}
+	else if (held_relextlock.waiting)
+	{
+		/* We were waiting for the lock; release the wait count we held. */
+		held_relextlock.waiting = false;
+		pg_atomic_sub_fetch_u32(&(held_relextlock.lock->state), 1);
+	}
+}
+
+/*
+ * Are we holding any extension lock?
+ */
+bool
+IsAnyRelationExtensionLockHeld(void)
+{
+	return held_relextlock.nLocks > 0;
+}
+
+/*
+ *		WaitForRelationExtensionLockToBeFree
+ *
+ * Wait for the relation extension lock on the given relation to
+ * be free without acquiring it.
+ */
+void
+WaitForRelationExtensionLockToBeFree(Relation relation)
+{
+	RelExtLock *relextlock;
+	Oid			relid;
+
+	relid = RelationGetRelid(relation);
+
+	if (held_relextlock.nLocks > 0)
+	{
+		/*
+		 * If we already hold the lock, nobody else does, so we can return
+		 * immediately.
+		 */
+		if (relid == held_relextlock.relid)
+			return;
+		elog(ERROR,
+			 "can only manipulate one relation extension lock at a time");
+	}
+
+	/*
+	 * If the last relation extension lock we touched is the same one for
+	 * which we now need to wait, we can use our cached pointer to the lock
+	 * instead of recomputing it.
+	 */
+	if (relid == held_relextlock.relid)
+		relextlock = held_relextlock.lock;
+	else
+	{
+		RelExtLockTag tag;
+
+		tag.dbid = IsSharedRelation(relid) ? InvalidOid : MyDatabaseId;
+		tag.relid = relid;
+		relextlock = &RelExtLockArray[RelExtLockTargetTagToIndex(&tag)];
+		held_relextlock.relid = relid;
+		held_relextlock.lock = relextlock;
+	}
+
+	for (;;)
+	{
+		uint32		state;
+
+		state = pg_atomic_read_u32(&(relextlock)->state);
+
+		/* Break if nobody is holding the lock on this relation */
+		if ((state & RELEXT_LOCK_BIT) == 0)
+			break;
+
+		/* Could not get the lock, prepare to wait */
+		if (!held_relextlock.waiting)
+		{
+			pg_atomic_add_fetch_u32(&(relextlock->state), 1);
+			held_relextlock.waiting = true;
+		}
+
+		/* Sleep until something happens, then recheck */
+		ConditionVariableSleep(&(relextlock->cv),
+							   WAIT_EVENT_RELATION_EXTENSION_LOCK);
+	}
+
+	ConditionVariableCancelSleep();
+
+	/* Release any wait count we hold */
+	if (held_relextlock.waiting)
+	{
+		pg_atomic_sub_fetch_u32(&(relextlock->state), 1);
+		held_relextlock.waiting = false;
+	}
+}
+
+/*
+ * Compute the hash code associated with a RelExtLock.
+ */
+static inline uint32
+RelExtLockTargetTagToIndex(RelExtLockTag *locktag)
+{
+	return tag_hash(locktag, sizeof(RelExtLockTag)) % N_RELEXTLOCK_ENTS;
+}
+
+/*
+ * Acquire a relation extension lock.
+ */
+static bool
+RelExtLockAcquire(Oid relid, bool conditional)
+{
+	RelExtLock *relextlock;
+	bool		mustwait;
+
+	/*
+	 * If we already hold the lock, we can just increase the count locally.
+	 * Since we don't do deadlock detection, caller must not try to take a new
+	 * relation extension lock while already holding them.
+	 */
+	if (held_relextlock.nLocks > 0)
+	{
+		if (relid != held_relextlock.relid)
+			elog(ERROR,
+				 "can only acquire one relation extension lock at a time");
+
+		held_relextlock.nLocks++;
+		return true;
+	}
+
+	/*
+	 * If the last relation extension lock we touched is the same one for we
+	 * now need to acquire, we can use our cached pointer to the lock instead
+	 * of recomputing it.  This is likely to be a common case in practice.
+	 */
+	if (relid == held_relextlock.relid)
+		relextlock = held_relextlock.lock;
+	else
+	{
+		RelExtLockTag tag;
+
+		/* Make a lock tag */
+		tag.dbid = IsSharedRelation(relid) ? InvalidOid : MyDatabaseId;
+		tag.relid = relid;
+
+		relextlock = &RelExtLockArray[RelExtLockTargetTagToIndex(&tag)];
+
+		/* Remember the lock we're interested in */
+		held_relextlock.relid = relid;
+		held_relextlock.lock = relextlock;
+	}
+
+	held_relextlock.waiting = false;
+	for (;;)
+	{
+		mustwait = RelExtLockAttemptLock(relextlock);
+
+		if (!mustwait)
+			break;				/* got the lock */
+
+		/* Could not got the lock, return iff in locking conditionally */
+		if (conditional)
+			return false;
+
+		/* Could not get the lock, prepare to wait */
+		if (!held_relextlock.waiting)
+		{
+			pg_atomic_add_fetch_u32(&(relextlock->state), 1);
+			held_relextlock.waiting = true;
+		}
+
+		/* Sleep until something happens, then recheck */
+		ConditionVariableSleep(&(relextlock->cv),
+							   WAIT_EVENT_RELATION_EXTENSION_LOCK);
+	}
+
+	ConditionVariableCancelSleep();
+
+	/* Release any wait count we hold */
+	if (held_relextlock.waiting)
+	{
+		pg_atomic_sub_fetch_u32(&(relextlock->state), 1);
+		held_relextlock.waiting = false;
+	}
+
+	Assert(!mustwait);
+
+	/* Remember lock held by this backend */
+	held_relextlock.relid = relid;
+	held_relextlock.lock = relextlock;
+	held_relextlock.nLocks = 1;
+
+	/* We got the lock! */
+	return true;
+}
+
+/*
+ * Attempt to atomically acquire the relation extension lock.
+ *
+ * Returns true if the lock isn't free and we need to wait.
+ */
+static bool
+RelExtLockAttemptLock(RelExtLock *relextlock)
+{
+	uint32		oldstate;
+
+	oldstate = pg_atomic_read_u32(&relextlock->state);
+
+	while (true)
+	{
+		bool		lock_free;
+
+		lock_free = (oldstate & RELEXT_LOCK_BIT) == 0;
+
+		if (!lock_free)
+			return true;
+
+		if (pg_atomic_compare_exchange_u32(&relextlock->state,
+										   &oldstate,
+										   oldstate | RELEXT_LOCK_BIT))
+			return false;
+	}
+
+	pg_unreachable();
+}
+
+/*
+ * Release extension lock in shared memory.  Should be called when our local
+ * lock count drops to 0.
+ */
+static void
+RelExtLockRelease(void)
+{
+	RelExtLock *relextlock;
+	uint32		state;
+	uint32		wait_counts;
+
+	Assert(held_relextlock.nLocks == 0);
+
+	relextlock = held_relextlock.lock;
+
+	/* Release the lock */
+	state = pg_atomic_sub_fetch_u32(&(relextlock->state), RELEXT_LOCK_BIT);
+
+	/* If there may be waiters, wake them up */
+	wait_counts = state & RELEXT_WAIT_COUNT_MASK;
+
+	if (wait_counts > 0)
+		ConditionVariableBroadcast(&(relextlock->cv));
+}
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 2010320095..174e0b051c 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -388,78 +388,6 @@ UnlockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode)
 	LockRelease(&tag, lockmode, true);
 }
 
-/*
- *		LockRelationForExtension
- *
- * This lock tag is used to interlock addition of pages to relations.
- * We need such locking because bufmgr/smgr definition of P_NEW is not
- * race-condition-proof.
- *
- * We assume the caller is already holding some type of regular lock on
- * the relation, so no AcceptInvalidationMessages call is needed here.
- */
-void
-LockRelationForExtension(Relation relation, LOCKMODE lockmode)
-{
-	LOCKTAG		tag;
-
-	SET_LOCKTAG_RELATION_EXTEND(tag,
-								relation->rd_lockInfo.lockRelId.dbId,
-								relation->rd_lockInfo.lockRelId.relId);
-
-	(void) LockAcquire(&tag, lockmode, false, false);
-}
-
-/*
- *		ConditionalLockRelationForExtension
- *
- * As above, but only lock if we can get the lock without blocking.
- * Returns true iff the lock was acquired.
- */
-bool
-ConditionalLockRelationForExtension(Relation relation, LOCKMODE lockmode)
-{
-	LOCKTAG		tag;
-
-	SET_LOCKTAG_RELATION_EXTEND(tag,
-								relation->rd_lockInfo.lockRelId.dbId,
-								relation->rd_lockInfo.lockRelId.relId);
-
-	return (LockAcquire(&tag, lockmode, false, true) != LOCKACQUIRE_NOT_AVAIL);
-}
-
-/*
- *		RelationExtensionLockWaiterCount
- *
- * Count the number of processes waiting for the given relation extension lock.
- */
-int
-RelationExtensionLockWaiterCount(Relation relation)
-{
-	LOCKTAG		tag;
-
-	SET_LOCKTAG_RELATION_EXTEND(tag,
-								relation->rd_lockInfo.lockRelId.dbId,
-								relation->rd_lockInfo.lockRelId.relId);
-
-	return LockWaiterCount(&tag);
-}
-
-/*
- *		UnlockRelationForExtension
- */
-void
-UnlockRelationForExtension(Relation relation, LOCKMODE lockmode)
-{
-	LOCKTAG		tag;
-
-	SET_LOCKTAG_RELATION_EXTEND(tag,
-								relation->rd_lockInfo.lockRelId.dbId,
-								relation->rd_lockInfo.lockRelId.relId);
-
-	LockRelease(&tag, lockmode, false);
-}
-
 /*
  *		LockPage
  *
@@ -1092,12 +1020,6 @@ DescribeLockTag(StringInfo buf, const LOCKTAG *tag)
 							 tag->locktag_field2,
 							 tag->locktag_field1);
 			break;
-		case LOCKTAG_RELATION_EXTEND:
-			appendStringInfo(buf,
-							 _("extension of relation %u of database %u"),
-							 tag->locktag_field2,
-							 tag->locktag_field1);
-			break;
 		case LOCKTAG_PAGE:
 			appendStringInfo(buf,
 							 _("page %u of relation %u of database %u"),
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 56dba09299..1bd970a907 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -40,6 +40,7 @@
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "pgstat.h"
+#include "storage/extension_lock.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "storage/sinvaladt.h"
@@ -749,6 +750,13 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	bool		found_conflict;
 	bool		log_lock = false;
 
+	/*
+	 * Relation extension locks don't participate in deadlock detection,
+	 * so make sure we don't try to acquire a heavyweight lock while
+	 * holding one.
+	 */
+	Assert(!IsAnyRelationExtensionLockHeld());
+
 	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
 		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
 	lockMethodTable = LockMethods[lockmethodid];
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 32df8c85a1..883cc0d4cc 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -45,6 +45,7 @@
 #include "replication/syncrep.h"
 #include "replication/walsender.h"
 #include "storage/condition_variable.h"
+#include "storage/extension_lock.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/pmsignal.h"
@@ -790,6 +791,8 @@ ProcReleaseLocks(bool isCommit)
 		return;
 	/* If waiting, get off wait queue (should only be needed after error) */
 	LockErrorCleanup();
+	/* Release any relation extension lock or wait counts */
+	RelExtLockCleanup();
 	/* Release standard locks, including session-level if aborting */
 	LockReleaseAll(DEFAULT_LOCKMETHOD, !isCommit);
 	/* Release transaction-level advisory locks */
diff --git a/src/backend/utils/adt/lockfuncs.c b/src/backend/utils/adt/lockfuncs.c
index 7e47ebeb6f..86d961dc31 100644
--- a/src/backend/utils/adt/lockfuncs.c
+++ b/src/backend/utils/adt/lockfuncs.c
@@ -25,7 +25,6 @@
 /* This must match enum LockTagType! */
 const char *const LockTagTypeNames[] = {
 	"relation",
-	"extend",
 	"page",
 	"tuple",
 	"transactionid",
@@ -240,7 +239,6 @@ pg_lock_status(PG_FUNCTION_ARGS)
 		switch ((LockTagType) instance->locktag.locktag_type)
 		{
 			case LOCKTAG_RELATION:
-			case LOCKTAG_RELATION_EXTEND:
 				values[1] = ObjectIdGetDatum(instance->locktag.locktag_field1);
 				values[2] = ObjectIdGetDatum(instance->locktag.locktag_field2);
 				nulls[3] = true;
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index aecb6013f0..12e6240825 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -848,6 +848,7 @@ typedef enum
 	WAIT_EVENT_PARALLEL_BITMAP_SCAN,
 	WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN,
 	WAIT_EVENT_PARALLEL_FINISH,
+	WAIT_EVENT_RELATION_EXTENSION_LOCK,
 	WAIT_EVENT_PROCARRAY_GROUP_UPDATE,
 	WAIT_EVENT_PROMOTE,
 	WAIT_EVENT_REPLICATION_ORIGIN_DROP,
diff --git a/src/include/storage/extension_lock.h b/src/include/storage/extension_lock.h
new file mode 100644
index 0000000000..0b26fa5716
--- /dev/null
+++ b/src/include/storage/extension_lock.h
@@ -0,0 +1,38 @@
+/*-------------------------------------------------------------------------
+ *
+ * extension_lock.h
+ *	  Relation extension lock manager
+ *
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/extension_lock.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef EXTENSION_LOCK_H
+#define EXTENSION_LOCK_H
+
+#ifdef FRONTEND
+#error "extension_lock.h may not be included from frontend code"
+#endif
+
+#include "port/atomics.h"
+#include "storage/s_lock.h"
+#include "storage/condition_variable.h"
+#include "storage/proclist_types.h"
+
+/* Lock a relation for extension */
+extern Size RelExtLockShmemSize(void);
+extern void InitRelExtLocks(void);
+extern void LockRelationForExtension(Relation relation);
+extern void UnlockRelationForExtension(Relation relation);
+extern bool ConditionalLockRelationForExtension(Relation relation);
+extern int	EstimateNumberOfExtensionLockWaiters(Relation relation);
+extern void WaitForRelationExtensionLockToBeFree(Relation relation);
+extern void RelExtLockCleanup(void);
+extern bool	IsAnyRelationExtensionLockHeld(void);
+
+#endif	/* EXTENSION_LOCK_H */
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index 3acc11aa5a..7609c1a58c 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -52,13 +52,6 @@ extern bool LockHasWaitersRelation(Relation relation, LOCKMODE lockmode);
 extern void LockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
 extern void UnlockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
 
-/* Lock a relation for extension */
-extern void LockRelationForExtension(Relation relation, LOCKMODE lockmode);
-extern void UnlockRelationForExtension(Relation relation, LOCKMODE lockmode);
-extern bool ConditionalLockRelationForExtension(Relation relation,
-												LOCKMODE lockmode);
-extern int	RelationExtensionLockWaiterCount(Relation relation);
-
 /* Lock a page (currently only used within indexes) */
 extern void LockPage(Relation relation, BlockNumber blkno, LOCKMODE lockmode);
 extern bool ConditionalLockPage(Relation relation, BlockNumber blkno, LOCKMODE lockmode);
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index bb8e4e6e5b..be17db8c55 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -137,7 +137,6 @@ typedef uint16 LOCKMETHODID;
 typedef enum LockTagType
 {
 	LOCKTAG_RELATION,			/* whole relation */
-	LOCKTAG_RELATION_EXTEND,	/* the right to extend a relation */
 	LOCKTAG_PAGE,				/* one page of a relation */
 	LOCKTAG_TUPLE,				/* one physical tuple */
 	LOCKTAG_TRANSACTION,		/* transaction (for waiting for xact done) */
@@ -185,15 +184,6 @@ typedef struct LOCKTAG
 	 (locktag).locktag_type = LOCKTAG_RELATION, \
 	 (locktag).locktag_lockmethodid = DEFAULT_LOCKMETHOD)
 
-/* same ID info as RELATION */
-#define SET_LOCKTAG_RELATION_EXTEND(locktag,dboid,reloid) \
-	((locktag).locktag_field1 = (dboid), \
-	 (locktag).locktag_field2 = (reloid), \
-	 (locktag).locktag_field3 = 0, \
-	 (locktag).locktag_field4 = 0, \
-	 (locktag).locktag_type = LOCKTAG_RELATION_EXTEND, \
-	 (locktag).locktag_lockmethodid = DEFAULT_LOCKMETHOD)
-
 /* ID info for a page is RELATION info + BlockNumber */
 #define SET_LOCKTAG_PAGE(locktag,dboid,reloid,blocknum) \
 	((locktag).locktag_field1 = (dboid), \
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e216de9570..aa3da09e9b 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3181,6 +3181,7 @@ registered_buffer
 regmatch_t
 regoff_t
 regproc
+relextlock_handle
 relopt_bool
 relopt_gen
 relopt_int
-- 
2.23.0

#90

Mahendra Singh Thalor

mahi6run@gmail.com

almost 6 years ago

In reply to: Masahiko Sawada (#89)

7 attachment(s)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Wed, 5 Feb 2020 at 12:07, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Feb 3, 2020 at 8:03 PM Amit Kapila <amit.kapila16@gmail.com>

wrote:

On Tue, Jun 26, 2018 at 12:47 PM Masahiko Sawada <sawada.mshk@gmail.com>

wrote:

On Fri, Apr 27, 2018 at 4:25 AM, Robert Haas <robertmhaas@gmail.com>

wrote:

On Thu, Apr 26, 2018 at 3:10 PM, Andres Freund <andres@anarazel.de>

wrote:

I think the real question is whether the scenario is common

enough to

worry about. In practice, you'd have to be extremely unlucky to

doing many bulk loads at the same time that all happened to hash

the same bucket.

With a bunch of parallel bulkloads into partitioned tables that

really

doesn't seem that unlikely?

It increases the likelihood of collisions, but probably decreases

the

number of cases where the contention gets really bad.

For example, suppose each table has 100 partitions and you are
bulk-loading 10 of them at a time. It's virtually certain that you
will have some collisions, but the amount of contention within each
bucket will remain fairly low because each backend spends only 1% of
its time in the bucket corresponding to any given partition.

I share another result of performance evaluation between current HEAD
and current HEAD with v13 patch(N_RELEXTLOCK_ENTS = 1024).

Type of table: normal table, unlogged table
Number of child tables : 16, 64 (all tables are located on the same

tablespace)

Number of clients : 32
Number of trials : 100
Duration: 180 seconds for each trials

The hardware spec of server is Intel Xeon 2.4GHz (HT 160cores), 256GB
RAM, NVMe SSD 1.5TB.
Each clients load 10kB random data across all partitioned tables.

Here is the result.

childs | type | target | avg_tps | diff with HEAD
--------+----------+---------+------------+------------------
16 | normal | HEAD | 1643.833 |
16 | normal | Patched | 1619.5404 | 0.985222
16 | unlogged | HEAD | 9069.3543 |
16 | unlogged | Patched | 9368.0263 | 1.032932
64 | normal | HEAD | 1598.698 |
64 | normal | Patched | 1587.5906 | 0.993052
64 | unlogged | HEAD | 9629.7315 |
64 | unlogged | Patched | 10208.2196 | 1.060073
(8 rows)

For normal tables, loading tps decreased 1% ~ 2% with this patch
whereas it increased 3% ~ 6% for unlogged tables. There were
collisions at 0 ~ 5 relation extension lock slots between 2 relations
in the 64 child tables case but it didn't seem to affect the tps.

AFAIU, this resembles the workload that Andres was worried about. I
think we should once run this test in a different environment, but
considering this to be correct and repeatable, where do we go with
this patch especially when we know it improves many workloads [1] as
well. We know that on a pathological case constructed by Mithun [2],
this causes regression as well. I am not sure if the test done by
Mithun really mimics any real-world workload as he has tested by
making N_RELEXTLOCK_ENTS = 1 to hit the worst case.

Sawada-San, if you have a script or data for the test done by you,
then please share it so that others can also try to reproduce it.

Unfortunately the environment I used for performance verification is
no longer available.

I agree to run this test in a different environment. I've attached the
rebased version patch. I'm measuring the performance with/without
patch, so will share the results.

Thanks Sawada-san for patch.

From last few days, I was reading this thread and was reviewing v13 patch.
To debug and test, I did re-base of v13 patch. I compared my re-based patch
and v14 patch. I think, ordering of header files is not alphabetically in
v14 patch. (I haven't reviewed v14 patch fully because before review, I
wanted to test false sharing). While debugging, I didn't noticed any hang
or lock related issue.

I did some testing to test false sharing(bulk insert, COPY data, bulk
insert into partitions tables). Below is the testing summary.

*Test setup(Bulk insert into partition tables):*
autovacuum=off
shared_buffers=512MB -c max_wal_size=20GB -c checkpoint_timeout=12min

Basically, I created a table with 13 partitions. Using pgbench, I inserted
bulk data. I used below pgbench command:
*./pgbench -c $threads -j $threads -T 180 -f insert1.sql@1 -f insert2.sql@1
-f insert3.sql@1 -f insert4.sql@1 postgres*

I took scripts from previews mails and modified. For reference, I am
attaching test scripts. I tested with default 1024 slots(N_RELEXTLOCK_ENTS
= 1024).

*Clients HEAD (tps) With v14 patch (tps)
%change (time: 180s)*
1 92.979796
100.877446 +8.49 %
32 392.881863
388.470622 -1.12 %
56 551.753235
528.018852 -4.30 %
60 648.273767
653.251507 +0.76 %
64 645.975124
671.322140 +3.92 %
66 662.728010 673.399762
+1.61 %
70 647.103183
660.694914 +2.10 %
74 648.824027
676.487622 +4.26 %

From above results, we can see that in most cases, TPS is slightly
increased with v14 patch. I am still testing and will post my results.

I want to test extension lock by blocking use of fsm(use_fsm=false in
code). I think, if we block use of fsm, then load will increase into
extension lock. Is this correct way to test?

Please let me know if you have any specific testing scenario.

--
Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com

#91

Amit Kapila

amit.kapila16@gmail.com

almost 6 years ago

In reply to: Mahendra Singh Thalor (#90)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Thu, Feb 6, 2020 at 1:57 AM Mahendra Singh Thalor <mahi6run@gmail.com> wrote:

On Wed, 5 Feb 2020 at 12:07, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Feb 3, 2020 at 8:03 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jun 26, 2018 at 12:47 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Apr 27, 2018 at 4:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Apr 26, 2018 at 3:10 PM, Andres Freund <andres@anarazel.de> wrote:

I think the real question is whether the scenario is common enough to
worry about. In practice, you'd have to be extremely unlucky to be
doing many bulk loads at the same time that all happened to hash to
the same bucket.

With a bunch of parallel bulkloads into partitioned tables that really
doesn't seem that unlikely?

It increases the likelihood of collisions, but probably decreases the
number of cases where the contention gets really bad.

For example, suppose each table has 100 partitions and you are
bulk-loading 10 of them at a time. It's virtually certain that you
will have some collisions, but the amount of contention within each
bucket will remain fairly low because each backend spends only 1% of
its time in the bucket corresponding to any given partition.

I share another result of performance evaluation between current HEAD
and current HEAD with v13 patch(N_RELEXTLOCK_ENTS = 1024).

Type of table: normal table, unlogged table
Number of child tables : 16, 64 (all tables are located on the same tablespace)
Number of clients : 32
Number of trials : 100
Duration: 180 seconds for each trials

The hardware spec of server is Intel Xeon 2.4GHz (HT 160cores), 256GB
RAM, NVMe SSD 1.5TB.
Each clients load 10kB random data across all partitioned tables.

Here is the result.

childs | type | target | avg_tps | diff with HEAD
--------+----------+---------+------------+------------------
16 | normal | HEAD | 1643.833 |
16 | normal | Patched | 1619.5404 | 0.985222
16 | unlogged | HEAD | 9069.3543 |
16 | unlogged | Patched | 9368.0263 | 1.032932
64 | normal | HEAD | 1598.698 |
64 | normal | Patched | 1587.5906 | 0.993052
64 | unlogged | HEAD | 9629.7315 |
64 | unlogged | Patched | 10208.2196 | 1.060073
(8 rows)

For normal tables, loading tps decreased 1% ~ 2% with this patch
whereas it increased 3% ~ 6% for unlogged tables. There were
collisions at 0 ~ 5 relation extension lock slots between 2 relations
in the 64 child tables case but it didn't seem to affect the tps.

AFAIU, this resembles the workload that Andres was worried about. I
think we should once run this test in a different environment, but
considering this to be correct and repeatable, where do we go with
this patch especially when we know it improves many workloads [1] as
well. We know that on a pathological case constructed by Mithun [2],
this causes regression as well. I am not sure if the test done by
Mithun really mimics any real-world workload as he has tested by
making N_RELEXTLOCK_ENTS = 1 to hit the worst case.

Sawada-San, if you have a script or data for the test done by you,
then please share it so that others can also try to reproduce it.

Unfortunately the environment I used for performance verification is
no longer available.

I agree to run this test in a different environment. I've attached the
rebased version patch. I'm measuring the performance with/without
patch, so will share the results.

Thanks Sawada-san for patch.

From last few days, I was reading this thread and was reviewing v13 patch. To debug and test, I did re-base of v13 patch. I compared my re-based patch and v14 patch. I think, ordering of header files is not alphabetically in v14 patch. (I haven't reviewed v14 patch fully because before review, I wanted to test false sharing). While debugging, I didn't noticed any hang or lock related issue.

I did some testing to test false sharing(bulk insert, COPY data, bulk insert into partitions tables). Below is the testing summary.

Test setup(Bulk insert into partition tables):
autovacuum=off
shared_buffers=512MB -c max_wal_size=20GB -c checkpoint_timeout=12min

Basically, I created a table with 13 partitions. Using pgbench, I inserted bulk data. I used below pgbench command:
./pgbench -c $threads -j $threads -T 180 -f insert1.sql@1 -f insert2.sql@1 -f insert3.sql@1 -f insert4.sql@1 postgres

I took scripts from previews mails and modified. For reference, I am attaching test scripts. I tested with default 1024 slots(N_RELEXTLOCK_ENTS = 1024).

Clients HEAD (tps) With v14 patch (tps) %change (time: 180s)
1 92.979796 100.877446 +8.49 %
32 392.881863 388.470622 -1.12 %
56 551.753235 528.018852 -4.30 %
60 648.273767 653.251507 +0.76 %
64 645.975124 671.322140 +3.92 %
66 662.728010 673.399762 +1.61 %
70 647.103183 660.694914 +2.10 %
74 648.824027 676.487622 +4.26 %

From above results, we can see that in most cases, TPS is slightly increased with v14 patch. I am still testing and will post my results.

The number at 56 and 74 client count seem slightly suspicious. Can
you please repeat those tests? Basically, I am not able to come up
with a theory why at 56 clients the performance with the patch is a
bit lower and then at 74 it is higher.

I want to test extension lock by blocking use of fsm(use_fsm=false in code). I think, if we block use of fsm, then load will increase into extension lock. Is this correct way to test?

Hmm, I think instead of directly hacking the code, you might want to
use the operation (probably cluster or vacuum full) where we set
HEAP_INSERT_SKIP_FSM. I think along with this you can try with
unlogged tables because that might stress the extension lock.

In the above test, you might want to test with a higher number of
partitions (say up to 100) as well. Also, see if you want to use the
Copy command.

Please let me know if you have any specific testing scenario.

Can you test the scenario mentioned by Konstantin Knizhnik [1]/messages/by-id/ef81da49-d491-db86-3ef6-5138d091fe91@postgrespro.ru where
this patch has shown significant gain? You might want to use a higher
core count machine to test it.

One thing we can do is to somehow measure the collisions on each bucket.

[1]: /messages/by-id/ef81da49-d491-db86-3ef6-5138d091fe91@postgrespro.ru

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#92

Amit Kapila

amit.kapila16@gmail.com

almost 6 years ago

In reply to: Masahiko Sawada (#84)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Tue, Jun 26, 2018 at 12:47 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Type of table: normal table, unlogged table
Number of child tables : 16, 64 (all tables are located on the same tablespace)
Number of clients : 32
Number of trials : 100
Duration: 180 seconds for each trials

The hardware spec of server is Intel Xeon 2.4GHz (HT 160cores), 256GB
RAM, NVMe SSD 1.5TB.
Each clients load 10kB random data across all partitioned tables.

Here is the result.

childs | type | target | avg_tps | diff with HEAD
--------+----------+---------+------------+------------------
16 | normal | HEAD | 1643.833 |
16 | normal | Patched | 1619.5404 | 0.985222
16 | unlogged | HEAD | 9069.3543 |
16 | unlogged | Patched | 9368.0263 | 1.032932
64 | normal | HEAD | 1598.698 |
64 | normal | Patched | 1587.5906 | 0.993052
64 | unlogged | HEAD | 9629.7315 |
64 | unlogged | Patched | 10208.2196 | 1.060073
(8 rows)

For normal tables, loading tps decreased 1% ~ 2% with this patch
whereas it increased 3% ~ 6% for unlogged tables. There were
collisions at 0 ~ 5 relation extension lock slots between 2 relations
in the 64 child tables case but it didn't seem to affect the tps.

How did you measure the collisions in this test? I think it is better
if Mahendra can also use the same technique in measuring that count.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#93

Masahiko Sawada

masahiko.sawada@2ndquadrant.com

almost 6 years ago

In reply to: Amit Kapila (#92)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Thu, 6 Feb 2020 at 13:16, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jun 26, 2018 at 12:47 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Type of table: normal table, unlogged table
Number of child tables : 16, 64 (all tables are located on the same tablespace)
Number of clients : 32
Number of trials : 100
Duration: 180 seconds for each trials

The hardware spec of server is Intel Xeon 2.4GHz (HT 160cores), 256GB
RAM, NVMe SSD 1.5TB.
Each clients load 10kB random data across all partitioned tables.

Here is the result.

childs | type | target | avg_tps | diff with HEAD
--------+----------+---------+------------+------------------
16 | normal | HEAD | 1643.833 |
16 | normal | Patched | 1619.5404 | 0.985222
16 | unlogged | HEAD | 9069.3543 |
16 | unlogged | Patched | 9368.0263 | 1.032932
64 | normal | HEAD | 1598.698 |
64 | normal | Patched | 1587.5906 | 0.993052
64 | unlogged | HEAD | 9629.7315 |
64 | unlogged | Patched | 10208.2196 | 1.060073
(8 rows)

For normal tables, loading tps decreased 1% ~ 2% with this patch
whereas it increased 3% ~ 6% for unlogged tables. There were
collisions at 0 ~ 5 relation extension lock slots between 2 relations
in the 64 child tables case but it didn't seem to affect the tps.

How did you measure the collisions in this test? I think it is better
if Mahendra can also use the same technique in measuring that count.

I created a created a SQL function that returns the hash value of the
lock tag, which is tag_hash(locktag, sizeof(RelExtLockTag)) %
N_RELEXTLOCK_ENTS. And examined the hash values of all tables.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#94

Mahendra Singh Thalor

mahi6run@gmail.com

almost 6 years ago

In reply to: Amit Kapila (#91)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Thu, 6 Feb 2020 at 09:44, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Feb 6, 2020 at 1:57 AM Mahendra Singh Thalor <mahi6run@gmail.com>

wrote:

On Wed, 5 Feb 2020 at 12:07, Masahiko Sawada <sawada.mshk@gmail.com>

wrote:

On Mon, Feb 3, 2020 at 8:03 PM Amit Kapila <amit.kapila16@gmail.com>

wrote:

On Tue, Jun 26, 2018 at 12:47 PM Masahiko Sawada <

sawada.mshk@gmail.com> wrote:

On Fri, Apr 27, 2018 at 4:25 AM, Robert Haas <

robertmhaas@gmail.com> wrote:

On Thu, Apr 26, 2018 at 3:10 PM, Andres Freund <

andres@anarazel.de> wrote:

I think the real question is whether the scenario is common

enough to

worry about. In practice, you'd have to be extremely unlucky

to be

doing many bulk loads at the same time that all happened to

hash to

the same bucket.

With a bunch of parallel bulkloads into partitioned tables

that really

doesn't seem that unlikely?

It increases the likelihood of collisions, but probably

decreases the

number of cases where the contention gets really bad.

For example, suppose each table has 100 partitions and you are
bulk-loading 10 of them at a time. It's virtually certain that

you

will have some collisions, but the amount of contention within

each

bucket will remain fairly low because each backend spends only

1% of

its time in the bucket corresponding to any given partition.

I share another result of performance evaluation between current

HEAD

and current HEAD with v13 patch(N_RELEXTLOCK_ENTS = 1024).

Type of table: normal table, unlogged table
Number of child tables : 16, 64 (all tables are located on the

same tablespace)

Number of clients : 32
Number of trials : 100
Duration: 180 seconds for each trials

The hardware spec of server is Intel Xeon 2.4GHz (HT 160cores),

256GB

RAM, NVMe SSD 1.5TB.
Each clients load 10kB random data across all partitioned tables.

Here is the result.

childs | type | target | avg_tps | diff with HEAD
--------+----------+---------+------------+------------------
16 | normal | HEAD | 1643.833 |
16 | normal | Patched | 1619.5404 | 0.985222
16 | unlogged | HEAD | 9069.3543 |
16 | unlogged | Patched | 9368.0263 | 1.032932
64 | normal | HEAD | 1598.698 |
64 | normal | Patched | 1587.5906 | 0.993052
64 | unlogged | HEAD | 9629.7315 |
64 | unlogged | Patched | 10208.2196 | 1.060073
(8 rows)

For normal tables, loading tps decreased 1% ~ 2% with this patch
whereas it increased 3% ~ 6% for unlogged tables. There were
collisions at 0 ~ 5 relation extension lock slots between 2

relations

in the 64 child tables case but it didn't seem to affect the tps.

AFAIU, this resembles the workload that Andres was worried about.

think we should once run this test in a different environment, but
considering this to be correct and repeatable, where do we go with
this patch especially when we know it improves many workloads [1] as
well. We know that on a pathological case constructed by Mithun

[2]: ,

this causes regression as well. I am not sure if the test done by
Mithun really mimics any real-world workload as he has tested by
making N_RELEXTLOCK_ENTS = 1 to hit the worst case.

Sawada-San, if you have a script or data for the test done by you,
then please share it so that others can also try to reproduce it.

Unfortunately the environment I used for performance verification is
no longer available.

I agree to run this test in a different environment. I've attached the
rebased version patch. I'm measuring the performance with/without
patch, so will share the results.

Thanks Sawada-san for patch.

From last few days, I was reading this thread and was reviewing v13

patch. To debug and test, I did re-base of v13 patch. I compared my
re-based patch and v14 patch. I think, ordering of header files is not
alphabetically in v14 patch. (I haven't reviewed v14 patch fully because
before review, I wanted to test false sharing). While debugging, I didn't
noticed any hang or lock related issue.

I did some testing to test false sharing(bulk insert, COPY data, bulk

insert into partitions tables). Below is the testing summary.

Test setup(Bulk insert into partition tables):
autovacuum=off
shared_buffers=512MB -c max_wal_size=20GB -c checkpoint_timeout=12min

Basically, I created a table with 13 partitions. Using pgbench, I

inserted bulk data. I used below pgbench command:

./pgbench -c $threads -j $threads -T 180 -f insert1.sql@1 -f

insert2.sql@1 -f insert3.sql@1 -f insert4.sql@1 postgres

I took scripts from previews mails and modified. For reference, I am

attaching test scripts. I tested with default 1024 slots(N_RELEXTLOCK_ENTS
= 1024).

Clients HEAD (tps) With v14 patch (tps)

%change (time: 180s)

1 92.979796 100.877446

+8.49 %

32 392.881863 388.470622

-1.12 %

56 551.753235 528.018852

-4.30 %

60 648.273767 653.251507

+0.76 %

64 645.975124 671.322140

+3.92 %

66 662.728010 673.399762

+1.61 %

70 647.103183 660.694914

+2.10 %

74 648.824027 676.487622

+4.26 %

From above results, we can see that in most cases, TPS is slightly

increased with v14 patch. I am still testing and will post my results.

The number at 56 and 74 client count seem slightly suspicious. Can
you please repeat those tests? Basically, I am not able to come up
with a theory why at 56 clients the performance with the patch is a
bit lower and then at 74 it is higher.

Okay. I will repeat test.

I want to test extension lock by blocking use of fsm(use_fsm=false in

code). I think, if we block use of fsm, then load will increase into
extension lock. Is this correct way to test?

Hmm, I think instead of directly hacking the code, you might want to
use the operation (probably cluster or vacuum full) where we set
HEAP_INSERT_SKIP_FSM. I think along with this you can try with
unlogged tables because that might stress the extension lock.

Okay. I will test.

In the above test, you might want to test with a higher number of
partitions (say up to 100) as well. Also, see if you want to use the
Copy command.

Okay. I will test.

Please let me know if you have any specific testing scenario.

Can you test the scenario mentioned by Konstantin Knizhnik [1] where
this patch has shown significant gain? You might want to use a higher
core count machine to test it.

I followed Konstantin Knizhnik steps and tested insert with high core .
Below is the test summary:

*Test setup:*
autovacuum = off
max_connections = 1000

*My testing machine:*
$ lscpu
Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 192
On-line CPU(s) list: 0-191
Thread(s) per core: 8
Core(s) per socket: 1
Socket(s): 24
NUMA node(s): 4
Model: IBM,8286-42A
L1d cache: 64K
L1i cache: 32K
L2 cache: 512K
L3 cache: 8192K
NUMA node0 CPU(s): 0-47
NUMA node1 CPU(s): 48-95
NUMA node2 CPU(s): 96-143
NUMA node3 CPU(s): 144-191

*create table test (i int, md5 text);*

*insert.sql:*
begin;
insert into test select i, md5(i::text) from generate_series(1,1000) AS i;
end;

*pgbench command:*
./pgbench postgres -c 1000 -j 36 -T 180 -P 10 -f insert.sql >> results.txt

I tested with 1000 clients. Below it the tps:
TPS on HEAD:
Run 1) : 608.908721
Run 2) : 599.962863
Run 3) : 606.378819
Run 4) : 607.174076
Run 5) : 598.531958

TPS with v14 patch: ( N_RELEXTLOCK_ENTS = 1024)
Run 1) : 649.488472
Run 2) : 657.902261
Run 3) : 654.478580
Run 4) : 648.085126
Run 5) : 647.171482

%change = +7.10 %

Apart from above test, I did some more tests (N_RELEXTLOCK_ENTS = 1024):
1) bulk insert into 1 table for T = 180s, 3600s, clients-100,1000, table-
logged,unlogged
2) copy command
3) bulk load into table having 13 partitions

In all the cases, I can see 4-9% improvement in TPS as compared to HEAD.

@Konstantin Knizhnik, if you remember, then please let me know that how
much tps gain was observed in your insert test? Is it nearby to my results?

--
Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com

#95

Mahendra Singh Thalor

mahi6run@gmail.com

almost 6 years ago

In reply to: Mahendra Singh Thalor (#94)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Sat, 8 Feb 2020 at 00:27, Mahendra Singh Thalor <mahi6run@gmail.com>
wrote:

On Thu, 6 Feb 2020 at 09:44, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Feb 6, 2020 at 1:57 AM Mahendra Singh Thalor <mahi6run@gmail.com>

wrote:

On Wed, 5 Feb 2020 at 12:07, Masahiko Sawada <sawada.mshk@gmail.com>

wrote:

On Mon, Feb 3, 2020 at 8:03 PM Amit Kapila <amit.kapila16@gmail.com>

wrote:

On Tue, Jun 26, 2018 at 12:47 PM Masahiko Sawada <

sawada.mshk@gmail.com> wrote:

On Fri, Apr 27, 2018 at 4:25 AM, Robert Haas <

robertmhaas@gmail.com> wrote:

On Thu, Apr 26, 2018 at 3:10 PM, Andres Freund <

andres@anarazel.de> wrote:

I think the real question is whether the scenario is common

enough to

worry about. In practice, you'd have to be extremely

unlucky to be

doing many bulk loads at the same time that all happened to

hash to

the same bucket.

With a bunch of parallel bulkloads into partitioned tables

that really

doesn't seem that unlikely?

It increases the likelihood of collisions, but probably

decreases the

number of cases where the contention gets really bad.

For example, suppose each table has 100 partitions and you are
bulk-loading 10 of them at a time. It's virtually certain

that you

will have some collisions, but the amount of contention

within each

bucket will remain fairly low because each backend spends

only 1% of

its time in the bucket corresponding to any given partition.

I share another result of performance evaluation between

current HEAD

and current HEAD with v13 patch(N_RELEXTLOCK_ENTS = 1024).

Type of table: normal table, unlogged table
Number of child tables : 16, 64 (all tables are located on the

same tablespace)

Number of clients : 32
Number of trials : 100
Duration: 180 seconds for each trials

The hardware spec of server is Intel Xeon 2.4GHz (HT 160cores),

256GB

RAM, NVMe SSD 1.5TB.
Each clients load 10kB random data across all partitioned

tables.

Here is the result.

childs | type | target | avg_tps | diff with HEAD
--------+----------+---------+------------+------------------
16 | normal | HEAD | 1643.833 |
16 | normal | Patched | 1619.5404 | 0.985222
16 | unlogged | HEAD | 9069.3543 |
16 | unlogged | Patched | 9368.0263 | 1.032932
64 | normal | HEAD | 1598.698 |
64 | normal | Patched | 1587.5906 | 0.993052
64 | unlogged | HEAD | 9629.7315 |
64 | unlogged | Patched | 10208.2196 | 1.060073
(8 rows)

For normal tables, loading tps decreased 1% ~ 2% with this patch
whereas it increased 3% ~ 6% for unlogged tables. There were
collisions at 0 ~ 5 relation extension lock slots between 2

relations

in the 64 child tables case but it didn't seem to affect the

tps.

AFAIU, this resembles the workload that Andres was worried about.

think we should once run this test in a different environment, but
considering this to be correct and repeatable, where do we go with
this patch especially when we know it improves many workloads [1]

well. We know that on a pathological case constructed by Mithun

[2]: ,

this causes regression as well. I am not sure if the test done by
Mithun really mimics any real-world workload as he has tested by
making N_RELEXTLOCK_ENTS = 1 to hit the worst case.

Sawada-San, if you have a script or data for the test done by you,
then please share it so that others can also try to reproduce it.

Unfortunately the environment I used for performance verification is
no longer available.

I agree to run this test in a different environment. I've attached

the

rebased version patch. I'm measuring the performance with/without
patch, so will share the results.

Thanks Sawada-san for patch.

From last few days, I was reading this thread and was reviewing v13

I did some testing to test false sharing(bulk insert, COPY data, bulk

insert into partitions tables). Below is the testing summary.

Test setup(Bulk insert into partition tables):
autovacuum=off
shared_buffers=512MB -c max_wal_size=20GB -c checkpoint_timeout=12min

Basically, I created a table with 13 partitions. Using pgbench, I

inserted bulk data. I used below pgbench command:

./pgbench -c $threads -j $threads -T 180 -f insert1.sql@1 -f

insert2.sql@1 -f insert3.sql@1 -f insert4.sql@1 postgres

I took scripts from previews mails and modified. For reference, I am

attaching test scripts. I tested with default 1024 slots(N_RELEXTLOCK_ENTS
= 1024).

Clients HEAD (tps) With v14 patch (tps)

%change (time: 180s)

1 92.979796 100.877446

+8.49 %

32 392.881863 388.470622

-1.12 %

56 551.753235 528.018852

-4.30 %

60 648.273767 653.251507

+0.76 %

64 645.975124 671.322140

+3.92 %

66 662.728010 673.399762

+1.61 %

70 647.103183 660.694914

+2.10 %

74 648.824027 676.487622

+4.26 %

From above results, we can see that in most cases, TPS is slightly

increased with v14 patch. I am still testing and will post my results.

The number at 56 and 74 client count seem slightly suspicious. Can
you please repeat those tests? Basically, I am not able to come up
with a theory why at 56 clients the performance with the patch is a
bit lower and then at 74 it is higher.

Okay. I will repeat test.

I re-tested in different machine because in previous machine, results are
in-consistent

./pgbench -c $threads -j $threads -T 180 -f insert1.sql@1 -f insert2.sql@1
-f insert3.sql@1 -f insert4.sql@1 postgres

Clients HEAD(tps) With v14 patch(tps) %change
(time: 180s)
1 41.491486 41.375532 -0.27%
32 335.138568 330.028739 -1.52%
56 353.783930 360.883710 +2.00%
60 341.741925 359.028041 +5.05%
64 338.521730 356.511423 +5.13%
66 339.838921 352.761766 +3.80%
70 339.305454 353.658425 +4.23%
74 332.016217 348.809042 +5.05%

From above results, it seems that there is very little regression with the
patch(+-5%) that can be run to run variation.

I want to test extension lock by blocking use of fsm(use_fsm=false in

code). I think, if we block use of fsm, then load will increase into
extension lock. Is this correct way to test?

Hmm, I think instead of directly hacking the code, you might want to
use the operation (probably cluster or vacuum full) where we set
HEAP_INSERT_SKIP_FSM. I think along with this you can try with
unlogged tables because that might stress the extension lock.

Okay. I will test.

I tested with unlogged tables also. There also I was getting 3-6% gain in
tps.

In the above test, you might want to test with a higher number of
partitions (say up to 100) as well. Also, see if you want to use the
Copy command.

Okay. I will test.

I tested with 500, 1000, 2000 paratitions. I observed max +5% regress in
the tps and there was no performace degradation.

*For example:*
I created a table with 2000 paratitions and then I checked false sharing.
Slot Number Slot Freq. Slot Number Slot Freq. Slot Number Slot Freq.
156 13 973 11 446 10
627 13 52 10 488 10
782 12 103 10 501 10
812 12 113 10 701 10
192 11 175 10 737 10
221 11 235 10 754 10
367 11 254 10 781 10
546 11 314 10 790 10
814 11 419 10 833 10
917 11 424 10 888 10

From above table, we can see that total 13 child tables are falling in same
backet (slot 156) so I did bulk-loading only in those 13 child tables to
check tps in false sharing but I noticed that there was no performance
degradation.

--
Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com

#96

Amit Kapila

amit.kapila16@gmail.com

almost 6 years ago

In reply to: Mahendra Singh Thalor (#95)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Mon, Feb 10, 2020 at 10:28 PM Mahendra Singh Thalor
<mahi6run@gmail.com> wrote:

On Sat, 8 Feb 2020 at 00:27, Mahendra Singh Thalor <mahi6run@gmail.com> wrote:

On Thu, 6 Feb 2020 at 09:44, Amit Kapila <amit.kapila16@gmail.com> wrote:

The number at 56 and 74 client count seem slightly suspicious. Can
you please repeat those tests? Basically, I am not able to come up
with a theory why at 56 clients the performance with the patch is a
bit lower and then at 74 it is higher.

Okay. I will repeat test.

I re-tested in different machine because in previous machine, results are in-consistent

Thanks for doing detailed tests.

My testing machine:
$ lscpu
Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 192
On-line CPU(s) list: 0-191
Thread(s) per core: 8
Core(s) per socket: 1
Socket(s): 24
NUMA node(s): 4
Model: IBM,8286-42A
L1d cache: 64K
L1i cache: 32K
L2 cache: 512K
L3 cache: 8192K
NUMA node0 CPU(s): 0-47
NUMA node1 CPU(s): 48-95
NUMA node2 CPU(s): 96-143
NUMA node3 CPU(s): 144-191

./pgbench -c $threads -j $threads -T 180 -f insert1.sql@1 -f insert2.sql@1 -f insert3.sql@1 -f insert4.sql@1 postgres

Clients HEAD(tps) With v14 patch(tps) %change (time: 180s)
1 41.491486 41.375532 -0.27%
32 335.138568 330.028739 -1.52%
56 353.783930 360.883710 +2.00%
60 341.741925 359.028041 +5.05%
64 338.521730 356.511423 +5.13%
66 339.838921 352.761766 +3.80%
70 339.305454 353.658425 +4.23%
74 332.016217 348.809042 +5.05%

From above results, it seems that there is very little regression with the patch(+-5%) that can be run to run variation.

Hmm, I don't see 5% regression, rather it is a performance gain of ~5%
with the patch? When we use regression, that indicates with the patch
performance (TPS) is reduced, but I don't see that in the above
numbers. Kindly clarify.

I want to test extension lock by blocking use of fsm(use_fsm=false in code). I think, if we block use of fsm, then load will increase into extension lock. Is this correct way to test?

Hmm, I think instead of directly hacking the code, you might want to
use the operation (probably cluster or vacuum full) where we set
HEAP_INSERT_SKIP_FSM. I think along with this you can try with
unlogged tables because that might stress the extension lock.

Okay. I will test.

I tested with unlogged tables also. There also I was getting 3-6% gain in tps.

In the above test, you might want to test with a higher number of
partitions (say up to 100) as well. Also, see if you want to use the
Copy command.

Okay. I will test.

I tested with 500, 1000, 2000 paratitions. I observed max +5% regress in the tps and there was no performace degradation.

Again, I am not sure if you see performance dip here. I think your
usage of the word 'regression' is not correct or at least confusing.

For example:
I created a table with 2000 paratitions and then I checked false sharing.
Slot NumberSlot Freq.Slot NumberSlot Freq.Slot NumberSlot Freq.
156139731144610
62713521048810
782121031050110
812121131070110
192111751073710
221112351075410
367112541078110
546113141079010
814114191083310
917114241088810

From above table, we can see that total 13 child tables are falling in same backet (slot 156) so I did bulk-loading only in those 13 child tables to check tps in false sharing but I noticed that there was no performance degradation.

Okay. Is it possible to share these numbers and scripts?

Thanks for doing the detailed tests for this patch.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#97

Amit Kapila

amit.kapila16@gmail.com

almost 6 years ago

In reply to: Masahiko Sawada (#89)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Wed, Feb 5, 2020 at 12:07 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Unfortunately the environment I used for performance verification is
no longer available.

I agree to run this test in a different environment. I've attached the
rebased version patch. I'm measuring the performance with/without
patch, so will share the results.

Did you get a chance to run these tests? Lately, Mahendra has done a
lot of performance testing of this patch and shared his results. I
don't see much downside with the patch, rather there is a performance
increase of 3-9% in various scenarios.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#98

Tom Lane

tgl@sss.pgh.pa.us

almost 6 years ago

In reply to: Amit Kapila (#97)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

I took a brief look through this patch. I agree with the fundamental
idea that we shouldn't need to use the heavyweight lock manager for
relation extension, since deadlock is not a concern and no backend
should ever need to hold more than one such lock at once. But it feels
to me like this particular solution is rather seriously overengineered.
I would like to suggest that we do something similar to Robert Haas'
excellent hack (daa7527af) for the !HAVE_SPINLOCK case in lmgr/spin.c,
that is,

* Create some predetermined number N of LWLocks for relation extension.
* When we want to extend some relation R, choose one of those locks
(say, R's relfilenode number mod N) and lock it.

1. As long as all backends agree on the relation-to-lock mapping, this
provides full security against concurrent extensions of the same
relation.

2. Occasionally a backend will be blocked when it doesn't need to be,
because of false sharing of a lock between two relations that need to
be extended at the same time. But as long as N is large enough (and
I doubt that it needs to be very large), that will be a negligible
penalty.

3. Aside from being a lot simpler than the proposed extension_lock.c,
this approach involves absolutely negligible overhead beyond the raw
LWLockAcquire and LWLockRelease calls. I suspect therefore that in
typical noncontended cases it will be faster. It also does not require
any new resource management overhead, thus eliminating this patch's
small but real penalty on transaction exit/cleanup.

We'd need to do a bit of performance testing to choose a good value
for N. I think that with N comparable to MaxBackends, the odds of
false sharing being a problem would be quite negligible ... but it
could be that we could get away with a lot less than that.

regards, tom lane

#99

Masahiko Sawada

masahiko.sawada@2ndquadrant.com

almost 6 years ago

In reply to: Tom Lane (#98)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Wed, 12 Feb 2020 at 00:43, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I took a brief look through this patch. I agree with the fundamental
idea that we shouldn't need to use the heavyweight lock manager for
relation extension, since deadlock is not a concern and no backend
should ever need to hold more than one such lock at once. But it feels
to me like this particular solution is rather seriously overengineered.
I would like to suggest that we do something similar to Robert Haas'
excellent hack (daa7527af) for the !HAVE_SPINLOCK case in lmgr/spin.c,
that is,

* Create some predetermined number N of LWLocks for relation extension.

My original proposal used LWLocks and hash tables for relation
extension but there was a discussion that using LWLocks is not good
because it's not interruptible[1]/messages/by-id/CA+TgmoZnWYQvmeqeGyY+0j-Tfmx8cTzRadfxJQwK9A-nCQ7GkA@mail.gmail.com. Because of this reason and that we
don't need to have two lock level (shared, exclusive) for relation
extension lock we ended up with implementing dedicated lock manager
for extension lock. I think we will have that problem if we use LWLocks.

Regards,

[1]: /messages/by-id/CA+TgmoZnWYQvmeqeGyY+0j-Tfmx8cTzRadfxJQwK9A-nCQ7GkA@mail.gmail.com

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#100

Andres Freund

andres@anarazel.de

almost 6 years ago

In reply to: Amit Kapila (#97)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

Hi,

On 2020-02-11 08:01:34 +0530, Amit Kapila wrote:

I don't see much downside with the patch, rather there is a
performance increase of 3-9% in various scenarios.

As I wrote in [1]/messages/by-id/20200211042229.msv23badgqljrdg2@alap3.anarazel.de I started to look at this patch. My problem with itis
that it just seems like the wrong direction architecturally to
me. There's two main aspects to this:

1) It basically builds a another, more lightweight but less capable, of
a lock manager that can lock more objects than we can have distinct
locks for. It is faster because it uses *one* hashtable without
conflict handling, because it has fewer lock modes, and because it
doesn't support detecting deadlocks. And probably some other things.

2) A lot of the contention around file extension comes from us doing
multiple expensive things under one lock (determining current
relation size, searching victim buffer, extending file), and in tiny
increments (growing a 1TB table by 8kb). This patch doesn't address
that at all.

I've focused on 1) in the email referenced above ([1]/messages/by-id/20200211042229.msv23badgqljrdg2@alap3.anarazel.de). Here I'll focus
on 2).

To quantify my concerns I instrumented postgres to measure the time for
various operations that are part of extending a file (all per
process). The hardware is a pretty fast nvme, with unlogged tables, on a
20/40 core/threads machine. The workload is copying a scale 10
pgbench_accounts into an unindexed, unlogged table using pgbench.

Here are the instrumentations for various client counts, when just
measuring 20s:

1 client:
LOG: extension time: lock wait: 0.00 lock held: 3.19 filesystem: 1.29 buffersearch: 1.58

2 clients:
LOG: extension time: lock wait: 0.47 lock held: 2.99 filesystem: 1.24 buffersearch: 1.43
LOG: extension time: lock wait: 0.60 lock held: 3.05 filesystem: 1.23 buffersearch: 1.50

4 clients:
LOG: extension time: lock wait: 3.92 lock held: 2.69 filesystem: 1.10 buffersearch: 1.29
LOG: extension time: lock wait: 4.40 lock held: 2.02 filesystem: 0.81 buffersearch: 0.93
LOG: extension time: lock wait: 3.86 lock held: 2.59 filesystem: 1.06 buffersearch: 1.22
LOG: extension time: lock wait: 4.00 lock held: 2.65 filesystem: 1.08 buffersearch: 1.26

8 clients:
LOG: extension time: lock wait: 6.94 lock held: 1.74 filesystem: 0.70 buffersearch: 0.80
LOG: extension time: lock wait: 7.16 lock held: 1.81 filesystem: 0.73 buffersearch: 0.82
LOG: extension time: lock wait: 6.93 lock held: 1.95 filesystem: 0.80 buffersearch: 0.89
LOG: extension time: lock wait: 7.08 lock held: 1.87 filesystem: 0.76 buffersearch: 0.86
LOG: extension time: lock wait: 6.95 lock held: 1.95 filesystem: 0.80 buffersearch: 0.89
LOG: extension time: lock wait: 6.88 lock held: 2.01 filesystem: 0.83 buffersearch: 0.93
LOG: extension time: lock wait: 6.94 lock held: 2.02 filesystem: 0.82 buffersearch: 0.93
LOG: extension time: lock wait: 7.02 lock held: 1.95 filesystem: 0.80 buffersearch: 0.89

16 clients:
LOG: extension time: lock wait: 10.37 lock held: 0.88 filesystem: 0.36 buffersearch: 0.39
LOG: extension time: lock wait: 10.53 lock held: 0.90 filesystem: 0.37 buffersearch: 0.40
LOG: extension time: lock wait: 10.72 lock held: 1.01 filesystem: 0.42 buffersearch: 0.45
LOG: extension time: lock wait: 10.45 lock held: 1.25 filesystem: 0.52 buffersearch: 0.55
LOG: extension time: lock wait: 10.66 lock held: 0.94 filesystem: 0.38 buffersearch: 0.41
LOG: extension time: lock wait: 10.50 lock held: 1.27 filesystem: 0.53 buffersearch: 0.56
LOG: extension time: lock wait: 10.53 lock held: 1.19 filesystem: 0.49 buffersearch: 0.53
LOG: extension time: lock wait: 10.57 lock held: 1.22 filesystem: 0.50 buffersearch: 0.53
LOG: extension time: lock wait: 10.72 lock held: 1.17 filesystem: 0.48 buffersearch: 0.52
LOG: extension time: lock wait: 10.67 lock held: 1.32 filesystem: 0.55 buffersearch: 0.58
LOG: extension time: lock wait: 10.95 lock held: 0.92 filesystem: 0.38 buffersearch: 0.40
LOG: extension time: lock wait: 10.81 lock held: 1.24 filesystem: 0.51 buffersearch: 0.56
LOG: extension time: lock wait: 10.62 lock held: 1.27 filesystem: 0.53 buffersearch: 0.56
LOG: extension time: lock wait: 11.14 lock held: 0.94 filesystem: 0.38 buffersearch: 0.41
LOG: extension time: lock wait: 11.20 lock held: 0.96 filesystem: 0.39 buffersearch: 0.42
LOG: extension time: lock wait: 10.75 lock held: 1.41 filesystem: 0.58 buffersearch: 0.63
0.88 + 0.90 + 1.01 + 1.25 + 0.94 + 1.27 + 1.19 + 1.22 + 1.17 + 1.32 + 0.92 + 1.24 + 1.27 + 0.94 + 0.96 + 1.41
in *none* of these cases the drive gets even close to being
saturated. Like not even 1/3.

If you consider the total time with the lock held, and the total time of
the test, it becomes very quickly obvious that pretty quickly we spend
the majority of the total time with the lock held.
client count 1: 3.18/20 = 0.16
client count 2: 6.04/20 = 0.30
client count 4: 9.95/20 = 0.50
client count 8: 15.30/20 = 0.76
client count 16: 17.89/20 = 0.89

In other words, the reason that relation extension scales terribly
isn't, to a significant degree, because the locking is slow. It's
because we hold locks for the majority of the benchmark's time starting
even at just 4 clients. Focusing on making the locking faster is just
optimizing for the wrong thing. Amdahl's law will just restrict the
benefits to a pretty small amount.

Looking at a CPU time profile (i.e. it'll not include the time spent
waiting for a lock, once sleeping in the kernel) for time spent within
RelationGetBufferForTuple():

-   19.16%     0.29%  postgres  postgres            [.] RelationGetBufferForTuple
   - 18.88% RelationGetBufferForTuple
      - 13.18% ReadBufferExtended
         - ReadBuffer_common
            + 5.02% mdextend
            + 4.77% FlushBuffer.part.0
            + 0.61% BufTableLookup
              0.52% __memset_avx2_erms
      + 1.65% PageInit
      - 1.18% LockRelationForExtension
         - 1.16% LockAcquireExtended
            - 1.07% WaitOnLock
               - 1.01% ProcSleep
                  - 0.88% WaitLatchOrSocket
                       0.52% WaitEventSetWait
        0.65% RecordAndGetPageWithFreeSpace

the same workload using an assert enabled build, to get a simpler to
interpret profile:
-   13.28%     0.19%  postgres  postgres            [.] RelationGetBufferForTuple
   - 13.09% RelationGetBufferForTuple
      - 8.35% RelationAddExtraBlocks
         - 7.67% ReadBufferBI
            - 7.54% ReadBufferExtended
               - 7.52% ReadBuffer_common
                  - 3.64% BufferAlloc
                     + 2.39% FlushBuffer
                     + 0.27% BufTableLookup
                     + 0.24% BufTableDelete
                     + 0.15% LWLockAcquire
                       0.14% StrategyGetBuffer
                     + 0.13% BufTableHashCode
                  - 2.96% smgrextend
                     + mdextend
                  + 0.52% __memset_avx2_erms
                  + 0.14% smgrnblocks
                    0.11% __GI___clock_gettime (inlined)
         + 0.57% RecordPageWithFreeSpace
      - 1.23% RecordAndGetPageWithFreeSpace
         - 1.03% fsm_set_and_search
            + 0.50% fsm_readbuf
            + 0.20% LockBuffer
            + 0.18% UnlockReleaseBuffer
              0.11% fsm_set_avail
           0.19% fsm_search
      - 0.86% ReadBufferBI
         - 0.72% ReadBufferExtended
            - ReadBuffer_common
               - 0.58% BufferAlloc
                  + 0.20% BufTableLookup
                  + 0.10% LWLockAcquire
      + 0.81% PageInit
      - 0.67% LockRelationForExtension
         - 0.67% LockAcquire
            - LockAcquireExtended
               + 0.60% WaitOnLock

Which, I think, pretty clearly shows a few things:

1) It's crucial to move acquiring a victim buffer to the outside of the
extension lock, as for copy acquiring the victim buffer will commonly
cause a buffer having to be written out, due to the ringbuffer. This
is even more crucial when using a logged table, as the writeout then
also will often also trigger a WAL flush.

While doing so will sometimes add a round of acquiring the buffer
mapping locks, having to do the FlushBuffer while holding the
extension lock is a huge problem.

This'd also move a good bit of the cost of finding (i.e. clock sweep
/ ringbuffer replacement) and invalidating the old buffer mapping out
of the lock.

2) We need to make the smgrwrite more efficient, it is costing a lot of
time. A small additional experiment shows the cost of doing 8kb
writes:

I wrote a small program that just iteratively writes a 32GB file:

pwrite using 8kb blocks:
0.24user 17.88system 0:18.16 elapsed 99%CPU

pwrite using 128kb blocks:
0.00user 16.71system 0:17.01 elapsed 98%CPU

pwrite using 256kb blocks:
0.00user 15.95system 0:16.03 elapsed 99%CPU

pwritev() using 16 8kb blocks to write 128kb at once:
0.02user 15.94system 0:16.09 elapsed 99%CPU

pwritev() using 32 8kb blocks to write 256kb at once:
0.01user 14.90system 0:14.93 elapsed 99%CPU

pwritev() using 128 8kb blocks to write 1MB at once:
0.00user 13.96system 0:13.96 elapsed 99%CPU

if I instead just use posix_fallocate() with 8kb blocks:
0.28user 23.49system 0:23.78elapsed 99%CPU (0avgtext+0avgdata 1212maxresident)k
0inputs+0outputs (0major+66minor)pagefaults 0swaps

if I instead just use posix_fallocate() with 32 8kb blocks:
0.01user 1.18system 0:01.19elapsed 99%CPU (0avgtext+0avgdata 1200maxresident)k
0inputs+0outputs (0major+67minor)pagefaults 0swaps

obviously fallocate doesn't quite have the same behaviour, and may incur
a bit higher overhead for a later write.

using a version that instead uses O_DIRECT + async IO, I get (but
only when also doing posix_fallocate in larger chunks):
0.05user 5.53system 0:12.53 elapsed 44%CPU

So we get considerably higher write throughput, at a considerably
lower CPU usage (because DMA replaces the CPU doing a memcpy()).

So it looks like extending the file with posix_fallocate() might be a
winner, but only if we actually can do so in larger chunks than 8kb
at once.

Alternatively it could be worthwhile to rejigger things so we don't
extend the files with zeroes once, just to then immediately overwrite
it with actual content. For some users it's probably possible to
pre-generate a page with contents when extending the file (would need
fiddling with block numbers etc).

3) We should move the PageInit() that's currently done with the
extension lock held, to the outside. Since we get the buffer with
RBM_ZERO_AND_LOCK these days, that should be safe. Also, we don't
need to zero the entire buffer both in RelationGetBufferForTuple()'s
PageInit(), and in ReadBuffer_common() before calling smgrextend().

Greetings,

Andres Freund

[1]: /messages/by-id/20200211042229.msv23badgqljrdg2@alap3.anarazel.de

#101

Masahiko Sawada

masahiko.sawada@2ndquadrant.com

almost 6 years ago

In reply to: Amit Kapila (#97)

1 attachment(s)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Tue, 11 Feb 2020 at 11:31, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Feb 5, 2020 at 12:07 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Unfortunately the environment I used for performance verification is
no longer available.

I agree to run this test in a different environment. I've attached the
rebased version patch. I'm measuring the performance with/without
patch, so will share the results.

Did you get a chance to run these tests? Lately, Mahendra has done a
lot of performance testing of this patch and shared his results. I
don't see much downside with the patch, rather there is a performance
increase of 3-9% in various scenarios.

I've done performance tests on my laptop while changing the number of
partitions. 4 clients concurrently insert 32 tuples to randomly
selected partitions in a transaction. Therefore by changing the number
of partition the contention of relation extension lock would also be
changed. All tables are unlogged tables and N_RELEXTLOCK_ENTS is 1024.

Here is my test results:

* HEAD
nchilds = 64 tps = 33135
nchilds = 128 tps = 31249
nchilds = 256 tps = 29356

* Patched
nchilds = 64 tps = 32057
nchilds = 128 tps = 32426
nchilds = 256 tps = 29483

The performance has been slightly improved by the patch in two cases.
I've also attached the shell script I used to test.

When I set N_RELEXTLOCK_ENTS to 1 so that all relation locks conflicts
the result is:

nchilds = 64 tps = 30887
nchilds = 128 tps = 30015
nchilds = 256 tps = 27837

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#102

Amit Kapila

amit.kapila16@gmail.com

almost 6 years ago

In reply to: Masahiko Sawada (#99)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Wed, Feb 12, 2020 at 7:36 AM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:

On Wed, 12 Feb 2020 at 00:43, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I took a brief look through this patch. I agree with the fundamental
idea that we shouldn't need to use the heavyweight lock manager for
relation extension, since deadlock is not a concern and no backend
should ever need to hold more than one such lock at once. But it feels
to me like this particular solution is rather seriously overengineered.
I would like to suggest that we do something similar to Robert Haas'
excellent hack (daa7527af) for the !HAVE_SPINLOCK case in lmgr/spin.c,
that is,

* Create some predetermined number N of LWLocks for relation extension.

My original proposal used LWLocks and hash tables for relation
extension but there was a discussion that using LWLocks is not good
because it's not interruptible[1]. Because of this reason and that we
don't need to have two lock level (shared, exclusive) for relation
extension lock we ended up with implementing dedicated lock manager
for extension lock. I think we will have that problem if we use LWLocks.

Hmm, but we use LWLocks for (a) WALWrite/Flush (see the usage of
WALWriteLock), (b) writing the shared buffer contents (see
io_in_progress lock and its usage in FlushBuffer) and might be for few
other similar stuff. Many times those take more time than extending a
block in relation especially when we combine the WAL write for
multiple commits. So, if this is a problem for relation extension
lock, then the same thing holds true there also. Now, there are cases
like when we extend the relation with multiple blocks, finding victim
buffer under this lock, etc. where this can be also equally or more
costly, but I think we can improve some of those cases (some of this
is even pointed by Andres in his email) if we agree on a fundamental
idea of using LWLocks as proposed by Tom. I am not telling that we
implement Tom's idea without weighing its pros and cons, but it has an
appeal due to its simplicity.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#103

Amit Kapila

amit.kapila16@gmail.com

almost 6 years ago

In reply to: Andres Freund (#100)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Wed, Feb 12, 2020 at 10:24 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2020-02-11 08:01:34 +0530, Amit Kapila wrote:

I don't see much downside with the patch, rather there is a
performance increase of 3-9% in various scenarios.

As I wrote in [1] I started to look at this patch. My problem with itis
that it just seems like the wrong direction architecturally to
me. There's two main aspects to this:

1) It basically builds a another, more lightweight but less capable, of
a lock manager that can lock more objects than we can have distinct
locks for. It is faster because it uses *one* hashtable without
conflict handling, because it has fewer lock modes, and because it
doesn't support detecting deadlocks. And probably some other things.

2) A lot of the contention around file extension comes from us doing
multiple expensive things under one lock (determining current
relation size, searching victim buffer, extending file), and in tiny
increments (growing a 1TB table by 8kb). This patch doesn't address
that at all.

It seems to me both the two points try to address the performance
angle of the patch, but here our actual intention was to make this
lock block among parallel workers so that we can implement/improve
some of the parallel writes operations (like parallelly vacuuming the
heap or index, parallel bulk load, etc.). Both independently are
worth accomplishing, but not w.r.t parallel writes. Here, we were
doing some benchmarking to see if we haven't regressed performance in
any cases.

I've focused on 1) in the email referenced above ([1]). Here I'll focus
on 2).

Which, I think, pretty clearly shows a few things:

I agree with all your below observations.

1) It's crucial to move acquiring a victim buffer to the outside of the
extension lock, as for copy acquiring the victim buffer will commonly
cause a buffer having to be written out, due to the ringbuffer. This
is even more crucial when using a logged table, as the writeout then
also will often also trigger a WAL flush.

While doing so will sometimes add a round of acquiring the buffer
mapping locks, having to do the FlushBuffer while holding the
extension lock is a huge problem.

This'd also move a good bit of the cost of finding (i.e. clock sweep
/ ringbuffer replacement) and invalidating the old buffer mapping out
of the lock.

I think this mostly because of the way currently code is arranged to
extend a block via ReadBuffer* API. IIUC, currently the main
operations under relation extension lock are as follows:
a. get the block number for extension via smgrnblocks.
b. find victim buffer
c. associate buffer with the block no. found in step-a.
d. initialize the block with zeros
e. write the block
f. PageInit

I think if we can rearrange such that steps b and c can be done after
e or f, then we don't need to hold the extension lock to find the
victim buffer.

2) We need to make the smgrwrite more efficient, it is costing a lot of
time. A small additional experiment shows the cost of doing 8kb
writes:

I wrote a small program that just iteratively writes a 32GB file:

So it looks like extending the file with posix_fallocate() might be a
winner, but only if we actually can do so in larger chunks than 8kb
at once.

A good experiment and sounds like worth doing.

3) We should move the PageInit() that's currently done with the
extension lock held, to the outside. Since we get the buffer with
RBM_ZERO_AND_LOCK these days, that should be safe. Also, we don't
need to zero the entire buffer both in RelationGetBufferForTuple()'s
PageInit(), and in ReadBuffer_common() before calling smgrextend().

Agreed.

I feel all three are independent improvements and can be done separately.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#104

Tom Lane

tgl@sss.pgh.pa.us

almost 6 years ago

In reply to: Amit Kapila (#102)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

Amit Kapila <amit.kapila16@gmail.com> writes:

On Wed, Feb 12, 2020 at 7:36 AM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:

On Wed, 12 Feb 2020 at 00:43, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I would like to suggest that we do something similar to Robert Haas'
excellent hack (daa7527af) for the !HAVE_SPINLOCK case in lmgr/spin.c,

My original proposal used LWLocks and hash tables for relation
extension but there was a discussion that using LWLocks is not good
because it's not interruptible[1].

Hmm, but we use LWLocks for (a) WALWrite/Flush (see the usage of
WALWriteLock), (b) writing the shared buffer contents (see
io_in_progress lock and its usage in FlushBuffer) and might be for few
other similar stuff. Many times those take more time than extending a
block in relation especially when we combine the WAL write for
multiple commits. So, if this is a problem for relation extension
lock, then the same thing holds true there also.

Yeah. I would say a couple more things:

* I see no reason to think that a relation extension lock would ever
be held long enough for noninterruptibility to be a real issue. Our
expectations for query cancel response time are in the tens to
hundreds of msec anyway.

* There are other places where an LWLock can be held for a *long* time,
notably the CheckpointLock. If we do think this is an issue, we could
devise a way to not insist on noninterruptibility. The easiest fix
is just to do a matching RESUME_INTERRUPTS after getting the lock and
HOLD_INTERRUPTS again before releasing it; though maybe it'd be worth
offering some slightly cleaner way. Point here is that LWLockAcquire
only does that because it's useful to the majority of callers, not
because it's graven in stone that it must be like that.

In general, if we think there are issues with LWLock, it seems to me
we'd be better off to try to fix them, not to invent a whole new
single-purpose lock manager that we'll have to debug and maintain.
I do not see anything about this problem that suggests that that would
provide a major win. As Andres has noted, there are lots of other
aspects of it that are likely to be more useful to spend effort on.

regards, tom lane

#105

Amit Kapila

amit.kapila16@gmail.com

almost 6 years ago

In reply to: Tom Lane (#104)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Wed, Feb 12, 2020 at 10:23 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Amit Kapila <amit.kapila16@gmail.com> writes:

On Wed, Feb 12, 2020 at 7:36 AM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:

On Wed, 12 Feb 2020 at 00:43, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I would like to suggest that we do something similar to Robert Haas'
excellent hack (daa7527af) for the !HAVE_SPINLOCK case in lmgr/spin.c,

My original proposal used LWLocks and hash tables for relation
extension but there was a discussion that using LWLocks is not good
because it's not interruptible[1].

Hmm, but we use LWLocks for (a) WALWrite/Flush (see the usage of
WALWriteLock), (b) writing the shared buffer contents (see
io_in_progress lock and its usage in FlushBuffer) and might be for few
other similar stuff. Many times those take more time than extending a
block in relation especially when we combine the WAL write for
multiple commits. So, if this is a problem for relation extension
lock, then the same thing holds true there also.

Yeah. I would say a couple more things:

* I see no reason to think that a relation extension lock would ever
be held long enough for noninterruptibility to be a real issue. Our
expectations for query cancel response time are in the tens to
hundreds of msec anyway.

* There are other places where an LWLock can be held for a *long* time,
notably the CheckpointLock. If we do think this is an issue, we could
devise a way to not insist on noninterruptibility. The easiest fix
is just to do a matching RESUME_INTERRUPTS after getting the lock and
HOLD_INTERRUPTS again before releasing it; though maybe it'd be worth
offering some slightly cleaner way.

Yeah, this sounds like the better answer for noninterruptibility
aspect of this design. One idea that occurred to me was to pass a
parameter to LWLOCK acquire/release APIs to indicate whether to
hold/resume interrupts, but I don't know if that is any better than
doing it at the required place. I am not sure if all places are
careful whether they really want to hold interrupts, so if we provide
a new parameter at least new users of API will think about it.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#106

Amit Kapila

amit.kapila16@gmail.com

almost 6 years ago

In reply to: Tom Lane (#98)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Tue, Feb 11, 2020 at 9:13 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I took a brief look through this patch. I agree with the fundamental
idea that we shouldn't need to use the heavyweight lock manager for
relation extension, since deadlock is not a concern and no backend
should ever need to hold more than one such lock at once. But it feels
to me like this particular solution is rather seriously overengineered.
I would like to suggest that we do something similar to Robert Haas'
excellent hack (daa7527af) for the !HAVE_SPINLOCK case in lmgr/spin.c,
that is,

* Create some predetermined number N of LWLocks for relation extension.
* When we want to extend some relation R, choose one of those locks
(say, R's relfilenode number mod N) and lock it.

I am imagining something on the lines of BufferIOLWLockArray (here it
will be RelExtLWLockArray). The size (N) could MaxBackends or some
percentage of it (depending on testing) and indexing into an array
could be as suggested (R's relfilenode number mod N). We need to
initialize this during shared memory initialization. Then, to extend
the relation with multiple blocks at-a-time (as we do in
RelationAddExtraBlocks), we can either use the already proven
technique of group clear xid mechanism (see ProcArrayGroupClearXid) or
have an additional state in the RelExtLWLockArray which will keep the
count of waiters (as done in latest patch of Sawada-san [1]/messages/by-id/CAD21AoADkWhkLEB_=kjLZeZ_ML9_hSQqNBWz+d821QHf=O9LJQ@mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com). We
might want to experiment with both approaches and see which yields
better results.

[1]: /messages/by-id/CAD21AoADkWhkLEB_=kjLZeZ_ML9_hSQqNBWz+d821QHf=O9LJQ@mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#107

Dilip Kumar

dilipbalaut@gmail.com

almost 6 years ago

In reply to: Amit Kapila (#106)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Thu, Feb 13, 2020 at 9:46 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Feb 11, 2020 at 9:13 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I took a brief look through this patch. I agree with the fundamental
idea that we shouldn't need to use the heavyweight lock manager for
relation extension, since deadlock is not a concern and no backend
should ever need to hold more than one such lock at once. But it feels
to me like this particular solution is rather seriously overengineered.
I would like to suggest that we do something similar to Robert Haas'
excellent hack (daa7527af) for the !HAVE_SPINLOCK case in lmgr/spin.c,
that is,

* Create some predetermined number N of LWLocks for relation extension.
* When we want to extend some relation R, choose one of those locks
(say, R's relfilenode number mod N) and lock it.

I am imagining something on the lines of BufferIOLWLockArray (here it
will be RelExtLWLockArray). The size (N) could MaxBackends or some
percentage of it (depending on testing) and indexing into an array
could be as suggested (R's relfilenode number mod N). We need to
initialize this during shared memory initialization. Then, to extend
the relation with multiple blocks at-a-time (as we do in
RelationAddExtraBlocks), we can either use the already proven
technique of group clear xid mechanism (see ProcArrayGroupClearXid) or
have an additional state in the RelExtLWLockArray which will keep the
count of waiters (as done in latest patch of Sawada-san [1]). We
might want to experiment with both approaches and see which yields
better results.

IMHO, in this case, there is no point in using the "group clear" type
of mechanism mainly for two reasons 1) It will unnecessarily make
PGPROC structure heavy.
2) For our case, we don't need any specific pieces of information from
other waiters, we just need the count.

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#108

Mahendra Singh Thalor

mahi6run@gmail.com

almost 6 years ago

In reply to: Amit Kapila (#106)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Thu, 13 Feb 2020 at 09:46, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Feb 11, 2020 at 9:13 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I took a brief look through this patch. I agree with the fundamental
idea that we shouldn't need to use the heavyweight lock manager for
relation extension, since deadlock is not a concern and no backend
should ever need to hold more than one such lock at once. But it feels
to me like this particular solution is rather seriously overengineered.
I would like to suggest that we do something similar to Robert Haas'
excellent hack (daa7527af) for the !HAVE_SPINLOCK case in lmgr/spin.c,
that is,

* Create some predetermined number N of LWLocks for relation extension.
* When we want to extend some relation R, choose one of those locks
(say, R's relfilenode number mod N) and lock it.

I am imagining something on the lines of BufferIOLWLockArray (here it
will be RelExtLWLockArray). The size (N) could MaxBackends or some
percentage of it (depending on testing) and indexing into an array
could be as suggested (R's relfilenode number mod N). We need to
initialize this during shared memory initialization. Then, to extend
the relation with multiple blocks at-a-time (as we do in
RelationAddExtraBlocks), we can either use the already proven
technique of group clear xid mechanism (see ProcArrayGroupClearXid) or
have an additional state in the RelExtLWLockArray which will keep the
count of waiters (as done in latest patch of Sawada-san [1]). We
might want to experiment with both approaches and see which yields
better results.

Thanks all for the suggestions. I have started working on the
implementation based on the suggestion. I will post a patch for this
in few days.

--
Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com

#109

Masahiko Sawada

masahiko.sawada@2ndquadrant.com

almost 6 years ago

In reply to: Amit Kapila (#106)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Thu, 13 Feb 2020 at 13:16, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Feb 11, 2020 at 9:13 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I took a brief look through this patch. I agree with the fundamental
idea that we shouldn't need to use the heavyweight lock manager for
relation extension, since deadlock is not a concern and no backend
should ever need to hold more than one such lock at once. But it feels
to me like this particular solution is rather seriously overengineered.
I would like to suggest that we do something similar to Robert Haas'
excellent hack (daa7527af) for the !HAVE_SPINLOCK case in lmgr/spin.c,
that is,

* Create some predetermined number N of LWLocks for relation extension.
* When we want to extend some relation R, choose one of those locks
(say, R's relfilenode number mod N) and lock it.

I am imagining something on the lines of BufferIOLWLockArray (here it
will be RelExtLWLockArray). The size (N) could MaxBackends or some
percentage of it (depending on testing) and indexing into an array
could be as suggested (R's relfilenode number mod N).

I'm not sure it's good that the contention of LWLock slot depends on
MaxBackends. Because it means that the more MaxBackends is larger, the
less the LWLock slot conflicts, even if the same number of backends
actually connecting. Normally we don't want to increase unnecessarily
MaxBackends for security reasons. In the current patch we defined a
fixed length of array for extension lock but I agree that we need to
determine what approach is the best depending on testing.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#110

Amit Kapila

amit.kapila16@gmail.com

almost 6 years ago

In reply to: Masahiko Sawada (#109)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Fri, Feb 14, 2020 at 11:42 AM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:

On Thu, 13 Feb 2020 at 13:16, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Feb 11, 2020 at 9:13 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I took a brief look through this patch. I agree with the fundamental
idea that we shouldn't need to use the heavyweight lock manager for
relation extension, since deadlock is not a concern and no backend
should ever need to hold more than one such lock at once. But it feels
to me like this particular solution is rather seriously overengineered.
I would like to suggest that we do something similar to Robert Haas'
excellent hack (daa7527af) for the !HAVE_SPINLOCK case in lmgr/spin.c,
that is,

* Create some predetermined number N of LWLocks for relation extension.
* When we want to extend some relation R, choose one of those locks
(say, R's relfilenode number mod N) and lock it.

I am imagining something on the lines of BufferIOLWLockArray (here it
will be RelExtLWLockArray). The size (N) could MaxBackends or some
percentage of it (depending on testing) and indexing into an array
could be as suggested (R's relfilenode number mod N).

I'm not sure it's good that the contention of LWLock slot depends on
MaxBackends. Because it means that the more MaxBackends is larger, the
less the LWLock slot conflicts, even if the same number of backends
actually connecting. Normally we don't want to increase unnecessarily
MaxBackends for security reasons. In the current patch we defined a
fixed length of array for extension lock but I agree that we need to
determine what approach is the best depending on testing.

I think MaxBackends will generally limit the number of different
relations that can simultaneously extend, but maybe tables with many
partitions might change the situation. You are right that some tests
might suggest a good number, let Mahendra write a patch and then we
can test it. Do you have any better idea?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#111

Tom Lane

tgl@sss.pgh.pa.us

almost 6 years ago

In reply to: Amit Kapila (#110)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

Amit Kapila <amit.kapila16@gmail.com> writes:

I think MaxBackends will generally limit the number of different
relations that can simultaneously extend, but maybe tables with many
partitions might change the situation. You are right that some tests
might suggest a good number, let Mahendra write a patch and then we
can test it. Do you have any better idea?

In the first place, there certainly isn't more than one extension
happening at a time per backend, else the entire premise of this
thread is wrong. Handwaving about partitions won't change that.

In the second place, it's ludicrous to expect that the underlying
platform/filesystem can support an infinite number of concurrent
file-extension operations. At some level (e.g. where disk blocks
are handed out, or where a record of the operation is written to
a filesystem journal) it's quite likely that things are bottlenecked
down to *one* such operation at a time per filesystem. So I'm not
that concerned about occasional false-sharing limiting our ability
to issue concurrent requests. There are probably worse restrictions
at lower levels.

regards, tom lane

#112

Robert Haas

robertmhaas@gmail.com

almost 6 years ago

In reply to: Tom Lane (#104)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Wed, Feb 12, 2020 at 11:53 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Yeah. I would say a couple more things:

* I see no reason to think that a relation extension lock would ever
be held long enough for noninterruptibility to be a real issue. Our
expectations for query cancel response time are in the tens to
hundreds of msec anyway.

I don't agree, because (1) the time to perform a relation extension on
a busy system can be far longer than that and (2) if the disk is
failing, then it can be *really* long, or indefinite.

* There are other places where an LWLock can be held for a *long* time,
notably the CheckpointLock. If we do think this is an issue, we could
devise a way to not insist on noninterruptibility. The easiest fix
is just to do a matching RESUME_INTERRUPTS after getting the lock and
HOLD_INTERRUPTS again before releasing it; though maybe it'd be worth
offering some slightly cleaner way. Point here is that LWLockAcquire
only does that because it's useful to the majority of callers, not
because it's graven in stone that it must be like that.

That's an interesting idea, but it doesn't make the lock acquisition
itself interruptible, which seems pretty important to me in this case.
I wonder if we could have an LWLockAcquireInterruptibly() or some such
that allows the lock acquisition itself to be interruptible. I think
that would require some rejiggering but it might be doable.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#113

Tom Lane

tgl@sss.pgh.pa.us

almost 6 years ago

In reply to: Robert Haas (#112)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, Feb 12, 2020 at 11:53 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

* I see no reason to think that a relation extension lock would ever
be held long enough for noninterruptibility to be a real issue. Our
expectations for query cancel response time are in the tens to
hundreds of msec anyway.

I don't agree, because (1) the time to perform a relation extension on
a busy system can be far longer than that and (2) if the disk is
failing, then it can be *really* long, or indefinite.

I remain unconvinced ... wouldn't both of those claims apply to any disk
I/O request? Are we going to try to ensure that no I/O ever happens
while holding an LWLock, and if so how? (Again, CheckpointLock is a
counterexample, which has been that way for decades without reported
problems. But actually I think buffer I/O locks are an even more
direct counterexample.)

* There are other places where an LWLock can be held for a *long* time,
notably the CheckpointLock. If we do think this is an issue, we could
devise a way to not insist on noninterruptibility. The easiest fix
is just to do a matching RESUME_INTERRUPTS after getting the lock and
HOLD_INTERRUPTS again before releasing it; though maybe it'd be worth
offering some slightly cleaner way. Point here is that LWLockAcquire
only does that because it's useful to the majority of callers, not
because it's graven in stone that it must be like that.

That's an interesting idea, but it doesn't make the lock acquisition
itself interruptible, which seems pretty important to me in this case.

Good point: if you think the contained operation might run too long to
suit you, then you don't want other backends to be stuck behind it for
the same amount of time.

I wonder if we could have an LWLockAcquireInterruptibly() or some such
that allows the lock acquisition itself to be interruptible. I think
that would require some rejiggering but it might be doable.

Yeah, I had the impression from a brief look at LWLockAcquire that
it was itself depending on not throwing errors partway through.
But with careful and perhaps-a-shade-slower coding, we could probably
make a version that didn't require that.

regards, tom lane

#114

Robert Haas

robertmhaas@gmail.com

almost 6 years ago

In reply to: Tom Lane (#113)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Fri, Feb 14, 2020 at 10:43 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I remain unconvinced ... wouldn't both of those claims apply to any disk
I/O request? Are we going to try to ensure that no I/O ever happens
while holding an LWLock, and if so how? (Again, CheckpointLock is a
counterexample, which has been that way for decades without reported
problems. But actually I think buffer I/O locks are an even more
direct counterexample.)

Yes, that's a problem. I proposed a patch a few years ago that
replaced the buffer I/O locks with condition variables, and I think
that's a good idea for lots of reasons, including this one. I never
quite got around to pushing that through to commit, but I think we
should do that. Aside from fixing this problem, it also prevents
certain scenarios where we can currently busy-loop.

I do realize that we're unlikely to ever solve this problem
completely, but I don't think that should discourage us from making
incremental progress. Just as debuggability is a sticking point for
you, what I'm going to call operate-ability is a sticking point for
me. My work here at EnterpriseDB exposes me on a fairly regular basis
to real broken systems, and I'm therefore really sensitive to the
concerns that people have when trying to recover after a system has
become, for one reason or another, really broken.

Interruptibility may not be the #1 concern in that area, but it's very
high on the list. EnterpriseDB customers, as a rule, *really* hate
being told to restart the database because one session is stuck. It
causes a lot of disruption for them and the person who does the
restart gets yelled at by their boss, and maybe their bosses boss and
the boss above that. It means that their whole application, which may
be mission-critical, is down until the database finishes restarting,
and that is not always a quick process, especially after an immediate
shutdown. I don't think we can ever make everything that can get stuck
interruptible, but the more we can do the better.

The work you and others have done over the years to add
CHECK_FOR_INTERRUPTS() to more places pays real dividends. Making
sessions that are blocked on disk I/O interruptible in at least some
of the more common cases would be a huge win. Other people may well
have different experiences, but my experience is that the disk
deciding to conk out for a while or just respond very very slowly is a
very common problem even (and sometimes especially) on very expensive
hardware. Obviously that's not great and you're in lots of trouble,
but being able to hit ^C and get control back significantly improves
your chances of being able to understand what has happened and recover
from it.

That's an interesting idea, but it doesn't make the lock acquisition
itself interruptible, which seems pretty important to me in this case.

Good point: if you think the contained operation might run too long to
suit you, then you don't want other backends to be stuck behind it for
the same amount of time.

Right.

I wonder if we could have an LWLockAcquireInterruptibly() or some such
that allows the lock acquisition itself to be interruptible. I think
that would require some rejiggering but it might be doable.

Yeah, I had the impression from a brief look at LWLockAcquire that
it was itself depending on not throwing errors partway through.
But with careful and perhaps-a-shade-slower coding, we could probably
make a version that didn't require that.

Yeah, that was my thought, too, but I didn't study it that carefully,
so somebody would need to do that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#115

Andres Freund

andres@anarazel.de

almost 6 years ago

In reply to: Tom Lane (#104)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

Hi,

On 2020-02-12 11:53:49 -0500, Tom Lane wrote:

In general, if we think there are issues with LWLock, it seems to me
we'd be better off to try to fix them, not to invent a whole new
single-purpose lock manager that we'll have to debug and maintain.

My impression is that what's being discussed here is doing exactly that,
except with s/lwlock/heavyweight locks/. We're basically replacing the
lock.c lock mapping table with an ad-hoc implementation, and now we're
also reinventing interruptability etc.

I still find the performance arguments pretty ludicruous, to be honest -
I think the numbers I posted about how much time we spend with the locks
held, back that up. I have a bit more understanding for the parallel
worker arguments, but only a bit:

I think if we develop a custom solution for the extension lock, we're
just going to end up having to develop another custom solution for a
bunch of other types of locks. It seems quite likely that we'll end up
also wanting TUPLE and also SPECULATIVE and PAGE type locks that we
don't want to share between leader & workers.

IMO the right thing here is to extend lock.c so we can better represent
whether certain types of lockmethods (& levels ?) are [not] to be
shared.

Greetings,

Andres Freund

#116

Andres Freund

andres@anarazel.de

almost 6 years ago

In reply to: Tom Lane (#111)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

Hi,

On 2020-02-14 09:42:40 -0500, Tom Lane wrote:

In the second place, it's ludicrous to expect that the underlying
platform/filesystem can support an infinite number of concurrent
file-extension operations. At some level (e.g. where disk blocks
are handed out, or where a record of the operation is written to
a filesystem journal) it's quite likely that things are bottlenecked
down to *one* such operation at a time per filesystem.

That's probably true to some degree from a theoretical POV, but I think
it's so far from where we are at, that it's effectively wrong. I can
concurrently extend a few files at close to 10GB/s on a set of fast
devices below a *single* filesystem. Whereas postgres bottlenecks far
far before this. Given that a lot of today's storage has latencies in
the 10-100s of microseconds, a journal flush doesn't necessarily cause
that much serialization - and OS journals do group commit like things
too.

Greetings,

Andres Freund

#117

Robert Haas

robertmhaas@gmail.com

almost 6 years ago

In reply to: Andres Freund (#115)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Fri, Feb 14, 2020 at 11:40 AM Andres Freund <andres@anarazel.de> wrote:

IMO the right thing here is to extend lock.c so we can better represent
whether certain types of lockmethods (& levels ?) are [not] to be
shared.

The part that I find awkward about that is the whole thing with the
deadlock detector. The deadlock detection code is old, crufty,
complex, and very difficult to test (or at least I have found it so).
A bug that I introduced when inventing group locking took like 5 years
for somebody to find.

One way of looking at the requirement that we have here is that
certain kinds of locks need to be exempted from group locking.
Basically, these are because they are a lower-level concept: a lock on
a relation is more of a "logical" concept, and you hold the lock until
eoxact, whereas a lock on an extend the relation is more of a
"physical" concept, and you give it up as soon as you are done. Page
locks are like relation extension locks in this regard. Unlike locks
on SQL-level objects, these should not be shared between members of a
lock group.

Now, if it weren't for the deadlock detector, that would be easy
enough. But figuring out what to do with the deadlock detector seems
really painful to me. I wonder if there's some way we can make an end
run around that problem. For instance, if we could make (and enforce)
a coding rule that you cannot acquire a heavyweight lock while holding
a relation extension or page lock, then maybe we could somehow teach
the deadlock detector to just ignore those kinds of locks, and teach
the lock acquisition machinery that they conflict between lock group
members.

On the other hand, I think you might also be understating the
differences between these kinds of locks and other heavyweight locks.
I suspect that the reason why we use lwlocks for buffers and
heavyweight locks here is because there are a conceptually infinite
number of relations, and lwlocks can't handle that. The only mechanism
we currently have that does handle that is the heavyweight lock
mechanism, and from my point of view, somebody just beat it with a
stick to make it fit this application. But the fact that it has been
made to fit does not mean that it is really fit for purpose. We use 2
of 9 lock levels, we don't need deadlock detection, we need different
behavior when group locking is in use, we release locks right away
rather than at eoxact. I don't think it's crazy to think that those
differences are significant enough to justify having a separate
mechanism, even if the one that is currently on the table is not
exactly what we want.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#118

Andres Freund

andres@anarazel.de

almost 6 years ago

In reply to: Robert Haas (#117)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

Hi,

On 2020-02-14 12:08:45 -0500, Robert Haas wrote:

On Fri, Feb 14, 2020 at 11:40 AM Andres Freund <andres@anarazel.de> wrote:

IMO the right thing here is to extend lock.c so we can better represent
whether certain types of lockmethods (& levels ?) are [not] to be
shared.

The part that I find awkward about that is the whole thing with the
deadlock detector. The deadlock detection code is old, crufty,
complex, and very difficult to test (or at least I have found it so).
A bug that I introduced when inventing group locking took like 5 years
for somebody to find.

Oh, I agree, lock.c and surrounding code is pretty crufty. Doubtful that
just building up a largely parallel piece of infrastructure next to it
is a good answer though.

One way of looking at the requirement that we have here is that
certain kinds of locks need to be exempted from group locking.
Basically, these are because they are a lower-level concept: a lock on
a relation is more of a "logical" concept, and you hold the lock until
eoxact, whereas a lock on an extend the relation is more of a
"physical" concept, and you give it up as soon as you are done. Page
locks are like relation extension locks in this regard. Unlike locks
on SQL-level objects, these should not be shared between members of a
lock group.

Now, if it weren't for the deadlock detector, that would be easy
enough. But figuring out what to do with the deadlock detector seems
really painful to me. I wonder if there's some way we can make an end
run around that problem. For instance, if we could make (and enforce)
a coding rule that you cannot acquire a heavyweight lock while holding
a relation extension or page lock, then maybe we could somehow teach
the deadlock detector to just ignore those kinds of locks, and teach
the lock acquisition machinery that they conflict between lock group
members.

Yea, that seems possible. I'm not really sure it's needed however? As
long as you're not teaching the locking mechanism new tricks that
influence the wait graph, why would the deadlock detector care? That's
quite different from the group locking case, where you explicitly needed
to teach it something fairly fundamental.

It might still be a good idea independently to add the rule & enforce
that acquire heavyweight locks while holding certain classes of locks is
not allowed.

On the other hand, I think you might also be understating the
differences between these kinds of locks and other heavyweight locks.
I suspect that the reason why we use lwlocks for buffers and
heavyweight locks here is because there are a conceptually infinite
number of relations, and lwlocks can't handle that.

Right. For me that's *the* fundamental service that lock.c delivers. And
it's the fundamental bit this thread so far largely has been focusing
on.

The only mechanism we currently have that does handle that is the
heavyweight lock mechanism, and from my point of view, somebody just
beat it with a stick to make it fit this application. But the fact
that it has been made to fit does not mean that it is really fit for
purpose. We use 2 of 9 lock levels, we don't need deadlock detection,
we need different behavior when group locking is in use, we release
locks right away rather than at eoxact. I don't think it's crazy to
think that those differences are significant enough to justify having
a separate mechanism, even if the one that is currently on the table
is not exactly what we want.

Isn't that mostly true to varying degrees for the majority of lock types
in lock.c? Sure, perhaps historically that's a misuse of lock.c, but
it's been pretty ingrained by now. I just don't see where leaving out
any of these features is going to give us fundamental advantages
justifying a different locking infrastructure.

E.g. not needing to support "conceptually infinite" number of relations
IMO does provide a fundamental advantage - no need for a mapping. I'm
not yet seeing anything equivalent for the extension vs. lock.c style
lock case.

Greetings,

Andres Freund

#119

Robert Haas

robertmhaas@gmail.com

almost 6 years ago

In reply to: Andres Freund (#118)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Fri, Feb 14, 2020 at 1:07 PM Andres Freund <andres@anarazel.de> wrote:

Yea, that seems possible. I'm not really sure it's needed however? As
long as you're not teaching the locking mechanism new tricks that
influence the wait graph, why would the deadlock detector care? That's
quite different from the group locking case, where you explicitly needed
to teach it something fairly fundamental.

Well, you have to teach it that locks of certain types conflict even
if they are in the same group, and that bleeds over pretty quickly
into the whole area of deadlock detection, because lock waits are the
edges in the graph that the deadlock detector processes.

It might still be a good idea independently to add the rule & enforce
that acquire heavyweight locks while holding certain classes of locks is
not allowed.

I think that's absolutely essential, if we're going to continue using
the main lock manager for this. I remain somewhat unconvinced that
doing so is the best way forward, but it is *a* way forward.

Right. For me that's *the* fundamental service that lock.c delivers. And
it's the fundamental bit this thread so far largely has been focusing
on.

For me, the deadlock detection is the far more complicated and problematic bit.

Isn't that mostly true to varying degrees for the majority of lock types
in lock.c? Sure, perhaps historically that's a misuse of lock.c, but
it's been pretty ingrained by now. I just don't see where leaving out
any of these features is going to give us fundamental advantages
justifying a different locking infrastructure.

I think the group locking + deadlock detection things are more
fundamental than you might be crediting, but I agree that having
parallel mechanisms has its own set of pitfalls.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#120

Amit Kapila

amit.kapila16@gmail.com

almost 6 years ago

In reply to: Tom Lane (#111)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Fri, Feb 14, 2020 at 8:12 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Amit Kapila <amit.kapila16@gmail.com> writes:

I think MaxBackends will generally limit the number of different
relations that can simultaneously extend, but maybe tables with many
partitions might change the situation. You are right that some tests
might suggest a good number, let Mahendra write a patch and then we
can test it. Do you have any better idea?

In the first place, there certainly isn't more than one extension
happening at a time per backend, else the entire premise of this
thread is wrong. Handwaving about partitions won't change that.

Having more number of partitions theoretically increases the chances
of false-sharing with the same number of concurrent sessions. For ex.
two sessions operating on two relations vs. two sessions working on
two relations with 100 partitions each would increase the chances of
false-sharing. Sawada-San and Mahendra have done many tests on
different systems and some monitoring with the previous patch that
with a decent number of fixed slots (1024), the false-sharing was very
less and even if it was there the effect was close to nothing. So, in
short, this is not the point to worry about, but to ensure that we
don't create any significant regressions in this area.

In the second place, it's ludicrous to expect that the underlying
platform/filesystem can support an infinite number of concurrent
file-extension operations. At some level (e.g. where disk blocks
are handed out, or where a record of the operation is written to
a filesystem journal) it's quite likely that things are bottlenecked
down to *one* such operation at a time per filesystem. So I'm not
that concerned about occasional false-sharing limiting our ability
to issue concurrent requests. There are probably worse restrictions
at lower levels.

Agreed and what we have observed during the tests is what you have
said in this paragraph.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#121

Amit Kapila

amit.kapila16@gmail.com

almost 6 years ago

In reply to: Tom Lane (#113)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Fri, Feb 14, 2020 at 9:13 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, Feb 12, 2020 at 11:53 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

That's an interesting idea, but it doesn't make the lock acquisition
itself interruptible, which seems pretty important to me in this case.

Good point: if you think the contained operation might run too long to
suit you, then you don't want other backends to be stuck behind it for
the same amount of time.

It is not clear to me why we should add that as a requirement for this
patch when other places like WALWriteLock, etc. have similar coding
patterns and we haven't heard a ton of complaints about making it
interruptable or if there are then I am not aware.

I wonder if we could have an LWLockAcquireInterruptibly() or some such
that allows the lock acquisition itself to be interruptible. I think
that would require some rejiggering but it might be doable.

Yeah, I had the impression from a brief look at LWLockAcquire that
it was itself depending on not throwing errors partway through.
But with careful and perhaps-a-shade-slower coding, we could probably
make a version that didn't require that.

If this becomes a requirement to move this patch, then surely we can
do that. BTW, what exactly we need to ensure for it? Is it something
on the lines of ensuring that in error path the state of the lock is
cleared? Are we worried that interrupt handler might do something
which will change the state of lock we are acquiring?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#122

Andres Freund

andres@anarazel.de

almost 6 years ago

In reply to: Robert Haas (#119)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

Hi,

On 2020-02-14 13:34:03 -0500, Robert Haas wrote:

On Fri, Feb 14, 2020 at 1:07 PM Andres Freund <andres@anarazel.de> wrote:

Yea, that seems possible. I'm not really sure it's needed however? As
long as you're not teaching the locking mechanism new tricks that
influence the wait graph, why would the deadlock detector care? That's
quite different from the group locking case, where you explicitly needed
to teach it something fairly fundamental.

Well, you have to teach it that locks of certain types conflict even
if they are in the same group, and that bleeds over pretty quickly
into the whole area of deadlock detection, because lock waits are the
edges in the graph that the deadlock detector processes.

Shouldn't this *theretically* be doable with changes mostly localized to
lock.c, by not using proc->lockGroupLeader but proc for lock types that
don't support group locking? I do see that deadlock.c largely looks at
->lockGroupLeader, but that kind of doesn't seem right to me.

It might still be a good idea independently to add the rule & enforce
that acquire heavyweight locks while holding certain classes of locks is
not allowed.

I think that's absolutely essential, if we're going to continue using
the main lock manager for this. I remain somewhat unconvinced that
doing so is the best way forward, but it is *a* way forward.

Seems like we should build this part independently of the lock.c/new
infra piece.

Right. For me that's *the* fundamental service that lock.c delivers. And
it's the fundamental bit this thread so far largely has been focusing
on.

For me, the deadlock detection is the far more complicated and problematic bit.

Isn't that mostly true to varying degrees for the majority of lock types
in lock.c? Sure, perhaps historically that's a misuse of lock.c, but
it's been pretty ingrained by now. I just don't see where leaving out
any of these features is going to give us fundamental advantages
justifying a different locking infrastructure.

I think the group locking + deadlock detection things are more
fundamental than you might be crediting, but I agree that having
parallel mechanisms has its own set of pitfalls.

It's possible. But I'm also hesitant to believe that we'll not need
other lock types that conflict between leader/worker, but that still
need deadlock detection. The more work we want to parallelize, the more
likely that imo will become.

Greetings,

Andres Freund

#123

Tom Lane

tgl@sss.pgh.pa.us

almost 6 years ago

In reply to: Andres Freund (#122)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

Andres Freund <andres@anarazel.de> writes:

On 2020-02-14 13:34:03 -0500, Robert Haas wrote:

I think the group locking + deadlock detection things are more
fundamental than you might be crediting, but I agree that having
parallel mechanisms has its own set of pitfalls.

It's possible. But I'm also hesitant to believe that we'll not need
other lock types that conflict between leader/worker, but that still
need deadlock detection. The more work we want to parallelize, the more
likely that imo will become.

Yeah. The concept that leader and workers can't conflict seems to me
to be dependent, in a very fundamental way, on the assumption that
we only need to parallelize read-only workloads. I don't think that's
going to have a long half-life.

regards, tom lane

#124

Amit Kapila

amit.kapila16@gmail.com

almost 6 years ago

In reply to: Tom Lane (#123)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Mon, Feb 17, 2020 at 2:42 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andres Freund <andres@anarazel.de> writes:

On 2020-02-14 13:34:03 -0500, Robert Haas wrote:

I think the group locking + deadlock detection things are more
fundamental than you might be crediting, but I agree that having
parallel mechanisms has its own set of pitfalls.

It's possible. But I'm also hesitant to believe that we'll not need
other lock types that conflict between leader/worker, but that still
need deadlock detection. The more work we want to parallelize, the more
likely that imo will become.

Yeah. The concept that leader and workers can't conflict seems to me
to be dependent, in a very fundamental way, on the assumption that
we only need to parallelize read-only workloads. I don't think that's
going to have a long half-life.

Surely, someday, we need to solve that problem. But it is not clear
when because if we see the operations for which we want to solve the
relation extension lock problem doesn't require that. For example,
for a parallel copy or further improving parallel vacuum to allow
multiple workers to scan and process the heap and individual index, we
don't need to change anything in group locking as far as I understand.

Now, for parallel deletes/updates, I think it will depend on how we
choose to parallelize those operations. I mean if we decide that each
worker will work on an independent set of pages like we do for a
sequential scan, we again might not need to change the group locking
unless I am missing something which is possible.

I think till we know the real need for changing group locking, going
in the direction of what Tom suggested to use an array of LWLocks [1]/messages/by-id/19443.1581435793@sss.pgh.pa.us
to address the problems in hand is a good idea. It is not very clear
to me that are we thinking to give up on Tom's idea [1]/messages/by-id/19443.1581435793@sss.pgh.pa.us and change
group locking even though it is not clear or at least nobody has
proposed an idea/patch which requires that? Or are we thinking that
we can do what Tom suggested for relation extension lock and also plan
to change group locking for future parallel operations that might
require it?

[1]: /messages/by-id/19443.1581435793@sss.pgh.pa.us

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#125

Andres Freund

andres@anarazel.de

almost 6 years ago

In reply to: Amit Kapila (#124)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

Hi,

On 2020-02-19 11:12:18 +0530, Amit Kapila wrote:

I think till we know the real need for changing group locking, going
in the direction of what Tom suggested to use an array of LWLocks [1]
to address the problems in hand is a good idea.

-many

I think that building yet another locking subsystem is the entirely
wrong idea - especially when there's imo no convincing architectural
reasons to do so.

It is not very clear to me that are we thinking to give up on Tom's
idea [1] and change group locking even though it is not clear or at
least nobody has proposed an idea/patch which requires that? Or are
we thinking that we can do what Tom suggested for relation extension
lock and also plan to change group locking for future parallel
operations that might require it?

What I'm advocating is that extension locks should continue to go
through lock.c. And yes, that requires some changes to group locking,
but I still don't see why they'd be complicated. And if there's concerns
about the cost of lock.c, I outlined a pretty long list of improvements
that'll help everyone, and I showed that the locking itself isn't
actually a large fraction of the scalability issues that extension has.

Regards,

Andres

#126

Amit Kapila

amit.kapila16@gmail.com

almost 6 years ago

In reply to: Andres Freund (#125)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Thu, Feb 20, 2020 at 8:06 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2020-02-19 11:12:18 +0530, Amit Kapila wrote:

I think till we know the real need for changing group locking, going
in the direction of what Tom suggested to use an array of LWLocks [1]
to address the problems in hand is a good idea.

-many

I think that building yet another locking subsystem is the entirely
wrong idea - especially when there's imo no convincing architectural
reasons to do so.

Hmm, AFAIU, it will be done by having an array of LWLocks which we do
at other places as well (like BufferIO locks). I am not sure if we
can call it as new locking subsystem, but if we decide to continue
using lock.c and change group locking then I think we can do that as
well, see my comments below regarding that.

It is not very clear to me that are we thinking to give up on Tom's
idea [1] and change group locking even though it is not clear or at
least nobody has proposed an idea/patch which requires that? Or are
we thinking that we can do what Tom suggested for relation extension
lock and also plan to change group locking for future parallel
operations that might require it?

What I'm advocating is that extension locks should continue to go
through lock.c. And yes, that requires some changes to group locking,
but I still don't see why they'd be complicated.

Fair position, as per initial analysis, I think if we do below three
things, it should work out without changing to a new way of locking
for relation extension or page type locks.
a. As per the discussion above, ensure in code we will never try to
acquire another heavy-weight lock after acquiring relation extension
or page type locks (probably by having Asserts in code or maybe some
other way).
b. Change lock.c so that group locking is not considered for these two
lock types. For ex. in LockCheckConflicts, along with the check (if
(proclock->groupLeader == MyProc && MyProc->lockGroupLeader == NULL)),
we also check lock->tag and call it a conflict for these two locks.
c. The deadlock detector can ignore checking these two types of locks
because point (a) ensures that those won't lead to deadlock. One idea
could be that FindLockCycleRecurseMember just ignores these two types
of locks by checking the lock tag.

It is possible that I might be missing something or we could achieve
this some other way as well.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#127

Mahendra Singh Thalor

mahi6run@gmail.com

almost 6 years ago

In reply to: Amit Kapila (#126)

2 attachment(s)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Mon, 24 Feb 2020 at 15:39, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Feb 20, 2020 at 8:06 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2020-02-19 11:12:18 +0530, Amit Kapila wrote:

I think till we know the real need for changing group locking, going
in the direction of what Tom suggested to use an array of LWLocks [1]
to address the problems in hand is a good idea.

-many

I think that building yet another locking subsystem is the entirely
wrong idea - especially when there's imo no convincing architectural
reasons to do so.

Hmm, AFAIU, it will be done by having an array of LWLocks which we do
at other places as well (like BufferIO locks). I am not sure if we
can call it as new locking subsystem, but if we decide to continue
using lock.c and change group locking then I think we can do that as
well, see my comments below regarding that.

It is not very clear to me that are we thinking to give up on Tom's
idea [1] and change group locking even though it is not clear or at
least nobody has proposed an idea/patch which requires that? Or are
we thinking that we can do what Tom suggested for relation extension
lock and also plan to change group locking for future parallel
operations that might require it?

What I'm advocating is that extension locks should continue to go
through lock.c. And yes, that requires some changes to group locking,
but I still don't see why they'd be complicated.

Fair position, as per initial analysis, I think if we do below three
things, it should work out without changing to a new way of locking
for relation extension or page type locks.
a. As per the discussion above, ensure in code we will never try to
acquire another heavy-weight lock after acquiring relation extension
or page type locks (probably by having Asserts in code or maybe some
other way).
b. Change lock.c so that group locking is not considered for these two
lock types. For ex. in LockCheckConflicts, along with the check (if
(proclock->groupLeader == MyProc && MyProc->lockGroupLeader == NULL)),
we also check lock->tag and call it a conflict for these two locks.
c. The deadlock detector can ignore checking these two types of locks
because point (a) ensures that those won't lead to deadlock. One idea
could be that FindLockCycleRecurseMember just ignores these two types
of locks by checking the lock tag.

Thanks Amit for summary.

Based on above 3 points, here attaching 2 patches for review.

1. v01_0001-Conflict-EXTENTION-lock-in-group-member.patch (Patch by Dilip Kumar)
Basically this patch is for point b and c.

2. v01_0002-Added-assert-to-verify-that-we-never-try-to-take-any.patch
(Patch by me)
This patch is for point a.

After applying both the patches, make check-world is passing.

We are testing both the patches and will post results.

Thoughts?

--
Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v01_0001-Conflict-EXTENTION-lock-in-group-member.patchapplication/octet-stream; name=v01_0001-Conflict-EXTENTION-lock-in-group-member.patchDownload

From 22f7bc63f25fbdaa218e64330fd3d13c865da654 Mon Sep 17 00:00:00 2001
From: Mahendra Singh Thalor <mahi6run@gmail.com>
Date: Tue, 3 Mar 2020 04:15:18 -0800
Subject: [PATCH 1/2] Conflict EXTENTION lock in group member

---
 src/backend/storage/lmgr/deadlock.c | 9 +++++++++
 src/backend/storage/lmgr/lock.c     | 8 ++++++++
 2 files changed, 17 insertions(+)

diff --git a/src/backend/storage/lmgr/deadlock.c b/src/backend/storage/lmgr/deadlock.c
index f8c5df08e6..8bff91b495 100644
--- a/src/backend/storage/lmgr/deadlock.c
+++ b/src/backend/storage/lmgr/deadlock.c
@@ -568,6 +568,15 @@ FindLockCycleRecurseMember(PGPROC *checkProc,
 	proclock = (PROCLOCK *) SHMQueueNext(procLocks, procLocks,
 										 offsetof(PROCLOCK, lockLink));
 
+	/*
+	 * After acquiring relation extension lock we don't acquire any other
+	 * heavyweight lock so relation extension lock never participate in actual
+	 * deadlock cycle.  So avoid the wait edge for this type of lock so that
+	 * we can avoid any false cycle detection due to group locking.
+	 */
+	if (lock->tag.locktag_type == LOCKTAG_RELATION_EXTEND)
+		return false;
+
 	while (proclock)
 	{
 		PGPROC	   *leader;
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 56dba09299..ef14655cf8 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -1403,6 +1403,14 @@ LockCheckConflicts(LockMethod lockMethodTable,
 		return true;
 	}
 
+	/* If it's a relation extension lock. */
+	if (lock->tag.locktag_type == LOCKTAG_RELATION_EXTEND)
+	{
+		PROCLOCK_PRINT("LockCheckConflicts: conflicting (simple)",
+				proclock);
+		return true;
+	}
+
 	/*
 	 * Locks held in conflicting modes by members of our own lock group are
 	 * not real conflicts; we can subtract those out and see if we still have
-- 
2.17.1

v01_0002-Added-assert-to-verify-that-we-never-try-to-take-any.patchapplication/octet-stream; name=v01_0002-Added-assert-to-verify-that-we-never-try-to-take-any.patchDownload

From 39432cfb614cc1996979bbeb22d4d0df631aca90 Mon Sep 17 00:00:00 2001
From: Mahendra Singh Thalor <mahi6run@gmail.com>
Date: Tue, 3 Mar 2020 04:03:45 -0800
Subject: [PATCH 2/2] Added assert to verify that we never try to take any
 heavy weight lock after acquiring relation Extension lock

In LockAcquireExtended, we will call AssertAnyExtentionLockHeadByMe
to check that our backend is not holding any extention lock.
---
 src/backend/storage/lmgr/lock.c | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index ef14655cf8..b04235f3f2 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -339,6 +339,7 @@ PROCLOCK_PRINT(const char *where, const PROCLOCK *proclockP)
 #endif							/* not LOCK_DEBUG */
 
 
+static void AssertAnyExtentionLockHeadByMe(void);
 static uint32 proclock_hash(const void *key, Size keysize);
 static void RemoveLocalLock(LOCALLOCK *locallock);
 static PROCLOCK *SetupLockInTable(LockMethod lockMethodTable, PGPROC *proc,
@@ -587,6 +588,31 @@ LockHeldByMe(const LOCKTAG *locktag, LOCKMODE lockmode)
 	return (locallock && locallock->nLocks > 0);
 }
 
+/*
+ * AssertAnyExtentionLockHeadByMe -- test whether any EXTENSION lock held by
+ * this backend.  If any EXTENSION lock is hold by this backend, then assert
+ * will fail.  To use this function, assert should be enabled.
+ */
+void AssertAnyExtentionLockHeadByMe()
+{
+#ifdef USE_ASSERT_CHECKING
+	HASH_SEQ_STATUS scan_status;
+	LOCALLOCK  *locallock;
+
+	hash_seq_init(&scan_status, LockMethodLocalHash);
+	while ((locallock = (LOCALLOCK *) hash_seq_search(&scan_status)) != NULL)
+	{
+		/*
+		 * Either lock is other than extension or we should not held extension
+		 * lock.  Because after acquiring extension lock, we should never try
+		 * to acquire any lock.
+		 */
+		Assert (locallock->tag.lock.locktag_type != LOCKTAG_RELATION_EXTEND ||
+				locallock->nLocks == 0);
+	}
+#endif
+}
+
 /*
  * LockHasWaiters -- look up 'locktag' and check if releasing this
  *		lock would wake up other processes waiting for it.
@@ -749,6 +775,12 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	bool		found_conflict;
 	bool		log_lock = false;
 
+	/*
+	 * This backend should not hold any relation extension lock while acquiring
+	 * heavy weight lock.
+	 */
+	AssertAnyExtentionLockHeadByMe();
+
 	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
 		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
 	lockMethodTable = LockMethods[lockmethodid];
-- 
2.17.1

#128

Dilip Kumar

dilipbalaut@gmail.com

almost 6 years ago

In reply to: Mahendra Singh Thalor (#127)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Wed, Mar 4, 2020 at 11:45 AM Mahendra Singh Thalor
<mahi6run@gmail.com> wrote:

On Mon, 24 Feb 2020 at 15:39, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Feb 20, 2020 at 8:06 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2020-02-19 11:12:18 +0530, Amit Kapila wrote:

I think till we know the real need for changing group locking, going
in the direction of what Tom suggested to use an array of LWLocks [1]
to address the problems in hand is a good idea.

-many

I think that building yet another locking subsystem is the entirely
wrong idea - especially when there's imo no convincing architectural
reasons to do so.

Hmm, AFAIU, it will be done by having an array of LWLocks which we do
at other places as well (like BufferIO locks). I am not sure if we
can call it as new locking subsystem, but if we decide to continue
using lock.c and change group locking then I think we can do that as
well, see my comments below regarding that.

It is not very clear to me that are we thinking to give up on Tom's
idea [1] and change group locking even though it is not clear or at
least nobody has proposed an idea/patch which requires that? Or are
we thinking that we can do what Tom suggested for relation extension
lock and also plan to change group locking for future parallel
operations that might require it?

What I'm advocating is that extension locks should continue to go
through lock.c. And yes, that requires some changes to group locking,
but I still don't see why they'd be complicated.

Fair position, as per initial analysis, I think if we do below three
things, it should work out without changing to a new way of locking
for relation extension or page type locks.
a. As per the discussion above, ensure in code we will never try to
acquire another heavy-weight lock after acquiring relation extension
or page type locks (probably by having Asserts in code or maybe some
other way).
b. Change lock.c so that group locking is not considered for these two
lock types. For ex. in LockCheckConflicts, along with the check (if
(proclock->groupLeader == MyProc && MyProc->lockGroupLeader == NULL)),
we also check lock->tag and call it a conflict for these two locks.
c. The deadlock detector can ignore checking these two types of locks
because point (a) ensures that those won't lead to deadlock. One idea
could be that FindLockCycleRecurseMember just ignores these two types
of locks by checking the lock tag.

Thanks Amit for summary.

Based on above 3 points, here attaching 2 patches for review.

1. v01_0001-Conflict-EXTENTION-lock-in-group-member.patch (Patch by Dilip Kumar)
Basically this patch is for point b and c.

2. v01_0002-Added-assert-to-verify-that-we-never-try-to-take-any.patch
(Patch by me)
This patch is for point a.

After applying both the patches, make check-world is passing.

We are testing both the patches and will post results.

Thoughts?

+static void AssertAnyExtentionLockHeadByMe(void);

+/*
+ * AssertAnyExtentionLockHeadByMe -- test whether any EXTENSION lock held by
+ * this backend.  If any EXTENSION lock is hold by this backend, then assert
+ * will fail.  To use this function, assert should be enabled.
+ */
+void AssertAnyExtentionLockHeadByMe()
+{

Some minor observations on 0002.
1. static is missing in a function definition.
2. Function name should start in new line after function return type
in function definition, as per pg guideline.
+void AssertAnyExtentionLockHeadByMe()
->
void
AssertAnyExtentionLockHeadByMe()

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#129

Mahendra Singh Thalor

mahi6run@gmail.com

almost 6 years ago

In reply to: Dilip Kumar (#128)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Wed, 4 Mar 2020 at 12:03, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Mar 4, 2020 at 11:45 AM Mahendra Singh Thalor
<mahi6run@gmail.com> wrote:

On Mon, 24 Feb 2020 at 15:39, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Feb 20, 2020 at 8:06 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2020-02-19 11:12:18 +0530, Amit Kapila wrote:

I think till we know the real need for changing group locking, going
in the direction of what Tom suggested to use an array of LWLocks [1]
to address the problems in hand is a good idea.

-many

I think that building yet another locking subsystem is the entirely
wrong idea - especially when there's imo no convincing architectural
reasons to do so.

Hmm, AFAIU, it will be done by having an array of LWLocks which we do
at other places as well (like BufferIO locks). I am not sure if we
can call it as new locking subsystem, but if we decide to continue
using lock.c and change group locking then I think we can do that as
well, see my comments below regarding that.

It is not very clear to me that are we thinking to give up on Tom's
idea [1] and change group locking even though it is not clear or at
least nobody has proposed an idea/patch which requires that? Or are
we thinking that we can do what Tom suggested for relation extension
lock and also plan to change group locking for future parallel
operations that might require it?

What I'm advocating is that extension locks should continue to go
through lock.c. And yes, that requires some changes to group locking,
but I still don't see why they'd be complicated.

Fair position, as per initial analysis, I think if we do below three
things, it should work out without changing to a new way of locking
for relation extension or page type locks.
a. As per the discussion above, ensure in code we will never try to
acquire another heavy-weight lock after acquiring relation extension
or page type locks (probably by having Asserts in code or maybe some
other way).
b. Change lock.c so that group locking is not considered for these two
lock types. For ex. in LockCheckConflicts, along with the check (if
(proclock->groupLeader == MyProc && MyProc->lockGroupLeader == NULL)),
we also check lock->tag and call it a conflict for these two locks.
c. The deadlock detector can ignore checking these two types of locks
because point (a) ensures that those won't lead to deadlock. One idea
could be that FindLockCycleRecurseMember just ignores these two types
of locks by checking the lock tag.

Thanks Amit for summary.

Based on above 3 points, here attaching 2 patches for review.

1. v01_0001-Conflict-EXTENTION-lock-in-group-member.patch (Patch by Dilip Kumar)
Basically this patch is for point b and c.

2. v01_0002-Added-assert-to-verify-that-we-never-try-to-take-any.patch
(Patch by me)
This patch is for point a.

After applying both the patches, make check-world is passing.

We are testing both the patches and will post results.

Hi all,

I am planing to test below 3 points on v1 patch set:

1. We will check that new added assert can be hit by hacking code
(while holding extension lock, try to take any heavyweight lock)
2. In FindLockCycleRecurseMember, for testing purposes, we can put
additional loop to check that for all relext holders, there must not
be any outer edge.
3. Test that group members are not granted the lock for the relation
extension lock (group members should conflict).

Please let me know your thoughts to test this patch.

--
Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com

#130

Amit Kapila

amit.kapila16@gmail.com

almost 6 years ago

In reply to: Mahendra Singh Thalor (#127)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Wed, Mar 4, 2020 at 11:45 AM Mahendra Singh Thalor
<mahi6run@gmail.com> wrote:

On Mon, 24 Feb 2020 at 15:39, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Feb 20, 2020 at 8:06 AM Andres Freund <andres@anarazel.de> wrote:

What I'm advocating is that extension locks should continue to go
through lock.c. And yes, that requires some changes to group locking,
but I still don't see why they'd be complicated.

Fair position, as per initial analysis, I think if we do below three
things, it should work out without changing to a new way of locking
for relation extension or page type locks.
a. As per the discussion above, ensure in code we will never try to
acquire another heavy-weight lock after acquiring relation extension
or page type locks (probably by having Asserts in code or maybe some
other way).
b. Change lock.c so that group locking is not considered for these two
lock types. For ex. in LockCheckConflicts, along with the check (if
(proclock->groupLeader == MyProc && MyProc->lockGroupLeader == NULL)),
we also check lock->tag and call it a conflict for these two locks.
c. The deadlock detector can ignore checking these two types of locks
because point (a) ensures that those won't lead to deadlock. One idea
could be that FindLockCycleRecurseMember just ignores these two types
of locks by checking the lock tag.

Thanks Amit for summary.

Based on above 3 points, here attaching 2 patches for review.

1. v01_0001-Conflict-EXTENTION-lock-in-group-member.patch (Patch by Dilip Kumar)
Basically this patch is for point b and c.

2. v01_0002-Added-assert-to-verify-that-we-never-try-to-take-any.patch
(Patch by me)
This patch is for point a.

After applying both the patches, make check-world is passing.

We are testing both the patches and will post results.

I think we need to do detailed code review in the places where we are
taking Relation Extension Lock and see whether we are acquiring
another heavy-weight lock after that. It seems to me that in
brin_getinsertbuffer, after acquiring Relation Extension Lock, we
might again try to acquire the same lock. See
brin_initialize_empty_new_buffer which is called after acquiring
Relation Extension Lock, in that function, we call
RecordPageWithFreeSpace and that can again try to acquire the same
lock if it needs to perform fsm_extend. I think there will be similar
instances in the code. I think it is fine if we again try to acquire
it, but the current assertion in your patch needs to be adjusted for
that.

Few other minor comments on
v01_0002-Added-assert-to-verify-that-we-never-try-to-take-any:
1. Ideally, this should be the first patch as we first need to ensure
that we don't take any heavy-weight locks after acquiring a relation
extension lock.

2. I think it is better to add an Assert after initial error checks
(after RecoveryInProgress().. check)

3.
+ Assert (locallock->tag.lock.locktag_type != LOCKTAG_RELATION_EXTEND ||
+ locallock->nLocks == 0);

I think it is not possible that we have an entry in
LockMethodLocalHash and its value is zero. Do you see any such
possibility, if not, then we might want to remove it?

4. We already have a macro for LOCALLOCK_LOCKMETHOD, can we write
another one tag type? This will make the check look a bit cleaner and
probably if we need to extend it in future for Page type locks, then
also it will be good.

5. I have also tried to think of another way to check if we already
hold lock type LOCKTAG_RELATION_EXTEND, but couldn't come up with a
cheaper way than this. Basically, I think if we traverse the
MyProc->myProcLocks queue, we will get this information, but that
doesn't seem much cheaper than this.

6. Another thing that could be possible is to make this a test and
elog so that it can hit in production scenarios, but I think the cost
of that will be high unless we have a very simple way to write this
test condition.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#131

Dilip Kumar

dilipbalaut@gmail.com

almost 6 years ago

In reply to: Amit Kapila (#130)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Thu, Mar 5, 2020 at 12:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Mar 4, 2020 at 11:45 AM Mahendra Singh Thalor
<mahi6run@gmail.com> wrote:

On Mon, 24 Feb 2020 at 15:39, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Feb 20, 2020 at 8:06 AM Andres Freund <andres@anarazel.de> wrote:

What I'm advocating is that extension locks should continue to go
through lock.c. And yes, that requires some changes to group locking,
but I still don't see why they'd be complicated.

Fair position, as per initial analysis, I think if we do below three
things, it should work out without changing to a new way of locking
for relation extension or page type locks.
a. As per the discussion above, ensure in code we will never try to
acquire another heavy-weight lock after acquiring relation extension
or page type locks (probably by having Asserts in code or maybe some
other way).
b. Change lock.c so that group locking is not considered for these two
lock types. For ex. in LockCheckConflicts, along with the check (if
(proclock->groupLeader == MyProc && MyProc->lockGroupLeader == NULL)),
we also check lock->tag and call it a conflict for these two locks.
c. The deadlock detector can ignore checking these two types of locks
because point (a) ensures that those won't lead to deadlock. One idea
could be that FindLockCycleRecurseMember just ignores these two types
of locks by checking the lock tag.

Thanks Amit for summary.

Based on above 3 points, here attaching 2 patches for review.

1. v01_0001-Conflict-EXTENTION-lock-in-group-member.patch (Patch by Dilip Kumar)
Basically this patch is for point b and c.

2. v01_0002-Added-assert-to-verify-that-we-never-try-to-take-any.patch
(Patch by me)
This patch is for point a.

After applying both the patches, make check-world is passing.

We are testing both the patches and will post results.

I think we need to do detailed code review in the places where we are
taking Relation Extension Lock and see whether we are acquiring
another heavy-weight lock after that. It seems to me that in
brin_getinsertbuffer, after acquiring Relation Extension Lock, we
might again try to acquire the same lock. See
brin_initialize_empty_new_buffer which is called after acquiring
Relation Extension Lock, in that function, we call
RecordPageWithFreeSpace and that can again try to acquire the same
lock if it needs to perform fsm_extend. I think there will be similar
instances in the code. I think it is fine if we again try to acquire
it, but the current assertion in your patch needs to be adjusted for
that.

Few other minor comments on
v01_0002-Added-assert-to-verify-that-we-never-try-to-take-any:
1. Ideally, this should be the first patch as we first need to ensure
that we don't take any heavy-weight locks after acquiring a relation
extension lock.

2. I think it is better to add an Assert after initial error checks
(after RecoveryInProgress().. check)
3.
+ Assert (locallock->tag.lock.locktag_type != LOCKTAG_RELATION_EXTEND ||
+ locallock->nLocks == 0);
I think it is not possible that we have an entry in
LockMethodLocalHash and its value is zero. Do you see any such
possibility, if not, then we might want to remove it?

4. We already have a macro for LOCALLOCK_LOCKMETHOD, can we write
another one tag type? This will make the check look a bit cleaner and
probably if we need to extend it in future for Page type locks, then
also it will be good.

5. I have also tried to think of another way to check if we already
hold lock type LOCKTAG_RELATION_EXTEND, but couldn't come up with a
cheaper way than this. Basically, I think if we traverse the
MyProc->myProcLocks queue, we will get this information, but that
doesn't seem much cheaper than this.

I think we can maintain a flag (rel_extlock_held). And, we can set
that true in LockRelationForExtension,
ConditionalLockRelationForExtension functions and we can reset it in
UnlockRelationForExtension or in the error path e.g. LockReleaseAll.
I think, this way we will be able to elog and this will be much
cheaper.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#132

Mahendra Singh Thalor

mahi6run@gmail.com

almost 6 years ago

In reply to: Dilip Kumar (#131)

2 attachment(s)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Thu, 5 Mar 2020 at 13:54, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Mar 5, 2020 at 12:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Mar 4, 2020 at 11:45 AM Mahendra Singh Thalor
<mahi6run@gmail.com> wrote:

On Mon, 24 Feb 2020 at 15:39, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Feb 20, 2020 at 8:06 AM Andres Freund <andres@anarazel.de> wrote:

What I'm advocating is that extension locks should continue to go
through lock.c. And yes, that requires some changes to group locking,
but I still don't see why they'd be complicated.

Fair position, as per initial analysis, I think if we do below three
things, it should work out without changing to a new way of locking
for relation extension or page type locks.
a. As per the discussion above, ensure in code we will never try to
acquire another heavy-weight lock after acquiring relation extension
or page type locks (probably by having Asserts in code or maybe some
other way).
b. Change lock.c so that group locking is not considered for these two
lock types. For ex. in LockCheckConflicts, along with the check (if
(proclock->groupLeader == MyProc && MyProc->lockGroupLeader == NULL)),
we also check lock->tag and call it a conflict for these two locks.
c. The deadlock detector can ignore checking these two types of locks
because point (a) ensures that those won't lead to deadlock. One idea
could be that FindLockCycleRecurseMember just ignores these two types
of locks by checking the lock tag.

Thanks Amit for summary.

Based on above 3 points, here attaching 2 patches for review.

1. v01_0001-Conflict-EXTENTION-lock-in-group-member.patch (Patch by Dilip Kumar)
Basically this patch is for point b and c.

2. v01_0002-Added-assert-to-verify-that-we-never-try-to-take-any.patch
(Patch by me)
This patch is for point a.

After applying both the patches, make check-world is passing.

We are testing both the patches and will post results.

Thanks Amit and Dilip for reviewing the patches.

I think we need to do detailed code review in the places where we are
taking Relation Extension Lock and see whether we are acquiring
another heavy-weight lock after that. It seems to me that in
brin_getinsertbuffer, after acquiring Relation Extension Lock, we
might again try to acquire the same lock. See
brin_initialize_empty_new_buffer which is called after acquiring
Relation Extension Lock, in that function, we call
RecordPageWithFreeSpace and that can again try to acquire the same
lock if it needs to perform fsm_extend. I think there will be similar
instances in the code. I think it is fine if we again try to acquire
it, but the current assertion in your patch needs to be adjusted for
that.

I agree with you. Dilip is doing code review and he will post
results. As you mentioned that while holing Relation Extension Lock,
we might again try to acquire same Relation Extension Lock, so to
handle this in assert I did some changes in patch and attaching patch
for review. (I will test this scenario)

Few other minor comments on
v01_0002-Added-assert-to-verify-that-we-never-try-to-take-any:
1. Ideally, this should be the first patch as we first need to ensure
that we don't take any heavy-weight locks after acquiring a relation
extension lock.

Fixed.

2. I think it is better to add an Assert after initial error checks
(after RecoveryInProgress().. check)

I am not getting your points. Can you explain me, that which type of
assert you are suggesting?

3.
+ Assert (locallock->tag.lock.locktag_type != LOCKTAG_RELATION_EXTEND ||
+ locallock->nLocks == 0);
I think it is not possible that we have an entry in
LockMethodLocalHash and its value is zero. Do you see any such
possibility, if not, then we might want to remove it?

Yes, this condition is not needed. Fixed.

4. We already have a macro for LOCALLOCK_LOCKMETHOD, can we write
another one tag type? This will make the check look a bit cleaner and
probably if we need to extend it in future for Page type locks, then
also it will be good.

Good point. I added macros in this version.

Here, attaching new patch set for review.

Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v02_0001-Added-assert-to-verify-that-we-never-try-to-take-any.patchapplication/octet-stream; name=v02_0001-Added-assert-to-verify-that-we-never-try-to-take-any.patchDownload

From 52c75faf73cfcef300a2901447b9d45213501a15 Mon Sep 17 00:00:00 2001
From: Mahendra Singh Thalor <mahi6run@gmail.com>
Date: Thu, 5 Mar 2020 10:55:17 -0800
Subject: [PATCH 1/2] Added assert to verify that we never try to take any 
 heavyweight lock after acquiring relation Extension lock.

In LockAcquireExtended, we will call AssertAnyExtentionLockHeadByMe
to check that our backend is not holding any extention lock.
---
 src/backend/storage/lmgr/lock.c | 68 +++++++++++++++++++++++++++++++++
 src/include/storage/lock.h      |  3 ++
 2 files changed, 71 insertions(+)

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 56dba09299..9ce0a213b0 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -339,6 +339,7 @@ PROCLOCK_PRINT(const char *where, const PROCLOCK *proclockP)
 #endif							/* not LOCK_DEBUG */
 
 
+static void AssertAnyExtentionLockHeadByMe(const LOCKTAG *locktag);
 static uint32 proclock_hash(const void *key, Size keysize);
 static void RemoveLocalLock(LOCALLOCK *locallock);
 static PROCLOCK *SetupLockInTable(LockMethod lockMethodTable, PGPROC *proc,
@@ -587,6 +588,66 @@ LockHeldByMe(const LOCKTAG *locktag, LOCKMODE lockmode)
 	return (locallock && locallock->nLocks > 0);
 }
 
+/*
+ * AssertAnyExtentionLockHeadByMe -- test whether any EXTENSION lock held by
+ * this backend.  If any EXTENSION lock is held by this backend, then it should
+ * be same as we are trying to acquire lock now, if not same, then assert will
+ * fail because we expect that after acquiring EXTENSION lock, we will not take
+ * any other heavyweight lock.  To use this function, assert should be enabled.
+ */
+static void
+AssertAnyExtentionLockHeadByMe(const LOCKTAG *locktag)
+{
+#ifdef USE_ASSERT_CHECKING
+	HASH_SEQ_STATUS scan_status;
+	LOCALLOCK  *locallock;
+	bool	need_rel_lock;
+
+	/*
+	 * If we are trying to take relation extension lock, then set need_rel_lock
+	 * flag.
+	 */
+	if (locktag->locktag_type == LOCKTAG_RELATION_EXTEND)
+		need_rel_lock = true;
+	else
+		need_rel_lock = false;
+
+	/* Init the sequence hash search. */
+	hash_seq_init(&scan_status, LockMethodLocalHash);
+
+	/* Do sequence hash search for all the locks. */
+	while ((locallock = (LOCALLOCK *) hash_seq_search(&scan_status)) != NULL)
+	{
+		/*
+		 * If we are trying to acquire relation extension lock, then either this
+		 * lock hash entry should not be relation extension lock or hask lock
+		 * should be same as our required lock(means this backend is already
+		 * holding relation extension lock for same relation) .
+		 */
+		if (need_rel_lock)
+		{
+			/*
+			 * Either this hash entry is not beloging to any relation extension
+			 * lock or required relation extension lock is already hold by this
+			 * backend.
+			 */
+			Assert (LOCALLOCK_LOCKTYPE(locallock) != LOCKTAG_RELATION_EXTEND ||
+					(LOCALLOCK_LOCKDBOID(locallock) == locktag->locktag_field1 &&
+					 LOCALLOCK_LOCKRELOID(locallock) == locktag->locktag_field2));
+		}
+		else
+		{
+			/* This hash entry should not belog to relation extension lock. */
+			Assert (LOCALLOCK_LOCKTYPE(locallock) != LOCKTAG_RELATION_EXTEND);
+		}
+	}
+
+	/* All locks are processed so locallock should be NULL. */
+	Assert (locallock == NULL);
+
+#endif
+}
+
 /*
  * LockHasWaiters -- look up 'locktag' and check if releasing this
  *		lock would wake up other processes waiting for it.
@@ -749,6 +810,13 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	bool		found_conflict;
 	bool		log_lock = false;
 
+	/*
+	 * Sanity check to verify that after acquiring relation extension lock, we
+	 * never try to take any other heavyweight lock but whle holding relation
+	 * extension lock, backend can ask for same relation extension lock again.
+	 */
+	AssertAnyExtentionLockHeadByMe(locktag);
+
 	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
 		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
 	lockMethodTable = LockMethods[lockmethodid];
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index bb8e4e6e5b..4593346051 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -301,6 +301,9 @@ typedef struct LOCK
 } LOCK;
 
 #define LOCK_LOCKMETHOD(lock) ((LOCKMETHODID) (lock).tag.locktag_lockmethodid)
+#define LOCALLOCK_LOCKTYPE(locallock) ((locallock)->tag.lock.locktag_type)
+#define LOCALLOCK_LOCKDBOID(locallock) ((locallock)->tag.lock.locktag_field1)
+#define LOCALLOCK_LOCKRELOID(locallock) ((locallock)->tag.lock.locktag_field2)
 
 
 /*
-- 
2.17.1

v02_0002-Conflict-EXTENTION-lock-in-group-member.patchapplication/octet-stream; name=v02_0002-Conflict-EXTENTION-lock-in-group-member.patchDownload

From 22f7bc63f25fbdaa218e64330fd3d13c865da654 Mon Sep 17 00:00:00 2001
From: Mahendra Singh Thalor <mahi6run@gmail.com>
Date: Tue, 3 Mar 2020 04:15:18 -0800
Subject: [PATCH 1/2] Conflict EXTENTION lock in group member

---
 src/backend/storage/lmgr/deadlock.c | 9 +++++++++
 src/backend/storage/lmgr/lock.c     | 8 ++++++++
 2 files changed, 17 insertions(+)

diff --git a/src/backend/storage/lmgr/deadlock.c b/src/backend/storage/lmgr/deadlock.c
index f8c5df08e6..8bff91b495 100644
--- a/src/backend/storage/lmgr/deadlock.c
+++ b/src/backend/storage/lmgr/deadlock.c
@@ -568,6 +568,15 @@ FindLockCycleRecurseMember(PGPROC *checkProc,
 	proclock = (PROCLOCK *) SHMQueueNext(procLocks, procLocks,
 										 offsetof(PROCLOCK, lockLink));
 
+	/*
+	 * After acquiring relation extension lock we don't acquire any other
+	 * heavyweight lock so relation extension lock never participate in actual
+	 * deadlock cycle.  So avoid the wait edge for this type of lock so that
+	 * we can avoid any false cycle detection due to group locking.
+	 */
+	if (lock->tag.locktag_type == LOCKTAG_RELATION_EXTEND)
+		return false;
+
 	while (proclock)
 	{
 		PGPROC	   *leader;
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 56dba09299..ef14655cf8 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -1403,6 +1403,14 @@ LockCheckConflicts(LockMethod lockMethodTable,
 		return true;
 	}
 
+	/* If it's a relation extension lock. */
+	if (lock->tag.locktag_type == LOCKTAG_RELATION_EXTEND)
+	{
+		PROCLOCK_PRINT("LockCheckConflicts: conflicting (simple)",
+				proclock);
+		return true;
+	}
+
 	/*
 	 * Locks held in conflicting modes by members of our own lock group are
 	 * not real conflicts; we can subtract those out and see if we still have
-- 
2.17.1

#133

Mahendra Singh Thalor

mahi6run@gmail.com

almost 6 years ago

In reply to: Dilip Kumar (#128)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Wed, 4 Mar 2020 at 12:03, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Mar 4, 2020 at 11:45 AM Mahendra Singh Thalor
<mahi6run@gmail.com> wrote:

On Mon, 24 Feb 2020 at 15:39, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Feb 20, 2020 at 8:06 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2020-02-19 11:12:18 +0530, Amit Kapila wrote:

I think till we know the real need for changing group locking, going
in the direction of what Tom suggested to use an array of LWLocks [1]
to address the problems in hand is a good idea.

-many

I think that building yet another locking subsystem is the entirely
wrong idea - especially when there's imo no convincing architectural
reasons to do so.

Hmm, AFAIU, it will be done by having an array of LWLocks which we do
at other places as well (like BufferIO locks). I am not sure if we
can call it as new locking subsystem, but if we decide to continue
using lock.c and change group locking then I think we can do that as
well, see my comments below regarding that.

It is not very clear to me that are we thinking to give up on Tom's
idea [1] and change group locking even though it is not clear or at
least nobody has proposed an idea/patch which requires that? Or are
we thinking that we can do what Tom suggested for relation extension
lock and also plan to change group locking for future parallel
operations that might require it?

What I'm advocating is that extension locks should continue to go
through lock.c. And yes, that requires some changes to group locking,
but I still don't see why they'd be complicated.

Fair position, as per initial analysis, I think if we do below three
things, it should work out without changing to a new way of locking
for relation extension or page type locks.
a. As per the discussion above, ensure in code we will never try to
acquire another heavy-weight lock after acquiring relation extension
or page type locks (probably by having Asserts in code or maybe some
other way).
b. Change lock.c so that group locking is not considered for these two
lock types. For ex. in LockCheckConflicts, along with the check (if
(proclock->groupLeader == MyProc && MyProc->lockGroupLeader == NULL)),
we also check lock->tag and call it a conflict for these two locks.
c. The deadlock detector can ignore checking these two types of locks
because point (a) ensures that those won't lead to deadlock. One idea
could be that FindLockCycleRecurseMember just ignores these two types
of locks by checking the lock tag.

Thanks Amit for summary.

Based on above 3 points, here attaching 2 patches for review.

1. v01_0001-Conflict-EXTENTION-lock-in-group-member.patch (Patch by Dilip Kumar)
Basically this patch is for point b and c.

2. v01_0002-Added-assert-to-verify-that-we-never-try-to-take-any.patch
(Patch by me)
This patch is for point a.

After applying both the patches, make check-world is passing.

We are testing both the patches and will post results.

Thoughts?

+static void AssertAnyExtentionLockHeadByMe(void);
+/*
+ * AssertAnyExtentionLockHeadByMe -- test whether any EXTENSION lock held by
+ * this backend.  If any EXTENSION lock is hold by this backend, then assert
+ * will fail.  To use this function, assert should be enabled.
+ */
+void AssertAnyExtentionLockHeadByMe()
+{
Some minor observations on 0002.
1. static is missing in a function definition.
2. Function name should start in new line after function return type
in function definition, as per pg guideline.
+void AssertAnyExtentionLockHeadByMe()
->
void
AssertAnyExtentionLockHeadByMe()

Thanks Dilip for review.

I have fixed above 2 points in v2 patch set.

--
Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com

#134

Robert Haas

robertmhaas@gmail.com

almost 6 years ago

In reply to: Mahendra Singh Thalor (#132)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Thu, Mar 5, 2020 at 2:18 PM Mahendra Singh Thalor <mahi6run@gmail.com> wrote:

Here, attaching new patch set for review.

I was kind of assuming that the way this would work is that it would
set a flag or increment a counter or something when we acquire a
relation extension lock, and then reverse the process when we release
it. Then the Assert could just check the flag. Walking the whole
LOCALLOCK table is expensive.

Also, spelling counts. This patch features "extention" multiple times,
plus also "hask," "beloging," "belog," and "whle", which is an awful
lot of typos for a 70-line patch. If you are using macOS, try opening
the patch in TextEdit. If you are inventing a new function name, spell
the words you include the same way they are spelled elsewhere.

Even aside from the typo, AssertAnyExtentionLockHeadByMe() is not a
very good function name. It sounds like it's asserting that we hold an
extension lock, rather than that we don't, and also, that's not
exactly what it checks anyway, because there's this special case for
when we're acquiring a relation extension lock we already hold.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#135

Amit Kapila

amit.kapila16@gmail.com

almost 6 years ago

In reply to: Dilip Kumar (#131)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Thu, Mar 5, 2020 at 1:54 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Mar 5, 2020 at 12:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

5. I have also tried to think of another way to check if we already
hold lock type LOCKTAG_RELATION_EXTEND, but couldn't come up with a
cheaper way than this. Basically, I think if we traverse the
MyProc->myProcLocks queue, we will get this information, but that
doesn't seem much cheaper than this.

I think we can maintain a flag (rel_extlock_held). And, we can set
that true in LockRelationForExtension,
ConditionalLockRelationForExtension functions and we can reset it in
UnlockRelationForExtension or in the error path e.g. LockReleaseAll.

I think if we reset it in LockReleaseAll during the error path, then
we need to find a way to reset it during LockReleaseCurrentOwner as
that is called during Subtransaction Abort which can be tricky as we
don't know if it belongs to the current owner. How about resetting in
Abort(Sub)Transaction and CommitTransaction after we release locks via
ResourceOwnerRelease.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#136

Amit Kapila

amit.kapila16@gmail.com

almost 6 years ago

In reply to: Robert Haas (#134)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Fri, Mar 6, 2020 at 2:19 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Mar 5, 2020 at 2:18 PM Mahendra Singh Thalor <mahi6run@gmail.com> wrote:

Here, attaching new patch set for review.

I was kind of assuming that the way this would work is that it would
set a flag or increment a counter or something when we acquire a
relation extension lock, and then reverse the process when we release
it. Then the Assert could just check the flag. Walking the whole
LOCALLOCK table is expensive.

I think we can keep such a flag in TopTransactionState. We free such
locks after the work is done (except during error where we free them
at transaction abort) rather than at transaction commit, so one might
say it is better not to associate with transaction state, but not sure
if there is other better place. Do you have any suggestions?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#137

Robert Haas

robertmhaas@gmail.com

almost 6 years ago

In reply to: Amit Kapila (#136)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Thu, Mar 5, 2020 at 11:45 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I think we can keep such a flag in TopTransactionState. We free such
locks after the work is done (except during error where we free them
at transaction abort) rather than at transaction commit, so one might
say it is better not to associate with transaction state, but not sure
if there is other better place. Do you have any suggestions?

I assumed it would be a global variable in lock.c. lock.c has got to
know when any lock is required or released, so I don't know why we
need to involve xact.c in the bookkeeping.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#138

Dilip Kumar

dilipbalaut@gmail.com

almost 6 years ago

In reply to: Amit Kapila (#135)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Fri, Mar 6, 2020 at 9:47 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Mar 5, 2020 at 1:54 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Mar 5, 2020 at 12:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

5. I have also tried to think of another way to check if we already
hold lock type LOCKTAG_RELATION_EXTEND, but couldn't come up with a
cheaper way than this. Basically, I think if we traverse the
MyProc->myProcLocks queue, we will get this information, but that
doesn't seem much cheaper than this.

I think we can maintain a flag (rel_extlock_held). And, we can set
that true in LockRelationForExtension,
ConditionalLockRelationForExtension functions and we can reset it in
UnlockRelationForExtension or in the error path e.g. LockReleaseAll.

I think if we reset it in LockReleaseAll during the error path, then
we need to find a way to reset it during LockReleaseCurrentOwner as
that is called during Subtransaction Abort which can be tricky as we
don't know if it belongs to the current owner. How about resetting in
Abort(Sub)Transaction and CommitTransaction after we release locks via
ResourceOwnerRelease.

I think instead of the flag we need to keep the counter because we can
acquire the same relation extension lock multiple times. So
basically, every time we acquire the lock we can increment the counter
and while releasing we can decrement it. During an error path, I
think it is fine to set it to 0 in CommitTransaction/AbortTransaction.
But, I am not sure that we can set to 0 or decrement it in
AbortSubTransaction because we are not sure whether we have acquired
the lock under this subtransaction or not.

Having said that, I think there should not be any case that we are
starting the sub-transaction while holding the relation extension
lock.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#139

Amit Kapila

amit.kapila16@gmail.com

almost 6 years ago

In reply to: Dilip Kumar (#138)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Sat, Mar 7, 2020 at 9:57 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Mar 6, 2020 at 9:47 AM Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Thu, Mar 5, 2020 at 1:54 PM Dilip Kumar <dilipbalaut@gmail.com>

wrote:

On Thu, Mar 5, 2020 at 12:15 PM Amit Kapila <amit.kapila16@gmail.com>

wrote:

5. I have also tried to think of another way to check if we already
hold lock type LOCKTAG_RELATION_EXTEND, but couldn't come up with a
cheaper way than this. Basically, I think if we traverse the
MyProc->myProcLocks queue, we will get this information, but that
doesn't seem much cheaper than this.

I think we can maintain a flag (rel_extlock_held). And, we can set
that true in LockRelationForExtension,
ConditionalLockRelationForExtension functions and we can reset it in
UnlockRelationForExtension or in the error path e.g. LockReleaseAll.

I think if we reset it in LockReleaseAll during the error path, then
we need to find a way to reset it during LockReleaseCurrentOwner as
that is called during Subtransaction Abort which can be tricky as we
don't know if it belongs to the current owner. How about resetting in
Abort(Sub)Transaction and CommitTransaction after we release locks via
ResourceOwnerRelease.

I think instead of the flag we need to keep the counter because we can
acquire the same relation extension lock multiple times. So
basically, every time we acquire the lock we can increment the counter
and while releasing we can decrement it. During an error path, I
think it is fine to set it to 0 in CommitTransaction/AbortTransaction.
But, I am not sure that we can set to 0 or decrement it in
AbortSubTransaction because we are not sure whether we have acquired
the lock under this subtransaction or not.

Having said that, I think there should not be any case that we are
starting the sub-transaction while holding the relation extension
lock.

Right, this is exactly the point. I think we can mention this in comments
to make it clear why setting it to zero is fine during subtransaction
abort.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#140

Amit Kapila

amit.kapila16@gmail.com

almost 6 years ago

In reply to: Amit Kapila (#139)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Sat, Mar 7, 2020 at 11:14 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sat, Mar 7, 2020 at 9:57 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Mar 6, 2020 at 9:47 AM Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Thu, Mar 5, 2020 at 1:54 PM Dilip Kumar <dilipbalaut@gmail.com>

wrote:

On Thu, Mar 5, 2020 at 12:15 PM Amit Kapila <amit.kapila16@gmail.com>

wrote:

5. I have also tried to think of another way to check if we already
hold lock type LOCKTAG_RELATION_EXTEND, but couldn't come up with a
cheaper way than this. Basically, I think if we traverse the
MyProc->myProcLocks queue, we will get this information, but that
doesn't seem much cheaper than this.

I think we can maintain a flag (rel_extlock_held). And, we can set
that true in LockRelationForExtension,
ConditionalLockRelationForExtension functions and we can reset it in
UnlockRelationForExtension or in the error path e.g. LockReleaseAll.

I think if we reset it in LockReleaseAll during the error path, then
we need to find a way to reset it during LockReleaseCurrentOwner as
that is called during Subtransaction Abort which can be tricky as we
don't know if it belongs to the current owner. How about resetting in
Abort(Sub)Transaction and CommitTransaction after we release locks via
ResourceOwnerRelease.

I think instead of the flag we need to keep the counter because we can
acquire the same relation extension lock multiple times. So
basically, every time we acquire the lock we can increment the counter
and while releasing we can decrement it. During an error path, I
think it is fine to set it to 0 in CommitTransaction/AbortTransaction.
But, I am not sure that we can set to 0 or decrement it in
AbortSubTransaction because we are not sure whether we have acquired
the lock under this subtransaction or not.

Having said that, I think there should not be any case that we are
starting the sub-transaction while holding the relation extension
lock.

Right, this is exactly the point. I think we can mention this in comments
to make it clear why setting it to zero is fine during subtransaction
abort.

Is there anything wrong with having an Assert during subtransaction start
to indicate that we don't have a relation extension lock?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#141

Dilip Kumar

dilipbalaut@gmail.com

almost 6 years ago

In reply to: Amit Kapila (#140)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Sat, Mar 7, 2020 at 3:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sat, Mar 7, 2020 at 11:14 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sat, Mar 7, 2020 at 9:57 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Mar 6, 2020 at 9:47 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Mar 5, 2020 at 1:54 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Mar 5, 2020 at 12:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

5. I have also tried to think of another way to check if we already
hold lock type LOCKTAG_RELATION_EXTEND, but couldn't come up with a
cheaper way than this. Basically, I think if we traverse the
MyProc->myProcLocks queue, we will get this information, but that
doesn't seem much cheaper than this.

I think we can maintain a flag (rel_extlock_held). And, we can set
that true in LockRelationForExtension,
ConditionalLockRelationForExtension functions and we can reset it in
UnlockRelationForExtension or in the error path e.g. LockReleaseAll.

I think if we reset it in LockReleaseAll during the error path, then
we need to find a way to reset it during LockReleaseCurrentOwner as
that is called during Subtransaction Abort which can be tricky as we
don't know if it belongs to the current owner. How about resetting in
Abort(Sub)Transaction and CommitTransaction after we release locks via
ResourceOwnerRelease.

I think instead of the flag we need to keep the counter because we can
acquire the same relation extension lock multiple times. So
basically, every time we acquire the lock we can increment the counter
and while releasing we can decrement it. During an error path, I
think it is fine to set it to 0 in CommitTransaction/AbortTransaction.
But, I am not sure that we can set to 0 or decrement it in
AbortSubTransaction because we are not sure whether we have acquired
the lock under this subtransaction or not.

Having said that, I think there should not be any case that we are
starting the sub-transaction while holding the relation extension
lock.

Right, this is exactly the point. I think we can mention this in comments to make it clear why setting it to zero is fine during subtransaction abort.

Is there anything wrong with having an Assert during subtransaction start to indicate that we don't have a relation extension lock?

Yes, I was planning to do that.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#142

Tom Lane

tgl@sss.pgh.pa.us

almost 6 years ago

In reply to: Dilip Kumar (#138)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

Dilip Kumar <dilipbalaut@gmail.com> writes:

I think instead of the flag we need to keep the counter because we can
acquire the same relation extension lock multiple times.

Uh ... what? How would that not be broken usage on its face?

I continue to think that we'd be better off getting all of this
out of the heavyweight lock manager. There is no reason why we
should need deadlock detection, or multiple holds of the same
lock, or pretty much anything that LWLocks don't give you.

regards, tom lane

#143

Dilip Kumar

dilipbalaut@gmail.com

almost 6 years ago

In reply to: Tom Lane (#142)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Sat, Mar 7, 2020 at 8:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Dilip Kumar <dilipbalaut@gmail.com> writes:

I think instead of the flag we need to keep the counter because we can
acquire the same relation extension lock multiple times.

Uh ... what? How would that not be broken usage on its face?

Basically, if we can ensure that while holding the relation extension
lock we will not wait for any other lock then we can ignore in the
deadlock detection path so that we don't detect the false deadlock due
to the group locking mechanism. So if we are already holding the
relation extension lock and trying to acquire the same lock-in same
mode then it can never wait so this is safe.

I continue to think that we'd be better off getting all of this
out of the heavyweight lock manager. There is no reason why we
should need deadlock detection, or multiple holds of the same
lock, or pretty much anything that LWLocks don't give you.

Right, we never need deadlock detection for this lock. But, I think
there are quite a few cases where we have multiple holds at the same
time. e.g, during RelationAddExtraBlocks, while holding the relation
extension lock we try to update the block in FSM and FSM might need to
add extra FSM block which will again try to acquire the same lock.

But, I think the main reason for not converting it to an LWLocks is
because Andres has a concern about inventing new lock mechanism as
discuss upthread[1]/messages/by-id/20200220023612.c44ggploywxtlvmx@alap3.anarazel.de

[1]: /messages/by-id/20200220023612.c44ggploywxtlvmx@alap3.anarazel.de

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#144

Masahiko Sawada

masahiko.sawada@2ndquadrant.com

almost 6 years ago

In reply to: Amit Kapila (#126)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Mon, 24 Feb 2020 at 19:08, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Feb 20, 2020 at 8:06 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2020-02-19 11:12:18 +0530, Amit Kapila wrote:

I think till we know the real need for changing group locking, going
in the direction of what Tom suggested to use an array of LWLocks [1]
to address the problems in hand is a good idea.

-many

I think that building yet another locking subsystem is the entirely
wrong idea - especially when there's imo no convincing architectural
reasons to do so.

Hmm, AFAIU, it will be done by having an array of LWLocks which we do
at other places as well (like BufferIO locks). I am not sure if we
can call it as new locking subsystem, but if we decide to continue
using lock.c and change group locking then I think we can do that as
well, see my comments below regarding that.

It is not very clear to me that are we thinking to give up on Tom's
idea [1] and change group locking even though it is not clear or at
least nobody has proposed an idea/patch which requires that? Or are
we thinking that we can do what Tom suggested for relation extension
lock and also plan to change group locking for future parallel
operations that might require it?

What I'm advocating is that extension locks should continue to go
through lock.c. And yes, that requires some changes to group locking,
but I still don't see why they'd be complicated.

Fair position, as per initial analysis, I think if we do below three
things, it should work out without changing to a new way of locking
for relation extension or page type locks.
a. As per the discussion above, ensure in code we will never try to
acquire another heavy-weight lock after acquiring relation extension
or page type locks (probably by having Asserts in code or maybe some
other way).

The current patch
(v02_0001-Added-assert-to-verify-that-we-never-try-to-take-any.patch)
doesn't check that acquiring a heavy-weight lock after page type lock,
is that right? There is the path doing that: ginInsertCleanup() holds
a page lock and insert the pending list items, which might hold a
relation extension lock.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#145

Amit Kapila

amit.kapila16@gmail.com

almost 6 years ago

In reply to: Dilip Kumar (#143)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Sat, Mar 7, 2020 at 9:19 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Sat, Mar 7, 2020 at 8:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Dilip Kumar <dilipbalaut@gmail.com> writes:

I think instead of the flag we need to keep the counter because we can
acquire the same relation extension lock multiple times.

Uh ... what? How would that not be broken usage on its face?

Basically, if we can ensure that while holding the relation extension
lock we will not wait for any other lock then we can ignore in the
deadlock detection path so that we don't detect the false deadlock due
to the group locking mechanism. So if we are already holding the
relation extension lock and trying to acquire the same lock-in same
mode then it can never wait so this is safe.

I continue to think that we'd be better off getting all of this
out of the heavyweight lock manager. There is no reason why we
should need deadlock detection, or multiple holds of the same
lock, or pretty much anything that LWLocks don't give you.

Right, we never need deadlock detection for this lock. But, I think
there are quite a few cases where we have multiple holds at the same
time. e.g, during RelationAddExtraBlocks, while holding the relation
extension lock we try to update the block in FSM and FSM might need to
add extra FSM block which will again try to acquire the same lock.

But, I think the main reason for not converting it to an LWLocks is
because Andres has a concern about inventing new lock mechanism as
discuss upthread[1]

Right, that is one point and another is that if we go via the route of
converting it to LWLocks, then we also need to think of some solution for
page locks that are used in ginInsertCleanup. However, if we go with the
approach being pursued [1]/messages/by-id/CAA4eK1+Njo+pnqSNi2ScKf0BcVBWWf37BrW-pykVSG0B0C5Qig@mail.gmail.com then the page locks will be handled in a similar
way as relation extension locks.

[1]: /messages/by-id/CAA4eK1+Njo+pnqSNi2ScKf0BcVBWWf37BrW-pykVSG0B0C5Qig@mail.gmail.com
/messages/by-id/CAA4eK1+Njo+pnqSNi2ScKf0BcVBWWf37BrW-pykVSG0B0C5Qig@mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#146

Amit Kapila

amit.kapila16@gmail.com

almost 6 years ago

In reply to: Masahiko Sawada (#144)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Sun, Mar 8, 2020 at 7:58 AM Masahiko Sawada <
masahiko.sawada@2ndquadrant.com> wrote:

On Mon, 24 Feb 2020 at 19:08, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Feb 20, 2020 at 8:06 AM Andres Freund <andres@anarazel.de>

wrote:

Hi,

On 2020-02-19 11:12:18 +0530, Amit Kapila wrote:

I think till we know the real need for changing group locking, going
in the direction of what Tom suggested to use an array of LWLocks [1]
to address the problems in hand is a good idea.

-many

I think that building yet another locking subsystem is the entirely
wrong idea - especially when there's imo no convincing architectural
reasons to do so.

Hmm, AFAIU, it will be done by having an array of LWLocks which we do
at other places as well (like BufferIO locks). I am not sure if we
can call it as new locking subsystem, but if we decide to continue
using lock.c and change group locking then I think we can do that as
well, see my comments below regarding that.

It is not very clear to me that are we thinking to give up on Tom's
idea [1] and change group locking even though it is not clear or at
least nobody has proposed an idea/patch which requires that? Or are
we thinking that we can do what Tom suggested for relation extension
lock and also plan to change group locking for future parallel
operations that might require it?

What I'm advocating is that extension locks should continue to go
through lock.c. And yes, that requires some changes to group locking,
but I still don't see why they'd be complicated.

Fair position, as per initial analysis, I think if we do below three
things, it should work out without changing to a new way of locking
for relation extension or page type locks.
a. As per the discussion above, ensure in code we will never try to
acquire another heavy-weight lock after acquiring relation extension
or page type locks (probably by having Asserts in code or maybe some
other way).

The current patch
(v02_0001-Added-assert-to-verify-that-we-never-try-to-take-any.patch)
doesn't check that acquiring a heavy-weight lock after page type lock,
is that right?

No, it should do that.

There is the path doing that: ginInsertCleanup() holds
a page lock and insert the pending list items, which might hold a
relation extension lock.

Right, I could also see that, but do you see any problem with that? I
agree that Assert should cover this case, but I don't see any fundamental
problem with that.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#147

Masahiko Sawada

masahiko.sawada@2ndquadrant.com

almost 6 years ago

In reply to: Amit Kapila (#146)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Mon, 9 Mar 2020 at 14:16, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, Mar 8, 2020 at 7:58 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:

On Mon, 24 Feb 2020 at 19:08, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Feb 20, 2020 at 8:06 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2020-02-19 11:12:18 +0530, Amit Kapila wrote:

I think till we know the real need for changing group locking, going
in the direction of what Tom suggested to use an array of LWLocks [1]
to address the problems in hand is a good idea.

-many

I think that building yet another locking subsystem is the entirely
wrong idea - especially when there's imo no convincing architectural
reasons to do so.

Hmm, AFAIU, it will be done by having an array of LWLocks which we do
at other places as well (like BufferIO locks). I am not sure if we
can call it as new locking subsystem, but if we decide to continue
using lock.c and change group locking then I think we can do that as
well, see my comments below regarding that.

It is not very clear to me that are we thinking to give up on Tom's
idea [1] and change group locking even though it is not clear or at
least nobody has proposed an idea/patch which requires that? Or are
we thinking that we can do what Tom suggested for relation extension
lock and also plan to change group locking for future parallel
operations that might require it?

What I'm advocating is that extension locks should continue to go
through lock.c. And yes, that requires some changes to group locking,
but I still don't see why they'd be complicated.

Fair position, as per initial analysis, I think if we do below three
things, it should work out without changing to a new way of locking
for relation extension or page type locks.
a. As per the discussion above, ensure in code we will never try to
acquire another heavy-weight lock after acquiring relation extension
or page type locks (probably by having Asserts in code or maybe some
other way).

The current patch
(v02_0001-Added-assert-to-verify-that-we-never-try-to-take-any.patch)
doesn't check that acquiring a heavy-weight lock after page type lock,
is that right?

No, it should do that.

There is the path doing that: ginInsertCleanup() holds
a page lock and insert the pending list items, which might hold a
relation extension lock.

Right, I could also see that, but do you see any problem with that? I agree that Assert should cover this case, but I don't see any fundamental problem with that.

I think that could be a problem if we change the group locking so that
it doesn't consider page lock type.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#148

Amit Kapila

amit.kapila16@gmail.com

almost 6 years ago

In reply to: Masahiko Sawada (#147)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Mon, Mar 9, 2020 at 11:38 AM Masahiko Sawada <
masahiko.sawada@2ndquadrant.com> wrote:

On Mon, 9 Mar 2020 at 14:16, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, Mar 8, 2020 at 7:58 AM Masahiko Sawada <

masahiko.sawada@2ndquadrant.com> wrote:

Fair position, as per initial analysis, I think if we do below three
things, it should work out without changing to a new way of locking
for relation extension or page type locks.
a. As per the discussion above, ensure in code we will never try to
acquire another heavy-weight lock after acquiring relation extension
or page type locks (probably by having Asserts in code or maybe some
other way).

The current patch
(v02_0001-Added-assert-to-verify-that-we-never-try-to-take-any.patch)
doesn't check that acquiring a heavy-weight lock after page type lock,
is that right?

No, it should do that.

There is the path doing that: ginInsertCleanup() holds
a page lock and insert the pending list items, which might hold a
relation extension lock.

Right, I could also see that, but do you see any problem with that? I

agree that Assert should cover this case, but I don't see any fundamental
problem with that.

I think that could be a problem if we change the group locking so that
it doesn't consider page lock type.

I might be missing something, but won't that be a problem only when if
there is a case where we acquire page lock after acquiring a relation
extension lock? Can you please explain the scenario you have in mind which
can create a problem?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#149

Masahiko Sawada

masahiko.sawada@2ndquadrant.com

almost 6 years ago

In reply to: Amit Kapila (#148)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Mon, 9 Mar 2020 at 15:50, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Mar 9, 2020 at 11:38 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:

On Mon, 9 Mar 2020 at 14:16, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, Mar 8, 2020 at 7:58 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:

Fair position, as per initial analysis, I think if we do below three
things, it should work out without changing to a new way of locking
for relation extension or page type locks.
a. As per the discussion above, ensure in code we will never try to
acquire another heavy-weight lock after acquiring relation extension
or page type locks (probably by having Asserts in code or maybe some
other way).

The current patch
(v02_0001-Added-assert-to-verify-that-we-never-try-to-take-any.patch)
doesn't check that acquiring a heavy-weight lock after page type lock,
is that right?

No, it should do that.

There is the path doing that: ginInsertCleanup() holds
a page lock and insert the pending list items, which might hold a
relation extension lock.

Right, I could also see that, but do you see any problem with that? I agree that Assert should cover this case, but I don't see any fundamental problem with that.

I think that could be a problem if we change the group locking so that
it doesn't consider page lock type.

I might be missing something, but won't that be a problem only when if there is a case where we acquire page lock after acquiring a relation extension lock?

Yes, you're right.

Well I meant that the reason why we need to make Assert should cover
page locks case is the same as the reason for extension lock type
case. If we change the group locking so that it doesn't consider
extension lock and change deadlock so that it doesn't make a wait edge
for it, we need to ensure that the same backend doesn't acquire
heavy-weight lock after holding relation extension lock. These are
already done in the current patch. Similarly, if we did the similar
change for page lock in the group locking and deadlock , we need to
ensure the same things for page lock. But ISTM it doesn't necessarily
need to support page lock for now because currently we use it only for
cleanup pending list of gin index.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#150

Amit Kapila

amit.kapila16@gmail.com

almost 6 years ago

In reply to: Masahiko Sawada (#149)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Mon, Mar 9, 2020 at 2:09 PM Masahiko Sawada <
masahiko.sawada@2ndquadrant.com> wrote:

On Mon, 9 Mar 2020 at 15:50, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Mar 9, 2020 at 11:38 AM Masahiko Sawada <

masahiko.sawada@2ndquadrant.com> wrote:

On Mon, 9 Mar 2020 at 14:16, Amit Kapila <amit.kapila16@gmail.com>

wrote:

On Sun, Mar 8, 2020 at 7:58 AM Masahiko Sawada <

masahiko.sawada@2ndquadrant.com> wrote:

Fair position, as per initial analysis, I think if we do below

three

things, it should work out without changing to a new way of locking
for relation extension or page type locks.
a. As per the discussion above, ensure in code we will never try to
acquire another heavy-weight lock after acquiring relation

extension

or page type locks (probably by having Asserts in code or maybe

some

other way).

The current patch
(v02_0001-Added-assert-to-verify-that-we-never-try-to-take-any.patch)
doesn't check that acquiring a heavy-weight lock after page type

lock,

is that right?

No, it should do that.

There is the path doing that: ginInsertCleanup() holds
a page lock and insert the pending list items, which might hold a
relation extension lock.

Right, I could also see that, but do you see any problem with that?

I agree that Assert should cover this case, but I don't see any fundamental
problem with that.

I think that could be a problem if we change the group locking so that
it doesn't consider page lock type.

I might be missing something, but won't that be a problem only when if

there is a case where we acquire page lock after acquiring a relation
extension lock?

Yes, you're right.

Well I meant that the reason why we need to make Assert should cover
page locks case is the same as the reason for extension lock type
case. If we change the group locking so that it doesn't consider
extension lock and change deadlock so that it doesn't make a wait edge
for it, we need to ensure that the same backend doesn't acquire
heavy-weight lock after holding relation extension lock. These are
already done in the current patch. Similarly, if we did the similar
change for page lock in the group locking and deadlock , we need to
ensure the same things for page lock.

Agreed.

But ISTM it doesn't necessarily
need to support page lock for now because currently we use it only for
cleanup pending list of gin index.

I agree, but I think it is better to have a patch for the same even if we
want to review/commit that separately. That will help us to look at how
the complete solution looks.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#151

Dilip Kumar

dilipbalaut@gmail.com

almost 6 years ago

In reply to: Amit Kapila (#150)

3 attachment(s)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Mon, Mar 9, 2020 at 2:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Mar 9, 2020 at 2:09 PM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:

On Mon, 9 Mar 2020 at 15:50, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Mar 9, 2020 at 11:38 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:

On Mon, 9 Mar 2020 at 14:16, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, Mar 8, 2020 at 7:58 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:

Fair position, as per initial analysis, I think if we do below three
things, it should work out without changing to a new way of locking
for relation extension or page type locks.
a. As per the discussion above, ensure in code we will never try to
acquire another heavy-weight lock after acquiring relation extension
or page type locks (probably by having Asserts in code or maybe some
other way).

The current patch
(v02_0001-Added-assert-to-verify-that-we-never-try-to-take-any.patch)
doesn't check that acquiring a heavy-weight lock after page type lock,
is that right?

No, it should do that.

There is the path doing that: ginInsertCleanup() holds
a page lock and insert the pending list items, which might hold a
relation extension lock.

Right, I could also see that, but do you see any problem with that? I agree that Assert should cover this case, but I don't see any fundamental problem with that.

I think that could be a problem if we change the group locking so that
it doesn't consider page lock type.

I might be missing something, but won't that be a problem only when if there is a case where we acquire page lock after acquiring a relation extension lock?

Yes, you're right.

Well I meant that the reason why we need to make Assert should cover
page locks case is the same as the reason for extension lock type
case. If we change the group locking so that it doesn't consider
extension lock and change deadlock so that it doesn't make a wait edge
for it, we need to ensure that the same backend doesn't acquire
heavy-weight lock after holding relation extension lock. These are
already done in the current patch. Similarly, if we did the similar
change for page lock in the group locking and deadlock , we need to
ensure the same things for page lock.

Agreed.

But ISTM it doesn't necessarily
need to support page lock for now because currently we use it only for
cleanup pending list of gin index.

I agree, but I think it is better to have a patch for the same even if we want to review/commit that separately. That will help us to look at how the complete solution looks.

Please find the updated patch (summary of the changes)
- Instead of searching the lock hash table for assert, it maintains a counter.
- Also, handled the case where we can acquire the relation extension
lock while holding the relation extension lock on the same relation.
- Handled the error case.

In addition to that prepared a WIP patch for handling the PageLock.
First, I thought that we can use the same counter for the PageLock and
the RelationExtensionLock because in assert we just need to check
whether we are trying to acquire any other heavyweight lock while
holding any of these locks. But, the exceptional case where we
allowed to acquire a relation extension lock while holding any of
these locks is a bit different. Because, if we are holding a relation
extension lock then we allowed to acquire the relation extension lock
on the same relation but it can not be any other relation otherwise it
can create a cycle. But, the same is not true with the PageLock,
i.e. while holding the PageLock you can acquire the relation extension
lock on any relation and that will be safe because the relation
extension lock guarantee that, it will never create the cycle.
However, I agree that we don't have any such cases where we want to
acquire a relation extension lock on the different relations while
holding the PageLock.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v3-0001-Add-assert-to-check-that-we-should-not-acquire-an.patchapplication/octet-stream; name=v3-0001-Add-assert-to-check-that-we-should-not-acquire-an.patchDownload

From bbd62ffc517ff0791c8ee215e5ec11e44f703f3e Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Sat, 7 Mar 2020 09:24:33 +0530
Subject: [PATCH v3 1/3] Add assert to check that we should not acquire any
 other lock if we are already holding the relation extension lock.  Only
 exception is that if we are trying to acquire the relation extension lock
 then we can hold the same lock.

---
 src/backend/access/transam/xact.c | 15 +++++++++
 src/backend/storage/lmgr/lmgr.c   | 17 ++++++++++-
 src/backend/storage/lmgr/lock.c   | 64 +++++++++++++++++++++++++++++++++++++++
 src/include/storage/lock.h        |  4 +++
 4 files changed, 99 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index e3c60f2..ca64712 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2267,6 +2267,9 @@ CommitTransaction(void)
 	XactTopFullTransactionId = InvalidFullTransactionId;
 	nParallelCurrentXids = 0;
 
+	/* Reset the relation extension lock held count. */
+	ResetRelExtLockHeldCount();
+
 	/*
 	 * done with commit processing, set current transaction state back to
 	 * default
@@ -2735,6 +2738,9 @@ AbortTransaction(void)
 		pgstat_report_xact_timestamp(0);
 	}
 
+	/* Reset the relation extension lock held count. */
+	ResetRelExtLockHeldCount();
+
 	/*
 	 * State remains TRANS_ABORT until CleanupTransaction().
 	 */
@@ -5006,6 +5012,9 @@ AbortSubTransaction(void)
 		AtEOSubXact_ApplyLauncher(false, s->nestingLevel);
 	}
 
+	/* Reset the relation extension lock held count. */
+	ResetRelExtLockHeldCount();
+
 	/*
 	 * Restore the upper transaction's read-only state, too.  This should be
 	 * redundant with GUC's cleanup but we may as well do it for consistency
@@ -5062,6 +5071,12 @@ PushTransaction(void)
 	TransactionState s;
 
 	/*
+	 * Relation extension lock must not be held while starting a new
+	 * sub-transaction.
+	 */
+	Assert(!IsRelExtLockHeld());
+
+	/*
 	 * We keep subtransaction state nodes in TopTransactionContext.
 	 */
 	s = (TransactionState)
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 2010320..26760f8 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -408,6 +408,9 @@ LockRelationForExtension(Relation relation, LOCKMODE lockmode)
 								relation->rd_lockInfo.lockRelId.relId);
 
 	(void) LockAcquire(&tag, lockmode, false, false);
+
+	/* Increment the lock hold count. */
+	IncrementRelExtLockHeldCount();
 }
 
 /*
@@ -420,12 +423,21 @@ bool
 ConditionalLockRelationForExtension(Relation relation, LOCKMODE lockmode)
 {
 	LOCKTAG		tag;
+	LockAcquireResult result;
 
 	SET_LOCKTAG_RELATION_EXTEND(tag,
 								relation->rd_lockInfo.lockRelId.dbId,
 								relation->rd_lockInfo.lockRelId.relId);
+	result = LockAcquire(&tag, lockmode, false, true);
 
-	return (LockAcquire(&tag, lockmode, false, true) != LOCKACQUIRE_NOT_AVAIL);
+	/* Increment the lock hold count if we got the lock. */
+	if (result != LOCKACQUIRE_NOT_AVAIL)
+	{
+		IncrementRelExtLockHeldCount();
+		return true;
+	}
+
+	return false;
 }
 
 /*
@@ -458,6 +470,9 @@ UnlockRelationForExtension(Relation relation, LOCKMODE lockmode)
 								relation->rd_lockInfo.lockRelId.relId);
 
 	LockRelease(&tag, lockmode, false);
+
+	/* Decrement the lock hold count. */
+	DecrementRelExtLockHeldCount();
 }
 
 /*
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 56dba09..59c9ca3 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -170,6 +170,15 @@ typedef struct TwoPhaseLockRecord
  */
 static int	FastPathLocalUseCount = 0;
 
+/*
+ * Count of number of relation extension lock currently held by this backend.
+ * We need this counter so that we can ensure that while holding the relation
+ * extension lock we are not trying to acquire any other heavy weight lock.
+ * Basically, that will ensuring that the proc holding relation extension lock
+ * can not wait for any another lock.
+ */
+static int	RelationExtensionLockHeldCount = 0;
+
 /* Macros for manipulating proc->fpLockBits */
 #define FAST_PATH_BITS_PER_SLOT			3
 #define FAST_PATH_LOCKNUMBER_OFFSET		1
@@ -841,6 +850,15 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	}
 
 	/*
+	 * We should not acquire any other lock if we are already holding the
+	 * relation extension lock.  Only exception is that if we are trying to
+	 * acquire the relation extension lock then we can hold the relation
+	 * extension on the same relation.
+	 */
+	Assert(!IsRelExtLockHeld() ||
+		   ((locktag->locktag_type == LOCKTAG_RELATION_EXTEND) && found));
+
+	/*
 	 * Prepare to emit a WAL record if acquisition of this lock needs to be
 	 * replayed in a standby server.
 	 *
@@ -4492,3 +4510,49 @@ LockWaiterCount(const LOCKTAG *locktag)
 
 	return waiters;
 }
+
+/*
+ * IsRelExtLockHeld
+ *
+ * Is relation extension lock is held by this backend.
+ */
+bool
+IsRelExtLockHeld()
+{
+	return RelationExtensionLockHeldCount > 0;
+}
+
+/*
+ * IncrementRelExtLockHeldCount
+ *
+ * Increment the relation extension lock held count.
+ */
+void
+IncrementRelExtLockHeldCount()
+{
+	RelationExtensionLockHeldCount++;
+}
+
+/*
+ * DecrementRelExtLockHeldCount
+ *
+ * Decrement the relation extension lock held count;
+ */
+void
+DecrementRelExtLockHeldCount()
+{
+	/* We must hold the relation extension lock. */
+	Assert(RelationExtensionLockHeldCount > 0);
+	RelationExtensionLockHeldCount--;
+}
+
+/*
+ * ResetRelExtLockHeldCount
+ *
+ * Reset the relation extension lock hold count;
+ */
+void
+ResetRelExtLockHeldCount()
+{
+	RelationExtensionLockHeldCount = 0;
+}
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index bb8e4e6..c31a5f3 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -582,6 +582,10 @@ extern void RememberSimpleDeadLock(PGPROC *proc1,
 extern void InitDeadLockChecking(void);
 
 extern int	LockWaiterCount(const LOCKTAG *locktag);
+extern bool IsRelExtLockHeld(void);
+extern void IncrementRelExtLockHeldCount(void);
+extern void DecrementRelExtLockHeldCount(void);
+extern void ResetRelExtLockHeldCount(void);
 
 #ifdef LOCK_DEBUG
 extern void DumpLocks(PGPROC *proc);
-- 
1.8.3.1

v3-0003-Conflict-Extension-Page-lock-in-group-member.patchapplication/octet-stream; name=v3-0003-Conflict-Extension-Page-lock-in-group-member.patchDownload

From d6686759d821c9f66473d7dc34d8459e6d1e6faf Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 9 Mar 2020 17:40:45 +0530
Subject: [PATCH v3 3/3] Conflict Extension/Page lock in group member

---
 src/backend/storage/lmgr/deadlock.c |  9 +++++++++
 src/backend/storage/lmgr/lock.c     | 12 ++++++++++++
 2 files changed, 21 insertions(+)

diff --git a/src/backend/storage/lmgr/deadlock.c b/src/backend/storage/lmgr/deadlock.c
index f8c5df0..49a5998 100644
--- a/src/backend/storage/lmgr/deadlock.c
+++ b/src/backend/storage/lmgr/deadlock.c
@@ -568,6 +568,15 @@ FindLockCycleRecurseMember(PGPROC *checkProc,
 	proclock = (PROCLOCK *) SHMQueueNext(procLocks, procLocks,
 										 offsetof(PROCLOCK, lockLink));
 
+	/*
+	 * Relation extension/page lock never participate in actual deadlock cycle.
+	 * So avoid the wait edge for these type of lock so that we can avoid any
+	 * false cycle detection due to group locking.
+	 */
+	if ((lock->tag.locktag_type == LOCKTAG_RELATION_EXTEND) ||
+		(lock->tag.locktag_type == LOCKTAG_PAGE))
+		return false;
+
 	while (proclock)
 	{
 		PGPROC	   *leader;
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 1fcff29..012ba96 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -1433,6 +1433,18 @@ LockCheckConflicts(LockMethod lockMethodTable,
 	}
 
 	/*
+	 * If it's a relation extension/page lock then it will conflict even between
+	 * the lock group member.
+	 */
+	if ((lock->tag.locktag_type == LOCKTAG_RELATION_EXTEND) ||
+		(lock->tag.locktag_type == LOCKTAG_PAGE))
+	{
+		PROCLOCK_PRINT("LockCheckConflicts: conflicting (simple)",
+				proclock);
+		return true;
+	}
+
+	/*
 	 * Locks held in conflicting modes by members of our own lock group are
 	 * not real conflicts; we can subtract those out and see if we still have
 	 * a conflict.  This is O(N) in the number of processes holding or
-- 
1.8.3.1

v3-0002-WIP-Extend-the-patch-for-handling-PageLock.patchapplication/octet-stream; name=v3-0002-WIP-Extend-the-patch-for-handling-PageLock.patchDownload

From 3393076461ff724968de44c98688f0784c5d492b Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 9 Mar 2020 17:17:53 +0530
Subject: [PATCH v3 2/3] WIP-Extend the patch for handling PageLock

---
 src/backend/access/transam/xact.c | 14 ++++----
 src/backend/storage/lmgr/lmgr.c   | 18 ++++++++++-
 src/backend/storage/lmgr/lock.c   | 67 +++++++++++++++++++++++++++++++++------
 src/include/storage/lock.h        |  5 ++-
 4 files changed, 85 insertions(+), 19 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index ca64712..ec7b7f8 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2267,8 +2267,8 @@ CommitTransaction(void)
 	XactTopFullTransactionId = InvalidFullTransactionId;
 	nParallelCurrentXids = 0;
 
-	/* Reset the relation extension lock held count. */
-	ResetRelExtLockHeldCount();
+	/* Reset the relation extension/page lock held count. */
+	ResetRelExtPageLockHeldCount();
 
 	/*
 	 * done with commit processing, set current transaction state back to
@@ -2738,8 +2738,8 @@ AbortTransaction(void)
 		pgstat_report_xact_timestamp(0);
 	}
 
-	/* Reset the relation extension lock held count. */
-	ResetRelExtLockHeldCount();
+	/* Reset the relation extension/page lock held count. */
+	ResetRelExtPageLockHeldCount();
 
 	/*
 	 * State remains TRANS_ABORT until CleanupTransaction().
@@ -5012,8 +5012,8 @@ AbortSubTransaction(void)
 		AtEOSubXact_ApplyLauncher(false, s->nestingLevel);
 	}
 
-	/* Reset the relation extension lock held count. */
-	ResetRelExtLockHeldCount();
+	/* Reset the relation extension/Page lock held count. */
+	ResetRelExtPageLockHeldCount();
 
 	/*
 	 * Restore the upper transaction's read-only state, too.  This should be
@@ -5074,7 +5074,7 @@ PushTransaction(void)
 	 * Relation extension lock must not be held while starting a new
 	 * sub-transaction.
 	 */
-	Assert(!IsRelExtLockHeld());
+	Assert(!(IsRelExtLockHeld() || IsPageLockHeld()));
 
 	/*
 	 * We keep subtransaction state nodes in TopTransactionContext.
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 26760f8..b0df063 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -492,6 +492,9 @@ LockPage(Relation relation, BlockNumber blkno, LOCKMODE lockmode)
 					 blkno);
 
 	(void) LockAcquire(&tag, lockmode, false, false);
+
+	/* Increment the lock held count. */
+	IncrementPageLockHeldCount();
 }
 
 /*
@@ -504,13 +507,22 @@ bool
 ConditionalLockPage(Relation relation, BlockNumber blkno, LOCKMODE lockmode)
 {
 	LOCKTAG		tag;
+	LockAcquireResult result;
 
 	SET_LOCKTAG_PAGE(tag,
 					 relation->rd_lockInfo.lockRelId.dbId,
 					 relation->rd_lockInfo.lockRelId.relId,
 					 blkno);
 
-	return (LockAcquire(&tag, lockmode, false, true) != LOCKACQUIRE_NOT_AVAIL);
+	result = LockAcquire(&tag, lockmode, false, true);
+	if (result != LOCKACQUIRE_NOT_AVAIL)
+	{
+		/* Increment the lock held count. */
+		IncrementPageLockHeldCount();
+		return true;
+	}
+
+	return false;
 }
 
 /*
@@ -527,6 +539,10 @@ UnlockPage(Relation relation, BlockNumber blkno, LOCKMODE lockmode)
 					 blkno);
 
 	LockRelease(&tag, lockmode, false);
+
+	/* Decrement the lock held count. */
+	DecrementPageLockHeldCount();
+
 }
 
 /*
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 59c9ca3..1fcff29 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -171,13 +171,14 @@ typedef struct TwoPhaseLockRecord
 static int	FastPathLocalUseCount = 0;
 
 /*
- * Count of number of relation extension lock currently held by this backend.
- * We need this counter so that we can ensure that while holding the relation
- * extension lock we are not trying to acquire any other heavy weight lock.
- * Basically, that will ensuring that the proc holding relation extension lock
- * can not wait for any another lock.
+ * Count of number of relation extension/page lock currently held by this
+ * backend. We need this counter so that we can ensure that while holding the
+ * relation extension/page lock we are not trying to acquire any other heavy
+ * weight lock which can cause deadlock.  Basically, that will ensure that the
+ * proc holding relation extension/page lock can not wait for any another lock.
  */
 static int	RelationExtensionLockHeldCount = 0;
+static int	PageLockHeldCount = 0;
 
 /* Macros for manipulating proc->fpLockBits */
 #define FAST_PATH_BITS_PER_SLOT			3
@@ -851,14 +852,24 @@ LockAcquireExtended(const LOCKTAG *locktag,
 
 	/*
 	 * We should not acquire any other lock if we are already holding the
-	 * relation extension lock.  Only exception is that if we are trying to
+	 * relation extension/page lock.  Only exception is that if we are trying to
 	 * acquire the relation extension lock then we can hold the relation
-	 * extension on the same relation.
+	 * extension/page lock.
 	 */
 	Assert(!IsRelExtLockHeld() ||
 		   ((locktag->locktag_type == LOCKTAG_RELATION_EXTEND) && found));
 
 	/*
+	 * XXX While holding the page lock we don't need to ensure that whether we
+	 * are trying to acquire the relation extension lock on the same relation
+	 * or any other relation.  Because the above assert is ensuring that after
+	 * holding the relation extension lock we are not going to wait for any
+	 * other process.
+	 */
+	Assert(!IsPageLockHeld() ||
+			(locktag->locktag_type == LOCKTAG_RELATION_EXTEND));	
+
+	/*
 	 * Prepare to emit a WAL record if acquisition of this lock needs to be
 	 * replayed in a standby server.
 	 *
@@ -4547,12 +4558,48 @@ DecrementRelExtLockHeldCount()
 }
 
 /*
- * ResetRelExtLockHeldCount
+ * ResetRelExtPageLockHeldCount
  *
- * Reset the relation extension lock hold count;
+ * Reset the relation extension/page lock hold count;
  */
 void
-ResetRelExtLockHeldCount()
+ResetRelExtPageLockHeldCount()
 {
 	RelationExtensionLockHeldCount = 0;
+	PageLockHeldCount = 0;
+}
+
+/*
+ * IsRelExtLockHeld
+ *
+ * Is relation extension lock is held by this backend.
+ */
+bool
+IsPageLockHeld()
+{
+	return PageLockHeldCount > 0;
+}
+
+/*
+ * IncrementPageLockHeldCount
+ *
+ * Increment the page lock hold count.
+ */
+void
+IncrementPageLockHeldCount()
+{
+	PageLockHeldCount++;
+}
+
+/*
+ * DecrementPageLockHeldCount
+ *
+ * Decrement the page lock hold count;
+ */
+void
+DecrementPageLockHeldCount()
+{
+	/* We must hold the page lock. */
+	Assert(PageLockHeldCount > 0);
+	PageLockHeldCount--;
 }
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index c31a5f3..ed8fbdc 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -585,7 +585,10 @@ extern int	LockWaiterCount(const LOCKTAG *locktag);
 extern bool IsRelExtLockHeld(void);
 extern void IncrementRelExtLockHeldCount(void);
 extern void DecrementRelExtLockHeldCount(void);
-extern void ResetRelExtLockHeldCount(void);
+extern void ResetRelExtPageLockHeldCount(void);
+extern bool IsPageLockHeld(void);
+extern void IncrementPageLockHeldCount(void);
+extern void DecrementPageLockHeldCount(void);
 
 #ifdef LOCK_DEBUG
 extern void DumpLocks(PGPROC *proc);
-- 
1.8.3.1

#152

Amit Kapila

amit.kapila16@gmail.com

almost 6 years ago

In reply to: Amit Kapila (#126)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Mon, Feb 24, 2020 at 3:38 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Feb 20, 2020 at 8:06 AM Andres Freund <andres@anarazel.de> wrote:

What I'm advocating is that extension locks should continue to go
through lock.c. And yes, that requires some changes to group locking,
but I still don't see why they'd be complicated.

Fair position, as per initial analysis, I think if we do below three
things, it should work out without changing to a new way of locking
for relation extension or page type locks.
a. As per the discussion above, ensure in code we will never try to
acquire another heavy-weight lock after acquiring relation extension
or page type locks (probably by having Asserts in code or maybe some
other way).

I have done an analysis of the relation extension lock (which can be
acquired via LockRelationForExtension or
ConditionalLockRelationForExtension) and found that we don't acquire
any other heavyweight lock after acquiring it. However, we do
sometimes try to acquire it again in the places where we update FSM
after extension, see points (e) and (f) described below. The usage of
this lock can be broadly divided into six categories and each one is
explained as follows:

a. Where after taking the relation extension lock we call ReadBuffer
(or its variant) and then LockBuffer. The LockBuffer internally calls
either LWLock to acquire or release neither of which acquire another
heavy-weight lock. It is quite obvious as well that while taking some
lightweight lock, there is no reason to acquire another heavyweight
lock on any object. The specs/comments of ReadBufferExtended (which
gets called from variants of ReadBuffer) API says that if the blknum
requested is P_NEW, only one backend can call it at-a-time which
indicates that we don't need to acquire any heavy-weight lock inside
this API. Otherwise, also, this API won't need a heavy-weight lock to
read the existing block into shared buffer as two different backends
are allowed to read the same block. I have also gone through all the
functions called/used in this path to ensure that we don't use
heavy-weight locks inside it.

The usage by APIs BloomNewBuffer, GinNewBuffer, gistNewBuffer,
_bt_getbuf, and SpGistNewBuffer falls in this category. Another API
that falls under this category is revmap_physical_extend which uses
ReadBuffer, LocakBuffer and ReleaseBuffer. The ReleaseBuffer API
unpins aka decrement the reference count for buffer and disassociates
a buffer from the resource owner. None of that requires heavy-weight
lock. T

b. After taking relation extension lock, we call
RelationGetNumberOfBlocks which primarily calls file-level functions
to determine the size of the file. This doesn't acquire any other
heavy-weight lock after relation extension lock.

The usage by APIs ginvacuumcleanup, gistvacuumscan, btvacuumscan, and
spgvacuumscan falls in this category.

c. There is a usage in API brin_page_cleanup() where we just acquire
and release the relation extension lock to avoid reinitializing the
page. As there is no call in-between acquire and release, so there is
no chance of another heavy-weight lock acquire after having relation
extension lock.

d. In fsm_extend() and vm_extend(), after acquiring relation extension
lock, we perform various file-level operations like RelationOpenSmgr,
smgrexists, smgrcreate, smgrnblocks, smgrextend. First, from theory,
we don't have any heavy-weight lock other than relation extension lock
which can cover such operations and then I have verified it by going
through these APIs that these don't acquire any other heavy-weight
lock. Then these APIs also call PageSetChecksumInplace computes a
checksum of the page and sets the same in page header which is quite
straight-forward and doesn't acquire any heavy-weight lock.

In vm_extend, we additionally call CacheInvalidateSmgr to send a
shared-inval message to force other backends to close any smgr
references they may have for the relation for which we extending
visibility map which has no reason to acquire any heavy-weight lock.
I have checked the code path as well and I didn't find any
heavy-weight lock call in that.

e. In brin_getinsertbuffer, we call ReadBuffer() and LockBuffer(), the
usage of which is the same as what is mentioned in (a). In addition
to that it calls brin_initialize_empty_new_buffer() which further
calls RecordPageWithFreeSpace which can again acquire relation
extension lock for same relation. This usage is safe because we have
a mechanism in heavy-weight lock manager that if we already hold a
lock and a request came for the same lock and in same mode, the lock
will be granted.

f. In RelationGetBufferForTuple(), there are multiple APIs that get
called and like (e), it can try to reacquire the relation extension
lock in one of those APIs. The main APIs it calls after acquiring
relation extension lock are described as follows:
- GetPageWithFreeSpace: This tries to find a page in the given
relation with at least the specified amount of free space. This
mainly checks the FSM pages and in one of the paths might call
fsm_extend which can again try to acquire the relation extension lock
on the same relation.
- RelationAddExtraBlocks: This adds multiple pages in a relation if
there is contention around relation extension lock. This calls
RelationExtensionLockWaiterCount which is mainly to check how many
lockers are waiting for the same lock, then call ReadBufferBI which as
explained above won't require heavy-weight locks and FSM APIs which
can acquire Relation extension lock on the same relation, but that is
safe as discussed previously.

The Page locks can be acquired via LockPage and ConditionalLockPage.
This is acquired from one place in the code during Gin index cleanup
(ginInsertCleanup). The basic idea is that it will scan the pending
list and move entries into the main index. While moving entries to
the main page, it might need to add a new page that will require us to
take a relation extension lock. Now, unlike relation extension lock,
after acquiring page lock, we do acquire another heavy-weight lock
(relation extension lock), but as we never acquire it in reverse
order, this is safe.

So, as per this analysis, we can add Asserts for relation extension
and page locks which will indicate that they won't participate in
deadlocks. It would be good if someone else can also do independent
analysis and verify my findings.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#153

Robert Haas

robertmhaas@gmail.com

almost 6 years ago

In reply to: Dilip Kumar (#138)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Fri, Mar 6, 2020 at 11:27 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I think instead of the flag we need to keep the counter because we can
acquire the same relation extension lock multiple times. So
basically, every time we acquire the lock we can increment the counter
and while releasing we can decrement it. During an error path, I
think it is fine to set it to 0 in CommitTransaction/AbortTransaction.
But, I am not sure that we can set to 0 or decrement it in
AbortSubTransaction because we are not sure whether we have acquired
the lock under this subtransaction or not.

I think that CommitTransaction, AbortTransaction, and friends have
*zero* business touching this. I think the counter - or flag - should
track whether we've got a PROCLOCK entry for a relation extension
lock. We either do, or we do not, and that does not change because of
anything have to do with the transaction state. It changes because
somebody calls LockRelease() or LockReleaseAll().

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#154

Robert Haas

robertmhaas@gmail.com

almost 6 years ago

In reply to: Tom Lane (#142)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Sat, Mar 7, 2020 at 10:23 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I continue to think that we'd be better off getting all of this
out of the heavyweight lock manager. There is no reason why we
should need deadlock detection, or multiple holds of the same
lock, or pretty much anything that LWLocks don't give you.

Well, that was my initial inclination too, but Andres didn't like it.
I don't know whether it's better to take his advice or yours.

The one facility that we need here which the heavyweight lock facility
does provide and the lightweight lock facility does not is the ability
to take locks on an effectively unlimited number of distinct objects.
That is, we can't have a separate LWLock for every relation, because
there ~2^32 relation OIDs per database, and ~2^32 database OIDs, and a
patch that tried to allocate a tranche of 2^64 LWLocks would probably
get shot down.

The patch I wrote for this tried to work around this by having an
array of LWLocks and hashing <DBOID, RELOID> pairs onto array slots.
This produces some false sharing, though, which Andres didn't like
(and I can understand his concern). We could work around that problem
with a more complex design, where the LWLocks in the array do not
themselves represent the right to extend the relation, but only
protect the list of lockers. But at that point it starts to look like
you are reinventing the whole LOCK/PROCLOCK division.

So from my point of view we've got three possible approaches here, all
imperfect:

- Hash <DB, REL> pairs onto an array of LWLocks that represent the
right to extend the relation. Problem: false sharing for the whole
time the lock is held.

- Hash <DB, REL> pairs onto an array of LWLocks that protect a list of
lockers. Problem: looks like reinventing LOCK/PROCLOCK mechanism,
which is a fair amount of complexity to be duplicating.

- Adapt the heavyweight lock manager. Problem: Code is old, complex,
grotty, and doesn't need more weird special cases.

Whatever we choose, I think we ought to try to get Page locks and
Relation Extension locks into the same system. They're conceptually
the same kind of thing: you're not locking an SQL object, you
basically want an LWLock, but you can't use an LWLock because you want
to lock an OID not a piece of shared memory, so you can't have enough
LWLocks to use them in the regular way.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#155

Amit Kapila

amit.kapila16@gmail.com

almost 6 years ago

In reply to: Robert Haas (#154)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Tue, Mar 10, 2020 at 6:48 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Sat, Mar 7, 2020 at 10:23 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I continue to think that we'd be better off getting all of this
out of the heavyweight lock manager. There is no reason why we
should need deadlock detection, or multiple holds of the same
lock, or pretty much anything that LWLocks don't give you.

Well, that was my initial inclination too, but Andres didn't like it.
I don't know whether it's better to take his advice or yours.

The one facility that we need here which the heavyweight lock facility
does provide and the lightweight lock facility does not is the ability
to take locks on an effectively unlimited number of distinct objects.
That is, we can't have a separate LWLock for every relation, because
there ~2^32 relation OIDs per database, and ~2^32 database OIDs, and a
patch that tried to allocate a tranche of 2^64 LWLocks would probably
get shot down.

I think if we have to follow any LWLock based design, then we also
need to think about a case where if it is already acquired by the
backend (say in X mode), then it should be granted if the same backend
tries to acquire it in same mode (or mode that is compatible with the
mode in which it is already acquired). As per my analysis above [1]/messages/by-id/CAA4eK1+E8Vu=9PYZBZvMrga0Ynz_m6jmT3G_vJv-3L1PWv9Krg@mail.gmail.com,
we do this at multiple places for relation extension lock.

[1]: /messages/by-id/CAA4eK1+E8Vu=9PYZBZvMrga0Ynz_m6jmT3G_vJv-3L1PWv9Krg@mail.gmail.com

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#156

Amit Kapila

amit.kapila16@gmail.com

almost 6 years ago

In reply to: Dilip Kumar (#151)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Tue, Mar 10, 2020 at 8:53 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Please find the updated patch (summary of the changes)
- Instead of searching the lock hash table for assert, it maintains a counter.
- Also, handled the case where we can acquire the relation extension
lock while holding the relation extension lock on the same relation.
- Handled the error case.

In addition to that prepared a WIP patch for handling the PageLock.
First, I thought that we can use the same counter for the PageLock and
the RelationExtensionLock because in assert we just need to check
whether we are trying to acquire any other heavyweight lock while
holding any of these locks. But, the exceptional case where we
allowed to acquire a relation extension lock while holding any of
these locks is a bit different. Because, if we are holding a relation
extension lock then we allowed to acquire the relation extension lock
on the same relation but it can not be any other relation otherwise it
can create a cycle. But, the same is not true with the PageLock,
i.e. while holding the PageLock you can acquire the relation extension
lock on any relation and that will be safe because the relation
extension lock guarantee that, it will never create the cycle.
However, I agree that we don't have any such cases where we want to
acquire a relation extension lock on the different relations while
holding the PageLock.

Right, today, we don't have such cases where after acquiring relation
extension or page lock for a particular relation, we need to acquire
any of those for other relation and I am not able to offhand think of
many cases where we might have such a need in the future. The one
theoretical possibility is to include fork_num in the lock tag while
acquiring extension lock for fsm/vm, but that will also have the same
relation. Similarly one might say it is valid to acquire extension
lock in share mode after we have acquired it exclusive mode. I am not
sure how much futuristic we want to make these Asserts.

I feel we should cover the current possible cases (which I think will
make the asserts more strict then required) and if there is a need to
relax them in the future for any particular use case, then we will
consider those. In general, if we consider the way Mahendra has
written a patch which is to find the entry via the local hash table to
check for an Assert condition, then it will be a bit easier to extend
the checks if required in future as that way we have more information
about the particular lock. However, it will make the check more
expensive which might be okay considering that it is only for Assert
enabled builds.

One minor comment:
/*
+ * We should not acquire any other lock if we are already holding the
+ * relation extension lock.  Only exception is that if we are trying to
+ * acquire the relation extension lock then we can hold the relation
+ * extension on the same relation.
+ */
+ Assert(!IsRelExtLockHeld() ||
+    ((locktag->locktag_type == LOCKTAG_RELATION_EXTEND) && found));

I think you don't need the second part of the check because if we have
found the lock in the local lock table, we would return before this
check. I think it will catch the case where if we have an extension
lock on one relation, then it won't allow us to acquire it on another
relation. OTOH, it will also not allow cases where backend has
relation extension lock in Exclusive mode and it tries to acquire it
in Shared mode. So, not sure if it is a good idea.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#157

Amit Kapila

amit.kapila16@gmail.com

almost 6 years ago

In reply to: Robert Haas (#153)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Tue, Mar 10, 2020 at 6:39 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Mar 6, 2020 at 11:27 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I think instead of the flag we need to keep the counter because we can
acquire the same relation extension lock multiple times. So
basically, every time we acquire the lock we can increment the counter
and while releasing we can decrement it. During an error path, I
think it is fine to set it to 0 in CommitTransaction/AbortTransaction.
But, I am not sure that we can set to 0 or decrement it in
AbortSubTransaction because we are not sure whether we have acquired
the lock under this subtransaction or not.

I think that CommitTransaction, AbortTransaction, and friends have
*zero* business touching this. I think the counter - or flag - should
track whether we've got a PROCLOCK entry for a relation extension
lock. We either do, or we do not, and that does not change because of
anything have to do with the transaction state. It changes because
somebody calls LockRelease() or LockReleaseAll().

Do we want to have a special check in the LockRelease() to identify
whether we are releasing relation extension lock? If not, then how we
will identify that relation extension is released and we can reset it
during subtransaction abort due to error? During success paths, we
know when we have released RelationExtension or Page Lock (via
UnlockRelationForExtension or UnlockPage). During the top-level
transaction end, we know when we have released all the locks, so that
will imply that RelationExtension and or Page locks must have been
released by that time.

If we have no other choice, then I see a few downsides of adding a
special check in the LockRelease() call:

1. Instead of resetting/decrement the variable from specific APIs like
UnlockRelationForExtension or UnlockPage, we need to have it in
LockRelease. It will also look odd, if set variable in
LockRelationForExtension, but don't reset in the
UnlockRelationForExtension variant. Now, maybe we can allow to reset
it at both places if it is a flag, but not if it is a counter
variable.

2. One can argue that adding extra instructions in a generic path
(like LockRelease) is not a good idea, especially if those are for an
Assert. I understand this won't add anything which we can measure by
standard benchmarks.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#158

Dilip Kumar

dilipbalaut@gmail.com

almost 6 years ago

In reply to: Amit Kapila (#156)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Wed, Mar 11, 2020 at 2:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Mar 10, 2020 at 8:53 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Please find the updated patch (summary of the changes)
- Instead of searching the lock hash table for assert, it maintains a counter.
- Also, handled the case where we can acquire the relation extension
lock while holding the relation extension lock on the same relation.
- Handled the error case.

In addition to that prepared a WIP patch for handling the PageLock.
First, I thought that we can use the same counter for the PageLock and
the RelationExtensionLock because in assert we just need to check
whether we are trying to acquire any other heavyweight lock while
holding any of these locks. But, the exceptional case where we
allowed to acquire a relation extension lock while holding any of
these locks is a bit different. Because, if we are holding a relation
extension lock then we allowed to acquire the relation extension lock
on the same relation but it can not be any other relation otherwise it
can create a cycle. But, the same is not true with the PageLock,
i.e. while holding the PageLock you can acquire the relation extension
lock on any relation and that will be safe because the relation
extension lock guarantee that, it will never create the cycle.
However, I agree that we don't have any such cases where we want to
acquire a relation extension lock on the different relations while
holding the PageLock.

Right, today, we don't have such cases where after acquiring relation
extension or page lock for a particular relation, we need to acquire
any of those for other relation and I am not able to offhand think of
many cases where we might have such a need in the future. The one
theoretical possibility is to include fork_num in the lock tag while
acquiring extension lock for fsm/vm, but that will also have the same
relation. Similarly one might say it is valid to acquire extension
lock in share mode after we have acquired it exclusive mode. I am not
sure how much futuristic we want to make these Asserts.

I feel we should cover the current possible cases (which I think will
make the asserts more strict then required) and if there is a need to
relax them in the future for any particular use case, then we will
consider those. In general, if we consider the way Mahendra has
written a patch which is to find the entry via the local hash table to
check for an Assert condition, then it will be a bit easier to extend
the checks if required in future as that way we have more information
about the particular lock. However, it will make the check more
expensive which might be okay considering that it is only for Assert
enabled builds.
One minor comment:
/*
+ * We should not acquire any other lock if we are already holding the
+ * relation extension lock.  Only exception is that if we are trying to
+ * acquire the relation extension lock then we can hold the relation
+ * extension on the same relation.
+ */
+ Assert(!IsRelExtLockHeld() ||
+    ((locktag->locktag_type == LOCKTAG_RELATION_EXTEND) && found));
I think you don't need the second part of the check because if we have
found the lock in the local lock table, we would return before this
check.

Right.

I think it will catch the case where if we have an extension

lock on one relation, then it won't allow us to acquire it on another
relation.

But, those will be caught even if we remove the second part right.
Basically, if we have Assert(!IsRelExtLockHeld(), that means by this
time you should not hold any relation extension lock. The exceptional
case where we allow relation extension on the same relation will
anyway not reach here. I think the second part of the Assert is just
useless.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#159

Dilip Kumar

dilipbalaut@gmail.com

almost 6 years ago

In reply to: Amit Kapila (#152)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Tue, Mar 10, 2020 at 4:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Feb 24, 2020 at 3:38 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Feb 20, 2020 at 8:06 AM Andres Freund <andres@anarazel.de> wrote:

What I'm advocating is that extension locks should continue to go
through lock.c. And yes, that requires some changes to group locking,
but I still don't see why they'd be complicated.

Fair position, as per initial analysis, I think if we do below three
things, it should work out without changing to a new way of locking
for relation extension or page type locks.
a. As per the discussion above, ensure in code we will never try to
acquire another heavy-weight lock after acquiring relation extension
or page type locks (probably by having Asserts in code or maybe some
other way).

I have done an analysis of the relation extension lock (which can be
acquired via LockRelationForExtension or
ConditionalLockRelationForExtension) and found that we don't acquire
any other heavyweight lock after acquiring it. However, we do
sometimes try to acquire it again in the places where we update FSM
after extension, see points (e) and (f) described below. The usage of
this lock can be broadly divided into six categories and each one is
explained as follows:

a. Where after taking the relation extension lock we call ReadBuffer
(or its variant) and then LockBuffer. The LockBuffer internally calls
either LWLock to acquire or release neither of which acquire another
heavy-weight lock. It is quite obvious as well that while taking some
lightweight lock, there is no reason to acquire another heavyweight
lock on any object. The specs/comments of ReadBufferExtended (which
gets called from variants of ReadBuffer) API says that if the blknum
requested is P_NEW, only one backend can call it at-a-time which
indicates that we don't need to acquire any heavy-weight lock inside
this API. Otherwise, also, this API won't need a heavy-weight lock to
read the existing block into shared buffer as two different backends
are allowed to read the same block. I have also gone through all the
functions called/used in this path to ensure that we don't use
heavy-weight locks inside it.

The usage by APIs BloomNewBuffer, GinNewBuffer, gistNewBuffer,
_bt_getbuf, and SpGistNewBuffer falls in this category. Another API
that falls under this category is revmap_physical_extend which uses
ReadBuffer, LocakBuffer and ReleaseBuffer. The ReleaseBuffer API
unpins aka decrement the reference count for buffer and disassociates
a buffer from the resource owner. None of that requires heavy-weight
lock. T

b. After taking relation extension lock, we call
RelationGetNumberOfBlocks which primarily calls file-level functions
to determine the size of the file. This doesn't acquire any other
heavy-weight lock after relation extension lock.

The usage by APIs ginvacuumcleanup, gistvacuumscan, btvacuumscan, and
spgvacuumscan falls in this category.

c. There is a usage in API brin_page_cleanup() where we just acquire
and release the relation extension lock to avoid reinitializing the
page. As there is no call in-between acquire and release, so there is
no chance of another heavy-weight lock acquire after having relation
extension lock.

d. In fsm_extend() and vm_extend(), after acquiring relation extension
lock, we perform various file-level operations like RelationOpenSmgr,
smgrexists, smgrcreate, smgrnblocks, smgrextend. First, from theory,
we don't have any heavy-weight lock other than relation extension lock
which can cover such operations and then I have verified it by going
through these APIs that these don't acquire any other heavy-weight
lock. Then these APIs also call PageSetChecksumInplace computes a
checksum of the page and sets the same in page header which is quite
straight-forward and doesn't acquire any heavy-weight lock.

In vm_extend, we additionally call CacheInvalidateSmgr to send a
shared-inval message to force other backends to close any smgr
references they may have for the relation for which we extending
visibility map which has no reason to acquire any heavy-weight lock.
I have checked the code path as well and I didn't find any
heavy-weight lock call in that.

e. In brin_getinsertbuffer, we call ReadBuffer() and LockBuffer(), the
usage of which is the same as what is mentioned in (a). In addition
to that it calls brin_initialize_empty_new_buffer() which further
calls RecordPageWithFreeSpace which can again acquire relation
extension lock for same relation. This usage is safe because we have
a mechanism in heavy-weight lock manager that if we already hold a
lock and a request came for the same lock and in same mode, the lock
will be granted.

f. In RelationGetBufferForTuple(), there are multiple APIs that get
called and like (e), it can try to reacquire the relation extension
lock in one of those APIs. The main APIs it calls after acquiring
relation extension lock are described as follows:
- GetPageWithFreeSpace: This tries to find a page in the given
relation with at least the specified amount of free space. This
mainly checks the FSM pages and in one of the paths might call
fsm_extend which can again try to acquire the relation extension lock
on the same relation.
- RelationAddExtraBlocks: This adds multiple pages in a relation if
there is contention around relation extension lock. This calls
RelationExtensionLockWaiterCount which is mainly to check how many
lockers are waiting for the same lock, then call ReadBufferBI which as
explained above won't require heavy-weight locks and FSM APIs which
can acquire Relation extension lock on the same relation, but that is
safe as discussed previously.

The Page locks can be acquired via LockPage and ConditionalLockPage.
This is acquired from one place in the code during Gin index cleanup
(ginInsertCleanup). The basic idea is that it will scan the pending
list and move entries into the main index. While moving entries to
the main page, it might need to add a new page that will require us to
take a relation extension lock. Now, unlike relation extension lock,
after acquiring page lock, we do acquire another heavy-weight lock
(relation extension lock), but as we never acquire it in reverse
order, this is safe.

So, as per this analysis, we can add Asserts for relation extension
and page locks which will indicate that they won't participate in
deadlocks. It would be good if someone else can also do independent
analysis and verify my findings.

I have also analyzed the usage for the RelationExtensioLock and the
PageLock. And, my findings are on the same lines.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#160

Dilip Kumar

dilipbalaut@gmail.com

almost 6 years ago

In reply to: Amit Kapila (#156)

3 attachment(s)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Wed, Mar 11, 2020 at 2:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Mar 10, 2020 at 8:53 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Please find the updated patch (summary of the changes)
- Instead of searching the lock hash table for assert, it maintains a counter.
- Also, handled the case where we can acquire the relation extension
lock while holding the relation extension lock on the same relation.
- Handled the error case.

In addition to that prepared a WIP patch for handling the PageLock.
First, I thought that we can use the same counter for the PageLock and
the RelationExtensionLock because in assert we just need to check
whether we are trying to acquire any other heavyweight lock while
holding any of these locks. But, the exceptional case where we
allowed to acquire a relation extension lock while holding any of
these locks is a bit different. Because, if we are holding a relation
extension lock then we allowed to acquire the relation extension lock
on the same relation but it can not be any other relation otherwise it
can create a cycle. But, the same is not true with the PageLock,
i.e. while holding the PageLock you can acquire the relation extension
lock on any relation and that will be safe because the relation
extension lock guarantee that, it will never create the cycle.
However, I agree that we don't have any such cases where we want to
acquire a relation extension lock on the different relations while
holding the PageLock.

Right, today, we don't have such cases where after acquiring relation
extension or page lock for a particular relation, we need to acquire
any of those for other relation and I am not able to offhand think of
many cases where we might have such a need in the future. The one
theoretical possibility is to include fork_num in the lock tag while
acquiring extension lock for fsm/vm, but that will also have the same
relation. Similarly one might say it is valid to acquire extension
lock in share mode after we have acquired it exclusive mode. I am not
sure how much futuristic we want to make these Asserts.

I feel we should cover the current possible cases (which I think will
make the asserts more strict then required) and if there is a need to
relax them in the future for any particular use case, then we will
consider those. In general, if we consider the way Mahendra has
written a patch which is to find the entry via the local hash table to
check for an Assert condition, then it will be a bit easier to extend
the checks if required in future as that way we have more information
about the particular lock. However, it will make the check more
expensive which might be okay considering that it is only for Assert
enabled builds.
One minor comment:
/*
+ * We should not acquire any other lock if we are already holding the
+ * relation extension lock.  Only exception is that if we are trying to
+ * acquire the relation extension lock then we can hold the relation
+ * extension on the same relation.
+ */
+ Assert(!IsRelExtLockHeld() ||
+    ((locktag->locktag_type == LOCKTAG_RELATION_EXTEND) && found));

I have fixed this in the attached patch set.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v4-0001-Add-assert-to-check-that-we-should-not-acquire-an.patchapplication/octet-stream; name=v4-0001-Add-assert-to-check-that-we-should-not-acquire-an.patchDownload

From 10571bb69268d8e9e739122eda4507cd65c5bbce Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Sat, 7 Mar 2020 09:24:33 +0530
Subject: [PATCH v4 1/3] Add assert to check that we should not acquire any
 other lock if we are already holding the relation extension lock.  Only
 exception is that if we are trying to acquire the relation extension lock
 then we can hold the same lock.

---
 src/backend/access/transam/xact.c | 15 ++++++++++
 src/backend/storage/lmgr/lmgr.c   | 17 ++++++++++-
 src/backend/storage/lmgr/lock.c   | 63 +++++++++++++++++++++++++++++++++++++++
 src/include/storage/lock.h        |  4 +++
 4 files changed, 98 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index e3c60f2..ca64712 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2267,6 +2267,9 @@ CommitTransaction(void)
 	XactTopFullTransactionId = InvalidFullTransactionId;
 	nParallelCurrentXids = 0;
 
+	/* Reset the relation extension lock held count. */
+	ResetRelExtLockHeldCount();
+
 	/*
 	 * done with commit processing, set current transaction state back to
 	 * default
@@ -2735,6 +2738,9 @@ AbortTransaction(void)
 		pgstat_report_xact_timestamp(0);
 	}
 
+	/* Reset the relation extension lock held count. */
+	ResetRelExtLockHeldCount();
+
 	/*
 	 * State remains TRANS_ABORT until CleanupTransaction().
 	 */
@@ -5006,6 +5012,9 @@ AbortSubTransaction(void)
 		AtEOSubXact_ApplyLauncher(false, s->nestingLevel);
 	}
 
+	/* Reset the relation extension lock held count. */
+	ResetRelExtLockHeldCount();
+
 	/*
 	 * Restore the upper transaction's read-only state, too.  This should be
 	 * redundant with GUC's cleanup but we may as well do it for consistency
@@ -5062,6 +5071,12 @@ PushTransaction(void)
 	TransactionState s;
 
 	/*
+	 * Relation extension lock must not be held while starting a new
+	 * sub-transaction.
+	 */
+	Assert(!IsRelExtLockHeld());
+
+	/*
 	 * We keep subtransaction state nodes in TopTransactionContext.
 	 */
 	s = (TransactionState)
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 2010320..26760f8 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -408,6 +408,9 @@ LockRelationForExtension(Relation relation, LOCKMODE lockmode)
 								relation->rd_lockInfo.lockRelId.relId);
 
 	(void) LockAcquire(&tag, lockmode, false, false);
+
+	/* Increment the lock hold count. */
+	IncrementRelExtLockHeldCount();
 }
 
 /*
@@ -420,12 +423,21 @@ bool
 ConditionalLockRelationForExtension(Relation relation, LOCKMODE lockmode)
 {
 	LOCKTAG		tag;
+	LockAcquireResult result;
 
 	SET_LOCKTAG_RELATION_EXTEND(tag,
 								relation->rd_lockInfo.lockRelId.dbId,
 								relation->rd_lockInfo.lockRelId.relId);
+	result = LockAcquire(&tag, lockmode, false, true);
 
-	return (LockAcquire(&tag, lockmode, false, true) != LOCKACQUIRE_NOT_AVAIL);
+	/* Increment the lock hold count if we got the lock. */
+	if (result != LOCKACQUIRE_NOT_AVAIL)
+	{
+		IncrementRelExtLockHeldCount();
+		return true;
+	}
+
+	return false;
 }
 
 /*
@@ -458,6 +470,9 @@ UnlockRelationForExtension(Relation relation, LOCKMODE lockmode)
 								relation->rd_lockInfo.lockRelId.relId);
 
 	LockRelease(&tag, lockmode, false);
+
+	/* Decrement the lock hold count. */
+	DecrementRelExtLockHeldCount();
 }
 
 /*
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 56dba09..6e65d8b 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -170,6 +170,15 @@ typedef struct TwoPhaseLockRecord
  */
 static int	FastPathLocalUseCount = 0;
 
+/*
+ * Count of number of relation extension lock currently held by this backend.
+ * We need this counter so that we can ensure that while holding the relation
+ * extension lock we are not trying to acquire any other heavy weight lock.
+ * Basically, that will ensuring that the proc holding relation extension lock
+ * can not wait for any another lock.
+ */
+static int	RelationExtensionLockHeldCount = 0;
+
 /* Macros for manipulating proc->fpLockBits */
 #define FAST_PATH_BITS_PER_SLOT			3
 #define FAST_PATH_LOCKNUMBER_OFFSET		1
@@ -841,6 +850,14 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	}
 
 	/*
+	 * We should not try to acquire any other heavyweight lock if we are already
+	 * holding the relation extension lock.  If we are trying to hold the same
+	 * relation extension lock then it should have been already granted so we
+	 * will not come here.
+	 */
+	Assert(!IsRelExtLockHeld());
+
+	/*
 	 * Prepare to emit a WAL record if acquisition of this lock needs to be
 	 * replayed in a standby server.
 	 *
@@ -4492,3 +4509,49 @@ LockWaiterCount(const LOCKTAG *locktag)
 
 	return waiters;
 }
+
+/*
+ * IsRelExtLockHeld
+ *
+ * Is relation extension lock is held by this backend.
+ */
+bool
+IsRelExtLockHeld()
+{
+	return RelationExtensionLockHeldCount > 0;
+}
+
+/*
+ * IncrementRelExtLockHeldCount
+ *
+ * Increment the relation extension lock held count.
+ */
+void
+IncrementRelExtLockHeldCount()
+{
+	RelationExtensionLockHeldCount++;
+}
+
+/*
+ * DecrementRelExtLockHeldCount
+ *
+ * Decrement the relation extension lock held count;
+ */
+void
+DecrementRelExtLockHeldCount()
+{
+	/* We must hold the relation extension lock. */
+	Assert(RelationExtensionLockHeldCount > 0);
+	RelationExtensionLockHeldCount--;
+}
+
+/*
+ * ResetRelExtLockHeldCount
+ *
+ * Reset the relation extension lock hold count;
+ */
+void
+ResetRelExtLockHeldCount()
+{
+	RelationExtensionLockHeldCount = 0;
+}
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index bb8e4e6..c31a5f3 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -582,6 +582,10 @@ extern void RememberSimpleDeadLock(PGPROC *proc1,
 extern void InitDeadLockChecking(void);
 
 extern int	LockWaiterCount(const LOCKTAG *locktag);
+extern bool IsRelExtLockHeld(void);
+extern void IncrementRelExtLockHeldCount(void);
+extern void DecrementRelExtLockHeldCount(void);
+extern void ResetRelExtLockHeldCount(void);
 
 #ifdef LOCK_DEBUG
 extern void DumpLocks(PGPROC *proc);
-- 
1.8.3.1

v4-0003-Conflict-Extension-Page-lock-in-group-member.patchapplication/octet-stream; name=v4-0003-Conflict-Extension-Page-lock-in-group-member.patchDownload

From a6cb8aaf8f513eab0bf888c3f43ac949d60cecb1 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 9 Mar 2020 17:40:45 +0530
Subject: [PATCH v4 3/3] Conflict Extension/Page lock in group member

---
 src/backend/storage/lmgr/deadlock.c |  9 +++++++++
 src/backend/storage/lmgr/lock.c     | 12 ++++++++++++
 2 files changed, 21 insertions(+)

diff --git a/src/backend/storage/lmgr/deadlock.c b/src/backend/storage/lmgr/deadlock.c
index f8c5df0..49a5998 100644
--- a/src/backend/storage/lmgr/deadlock.c
+++ b/src/backend/storage/lmgr/deadlock.c
@@ -568,6 +568,15 @@ FindLockCycleRecurseMember(PGPROC *checkProc,
 	proclock = (PROCLOCK *) SHMQueueNext(procLocks, procLocks,
 										 offsetof(PROCLOCK, lockLink));
 
+	/*
+	 * Relation extension/page lock never participate in actual deadlock cycle.
+	 * So avoid the wait edge for these type of lock so that we can avoid any
+	 * false cycle detection due to group locking.
+	 */
+	if ((lock->tag.locktag_type == LOCKTAG_RELATION_EXTEND) ||
+		(lock->tag.locktag_type == LOCKTAG_PAGE))
+		return false;
+
 	while (proclock)
 	{
 		PGPROC	   *leader;
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index b89e82a..c1f5e3f 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -1432,6 +1432,18 @@ LockCheckConflicts(LockMethod lockMethodTable,
 	}
 
 	/*
+	 * If it's a relation extension/page lock then it will conflict even between
+	 * the lock group member.
+	 */
+	if ((lock->tag.locktag_type == LOCKTAG_RELATION_EXTEND) ||
+		(lock->tag.locktag_type == LOCKTAG_PAGE))
+	{
+		PROCLOCK_PRINT("LockCheckConflicts: conflicting (simple)",
+				proclock);
+		return true;
+	}
+
+	/*
 	 * Locks held in conflicting modes by members of our own lock group are
 	 * not real conflicts; we can subtract those out and see if we still have
 	 * a conflict.  This is O(N) in the number of processes holding or
-- 
1.8.3.1

v4-0002-WIP-Extend-the-patch-for-handling-PageLock.patchapplication/octet-stream; name=v4-0002-WIP-Extend-the-patch-for-handling-PageLock.patchDownload

From 0c43326a33ac4d1cbeebf0aa5f9cf0197f553126 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Thu, 12 Mar 2020 09:19:13 +0530
Subject: [PATCH v4 2/3] WIP-Extend the patch for handling PageLock

---
 src/backend/access/transam/xact.c | 14 ++++----
 src/backend/storage/lmgr/lmgr.c   | 18 +++++++++-
 src/backend/storage/lmgr/lock.c   | 69 ++++++++++++++++++++++++++++++++-------
 src/include/storage/lock.h        |  5 ++-
 4 files changed, 86 insertions(+), 20 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index ca64712..ec7b7f8 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2267,8 +2267,8 @@ CommitTransaction(void)
 	XactTopFullTransactionId = InvalidFullTransactionId;
 	nParallelCurrentXids = 0;
 
-	/* Reset the relation extension lock held count. */
-	ResetRelExtLockHeldCount();
+	/* Reset the relation extension/page lock held count. */
+	ResetRelExtPageLockHeldCount();
 
 	/*
 	 * done with commit processing, set current transaction state back to
@@ -2738,8 +2738,8 @@ AbortTransaction(void)
 		pgstat_report_xact_timestamp(0);
 	}
 
-	/* Reset the relation extension lock held count. */
-	ResetRelExtLockHeldCount();
+	/* Reset the relation extension/page lock held count. */
+	ResetRelExtPageLockHeldCount();
 
 	/*
 	 * State remains TRANS_ABORT until CleanupTransaction().
@@ -5012,8 +5012,8 @@ AbortSubTransaction(void)
 		AtEOSubXact_ApplyLauncher(false, s->nestingLevel);
 	}
 
-	/* Reset the relation extension lock held count. */
-	ResetRelExtLockHeldCount();
+	/* Reset the relation extension/Page lock held count. */
+	ResetRelExtPageLockHeldCount();
 
 	/*
 	 * Restore the upper transaction's read-only state, too.  This should be
@@ -5074,7 +5074,7 @@ PushTransaction(void)
 	 * Relation extension lock must not be held while starting a new
 	 * sub-transaction.
 	 */
-	Assert(!IsRelExtLockHeld());
+	Assert(!(IsRelExtLockHeld() || IsPageLockHeld()));
 
 	/*
 	 * We keep subtransaction state nodes in TopTransactionContext.
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 26760f8..b0df063 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -492,6 +492,9 @@ LockPage(Relation relation, BlockNumber blkno, LOCKMODE lockmode)
 					 blkno);
 
 	(void) LockAcquire(&tag, lockmode, false, false);
+
+	/* Increment the lock held count. */
+	IncrementPageLockHeldCount();
 }
 
 /*
@@ -504,13 +507,22 @@ bool
 ConditionalLockPage(Relation relation, BlockNumber blkno, LOCKMODE lockmode)
 {
 	LOCKTAG		tag;
+	LockAcquireResult result;
 
 	SET_LOCKTAG_PAGE(tag,
 					 relation->rd_lockInfo.lockRelId.dbId,
 					 relation->rd_lockInfo.lockRelId.relId,
 					 blkno);
 
-	return (LockAcquire(&tag, lockmode, false, true) != LOCKACQUIRE_NOT_AVAIL);
+	result = LockAcquire(&tag, lockmode, false, true);
+	if (result != LOCKACQUIRE_NOT_AVAIL)
+	{
+		/* Increment the lock held count. */
+		IncrementPageLockHeldCount();
+		return true;
+	}
+
+	return false;
 }
 
 /*
@@ -527,6 +539,10 @@ UnlockPage(Relation relation, BlockNumber blkno, LOCKMODE lockmode)
 					 blkno);
 
 	LockRelease(&tag, lockmode, false);
+
+	/* Decrement the lock held count. */
+	DecrementPageLockHeldCount();
+
 }
 
 /*
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 6e65d8b..b89e82a 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -171,13 +171,14 @@ typedef struct TwoPhaseLockRecord
 static int	FastPathLocalUseCount = 0;
 
 /*
- * Count of number of relation extension lock currently held by this backend.
- * We need this counter so that we can ensure that while holding the relation
- * extension lock we are not trying to acquire any other heavy weight lock.
- * Basically, that will ensuring that the proc holding relation extension lock
- * can not wait for any another lock.
+ * Count of number of relation extension/page lock currently held by this
+ * backend. We need this counter so that we can ensure that while holding the
+ * relation extension/page lock we are not trying to acquire any other heavy
+ * weight lock which can cause deadlock.  Basically, that will ensure that the
+ * proc holding relation extension/page lock can not wait for any another lock.
  */
 static int	RelationExtensionLockHeldCount = 0;
+static int	PageLockHeldCount = 0;
 
 /* Macros for manipulating proc->fpLockBits */
 #define FAST_PATH_BITS_PER_SLOT			3
@@ -851,13 +852,23 @@ LockAcquireExtended(const LOCKTAG *locktag,
 
 	/*
 	 * We should not try to acquire any other heavyweight lock if we are already
-	 * holding the relation extension lock.  If we are trying to hold the same
-	 * relation extension lock then it should have been already granted so we
-	 * will not come here.
+	 * holding the relation extension/page lock.  If we are trying to hold the
+	 * same relation extension lock then it should have been already granted so
+	 * we will not come here.
 	 */
 	Assert(!IsRelExtLockHeld());
 
 	/*
+	 * XXX While holding the page lock we don't need to ensure that whether we
+	 * are trying to acquire the relation extension lock on the same relation
+	 * or any other relation.  Because the above assert is ensuring that after
+	 * holding the relation extension lock we are not going to wait for any
+	 * other process.
+	 */
+	Assert(!IsPageLockHeld() ||
+			(locktag->locktag_type == LOCKTAG_RELATION_EXTEND));
+
+	/*
 	 * Prepare to emit a WAL record if acquisition of this lock needs to be
 	 * replayed in a standby server.
 	 *
@@ -4546,12 +4557,48 @@ DecrementRelExtLockHeldCount()
 }
 
 /*
- * ResetRelExtLockHeldCount
+ * ResetRelExtPageLockHeldCount
  *
- * Reset the relation extension lock hold count;
+ * Reset the relation extension/page lock hold count;
  */
 void
-ResetRelExtLockHeldCount()
+ResetRelExtPageLockHeldCount()
 {
 	RelationExtensionLockHeldCount = 0;
+	PageLockHeldCount = 0;
+}
+
+/*
+ * IsRelExtLockHeld
+ *
+ * Is relation extension lock is held by this backend.
+ */
+bool
+IsPageLockHeld()
+{
+	return PageLockHeldCount > 0;
+}
+
+/*
+ * IncrementPageLockHeldCount
+ *
+ * Increment the page lock hold count.
+ */
+void
+IncrementPageLockHeldCount()
+{
+	PageLockHeldCount++;
+}
+
+/*
+ * DecrementPageLockHeldCount
+ *
+ * Decrement the page lock hold count;
+ */
+void
+DecrementPageLockHeldCount()
+{
+	/* We must hold the page lock. */
+	Assert(PageLockHeldCount > 0);
+	PageLockHeldCount--;
 }
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index c31a5f3..ed8fbdc 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -585,7 +585,10 @@ extern int	LockWaiterCount(const LOCKTAG *locktag);
 extern bool IsRelExtLockHeld(void);
 extern void IncrementRelExtLockHeldCount(void);
 extern void DecrementRelExtLockHeldCount(void);
-extern void ResetRelExtLockHeldCount(void);
+extern void ResetRelExtPageLockHeldCount(void);
+extern bool IsPageLockHeld(void);
+extern void IncrementPageLockHeldCount(void);
+extern void DecrementPageLockHeldCount(void);
 
 #ifdef LOCK_DEBUG
 extern void DumpLocks(PGPROC *proc);
-- 
1.8.3.1

#161

Dilip Kumar

dilipbalaut@gmail.com

almost 6 years ago

In reply to: Amit Kapila (#157)

1 attachment(s)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Wed, Mar 11, 2020 at 2:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Mar 10, 2020 at 6:39 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Mar 6, 2020 at 11:27 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I think instead of the flag we need to keep the counter because we can
acquire the same relation extension lock multiple times. So
basically, every time we acquire the lock we can increment the counter
and while releasing we can decrement it. During an error path, I
think it is fine to set it to 0 in CommitTransaction/AbortTransaction.
But, I am not sure that we can set to 0 or decrement it in
AbortSubTransaction because we are not sure whether we have acquired
the lock under this subtransaction or not.

I think that CommitTransaction, AbortTransaction, and friends have
*zero* business touching this. I think the counter - or flag - should
track whether we've got a PROCLOCK entry for a relation extension
lock. We either do, or we do not, and that does not change because of
anything have to do with the transaction state. It changes because
somebody calls LockRelease() or LockReleaseAll().

Do we want to have a special check in the LockRelease() to identify
whether we are releasing relation extension lock? If not, then how we
will identify that relation extension is released and we can reset it
during subtransaction abort due to error? During success paths, we
know when we have released RelationExtension or Page Lock (via
UnlockRelationForExtension or UnlockPage). During the top-level
transaction end, we know when we have released all the locks, so that
will imply that RelationExtension and or Page locks must have been
released by that time.

If we have no other choice, then I see a few downsides of adding a
special check in the LockRelease() call:

1. Instead of resetting/decrement the variable from specific APIs like
UnlockRelationForExtension or UnlockPage, we need to have it in
LockRelease. It will also look odd, if set variable in
LockRelationForExtension, but don't reset in the
UnlockRelationForExtension variant. Now, maybe we can allow to reset
it at both places if it is a flag, but not if it is a counter
variable.

2. One can argue that adding extra instructions in a generic path
(like LockRelease) is not a good idea, especially if those are for an
Assert. I understand this won't add anything which we can measure by
standard benchmarks.

I have just written a WIP patch for relation extension lock where
instead of incrementing and decrementing the counter in
LockRelationForExtension and UnlockRelationForExtension respectively.
We can just set and reset the flag in LockAcquireExtended and
LockRelease. So this patch appears simple to me as we are not
involving the transaction APIs to set and reset the flag. However, we
need to add an extra check as you have already mentioned. I think we
could measure the performance and see whether it has any impact or
not?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v5-0001-WIP-Add-assert-check-for-relation-extension-lock.patchapplication/x-patch; name=v5-0001-WIP-Add-assert-check-for-relation-extension-lock.patchDownload

From e49e483646f14a2e626190d5ef98f628668d025c Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Thu, 12 Mar 2020 13:16:45 +0530
Subject: [PATCH v5] WIP-Add assert check for relation extension lock

Add assert to check that we should not acquire any other lock if we are
already holding the relation extension lock.  Only exception is that if
we are trying to acquire the relation extension lock then we can hold the
same lock.
---
 src/backend/storage/lmgr/lock.c | 43 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 43 insertions(+)

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 56dba09..f572dab 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -170,6 +170,15 @@ typedef struct TwoPhaseLockRecord
  */
 static int	FastPathLocalUseCount = 0;
 
+/*
+ * Flag is set if the relation extension lock is currently held by this backend.
+ * We need this flag so that we can ensure that while holding the relation
+ * extension lock we are not trying to acquire any other heavy weight lock.
+ * Basically, that will ensuring that the proc holding relation extension lock
+ * can not wait for any another lock which can lead to a deadlock.
+ */
+static bool	IsRelationExtensionLockHeld = false;
+
 /* Macros for manipulating proc->fpLockBits */
 #define FAST_PATH_BITS_PER_SLOT			3
 #define FAST_PATH_LOCKNUMBER_OFFSET		1
@@ -841,6 +850,14 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	}
 
 	/*
+	 * We should not try to acquire any other heavyweight lock if we are already
+	 * holding the relation extension lock.  If we are trying to hold the same
+	 * relation extension lock then it should have been already granted so we
+	 * will not come here.
+	 */
+	Assert(!IsRelationExtensionLockHeld);
+
+	/*
 	 * Prepare to emit a WAL record if acquisition of this lock needs to be
 	 * replayed in a standby server.
 	 *
@@ -900,6 +917,11 @@ LockAcquireExtended(const LOCKTAG *locktag,
 			locallock->lock = NULL;
 			locallock->proclock = NULL;
 			GrantLockLocal(locallock, owner);
+
+			/* Set the flag that we acquired the relation extension lock. */
+			if (locktag->locktag_type == LOCKTAG_RELATION_EXTEND)
+				IsRelationExtensionLockHeld = true;
+
 			return LOCKACQUIRE_OK;
 		}
 	}
@@ -1100,6 +1122,10 @@ LockAcquireExtended(const LOCKTAG *locktag,
 							   locktag->locktag_field2);
 	}
 
+	/* Set the flag that we acquired the relation extension lock. */
+	if (locktag->locktag_type == LOCKTAG_RELATION_EXTEND)
+		IsRelationExtensionLockHeld = true;
+
 	return LOCKACQUIRE_OK;
 }
 
@@ -1999,6 +2025,13 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 		if (released)
 		{
 			RemoveLocalLock(locallock);
+
+			/*
+			 * Reset the flag if we have released the relation extension lock.
+			 */
+			if (locktag->locktag_type == LOCKTAG_RELATION_EXTEND)
+				IsRelationExtensionLockHeld = false;
+
 			return true;
 		}
 	}
@@ -2072,6 +2105,10 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 	LWLockRelease(partitionLock);
 
 	RemoveLocalLock(locallock);
+
+	/* Reset the flag if we have released the relation extension lock. */
+	if (locktag->locktag_type == LOCKTAG_RELATION_EXTEND)
+		IsRelationExtensionLockHeld = false;
 	return true;
 }
 
@@ -2347,6 +2384,12 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
 		LWLockRelease(partitionLock);
 	}							/* loop over partitions */
 
+	/*
+	 * We have released all the lock so reset the relation extension lock held
+	 * flag.
+	 */
+	IsRelationExtensionLockHeld = false;
+
 #ifdef LOCK_DEBUG
 	if (*(lockMethodTable->trace_flag))
 		elog(LOG, "LockReleaseAll done");
-- 
1.8.3.1

#162

Amit Kapila

amit.kapila16@gmail.com

almost 6 years ago

In reply to: Dilip Kumar (#160)

2 attachment(s)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Thu, Mar 12, 2020 at 11:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have fixed this in the attached patch set.

I have modified your
v4-0003-Conflict-Extension-Page-lock-in-group-member patch. The
modifications are (a) Change src/backend/storage/lmgr/README to
reflect new behaviour, (b) Introduce a new macro LOCK_LOCKTAG which
slightly simplifies the code, (c) moved the deadlock.c check a few
lines up and (d) changed a few comments.

It might be better if we can move the checks related to extension and
page lock in a separate API or macro. What do you think?

I have also used an extension to test this patch. This is the same
extension that I have used to test the group locking patch. It will
allow backends to form a group as we do for parallel workers. The
extension is attached to this email.

Test without patch:
Session-1
Create table t1(c1 int, c2 char(500));
Select become_lock_group_leader();

Insert into t1 values(generate_series(1,100),'aaa'); -- stop this
after acquiring relation extension lock via GDB.

Session-2
Select become_lock_group_member();
Insert into t1 values(generate_series(101,200),'aaa');
- Debug LockAcquire and found that it doesn't generate conflict for
Relation Extension lock.

The above experiment has shown that without patch group members can
acquire relation extension lock if the group leader has that lock.
After patch the second session waits for the first session to release
the relation extension lock. I know this is not a perfect way to test,
but it is better than nothing. I think we need to do some more
testing either using this extension or some other way for extension
and page locks.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

0001-Test-group-dead-locks.patchapplication/octet-stream; name=0001-Test-group-dead-locks.patchDownload

From b928401a2b2e472ad76fe859bca51a09ae2b587c Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Thu, 12 Mar 2020 10:32:29 +0530
Subject: [PATCH 1/2] Test group dead locks.

---
 contrib/Makefile                                   |  1 +
 contrib/test_group_deadlocks/Makefile              | 19 ++++++++
 .../test_group_deadlocks--1.0.sql                  | 15 ++++++
 .../test_group_deadlocks/test_group_deadlocks.c    | 57 ++++++++++++++++++++++
 .../test_group_deadlocks.control                   |  5 ++
 5 files changed, 97 insertions(+)
 create mode 100644 contrib/test_group_deadlocks/Makefile
 create mode 100644 contrib/test_group_deadlocks/test_group_deadlocks--1.0.sql
 create mode 100644 contrib/test_group_deadlocks/test_group_deadlocks.c
 create mode 100644 contrib/test_group_deadlocks/test_group_deadlocks.control

diff --git a/contrib/Makefile b/contrib/Makefile
index 1846d41..d04721d 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -45,6 +45,7 @@ SUBDIRS = \
 		tablefunc	\
 		tcn		\
 		test_decoding	\
+		test_group_deadlocks \
 		tsm_system_rows \
 		tsm_system_time \
 		unaccent	\
diff --git a/contrib/test_group_deadlocks/Makefile b/contrib/test_group_deadlocks/Makefile
new file mode 100644
index 0000000..057448c
--- /dev/null
+++ b/contrib/test_group_deadlocks/Makefile
@@ -0,0 +1,19 @@
+# contrib/test_group_deadlocks/Makefile
+
+MODULE_big = test_group_deadlocks
+OBJS = test_group_deadlocks.o $(WIN32RES)
+
+EXTENSION = test_group_deadlocks
+DATA = test_group_deadlocks--1.0.sql
+PGFILEDESC = "test_group_deadlocks - participate in group locking"
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/test_group_deadlocks
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/test_group_deadlocks/test_group_deadlocks--1.0.sql b/contrib/test_group_deadlocks/test_group_deadlocks--1.0.sql
new file mode 100644
index 0000000..377c363
--- /dev/null
+++ b/contrib/test_group_deadlocks/test_group_deadlocks--1.0.sql
@@ -0,0 +1,15 @@
+/* contrib/test_group_deadlocks/test_group_deadlocks--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_group_deadlocks" to load this file. \quit
+
+-- Register the function.
+CREATE FUNCTION become_lock_group_leader()
+RETURNS pg_catalog.void
+AS 'MODULE_PATHNAME'
+LANGUAGE C;
+
+CREATE FUNCTION become_lock_group_member(pid pg_catalog.int4)
+RETURNS pg_catalog.bool
+AS 'MODULE_PATHNAME'
+LANGUAGE C;
diff --git a/contrib/test_group_deadlocks/test_group_deadlocks.c b/contrib/test_group_deadlocks/test_group_deadlocks.c
new file mode 100644
index 0000000..f3d980a
--- /dev/null
+++ b/contrib/test_group_deadlocks/test_group_deadlocks.c
@@ -0,0 +1,57 @@
+/*-------------------------------------------------------------------------
+ *
+ * test_group_deadlocks.c
+ *		  group locking utilities
+ *
+ * Copyright (c) 2010-2014, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		  contrib/test_group_deadlocks/test_group_deadlocks.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "fmgr.h"
+#include "storage/proc.h"
+#include "storage/procarray.h"
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(become_lock_group_leader);
+PG_FUNCTION_INFO_V1(become_lock_group_member);
+
+
+/*
+ * become_lock_group_leader
+ *
+ * This function makes current backend process as lock group
+ * leader.
+ */
+Datum
+become_lock_group_leader(PG_FUNCTION_ARGS)
+{
+	BecomeLockGroupLeader();
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * become_lock_group_member
+ *
+ * This function makes current backend process as lock group
+ * member of the group owned by the process whose pid is passed
+ * as first argument.
+ */
+Datum
+become_lock_group_member(PG_FUNCTION_ARGS)
+{
+	bool		member;
+	PGPROC		*procleader;
+	int32		pid = PG_GETARG_INT32(0);
+
+	procleader = BackendPidGetProc(pid);
+	member = BecomeLockGroupMember(procleader, pid);
+
+	PG_RETURN_BOOL(member);
+}
diff --git a/contrib/test_group_deadlocks/test_group_deadlocks.control b/contrib/test_group_deadlocks/test_group_deadlocks.control
new file mode 100644
index 0000000..e2dcc71
--- /dev/null
+++ b/contrib/test_group_deadlocks/test_group_deadlocks.control
@@ -0,0 +1,5 @@
+# test_group_locking extension
+comment = 'become part of group'
+default_version = '1.0'
+module_pathname = '$libdir/test_group_deadlocks'
+relocatable = true
-- 
1.8.3.1

0002-Allow-relation-extension-and-page-locks-to-conflict-.patchapplication/octet-stream; name=0002-Allow-relation-extension-and-page-locks-to-conflict-.patchDownload

From fa8494c222439ca66ff0912c5a5303ad8b0622e9 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Thu, 12 Mar 2020 17:00:05 +0530
Subject: [PATCH 2/2] Allow relation extension and page locks to conflict among
 parallel group members.

---
 src/backend/storage/lmgr/README     | 58 ++++++++++++++++++++-----------------
 src/backend/storage/lmgr/deadlock.c |  9 ++++++
 src/backend/storage/lmgr/lock.c     | 12 ++++++++
 src/include/storage/lock.h          |  1 +
 4 files changed, 53 insertions(+), 27 deletions(-)

diff --git a/src/backend/storage/lmgr/README b/src/backend/storage/lmgr/README
index 56b0a12..9724930 100644
--- a/src/backend/storage/lmgr/README
+++ b/src/backend/storage/lmgr/README
@@ -597,21 +597,23 @@ deadlock detection algorithm very much, but it makes the bookkeeping more
 complicated.
 
 We choose to regard locks held by processes in the same parallel group as
-non-conflicting.  This means that two processes in a parallel group can hold a
-self-exclusive lock on the same relation at the same time, or one process can
-acquire an AccessShareLock while the other already holds AccessExclusiveLock.
-This might seem dangerous and could be in some cases (more on that below), but
-if we didn't do this then parallel query would be extremely prone to
-self-deadlock.  For example, a parallel query against a relation on which the
-leader already had AccessExclusiveLock would hang, because the workers would
-try to lock the same relation and be blocked by the leader; yet the leader
-can't finish until it receives completion indications from all workers.  An
-undetected deadlock results.  This is far from the only scenario where such a
-problem happens.  The same thing will occur if the leader holds only
-AccessShareLock, the worker seeks AccessShareLock, but between the time the
-leader attempts to acquire the lock and the time the worker attempts to
-acquire it, some other process queues up waiting for an AccessExclusiveLock.
-In this case, too, an indefinite hang results.
+non-conflicting with the exception of relation extension and page locks.  This
+means that two processes in a parallel group can hold a self-exclusive lock on
+the same relation at the same time, or one process can acquire an AccessShareLock
+while the other already holds AccessExclusiveLock.  This might seem dangerous and
+could be in some cases (more on that below), but if we didn't do this then
+parallel query would be extremely prone to self-deadlock.  For example, a
+parallel query against a relation on which the leader already had
+AccessExclusiveLock would hang, because the workers would try to lock the same
+relation and be blocked by the leader; yet the leader can't finish until it
+receives completion indications from all workers.  An undetected deadlock
+results.  This is far from the only scenario where such a problem happens.  The
+same thing will occur if the leader holds only AccessShareLock, the worker
+seeks AccessShareLock, but between the time the leader attempts to acquire the
+lock and the time the worker attempts to acquire it, some other process queues
+up waiting for an AccessExclusiveLock.  In this case, too, an indefinite hang
+results.  The relation extension and page locks don't participate in group
+locking which means such locks can conflict among the same group members.
 
 It might seem that we could predict which locks the workers will attempt to
 acquire and ensure before going parallel that those locks would be acquired
@@ -637,18 +639,20 @@ the other is safe enough.  Problems would occur if the leader initiated
 parallelism from a point in the code at which it had some backend-private
 state that made table access from another process unsafe, for example after
 calling SetReindexProcessing and before calling ResetReindexProcessing,
-catastrophe could ensue, because the worker won't have that state.  Similarly,
-problems could occur with certain kinds of non-relation locks, such as
-relation extension locks.  It's no safer for two related processes to extend
-the same relation at the time than for unrelated processes to do the same.
-However, since parallel mode is strictly read-only at present, neither this
-nor most of the similar cases can arise at present.  To allow parallel writes,
-we'll either need to (1) further enhance the deadlock detector to handle those
-types of locks in a different way than other types; or (2) have parallel
-workers use some other mutual exclusion method for such cases; or (3) revise
-those cases so that they no longer use heavyweight locking in the first place
-(which is not a crazy idea, given that such lock acquisitions are not expected
-to deadlock and that heavyweight lock acquisition is fairly slow anyway).
+catastrophe could ensue, because the worker won't have that state.
+
+To allow parallel inserts and parallel copy, we have ensured that relation
+extension and page locks don't participate in group locking which means such
+locks can conflict among the same group members.  We don't acquire a heavyweight
+lock on any other object after relation extension lock which means such a lock
+can never participate in the deadlock cycle.  After acquiring page locks, we can
+acquire relation extension lock but reverse never happens, so those will also
+not participate in deadlock.  To allow for other parallel writes like parallel
+update or parallel delete, we'll either need to (1) further enhance the
+deadlock detector to handle those tuple locks in a different way than
+other types; or (2) have parallel workers use some other mutual exclusion
+method for such cases.  Currently, the parallel mode is strictly read-only,
+but now we have the infrastructure to allow parallel inserts and parallel copy.
 
 Group locking adds three new members to each PGPROC: lockGroupLeader,
 lockGroupMembers, and lockGroupLink. A PGPROC's lockGroupLeader is NULL for
diff --git a/src/backend/storage/lmgr/deadlock.c b/src/backend/storage/lmgr/deadlock.c
index f8c5df0..80ec88b 100644
--- a/src/backend/storage/lmgr/deadlock.c
+++ b/src/backend/storage/lmgr/deadlock.c
@@ -555,6 +555,15 @@ FindLockCycleRecurseMember(PGPROC *checkProc,
 	int			numLockModes,
 				lm;
 
+	/*
+	 * The relation extension or page lock can never participate in actual
+	 * deadlock cycle.  See Asserts in LockAcquireExtended.  So, there is
+	 * no advantage in checking wait edges from it.
+	 */
+	if ((LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND) ||
+		(LOCK_LOCKTAG(*lock) == LOCKTAG_PAGE))
+		return false;
+        
 	lockMethodTable = GetLocksMethodTable(lock);
 	numLockModes = lockMethodTable->numLockModes;
 	conflictMask = lockMethodTable->conflictTab[checkProc->waitLockMode];
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 56dba09..6fdfeba 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -1404,6 +1404,18 @@ LockCheckConflicts(LockMethod lockMethodTable,
 	}
 
 	/*
+	 * The relation extension or page lock conflict even between the group
+	 * members.
+	 */
+	if ((LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND) ||
+		(LOCK_LOCKTAG(*lock) == LOCKTAG_PAGE))
+	{
+		PROCLOCK_PRINT("LockCheckConflicts: conflicting (group)",
+				proclock);
+		return true;
+	}
+
+	/*
 	 * Locks held in conflicting modes by members of our own lock group are
 	 * not real conflicts; we can subtract those out and see if we still have
 	 * a conflict.  This is O(N) in the number of processes holding or
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index bb8e4e6..fac979d 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -301,6 +301,7 @@ typedef struct LOCK
 } LOCK;
 
 #define LOCK_LOCKMETHOD(lock) ((LOCKMETHODID) (lock).tag.locktag_lockmethodid)
+#define LOCK_LOCKTAG(lock) ((LockTagType) (lock).tag.locktag_type)
 
 
 /*
-- 
1.8.3.1

#163

Kuntal Ghosh

kuntalghosh.2007@gmail.com

almost 6 years ago

In reply to: Amit Kapila (#162)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Thu, Mar 12, 2020 at 5:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Mar 12, 2020 at 11:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have fixed this in the attached patch set.

I have modified your
v4-0003-Conflict-Extension-Page-lock-in-group-member patch. The
modifications are (a) Change src/backend/storage/lmgr/README to
reflect new behaviour, (b) Introduce a new macro LOCK_LOCKTAG which
slightly simplifies the code, (c) moved the deadlock.c check a few
lines up and (d) changed a few comments.

It might be better if we can move the checks related to extension and
page lock in a separate API or macro. What do you think?

I think moving them inside a macro is a good idea. Also, I think we
should move all the Assert related code inside some debugging macro
similar to this:
#ifdef LOCK_DEBUG
....
#endif

+ /*
+ * The relation extension or page lock can never participate in actual
+ * deadlock cycle.  See Asserts in LockAcquireExtended.  So, there is
+ * no advantage in checking wait edges from it.
+ */
+ if ((LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND) ||
+ (LOCK_LOCKTAG(*lock) == LOCKTAG_PAGE))
+ return false;
+
Since this is true, we can also avoid these kind of locks in
ExpandConstraints, right? It'll certainly reduce some complexity in
topological sort.

  /*
+ * The relation extension or page lock conflict even between the group
+ * members.
+ */
+ if ((LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND) ||
+ (LOCK_LOCKTAG(*lock) == LOCKTAG_PAGE))
+ {
+ PROCLOCK_PRINT("LockCheckConflicts: conflicting (group)",
+ proclock);
+ return true;
+ }
This check includes the heavyweight locks that conflict even under
same parallel group. It also has another property that they can never
participate in deadlock cycles. And, the number of locks under this
category is likely to increase in future with new parallel features.
Hence, it could be used in multiple places. Should we move the
condition inside a macro and just call it from here?

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

#164

Amit Kapila

amit.kapila16@gmail.com

almost 6 years ago

In reply to: Kuntal Ghosh (#163)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Thu, Mar 12, 2020 at 7:50 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:

On Thu, Mar 12, 2020 at 5:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Mar 12, 2020 at 11:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have fixed this in the attached patch set.

I have modified your
v4-0003-Conflict-Extension-Page-lock-in-group-member patch. The
modifications are (a) Change src/backend/storage/lmgr/README to
reflect new behaviour, (b) Introduce a new macro LOCK_LOCKTAG which
slightly simplifies the code, (c) moved the deadlock.c check a few
lines up and (d) changed a few comments.

It might be better if we can move the checks related to extension and
page lock in a separate API or macro. What do you think?

I think moving them inside a macro is a good idea. Also, I think we
should move all the Assert related code inside some debugging macro
similar to this:
#ifdef LOCK_DEBUG
....
#endif

If we move it under some macro, then those Asserts will be only
enabled when that macro is defined. I think we want there Asserts to
be enabled always in assert enabled build, these will be like any
other Asserts in the code. What is the advantage of doing those under
macro?

+ /*
+ * The relation extension or page lock can never participate in actual
+ * deadlock cycle.  See Asserts in LockAcquireExtended.  So, there is
+ * no advantage in checking wait edges from it.
+ */
+ if ((LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND) ||
+ (LOCK_LOCKTAG(*lock) == LOCKTAG_PAGE))
+ return false;
+
Since this is true, we can also avoid these kind of locks in
ExpandConstraints, right?

Yes, I had also thought about it but left it to avoid sprinkling such
checks at more places than absolutely required.

It'll certainly reduce some complexity in
topological sort.

I think you mean to say TopoSort will have to look at fewer members in
the wait queue, otherwise, there is nothing from the perspective of
code which we can remove/change there. I think there will be hardly
any chance that such locks will participate here because we take those
for some work and release them (basically, they are unlike other
heavyweight locks which can be released at the end). Having said
that, I am not against putting those checks at the place you are
suggesting, it is just that I thought that it won't be of much use.

/*
+ * The relation extension or page lock conflict even between the group
+ * members.
+ */
+ if ((LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND) ||
+ (LOCK_LOCKTAG(*lock) == LOCKTAG_PAGE))
+ {
+ PROCLOCK_PRINT("LockCheckConflicts: conflicting (group)",
+ proclock);
+ return true;
+ }
This check includes the heavyweight locks that conflict even under
same parallel group. It also has another property that they can never
participate in deadlock cycles. And, the number of locks under this
category is likely to increase in future with new parallel features.
Hence, it could be used in multiple places. Should we move the
condition inside a macro and just call it from here?

Right, this is what I have suggested upthread. Do you have any
suggestions for naming such a macro or function? I could think of
something like LocksConflictAmongGroupMembers or
LocksNotParticipateInDeadlock. The first one suits more for its usage
in LockCheckConflicts and the second in the deadlock.c code. So none
of those sound perfect to me.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#165

Dilip Kumar

dilipbalaut@gmail.com

almost 6 years ago

In reply to: Amit Kapila (#162)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Thu, Mar 12, 2020 at 5:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Mar 12, 2020 at 11:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have fixed this in the attached patch set.

I have modified your
v4-0003-Conflict-Extension-Page-lock-in-group-member patch. The
modifications are (a) Change src/backend/storage/lmgr/README to
reflect new behaviour, (b) Introduce a new macro LOCK_LOCKTAG which
slightly simplifies the code, (c) moved the deadlock.c check a few
lines up and (d) changed a few comments.

Changes look fine to me.

It might be better if we can move the checks related to extension and
page lock in a separate API or macro. What do you think?

I feel it looks cleaner this way as well. But, If we plan to move it
to common function/macro then we should use some common name such that
it can be used in FindLockCycleRecurseMember as well as in
LockCheckConflicts.

I have also used an extension to test this patch. This is the same
extension that I have used to test the group locking patch. It will
allow backends to form a group as we do for parallel workers. The
extension is attached to this email.

Test without patch:
Session-1
Create table t1(c1 int, c2 char(500));
Select become_lock_group_leader();

Insert into t1 values(generate_series(1,100),'aaa'); -- stop this
after acquiring relation extension lock via GDB.

Session-2
Select become_lock_group_member();
Insert into t1 values(generate_series(101,200),'aaa');
- Debug LockAcquire and found that it doesn't generate conflict for
Relation Extension lock.

The above experiment has shown that without patch group members can
acquire relation extension lock if the group leader has that lock.
After patch the second session waits for the first session to release
the relation extension lock. I know this is not a perfect way to test,
but it is better than nothing. I think we need to do some more
testing either using this extension or some other way for extension
and page locks.

I have also tested the same and verified it.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#166

Dilip Kumar

dilipbalaut@gmail.com

almost 6 years ago

In reply to: Amit Kapila (#164)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Fri, Mar 13, 2020 at 8:29 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Mar 12, 2020 at 7:50 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:

On Thu, Mar 12, 2020 at 5:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Mar 12, 2020 at 11:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have fixed this in the attached patch set.

I have modified your
v4-0003-Conflict-Extension-Page-lock-in-group-member patch. The
modifications are (a) Change src/backend/storage/lmgr/README to
reflect new behaviour, (b) Introduce a new macro LOCK_LOCKTAG which
slightly simplifies the code, (c) moved the deadlock.c check a few
lines up and (d) changed a few comments.

It might be better if we can move the checks related to extension and
page lock in a separate API or macro. What do you think?

I think moving them inside a macro is a good idea. Also, I think we
should move all the Assert related code inside some debugging macro
similar to this:
#ifdef LOCK_DEBUG
....
#endif

If we move it under some macro, then those Asserts will be only
enabled when that macro is defined. I think we want there Asserts to
be enabled always in assert enabled build, these will be like any
other Asserts in the code. What is the advantage of doing those under
macro?
+ /*
+ * The relation extension or page lock can never participate in actual
+ * deadlock cycle.  See Asserts in LockAcquireExtended.  So, there is
+ * no advantage in checking wait edges from it.
+ */
+ if ((LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND) ||
+ (LOCK_LOCKTAG(*lock) == LOCKTAG_PAGE))
+ return false;
+
Since this is true, we can also avoid these kind of locks in
ExpandConstraints, right?
Yes, I had also thought about it but left it to avoid sprinkling such
checks at more places than absolutely required.

It'll certainly reduce some complexity in
topological sort.

I think you mean to say TopoSort will have to look at fewer members in
the wait queue, otherwise, there is nothing from the perspective of
code which we can remove/change there. I think there will be hardly
any chance that such locks will participate here because we take those
for some work and release them (basically, they are unlike other
heavyweight locks which can be released at the end). Having said
that, I am not against putting those checks at the place you are
suggesting, it is just that I thought that it won't be of much use.

I am not sure I understand this part. Because topological sort will
work on the soft edges we have created when we found the cycle, but
for relation extension/page lock we are completely ignoring hard/soft
edge then it will never participate in topo sort as well. Am I
missing something?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#167

Amit Kapila

amit.kapila16@gmail.com

almost 6 years ago

In reply to: Dilip Kumar (#161)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Thu, Mar 12, 2020 at 3:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Mar 11, 2020 at 2:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

If we have no other choice, then I see a few downsides of adding a
special check in the LockRelease() call:

1. Instead of resetting/decrement the variable from specific APIs like
UnlockRelationForExtension or UnlockPage, we need to have it in
LockRelease. It will also look odd, if set variable in
LockRelationForExtension, but don't reset in the
UnlockRelationForExtension variant. Now, maybe we can allow to reset
it at both places if it is a flag, but not if it is a counter
variable.

2. One can argue that adding extra instructions in a generic path
(like LockRelease) is not a good idea, especially if those are for an
Assert. I understand this won't add anything which we can measure by
standard benchmarks.

I have just written a WIP patch for relation extension lock where
instead of incrementing and decrementing the counter in
LockRelationForExtension and UnlockRelationForExtension respectively.
We can just set and reset the flag in LockAcquireExtended and
LockRelease. So this patch appears simple to me as we are not
involving the transaction APIs to set and reset the flag. However, we
need to add an extra check as you have already mentioned. I think we
could measure the performance and see whether it has any impact or
not?

LockAcquireExtended()
{
..
+ if (locktag->locktag_type == LOCKTAG_RELATION_EXTEND)
+ IsRelationExtensionLockHeld = true;
..
}

Can we move this check inside a function (CheckAndSetLockHeld or
something like that) as we need to add a similar thing for page lock?
Also, how about moving the set and reset of these flags to
GrantLockLocal and RemoveLocalLock as that will further reduce the
number of places where we need to add such a check. Another thing is
to see if it makes sense to have a macro like LOCALLOCK_LOCKMETHOD to
get the lock tag.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#168

Dilip Kumar

dilipbalaut@gmail.com

almost 6 years ago

In reply to: Amit Kapila (#167)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Fri, Mar 13, 2020 at 11:08 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Mar 12, 2020 at 3:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Mar 11, 2020 at 2:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

If we have no other choice, then I see a few downsides of adding a
special check in the LockRelease() call:

1. Instead of resetting/decrement the variable from specific APIs like
UnlockRelationForExtension or UnlockPage, we need to have it in
LockRelease. It will also look odd, if set variable in
LockRelationForExtension, but don't reset in the
UnlockRelationForExtension variant. Now, maybe we can allow to reset
it at both places if it is a flag, but not if it is a counter
variable.

2. One can argue that adding extra instructions in a generic path
(like LockRelease) is not a good idea, especially if those are for an
Assert. I understand this won't add anything which we can measure by
standard benchmarks.

I have just written a WIP patch for relation extension lock where
instead of incrementing and decrementing the counter in
LockRelationForExtension and UnlockRelationForExtension respectively.
We can just set and reset the flag in LockAcquireExtended and
LockRelease. So this patch appears simple to me as we are not
involving the transaction APIs to set and reset the flag. However, we
need to add an extra check as you have already mentioned. I think we
could measure the performance and see whether it has any impact or
not?
LockAcquireExtended()
{
..
+ if (locktag->locktag_type == LOCKTAG_RELATION_EXTEND)
+ IsRelationExtensionLockHeld = true;
..
}
Can we move this check inside a function (CheckAndSetLockHeld or
something like that) as we need to add a similar thing for page lock?

Also, how about moving the set and reset of these flags to
GrantLockLocal and RemoveLocalLock as that will further reduce the
number of places where we need to add such a check.

Make sense to me.

Another thing is

to see if it makes sense to have a macro like LOCALLOCK_LOCKMETHOD to
get the lock tag.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#169

Kuntal Ghosh

kuntalghosh.2007@gmail.com

almost 6 years ago

In reply to: Amit Kapila (#164)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Fri, Mar 13, 2020 at 8:29 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Mar 12, 2020 at 7:50 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:

I think moving them inside a macro is a good idea. Also, I think we
should move all the Assert related code inside some debugging macro
similar to this:
#ifdef LOCK_DEBUG
....
#endif

If we move it under some macro, then those Asserts will be only
enabled when that macro is defined. I think we want there Asserts to
be enabled always in assert enabled build, these will be like any
other Asserts in the code. What is the advantage of doing those under
macro?

My concern is related to performance regression. We're using two
static variables in hot-paths only for checking a few asserts. So, I'm
not sure whether we should enable the same by default, specially when
asserts are itself disabled.
-ResetRelExtLockHeldCount()
+ResetRelExtPageLockHeldCount()
{
RelationExtensionLockHeldCount = 0;
+ PageLockHeldCount = 0;
+}
Also, we're calling this method from frequently used functions like
Commit/AbortTransaction. So, it's better these two static variables
share the same cache line and reinitalize them with a single
instruction.

/*
+ * The relation extension or page lock conflict even between the group
+ * members.
+ */
+ if ((LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND) ||
+ (LOCK_LOCKTAG(*lock) == LOCKTAG_PAGE))
+ {
+ PROCLOCK_PRINT("LockCheckConflicts: conflicting (group)",
+ proclock);
+ return true;
+ }
This check includes the heavyweight locks that conflict even under
same parallel group. It also has another property that they can never
participate in deadlock cycles. And, the number of locks under this
category is likely to increase in future with new parallel features.
Hence, it could be used in multiple places. Should we move the
condition inside a macro and just call it from here?
Right, this is what I have suggested upthread. Do you have any
suggestions for naming such a macro or function? I could think of
something like LocksConflictAmongGroupMembers or
LocksNotParticipateInDeadlock. The first one suits more for its usage
in LockCheckConflicts and the second in the deadlock.c code. So none
of those sound perfect to me.

Actually, I'm not able to come up with a good suggestion. I'm trying
to think of a generic name similar to strong or weak locks but with
the following properties:
a. Locks that don't participate in deadlock detection
b. Locks that conflicts in the same parallel group

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

#170

Kuntal Ghosh

kuntalghosh.2007@gmail.com

almost 6 years ago

In reply to: Dilip Kumar (#166)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Fri, Mar 13, 2020 at 8:42 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Mar 12, 2020 at 7:50 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
+ /*
+ * The relation extension or page lock can never participate in actual
+ * deadlock cycle.  See Asserts in LockAcquireExtended.  So, there is
+ * no advantage in checking wait edges from it.
+ */
+ if ((LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND) ||
+ (LOCK_LOCKTAG(*lock) == LOCKTAG_PAGE))
+ return false;
+
Since this is true, we can also avoid these kind of locks in
ExpandConstraints, right?
I am not sure I understand this part. Because topological sort will
work on the soft edges we have created when we found the cycle, but
for relation extension/page lock we are completely ignoring hard/soft
edge then it will never participate in topo sort as well. Am I
missing something?

No, I think you're right. We only add constraints if we've detected a
cycle in the graph. Hence, you don't need the check here.

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

#171

Amit Kapila

amit.kapila16@gmail.com

almost 6 years ago

In reply to: Dilip Kumar (#165)

1 attachment(s)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Fri, Mar 13, 2020 at 8:37 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Mar 12, 2020 at 5:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Mar 12, 2020 at 11:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have fixed this in the attached patch set.

I have modified your
v4-0003-Conflict-Extension-Page-lock-in-group-member patch. The
modifications are (a) Change src/backend/storage/lmgr/README to
reflect new behaviour, (b) Introduce a new macro LOCK_LOCKTAG which
slightly simplifies the code, (c) moved the deadlock.c check a few
lines up and (d) changed a few comments.

Changes look fine to me.

Today, while looking at this patch again, I realized that there is a
where we sometimes allow group members to jump the wait queue. This
is primarily to avoid creating deadlocks (see ProcSleep). Now,
ideally, we don't need this for relation extension or page locks as
those can never lead to deadlocks. However, the current code will
give group members more priority to acquire relation extension or page
locks if any one of the members has held those locks. Now, if we want
we can prevent giving group members priority for these locks, but I am
not sure how important is that case. So, I have left that as it is by
adding a few comments. What do you think?

Additionally, I have changed/added a few more sentences in README.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

0002-Allow-relation-extension-and-page-locks-to-conflict-.v2.patchapplication/octet-stream; name=0002-Allow-relation-extension-and-page-locks-to-conflict-.v2.patchDownload

From bfd42993d2bf9ba88ffd26815565a321eec12440 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Thu, 12 Mar 2020 17:00:05 +0530
Subject: [PATCH] Allow relation extension and page locks to conflict among
 parallel group members.

---
 src/backend/storage/lmgr/README     | 60 ++++++++++++++++++++-----------------
 src/backend/storage/lmgr/deadlock.c |  9 ++++++
 src/backend/storage/lmgr/lock.c     | 12 ++++++++
 src/backend/storage/lmgr/proc.c     |  8 ++++-
 src/include/storage/lock.h          |  1 +
 5 files changed, 62 insertions(+), 28 deletions(-)

diff --git a/src/backend/storage/lmgr/README b/src/backend/storage/lmgr/README
index 56b0a12..13eb1cc 100644
--- a/src/backend/storage/lmgr/README
+++ b/src/backend/storage/lmgr/README
@@ -597,21 +597,22 @@ deadlock detection algorithm very much, but it makes the bookkeeping more
 complicated.
 
 We choose to regard locks held by processes in the same parallel group as
-non-conflicting.  This means that two processes in a parallel group can hold a
-self-exclusive lock on the same relation at the same time, or one process can
-acquire an AccessShareLock while the other already holds AccessExclusiveLock.
-This might seem dangerous and could be in some cases (more on that below), but
-if we didn't do this then parallel query would be extremely prone to
-self-deadlock.  For example, a parallel query against a relation on which the
-leader already had AccessExclusiveLock would hang, because the workers would
-try to lock the same relation and be blocked by the leader; yet the leader
-can't finish until it receives completion indications from all workers.  An
-undetected deadlock results.  This is far from the only scenario where such a
-problem happens.  The same thing will occur if the leader holds only
-AccessShareLock, the worker seeks AccessShareLock, but between the time the
-leader attempts to acquire the lock and the time the worker attempts to
-acquire it, some other process queues up waiting for an AccessExclusiveLock.
-In this case, too, an indefinite hang results.
+non-conflicting with the exception of relation extension and page locks.  This
+means that two processes in a parallel group can hold a self-exclusive lock on
+the same relation at the same time, or one process can acquire an AccessShareLock
+while the other already holds AccessExclusiveLock.  This might seem dangerous and
+could be in some cases (more on that below), but if we didn't do this then
+parallel query would be extremely prone to self-deadlock.  For example, a
+parallel query against a relation on which the leader already had
+AccessExclusiveLock would hang, because the workers would try to lock the same
+relation and be blocked by the leader; yet the leader can't finish until it
+receives completion indications from all workers.  An undetected deadlock
+results.  This is far from the only scenario where such a problem happens.  The
+same thing will occur if the leader holds only AccessShareLock, the worker
+seeks AccessShareLock, but between the time the leader attempts to acquire the
+lock and the time the worker attempts to acquire it, some other process queues
+up waiting for an AccessExclusiveLock.  In this case, too, an indefinite hang
+results.
 
 It might seem that we could predict which locks the workers will attempt to
 acquire and ensure before going parallel that those locks would be acquired
@@ -637,18 +638,23 @@ the other is safe enough.  Problems would occur if the leader initiated
 parallelism from a point in the code at which it had some backend-private
 state that made table access from another process unsafe, for example after
 calling SetReindexProcessing and before calling ResetReindexProcessing,
-catastrophe could ensue, because the worker won't have that state.  Similarly,
-problems could occur with certain kinds of non-relation locks, such as
-relation extension locks.  It's no safer for two related processes to extend
-the same relation at the time than for unrelated processes to do the same.
-However, since parallel mode is strictly read-only at present, neither this
-nor most of the similar cases can arise at present.  To allow parallel writes,
-we'll either need to (1) further enhance the deadlock detector to handle those
-types of locks in a different way than other types; or (2) have parallel
-workers use some other mutual exclusion method for such cases; or (3) revise
-those cases so that they no longer use heavyweight locking in the first place
-(which is not a crazy idea, given that such lock acquisitions are not expected
-to deadlock and that heavyweight lock acquisition is fairly slow anyway).
+catastrophe could ensue, because the worker won't have that state.
+
+To allow parallel inserts and parallel copy, we have ensured that relation
+extension and page locks don't participate in group locking which means such
+locks can conflict among the same group members.  This is required as it is no
+safer for two related processes to extend the same relation or perform clean up
+in gin indexes at a time than for unrelated processes to do the same.  We don't
+acquire a heavyweight lock on any other object after relation extension lock
+which means such a lock can never participate in the deadlock cycle.  After
+acquiring page locks, we can acquire relation extension lock but reverse never
+happens, so those will also not participate in deadlock.  To allow for other
+parallel writes like parallel update or parallel delete, we'll either need to
+(1) further enhance the deadlock detector to handle those tuple locks in a
+different way than other types; or (2) have parallel workers use some other
+mutual exclusion method for such cases.  Currently, the parallel mode is
+strictly read-only, but now we have the infrastructure to allow parallel
+inserts and parallel copy.
 
 Group locking adds three new members to each PGPROC: lockGroupLeader,
 lockGroupMembers, and lockGroupLink. A PGPROC's lockGroupLeader is NULL for
diff --git a/src/backend/storage/lmgr/deadlock.c b/src/backend/storage/lmgr/deadlock.c
index f8c5df0..80ec88b 100644
--- a/src/backend/storage/lmgr/deadlock.c
+++ b/src/backend/storage/lmgr/deadlock.c
@@ -555,6 +555,15 @@ FindLockCycleRecurseMember(PGPROC *checkProc,
 	int			numLockModes,
 				lm;
 
+	/*
+	 * The relation extension or page lock can never participate in actual
+	 * deadlock cycle.  See Asserts in LockAcquireExtended.  So, there is
+	 * no advantage in checking wait edges from it.
+	 */
+	if ((LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND) ||
+		(LOCK_LOCKTAG(*lock) == LOCKTAG_PAGE))
+		return false;
+        
 	lockMethodTable = GetLocksMethodTable(lock);
 	numLockModes = lockMethodTable->numLockModes;
 	conflictMask = lockMethodTable->conflictTab[checkProc->waitLockMode];
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 56dba09..6fdfeba 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -1404,6 +1404,18 @@ LockCheckConflicts(LockMethod lockMethodTable,
 	}
 
 	/*
+	 * The relation extension or page lock conflict even between the group
+	 * members.
+	 */
+	if ((LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND) ||
+		(LOCK_LOCKTAG(*lock) == LOCKTAG_PAGE))
+	{
+		PROCLOCK_PRINT("LockCheckConflicts: conflicting (group)",
+				proclock);
+		return true;
+	}
+
+	/*
 	 * Locks held in conflicting modes by members of our own lock group are
 	 * not real conflicts; we can subtract those out and see if we still have
 	 * a conflict.  This is O(N) in the number of processes holding or
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index eb321f7..b18f61b 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -1077,7 +1077,13 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
 
 	/*
 	 * If group locking is in use, locks held by members of my locking group
-	 * need to be included in myHeldLocks.
+	 * need to be included in myHeldLocks.  This is not required for
+	 * relation extension or page locks which conflict among group members.
+	 * However, including them in myHeldLocks will give group members the
+	 * priority to get those locks as compared to other backends which are
+	 * also trying to acquire those locks.  OTOH, we can avoid giving
+	 * priority to group members for that kind of locks, but there
+	 * doesn't appear to be a clear advantage of the same.
 	 */
 	if (leader != NULL)
 	{
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index bb8e4e6..fac979d 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -301,6 +301,7 @@ typedef struct LOCK
 } LOCK;
 
 #define LOCK_LOCKMETHOD(lock) ((LOCKMETHODID) (lock).tag.locktag_lockmethodid)
+#define LOCK_LOCKTAG(lock) ((LockTagType) (lock).tag.locktag_type)
 
 
 /*
-- 
1.8.3.1

#172

Dilip Kumar

dilipbalaut@gmail.com

almost 6 years ago

In reply to: Kuntal Ghosh (#169)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Fri, Mar 13, 2020 at 2:32 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:

On Fri, Mar 13, 2020 at 8:29 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Mar 12, 2020 at 7:50 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:

I think moving them inside a macro is a good idea. Also, I think we
should move all the Assert related code inside some debugging macro
similar to this:
#ifdef LOCK_DEBUG
....
#endif

If we move it under some macro, then those Asserts will be only
enabled when that macro is defined. I think we want there Asserts to
be enabled always in assert enabled build, these will be like any
other Asserts in the code. What is the advantage of doing those under
macro?

My concern is related to performance regression. We're using two
static variables in hot-paths only for checking a few asserts. So, I'm
not sure whether we should enable the same by default, specially when
asserts are itself disabled.
-ResetRelExtLockHeldCount()
+ResetRelExtPageLockHeldCount()
{
RelationExtensionLockHeldCount = 0;
+ PageLockHeldCount = 0;
+}
Also, we're calling this method from frequently used functions like
Commit/AbortTransaction. So, it's better these two static variables
share the same cache line and reinitalize them with a single
instruction.

In the recent version of the patch, instead of a counter, we have done
with a flag. So I think now we can just keep a single variable and we
can just reset the bit in a single instruction.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#173

Dilip Kumar

dilipbalaut@gmail.com

almost 6 years ago

In reply to: Dilip Kumar (#168)

4 attachment(s)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Fri, Mar 13, 2020 at 11:16 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Mar 13, 2020 at 11:08 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Thu, Mar 12, 2020 at 3:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Mar 11, 2020 at 2:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

If we have no other choice, then I see a few downsides of adding a
special check in the LockRelease() call:

1. Instead of resetting/decrement the variable from specific APIs like
UnlockRelationForExtension or UnlockPage, we need to have it in
LockRelease. It will also look odd, if set variable in
LockRelationForExtension, but don't reset in the
UnlockRelationForExtension variant. Now, maybe we can allow to reset
it at both places if it is a flag, but not if it is a counter
variable.

2. One can argue that adding extra instructions in a generic path
(like LockRelease) is not a good idea, especially if those are for an
Assert. I understand this won't add anything which we can measure by
standard benchmarks.

I have just written a WIP patch for relation extension lock where
instead of incrementing and decrementing the counter in
LockRelationForExtension and UnlockRelationForExtension respectively.
We can just set and reset the flag in LockAcquireExtended and
LockRelease. So this patch appears simple to me as we are not
involving the transaction APIs to set and reset the flag. However, we
need to add an extra check as you have already mentioned. I think we
could measure the performance and see whether it has any impact or
not?
LockAcquireExtended()
{
..
+ if (locktag->locktag_type == LOCKTAG_RELATION_EXTEND)
+ IsRelationExtensionLockHeld = true;
..
}
Can we move this check inside a function (CheckAndSetLockHeld or
something like that) as we need to add a similar thing for page lock?
ok

Done

Also, how about moving the set and reset of these flags to
GrantLockLocal and RemoveLocalLock as that will further reduce the
number of places where we need to add such a check.

Make sense to me.

Done

Another thing is

to see if it makes sense to have a macro like LOCALLOCK_LOCKMETHOD to
get the lock tag.

ok

Done

Apart from that, I have also extended the solution for the page lock.
And, I have also broken down the 3rd patch in two parts for relation
extension and for the page lock.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v6-0001-Add-assert-check-for-relation-extension-lock.patchapplication/octet-stream; name=v6-0001-Add-assert-check-for-relation-extension-lock.patchDownload

From 131d42809335a51f0ba602a4a45139a6f2d73776 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Thu, 12 Mar 2020 13:16:45 +0530
Subject: [PATCH v6 1/4] Add assert check for relation extension lock

Add assert to check that we should not acquire any other lock if we are
already holding the relation extension lock.  Only exception is that if
we are trying to acquire the relation extension lock then we can hold the
same lock.
---
 src/backend/storage/lmgr/lock.c | 48 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 48 insertions(+)

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 56dba09..24ca900 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -170,6 +170,15 @@ typedef struct TwoPhaseLockRecord
  */
 static int	FastPathLocalUseCount = 0;
 
+/*
+ * Flag is set if the relation extension lock is currently held by this backend.
+ * We need this flag so that we can ensure that while holding the relation
+ * extension lock we are not trying to acquire any other heavy weight lock.
+ * Basically, that will ensuring that the proc holding relation extension lock
+ * can not wait for any another lock which can lead to a deadlock.
+ */
+static bool	IsRelationExtensionLockHeld = false;
+
 /* Macros for manipulating proc->fpLockBits */
 #define FAST_PATH_BITS_PER_SLOT			3
 #define FAST_PATH_LOCKNUMBER_OFFSET		1
@@ -208,6 +217,9 @@ static int	FastPathLocalUseCount = 0;
 	(locktag)->locktag_field1 != InvalidOid && \
 	(mode) > ShareUpdateExclusiveLock)
 
+/* Get the lock tag of the local lock. */
+#define LOCALLOCK_LOCKTAG(locallock) ((LockTagType) (locallock).tag.lock.locktag_type)
+
 static bool FastPathGrantRelationLock(Oid relid, LOCKMODE lockmode);
 static bool FastPathUnGrantRelationLock(Oid relid, LOCKMODE lockmode);
 static bool FastPathTransferRelationLocks(LockMethod lockMethodTable,
@@ -841,6 +853,14 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	}
 
 	/*
+	 * We should not try to acquire any other heavyweight lock if we are already
+	 * holding the relation extension lock.  If we are trying to hold the same
+	 * relation extension lock then it should have been already granted so we
+	 * will not come here.
+	 */
+	Assert(!IsRelationExtensionLockHeld);
+
+	/*
 	 * Prepare to emit a WAL record if acquisition of this lock needs to be
 	 * replayed in a standby server.
 	 *
@@ -1288,6 +1308,28 @@ SetupLockInTable(LockMethod lockMethodTable, PGPROC *proc,
 }
 
 /*
+ * CheckAndSetLockHeld -- check and set the flag that we hold relation extension
+ *						  lock.
+ */
+static inline void
+CheckAndSetLockHeld(LOCALLOCK *locallock)
+{
+	if (LOCALLOCK_LOCKTAG(*locallock) == LOCKTAG_RELATION_EXTEND)
+		IsRelationExtensionLockHeld = true;
+}
+
+/*
+ * CheckAndReSetLockHeld -- check and reset the flag if we have released the
+ *							relation extension lock.
+ */
+static inline void
+CheckAndReSetLockHeld(LOCALLOCK *locallock)
+{
+	if (LOCALLOCK_LOCKTAG(*locallock) == LOCKTAG_RELATION_EXTEND)
+		IsRelationExtensionLockHeld = false;
+}
+
+/*
  * Subroutine to free a locallock entry
  */
 static void
@@ -1322,6 +1364,9 @@ RemoveLocalLock(LOCALLOCK *locallock)
 					 (void *) &(locallock->tag),
 					 HASH_REMOVE, NULL))
 		elog(WARNING, "locallock table corrupted");
+
+	/* Check and reset the lock held flag. */
+	CheckAndReSetLockHeld(locallock);
 }
 
 /*
@@ -1618,6 +1663,9 @@ GrantLockLocal(LOCALLOCK *locallock, ResourceOwner owner)
 	locallock->numLockOwners++;
 	if (owner != NULL)
 		ResourceOwnerRememberLock(owner, locallock);
+
+	/* Set the flag if we have acquired relation extension lock. */
+	CheckAndSetLockHeld(locallock);
 }
 
 /*
-- 
1.8.3.1

v6-0004-Page-lock-to-conflict-among-parallel-group-member.patchapplication/octet-stream; name=v6-0004-Page-lock-to-conflict-among-parallel-group-member.patchDownload

From 689b842b9378f4d0410269272261d386c2729651 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Fri, 13 Mar 2020 16:13:10 +0530
Subject: [PATCH v6 4/4] Page lock to conflict among parallel group members

---
 src/backend/storage/lmgr/README     | 60 ++++++++++++++++++++-----------------
 src/backend/storage/lmgr/deadlock.c |  9 +++---
 src/backend/storage/lmgr/lock.c     |  8 +++--
 src/backend/storage/lmgr/proc.c     | 12 ++++----
 4 files changed, 50 insertions(+), 39 deletions(-)

diff --git a/src/backend/storage/lmgr/README b/src/backend/storage/lmgr/README
index 56b0a12..13eb1cc 100644
--- a/src/backend/storage/lmgr/README
+++ b/src/backend/storage/lmgr/README
@@ -597,21 +597,22 @@ deadlock detection algorithm very much, but it makes the bookkeeping more
 complicated.
 
 We choose to regard locks held by processes in the same parallel group as
-non-conflicting.  This means that two processes in a parallel group can hold a
-self-exclusive lock on the same relation at the same time, or one process can
-acquire an AccessShareLock while the other already holds AccessExclusiveLock.
-This might seem dangerous and could be in some cases (more on that below), but
-if we didn't do this then parallel query would be extremely prone to
-self-deadlock.  For example, a parallel query against a relation on which the
-leader already had AccessExclusiveLock would hang, because the workers would
-try to lock the same relation and be blocked by the leader; yet the leader
-can't finish until it receives completion indications from all workers.  An
-undetected deadlock results.  This is far from the only scenario where such a
-problem happens.  The same thing will occur if the leader holds only
-AccessShareLock, the worker seeks AccessShareLock, but between the time the
-leader attempts to acquire the lock and the time the worker attempts to
-acquire it, some other process queues up waiting for an AccessExclusiveLock.
-In this case, too, an indefinite hang results.
+non-conflicting with the exception of relation extension and page locks.  This
+means that two processes in a parallel group can hold a self-exclusive lock on
+the same relation at the same time, or one process can acquire an AccessShareLock
+while the other already holds AccessExclusiveLock.  This might seem dangerous and
+could be in some cases (more on that below), but if we didn't do this then
+parallel query would be extremely prone to self-deadlock.  For example, a
+parallel query against a relation on which the leader already had
+AccessExclusiveLock would hang, because the workers would try to lock the same
+relation and be blocked by the leader; yet the leader can't finish until it
+receives completion indications from all workers.  An undetected deadlock
+results.  This is far from the only scenario where such a problem happens.  The
+same thing will occur if the leader holds only AccessShareLock, the worker
+seeks AccessShareLock, but between the time the leader attempts to acquire the
+lock and the time the worker attempts to acquire it, some other process queues
+up waiting for an AccessExclusiveLock.  In this case, too, an indefinite hang
+results.
 
 It might seem that we could predict which locks the workers will attempt to
 acquire and ensure before going parallel that those locks would be acquired
@@ -637,18 +638,23 @@ the other is safe enough.  Problems would occur if the leader initiated
 parallelism from a point in the code at which it had some backend-private
 state that made table access from another process unsafe, for example after
 calling SetReindexProcessing and before calling ResetReindexProcessing,
-catastrophe could ensue, because the worker won't have that state.  Similarly,
-problems could occur with certain kinds of non-relation locks, such as
-relation extension locks.  It's no safer for two related processes to extend
-the same relation at the time than for unrelated processes to do the same.
-However, since parallel mode is strictly read-only at present, neither this
-nor most of the similar cases can arise at present.  To allow parallel writes,
-we'll either need to (1) further enhance the deadlock detector to handle those
-types of locks in a different way than other types; or (2) have parallel
-workers use some other mutual exclusion method for such cases; or (3) revise
-those cases so that they no longer use heavyweight locking in the first place
-(which is not a crazy idea, given that such lock acquisitions are not expected
-to deadlock and that heavyweight lock acquisition is fairly slow anyway).
+catastrophe could ensue, because the worker won't have that state.
+
+To allow parallel inserts and parallel copy, we have ensured that relation
+extension and page locks don't participate in group locking which means such
+locks can conflict among the same group members.  This is required as it is no
+safer for two related processes to extend the same relation or perform clean up
+in gin indexes at a time than for unrelated processes to do the same.  We don't
+acquire a heavyweight lock on any other object after relation extension lock
+which means such a lock can never participate in the deadlock cycle.  After
+acquiring page locks, we can acquire relation extension lock but reverse never
+happens, so those will also not participate in deadlock.  To allow for other
+parallel writes like parallel update or parallel delete, we'll either need to
+(1) further enhance the deadlock detector to handle those tuple locks in a
+different way than other types; or (2) have parallel workers use some other
+mutual exclusion method for such cases.  Currently, the parallel mode is
+strictly read-only, but now we have the infrastructure to allow parallel
+inserts and parallel copy.
 
 Group locking adds three new members to each PGPROC: lockGroupLeader,
 lockGroupMembers, and lockGroupLink. A PGPROC's lockGroupLeader is NULL for
diff --git a/src/backend/storage/lmgr/deadlock.c b/src/backend/storage/lmgr/deadlock.c
index 6106c2d..f4a49d8 100644
--- a/src/backend/storage/lmgr/deadlock.c
+++ b/src/backend/storage/lmgr/deadlock.c
@@ -556,11 +556,12 @@ FindLockCycleRecurseMember(PGPROC *checkProc,
 				lm;
 
 	/*
-	 * The relation extension lock can never participate in actual deadlock
-	 * cycle.  See Asserts in LockAcquireExtended.  So, there is no advantage in
-	 * checking wait edges from it.
+	 * The relation extension or page lock can never participate in actual
+	 * deadlock cycle.  See Asserts in LockAcquireExtended.  So, there is
+	 * no advantage in checking wait edges from it.
 	 */
-	if (LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND)
+	if ((LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND) ||
+		(LOCK_LOCKTAG(*lock) == LOCKTAG_PAGE))
 		return false;
 
 	lockMethodTable = GetLocksMethodTable(lock);
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 02d7758..8b37251 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -1461,8 +1461,12 @@ LockCheckConflicts(LockMethod lockMethodTable,
 		return true;
 	}
 
-	/* The relation extension lock conflict even between the group members. */
-	if (LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND)
+	/*
+	 * The relation extension or page lock conflict even between the group
+	 * members.
+	 */
+	if ((LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND) ||
+		(LOCK_LOCKTAG(*lock) == LOCKTAG_PAGE))
 	{
 		PROCLOCK_PRINT("LockCheckConflicts: conflicting (group)",
 				proclock);
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 1127168..b18f61b 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -1078,12 +1078,12 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
 	/*
 	 * If group locking is in use, locks held by members of my locking group
 	 * need to be included in myHeldLocks.  This is not required for
-	 * relation extension lock which conflict among group members. However,
-	 * including them in myHeldLocks will give group members the priority to get
-	 * those locks as compared to other backends which are also trying to
-	 * acquire those locks.  OTOH, we can avoid giving priority to group members
-	 * for that kind of locks, but there doesn't appear to be a clear advantage
-	 * of the same.
+	 * relation extension or page locks which conflict among group members.
+	 * However, including them in myHeldLocks will give group members the
+	 * priority to get those locks as compared to other backends which are
+	 * also trying to acquire those locks.  OTOH, we can avoid giving
+	 * priority to group members for that kind of locks, but there
+	 * doesn't appear to be a clear advantage of the same.
 	 */
 	if (leader != NULL)
 	{
-- 
1.8.3.1

v6-0002-Extend-the-assert-for-the-page-lock.patchapplication/octet-stream; name=v6-0002-Extend-the-assert-for-the-page-lock.patchDownload

From 98dd50a2f70600a799d6d5827d48ddb4fb09fdb7 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Fri, 13 Mar 2020 14:02:06 +0530
Subject: [PATCH v6 2/4] Extend the assert for the page lock

---
 src/backend/storage/lmgr/lock.c | 41 +++++++++++++++++++++++++++--------------
 1 file changed, 27 insertions(+), 14 deletions(-)

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 24ca900..e182ec7 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -171,13 +171,16 @@ typedef struct TwoPhaseLockRecord
 static int	FastPathLocalUseCount = 0;
 
 /*
- * Flag is set if the relation extension lock is currently held by this backend.
- * We need this flag so that we can ensure that while holding the relation
- * extension lock we are not trying to acquire any other heavy weight lock.
- * Basically, that will ensuring that the proc holding relation extension lock
- * can not wait for any another lock which can lead to a deadlock.
+ * Flag is set if the relation extension/page lock is currently held by this
+ * backend.  We need this flag so that we can ensure that while holding the
+ * relation extension/page lock we are not trying to acquire any other heavy
+ * weight lock.  Basically, that will ensuring that the proc holding relation
+ * extension lock can not wait for any another lock which can lead to a
+ * deadlock.  However, for page lock the exception is that while holding the
+ * page lock it can wait on the relation extension lock.
  */
 static bool	IsRelationExtensionLockHeld = false;
+static bool	IsPageLockHeld = false;
 
 /* Macros for manipulating proc->fpLockBits */
 #define FAST_PATH_BITS_PER_SLOT			3
@@ -854,11 +857,17 @@ LockAcquireExtended(const LOCKTAG *locktag,
 
 	/*
 	 * We should not try to acquire any other heavyweight lock if we are already
-	 * holding the relation extension lock.  If we are trying to hold the same
-	 * relation extension lock then it should have been already granted so we
-	 * will not come here.
+	 * holding the relation extension/page lock.  If we are trying to hold the
+	 * same relation extension lock then it should have been already granted so
+	 * we will not come here.  However, While holding the page lock we don't
+	 * need to ensure that whether we are trying to acquire the relation
+	 * extension lock on the same relation or any other relation because we are
+	 * already ensuring that after holding the relation extension lock we are
+	 * not going to wait for any other lock.
 	 */
-	Assert(!IsRelationExtensionLockHeld);
+	Assert(!IsRelationExtensionLockHeld &&
+			(!IsPageLockHeld ||
+			(locktag->locktag_type == LOCKTAG_RELATION_EXTEND)));
 
 	/*
 	 * Prepare to emit a WAL record if acquisition of this lock needs to be
@@ -1308,25 +1317,29 @@ SetupLockInTable(LockMethod lockMethodTable, PGPROC *proc,
 }
 
 /*
- * CheckAndSetLockHeld -- check and set the flag that we hold relation extension
- *						  lock.
+ * CheckAndSetLockHeld -- check and set the flag that we hold relation
+ *						  extension/page lock.
  */
 static inline void
 CheckAndSetLockHeld(LOCALLOCK *locallock)
 {
 	if (LOCALLOCK_LOCKTAG(*locallock) == LOCKTAG_RELATION_EXTEND)
 		IsRelationExtensionLockHeld = true;
+	else if (LOCALLOCK_LOCKTAG(*locallock) == LOCKTAG_PAGE)
+		IsPageLockHeld = true;
 }
 
 /*
  * CheckAndReSetLockHeld -- check and reset the flag if we have released the
- *							relation extension lock.
+ *							relation extension/page lock.
  */
 static inline void
 CheckAndReSetLockHeld(LOCALLOCK *locallock)
 {
 	if (LOCALLOCK_LOCKTAG(*locallock) == LOCKTAG_RELATION_EXTEND)
 		IsRelationExtensionLockHeld = false;
+	else if (LOCALLOCK_LOCKTAG(*locallock) == LOCKTAG_PAGE)
+		IsPageLockHeld = false;
 }
 
 /*
@@ -1365,7 +1378,7 @@ RemoveLocalLock(LOCALLOCK *locallock)
 					 HASH_REMOVE, NULL))
 		elog(WARNING, "locallock table corrupted");
 
-	/* Check and reset the lock held flag. */
+	/* Check and reset the lock held flags. */
 	CheckAndReSetLockHeld(locallock);
 }
 
@@ -1664,7 +1677,7 @@ GrantLockLocal(LOCALLOCK *locallock, ResourceOwner owner)
 	if (owner != NULL)
 		ResourceOwnerRememberLock(owner, locallock);
 
-	/* Set the flag if we have acquired relation extension lock. */
+	/* Check and set the lock held flags. */
 	CheckAndSetLockHeld(locallock);
 }
 
-- 
1.8.3.1

v6-0003-Relation-extension-lock-to-conflict-among-paralle.patchapplication/octet-stream; name=v6-0003-Relation-extension-lock-to-conflict-among-paralle.patchDownload

From e57d7909735bb06d3cdb231043a97669262c3b8c Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Fri, 13 Mar 2020 18:55:04 +0530
Subject: [PATCH v6 3/4] Relation extension lock to conflict among parallel 
 group members.

Make relation extension lock conflicting among the parallel group members.
So that multiple worker will not get the lock at the same time and create
the race condition.  Also, remove this lock from participate in deadlock
detection.
---
 src/backend/storage/lmgr/deadlock.c | 8 ++++++++
 src/backend/storage/lmgr/lock.c     | 8 ++++++++
 src/backend/storage/lmgr/proc.c     | 8 +++++++-
 src/include/storage/lock.h          | 1 +
 4 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/lmgr/deadlock.c b/src/backend/storage/lmgr/deadlock.c
index f8c5df0..6106c2d 100644
--- a/src/backend/storage/lmgr/deadlock.c
+++ b/src/backend/storage/lmgr/deadlock.c
@@ -555,6 +555,14 @@ FindLockCycleRecurseMember(PGPROC *checkProc,
 	int			numLockModes,
 				lm;
 
+	/*
+	 * The relation extension lock can never participate in actual deadlock
+	 * cycle.  See Asserts in LockAcquireExtended.  So, there is no advantage in
+	 * checking wait edges from it.
+	 */
+	if (LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND)
+		return false;
+
 	lockMethodTable = GetLocksMethodTable(lock);
 	numLockModes = lockMethodTable->numLockModes;
 	conflictMask = lockMethodTable->conflictTab[checkProc->waitLockMode];
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index e182ec7..02d7758 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -1461,6 +1461,14 @@ LockCheckConflicts(LockMethod lockMethodTable,
 		return true;
 	}
 
+	/* The relation extension lock conflict even between the group members. */
+	if (LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND)
+	{
+		PROCLOCK_PRINT("LockCheckConflicts: conflicting (group)",
+				proclock);
+		return true;
+	}
+
 	/*
 	 * Locks held in conflicting modes by members of our own lock group are
 	 * not real conflicts; we can subtract those out and see if we still have
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index eb321f7..1127168 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -1077,7 +1077,13 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
 
 	/*
 	 * If group locking is in use, locks held by members of my locking group
-	 * need to be included in myHeldLocks.
+	 * need to be included in myHeldLocks.  This is not required for
+	 * relation extension lock which conflict among group members. However,
+	 * including them in myHeldLocks will give group members the priority to get
+	 * those locks as compared to other backends which are also trying to
+	 * acquire those locks.  OTOH, we can avoid giving priority to group members
+	 * for that kind of locks, but there doesn't appear to be a clear advantage
+	 * of the same.
 	 */
 	if (leader != NULL)
 	{
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index bb8e4e6..fac979d 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -301,6 +301,7 @@ typedef struct LOCK
 } LOCK;
 
 #define LOCK_LOCKMETHOD(lock) ((LOCKMETHODID) (lock).tag.locktag_lockmethodid)
+#define LOCK_LOCKTAG(lock) ((LockTagType) (lock).tag.locktag_type)
 
 
 /*
-- 
1.8.3.1

#174

Dilip Kumar

dilipbalaut@gmail.com

almost 6 years ago

In reply to: Amit Kapila (#171)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Fri, Mar 13, 2020 at 3:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Mar 13, 2020 at 8:37 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Mar 12, 2020 at 5:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Mar 12, 2020 at 11:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have fixed this in the attached patch set.

I have modified your
v4-0003-Conflict-Extension-Page-lock-in-group-member patch. The
modifications are (a) Change src/backend/storage/lmgr/README to
reflect new behaviour, (b) Introduce a new macro LOCK_LOCKTAG which
slightly simplifies the code, (c) moved the deadlock.c check a few
lines up and (d) changed a few comments.

Changes look fine to me.

Today, while looking at this patch again, I realized that there is a
where we sometimes allow group members to jump the wait queue. This
is primarily to avoid creating deadlocks (see ProcSleep). Now,
ideally, we don't need this for relation extension or page locks as
those can never lead to deadlocks. However, the current code will
give group members more priority to acquire relation extension or page
locks if any one of the members has held those locks. Now, if we want
we can prevent giving group members priority for these locks, but I am
not sure how important is that case. So, I have left that as it is by
adding a few comments. What do you think?

Additionally, I have changed/added a few more sentences in README.

I have included all your changes in the latest patch set.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#175

Amit Kapila

amit.kapila16@gmail.com

almost 6 years ago

In reply to: Dilip Kumar (#173)

4 attachment(s)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Fri, Mar 13, 2020 at 7:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Apart from that, I have also extended the solution for the page lock.
And, I have also broken down the 3rd patch in two parts for relation
extension and for the page lock.

Thanks, I have made a number of cosmetic changes and written
appropriate commit messages for all patches. See the attached patch
series and let me know your opinion? BTW, did you get a chance to test
page locks by using the extension which I have posted above or by some
other way? I think it is important to test page-lock related patches
now.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v7-0001-Assert-that-we-don-t-acquire-a-heavyweight-lock-on-a.patchapplication/octet-stream; name=v7-0001-Assert-that-we-don-t-acquire-a-heavyweight-lock-on-a.patchDownload

From a5e4594ecc1f363798fea4691c74c5559d56777c Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Sat, 14 Mar 2020 12:43:35 +0530
Subject: [PATCH 1/4] Assert that we don't acquire a heavyweight lock on
 another object after relation extension lock.

The only exception to the rule is that we can try to acquire the same
relation extension lock more than once.  This is allowed as we are not
creating any new lock for this case.  This restriction implies that the
relation extension lock won't ever participate in the deadlock cycle
because we can never wait for any other heavyweight lock after acquiring
this lock.

Such a restriction is okay for relation extension locks as unlike other
heavyweight locks these are not held till the transaction end.  These are
taken for a short duration to extend a particular relation and then
released.

Author: Dilip Kumar, with few changes by Amit Kapila
Reviewed-by: Amit Kapila and Kuntal Ghosh
Discussion: https://postgr.es/m/CAD21AoCmT3cFQUN4aVvzy5chw7DuzXrJCbrjTU05B+Ss=Gn1LA@mail.gmail.com
---
 src/backend/storage/lmgr/lock.c | 57 +++++++++++++++++++++++++++++++++++++++++
 src/include/storage/lock.h      |  1 +
 2 files changed, 58 insertions(+)

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 1df7b8e..2ff7c31 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -170,6 +170,21 @@ typedef struct TwoPhaseLockRecord
  */
 static int	FastPathLocalUseCount = 0;
 
+/*
+ * Flag to indicate if the relation extension lock is held by this backend.
+ * This flag is used to ensure that while holding the relation extension lock
+ * we don't try to acquire a heavyweight lock on any other object.  This
+ * restriction implies that the relation extension lock won't ever participate
+ * in the deadlock cycle because we can never wait for any other heavyweight
+ * lock after acquiring this lock.
+ *
+ * Such a restriction is okay for relation extension locks as unlike other
+ * heavyweight locks these are not held till the transaction end.  These are
+ * taken for a short duration to extend a particular relation and then
+ * released.
+ */
+static bool IsRelationExtensionLockHeld = false;
+
 /* Macros for manipulating proc->fpLockBits */
 #define FAST_PATH_BITS_PER_SLOT			3
 #define FAST_PATH_LOCKNUMBER_OFFSET		1
@@ -841,6 +856,13 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	}
 
 	/*
+	 * We don't acquire any other heavyweight lock while holding the relation
+	 * extension lock.  We do allow to acquire the same relation extension
+	 * lock more than once but that case won't reach here.
+	 */
+	Assert(!IsRelationExtensionLockHeld);
+
+	/*
 	 * Prepare to emit a WAL record if acquisition of this lock needs to be
 	 * replayed in a standby server.
 	 *
@@ -1288,6 +1310,33 @@ SetupLockInTable(LockMethod lockMethodTable, PGPROC *proc,
 }
 
 /*
+ * Check and set the flag that we hold the relation extension lock.
+ *
+ * It is callers responsibility that this function is called after acquiring
+ * the relation extension lock.
+ */
+static inline void
+CheckAndSetLockHeld(LOCALLOCK *locallock)
+{
+	if (LOCALLOCK_LOCKTAG(*locallock) == LOCKTAG_RELATION_EXTEND)
+		IsRelationExtensionLockHeld = true;
+}
+
+/*
+ * Check and reset the flag to indicate that we have released the relation
+ * extension lock.
+ *
+ * It is callers responsibility to ensure that this function is called after
+ * releasing the relation extension lock.
+ */
+static inline void
+CheckAndResetLockHeld(LOCALLOCK *locallock)
+{
+	if (LOCALLOCK_LOCKTAG(*locallock) == LOCKTAG_RELATION_EXTEND)
+		IsRelationExtensionLockHeld = false;
+}
+
+/*
  * Subroutine to free a locallock entry
  */
 static void
@@ -1322,6 +1371,11 @@ RemoveLocalLock(LOCALLOCK *locallock)
 					 (void *) &(locallock->tag),
 					 HASH_REMOVE, NULL))
 		elog(WARNING, "locallock table corrupted");
+
+	/*
+	 * Indicate that the lock is released for a particular type of locks.
+	 */
+	CheckAndResetLockHeld(locallock);
 }
 
 /*
@@ -1618,6 +1672,9 @@ GrantLockLocal(LOCALLOCK *locallock, ResourceOwner owner)
 	locallock->numLockOwners++;
 	if (owner != NULL)
 		ResourceOwnerRememberLock(owner, locallock);
+
+	/* Indicate that the lock is acquired for a certain type of locks. */
+	CheckAndSetLockHeld(locallock);
 }
 
 /*
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index bb8e4e6..fc0a712 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -419,6 +419,7 @@ typedef struct LOCALLOCK
 } LOCALLOCK;
 
 #define LOCALLOCK_LOCKMETHOD(llock) ((llock).tag.lock.locktag_lockmethodid)
+#define LOCALLOCK_LOCKTAG(llock) ((LockTagType) (llock).tag.lock.locktag_type)
 
 
 /*
-- 
1.8.3.1

v7-0002-Add-assert-to-ensure-that-page-locks-don-t-participa.patchapplication/octet-stream; name=v7-0002-Add-assert-to-ensure-that-page-locks-don-t-participa.patchDownload

From 2e5005cf2a1fb4f6d21b786327e733a5722c0d77 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Sat, 14 Mar 2020 17:09:23 +0530
Subject: [PATCH 2/4] Add assert to ensure that page locks don't participate in
 deadlock cycle.

Assert that we don't acquire any other heavyweight lock while holding the
page lock except for relation extension.  However, these locks are never
taken in reverse order which implies that page locks will never
participate in the deadlock cycle.

Similar to relation extension, page locks are also held for a short
duration, so imposing such a restriction won't hurt.

Author: Dilip Kumar, with few changes by Amit Kapila
Reviewed-by: Amit Kapila and Kuntal Ghosh
Discussion: https://postgr.es/m/CAD21AoCmT3cFQUN4aVvzy5chw7DuzXrJCbrjTU05B+Ss=Gn1LA@mail.gmail.com
---
 src/backend/storage/lmgr/lock.c | 31 +++++++++++++++++++++++++++----
 1 file changed, 27 insertions(+), 4 deletions(-)

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 2ff7c31..f0c913f0 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -185,6 +185,18 @@ static int	FastPathLocalUseCount = 0;
  */
 static bool IsRelationExtensionLockHeld = false;
 
+/*
+ * Flag to indicate if the page lock is held by this backend.  We don't
+ * acquire any other heavyweight lock while holding the page lock except for
+ * relation extension.  However, these locks are never taken in reverse order
+ * which implies that page locks will also never participate in the deadlock
+ * cycle.
+ *
+ * Similar to relation extension, page locks are also held for a short
+ * duration, so imposing such a restriction won't hurt.
+ */
+static bool IsPageLockHeld = false;
+
 /* Macros for manipulating proc->fpLockBits */
 #define FAST_PATH_BITS_PER_SLOT			3
 #define FAST_PATH_LOCKNUMBER_OFFSET		1
@@ -863,6 +875,13 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	Assert(!IsRelationExtensionLockHeld);
 
 	/*
+	 * We don't acquire any other heavyweight lock while holding the page lock
+	 * except for relation extension lock.
+	 */
+	Assert(!IsPageLockHeld ||
+		   (locktag->locktag_type == LOCKTAG_RELATION_EXTEND));
+
+	/*
 	 * Prepare to emit a WAL record if acquisition of this lock needs to be
 	 * replayed in a standby server.
 	 *
@@ -1310,30 +1329,34 @@ SetupLockInTable(LockMethod lockMethodTable, PGPROC *proc,
 }
 
 /*
- * Check and set the flag that we hold the relation extension lock.
+ * Check and set the flag that we hold the relation extension/page lock.
  *
  * It is callers responsibility that this function is called after acquiring
- * the relation extension lock.
+ * the relation extension/page lock.
  */
 static inline void
 CheckAndSetLockHeld(LOCALLOCK *locallock)
 {
 	if (LOCALLOCK_LOCKTAG(*locallock) == LOCKTAG_RELATION_EXTEND)
 		IsRelationExtensionLockHeld = true;
+	else if (LOCALLOCK_LOCKTAG(*locallock) == LOCKTAG_PAGE)
+		IsPageLockHeld = true;
 }
 
 /*
  * Check and reset the flag to indicate that we have released the relation
- * extension lock.
+ * extension/page lock.
  *
  * It is callers responsibility to ensure that this function is called after
- * releasing the relation extension lock.
+ * releasing the relation extension/page lock.
  */
 static inline void
 CheckAndResetLockHeld(LOCALLOCK *locallock)
 {
 	if (LOCALLOCK_LOCKTAG(*locallock) == LOCKTAG_RELATION_EXTEND)
 		IsRelationExtensionLockHeld = false;
+	else if (LOCALLOCK_LOCKTAG(*locallock) == LOCKTAG_PAGE)
+		IsPageLockHeld = false;
 }
 
 /*
-- 
1.8.3.1

v7-0003-Allow-relation-extension-lock-to-conflict-among-para.patchapplication/octet-stream; name=v7-0003-Allow-relation-extension-lock-to-conflict-among-para.patchDownload

From fd6a29f95c9554fe1426dc6bc1084698ad19c5a7 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Sat, 14 Mar 2020 17:48:17 +0530
Subject: [PATCH 3/4] Allow relation extension lock to conflict among parallel
 group members.

This is required as it is no safer for two related processes to extend the
same relation at a time than for unrelated processes to do the same.  We
don't acquire a heavyweight lock on any other object after relation
extension lock which means such a lock can never participate in the
deadlock cycle.  So, avoid checking wait edges from this lock.

This provides an infrastructure to allow parallel operations like insert,
copy, etc. which were earlier not possible as parallel group members won't
conflict for relation extension lock.

Author: Dilip Kumar, Amit Kapila
Reviewed-by: Amit Kapila and Kuntal Ghosh
Discussion: https://postgr.es/m/CAD21AoCmT3cFQUN4aVvzy5chw7DuzXrJCbrjTU05B+Ss=Gn1LA@mail.gmail.com
---
 src/backend/storage/lmgr/deadlock.c |  8 ++++++++
 src/backend/storage/lmgr/lock.c     | 10 ++++++++++
 src/backend/storage/lmgr/proc.c     |  8 +++++++-
 src/include/storage/lock.h          |  1 +
 4 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/lmgr/deadlock.c b/src/backend/storage/lmgr/deadlock.c
index f8c5df0..59060b6 100644
--- a/src/backend/storage/lmgr/deadlock.c
+++ b/src/backend/storage/lmgr/deadlock.c
@@ -555,6 +555,14 @@ FindLockCycleRecurseMember(PGPROC *checkProc,
 	int			numLockModes,
 				lm;
 
+	/*
+	 * The relation extension lock can never participate in actual deadlock
+	 * cycle.  See Assert in LockAcquireExtended.  So, there is no advantage
+	 * in checking wait edges from it.
+	 */
+	if (LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND)
+		return false;
+
 	lockMethodTable = GetLocksMethodTable(lock);
 	numLockModes = lockMethodTable->numLockModes;
 	conflictMask = lockMethodTable->conflictTab[checkProc->waitLockMode];
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index f0c913f0..36f98a7 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -1481,6 +1481,16 @@ LockCheckConflicts(LockMethod lockMethodTable,
 	}
 
 	/*
+	 * The relation extension lock conflict even between the group members.
+	 */
+	if (LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND)
+	{
+		PROCLOCK_PRINT("LockCheckConflicts: conflicting (group)",
+					   proclock);
+		return true;
+	}
+
+	/*
 	 * Locks held in conflicting modes by members of our own lock group are
 	 * not real conflicts; we can subtract those out and see if we still have
 	 * a conflict.  This is O(N) in the number of processes holding or
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index eb321f7..fa07ddf 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -1077,7 +1077,13 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
 
 	/*
 	 * If group locking is in use, locks held by members of my locking group
-	 * need to be included in myHeldLocks.
+	 * need to be included in myHeldLocks.  This is not required for relation
+	 * extension lock which conflict among group members. However, including
+	 * them in myHeldLocks will give group members the priority to get those
+	 * locks as compared to other backends which are also trying to acquire
+	 * those locks.  OTOH, we can avoid giving priority to group members for
+	 * that kind of locks, but there doesn't appear to be a clear advantage of
+	 * the same.
 	 */
 	if (leader != NULL)
 	{
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index fc0a712..a89e54d 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -301,6 +301,7 @@ typedef struct LOCK
 } LOCK;
 
 #define LOCK_LOCKMETHOD(lock) ((LOCKMETHODID) (lock).tag.locktag_lockmethodid)
+#define LOCK_LOCKTAG(lock) ((LockTagType) (lock).tag.locktag_type)
 
 
 /*
-- 
1.8.3.1

v7-0004-Allow-page-lock-to-conflict-among-parallel-group-mem.patchapplication/octet-stream; name=v7-0004-Allow-page-lock-to-conflict-among-parallel-group-mem.patchDownload

From cf4189bf8673c445c76f3b3aeb7d0e481e2faad0 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Sat, 14 Mar 2020 19:16:09 +0530
Subject: [PATCH 4/4] Allow page lock to conflict among parallel group members.

This is required as it is no safer for two related processes to perform
clean up in gin indexes at a time than for unrelated processes to do the
same.  After acquiring page locks, we can acquire relation extension lock
but reverse never happens which means these will also not participate in
deadlock.  So, avoid checking wait edges from this lock.

Currently, the parallel mode is strictly read-only, but after this patch
we have the infrastructure to allow parallel inserts and parallel copy.

Author: Dilip Kumar, Amit Kapila
Reviewed-by: Amit Kapila and Kuntal Ghosh
Discussion: https://postgr.es/m/CAD21AoCmT3cFQUN4aVvzy5chw7DuzXrJCbrjTU05B+Ss=Gn1LA@mail.gmail.com
---
 src/backend/storage/lmgr/README     | 60 ++++++++++++++++++++-----------------
 src/backend/storage/lmgr/deadlock.c |  9 +++---
 src/backend/storage/lmgr/lock.c     |  6 ++--
 src/backend/storage/lmgr/proc.c     | 12 ++++----
 4 files changed, 48 insertions(+), 39 deletions(-)

diff --git a/src/backend/storage/lmgr/README b/src/backend/storage/lmgr/README
index 56b0a12..13eb1cc 100644
--- a/src/backend/storage/lmgr/README
+++ b/src/backend/storage/lmgr/README
@@ -597,21 +597,22 @@ deadlock detection algorithm very much, but it makes the bookkeeping more
 complicated.
 
 We choose to regard locks held by processes in the same parallel group as
-non-conflicting.  This means that two processes in a parallel group can hold a
-self-exclusive lock on the same relation at the same time, or one process can
-acquire an AccessShareLock while the other already holds AccessExclusiveLock.
-This might seem dangerous and could be in some cases (more on that below), but
-if we didn't do this then parallel query would be extremely prone to
-self-deadlock.  For example, a parallel query against a relation on which the
-leader already had AccessExclusiveLock would hang, because the workers would
-try to lock the same relation and be blocked by the leader; yet the leader
-can't finish until it receives completion indications from all workers.  An
-undetected deadlock results.  This is far from the only scenario where such a
-problem happens.  The same thing will occur if the leader holds only
-AccessShareLock, the worker seeks AccessShareLock, but between the time the
-leader attempts to acquire the lock and the time the worker attempts to
-acquire it, some other process queues up waiting for an AccessExclusiveLock.
-In this case, too, an indefinite hang results.
+non-conflicting with the exception of relation extension and page locks.  This
+means that two processes in a parallel group can hold a self-exclusive lock on
+the same relation at the same time, or one process can acquire an AccessShareLock
+while the other already holds AccessExclusiveLock.  This might seem dangerous and
+could be in some cases (more on that below), but if we didn't do this then
+parallel query would be extremely prone to self-deadlock.  For example, a
+parallel query against a relation on which the leader already had
+AccessExclusiveLock would hang, because the workers would try to lock the same
+relation and be blocked by the leader; yet the leader can't finish until it
+receives completion indications from all workers.  An undetected deadlock
+results.  This is far from the only scenario where such a problem happens.  The
+same thing will occur if the leader holds only AccessShareLock, the worker
+seeks AccessShareLock, but between the time the leader attempts to acquire the
+lock and the time the worker attempts to acquire it, some other process queues
+up waiting for an AccessExclusiveLock.  In this case, too, an indefinite hang
+results.
 
 It might seem that we could predict which locks the workers will attempt to
 acquire and ensure before going parallel that those locks would be acquired
@@ -637,18 +638,23 @@ the other is safe enough.  Problems would occur if the leader initiated
 parallelism from a point in the code at which it had some backend-private
 state that made table access from another process unsafe, for example after
 calling SetReindexProcessing and before calling ResetReindexProcessing,
-catastrophe could ensue, because the worker won't have that state.  Similarly,
-problems could occur with certain kinds of non-relation locks, such as
-relation extension locks.  It's no safer for two related processes to extend
-the same relation at the time than for unrelated processes to do the same.
-However, since parallel mode is strictly read-only at present, neither this
-nor most of the similar cases can arise at present.  To allow parallel writes,
-we'll either need to (1) further enhance the deadlock detector to handle those
-types of locks in a different way than other types; or (2) have parallel
-workers use some other mutual exclusion method for such cases; or (3) revise
-those cases so that they no longer use heavyweight locking in the first place
-(which is not a crazy idea, given that such lock acquisitions are not expected
-to deadlock and that heavyweight lock acquisition is fairly slow anyway).
+catastrophe could ensue, because the worker won't have that state.
+
+To allow parallel inserts and parallel copy, we have ensured that relation
+extension and page locks don't participate in group locking which means such
+locks can conflict among the same group members.  This is required as it is no
+safer for two related processes to extend the same relation or perform clean up
+in gin indexes at a time than for unrelated processes to do the same.  We don't
+acquire a heavyweight lock on any other object after relation extension lock
+which means such a lock can never participate in the deadlock cycle.  After
+acquiring page locks, we can acquire relation extension lock but reverse never
+happens, so those will also not participate in deadlock.  To allow for other
+parallel writes like parallel update or parallel delete, we'll either need to
+(1) further enhance the deadlock detector to handle those tuple locks in a
+different way than other types; or (2) have parallel workers use some other
+mutual exclusion method for such cases.  Currently, the parallel mode is
+strictly read-only, but now we have the infrastructure to allow parallel
+inserts and parallel copy.
 
 Group locking adds three new members to each PGPROC: lockGroupLeader,
 lockGroupMembers, and lockGroupLink. A PGPROC's lockGroupLeader is NULL for
diff --git a/src/backend/storage/lmgr/deadlock.c b/src/backend/storage/lmgr/deadlock.c
index 59060b6..beedc79 100644
--- a/src/backend/storage/lmgr/deadlock.c
+++ b/src/backend/storage/lmgr/deadlock.c
@@ -556,11 +556,12 @@ FindLockCycleRecurseMember(PGPROC *checkProc,
 				lm;
 
 	/*
-	 * The relation extension lock can never participate in actual deadlock
-	 * cycle.  See Assert in LockAcquireExtended.  So, there is no advantage
-	 * in checking wait edges from it.
+	 * The relation extension or page lock can never participate in actual
+	 * deadlock cycle.  See Asserts in LockAcquireExtended.  So, there is no
+	 * advantage in checking wait edges from them.
 	 */
-	if (LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND)
+	if (LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND ||
+		(LOCK_LOCKTAG(*lock) == LOCKTAG_PAGE))
 		return false;
 
 	lockMethodTable = GetLocksMethodTable(lock);
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 36f98a7..915ad69 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -1481,9 +1481,11 @@ LockCheckConflicts(LockMethod lockMethodTable,
 	}
 
 	/*
-	 * The relation extension lock conflict even between the group members.
+	 * The relation extension or page lock conflict even between the group
+	 * members.
 	 */
-	if (LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND)
+	if (LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND ||
+		(LOCK_LOCKTAG(*lock) == LOCKTAG_PAGE))
 	{
 		PROCLOCK_PRINT("LockCheckConflicts: conflicting (group)",
 					   proclock);
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index fa07ddf..9938cdd 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -1078,12 +1078,12 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
 	/*
 	 * If group locking is in use, locks held by members of my locking group
 	 * need to be included in myHeldLocks.  This is not required for relation
-	 * extension lock which conflict among group members. However, including
-	 * them in myHeldLocks will give group members the priority to get those
-	 * locks as compared to other backends which are also trying to acquire
-	 * those locks.  OTOH, we can avoid giving priority to group members for
-	 * that kind of locks, but there doesn't appear to be a clear advantage of
-	 * the same.
+	 * extension or page locks which conflict among group members. However,
+	 * including them in myHeldLocks will give group members the priority to
+	 * get those locks as compared to other backends which are also trying to
+	 * acquire those locks.  OTOH, we can avoid giving priority to group
+	 * members for that kind of locks, but there doesn't appear to be a clear
+	 * advantage of the same.
 	 */
 	if (leader != NULL)
 	{
-- 
1.8.3.1

#176

Dilip Kumar

dilipbalaut@gmail.com

almost 6 years ago

In reply to: Amit Kapila (#175)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Sat, Mar 14, 2020 at 7:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Mar 13, 2020 at 7:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Apart from that, I have also extended the solution for the page lock.
And, I have also broken down the 3rd patch in two parts for relation
extension and for the page lock.

Thanks, I have made a number of cosmetic changes and written
appropriate commit messages for all patches. See the attached patch
series and let me know your opinion? BTW, did you get a chance to test
page locks by using the extension which I have posted above or by some
other way? I think it is important to test page-lock related patches
now.

I have reviewed the updated patches and looks fine to me. Apart from
this I have done testing for the Page Lock using group locking
extension.

--Setup
create table gin_test_tbl(i int4[]) with (autovacuum_enabled = off);
create index gin_test_idx on gin_test_tbl using gin (i);
create table gin_test_tbl1(i int4[]) with (autovacuum_enabled = off);
create index gin_test_idx1 on gin_test_tbl1 using gin (i);

--session1:
select become_lock_group_leader();
select gin_clean_pending_list('gin_test_idx');

--session2:
select become_lock_group_member(session1_pid);
select gin_clean_pending_list('gin_test_idx1');

--session3:
select become_lock_group_leader();
select gin_clean_pending_list('gin_test_idx1');

--session4:
select become_lock_group_member(session3_pid);
select gin_clean_pending_list('gin_test_idx');

ERROR: deadlock detected
DETAIL: Process 61953 waits for ExclusiveLock on page 0 of relation
16399 of database 13577; blocked by process 62197.
Process 62197 waits for ExclusiveLock on page 0 of relation 16400 of
database 13577; blocked by process 61953.
HINT: See server log for query details.

Session1 and Session3 acquire the PageLock on two different index's
meta-pages and blocked in gdb, meanwhile, their member tries to
acquire the page lock as shown in the above example and it detects the
deadlock which is solved after applying the patch.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#177

Dilip Kumar

dilipbalaut@gmail.com

almost 6 years ago

In reply to: Dilip Kumar (#176)

4 attachment(s)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Sun, Mar 15, 2020 at 1:15 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Sat, Mar 14, 2020 at 7:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Mar 13, 2020 at 7:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Apart from that, I have also extended the solution for the page lock.
And, I have also broken down the 3rd patch in two parts for relation
extension and for the page lock.

Thanks, I have made a number of cosmetic changes and written
appropriate commit messages for all patches. See the attached patch
series and let me know your opinion? BTW, did you get a chance to test
page locks by using the extension which I have posted above or by some
other way? I think it is important to test page-lock related patches
now.

I have reviewed the updated patches and looks fine to me. Apart from
this I have done testing for the Page Lock using group locking
extension.

--Setup
create table gin_test_tbl(i int4[]) with (autovacuum_enabled = off);
create index gin_test_idx on gin_test_tbl using gin (i);
create table gin_test_tbl1(i int4[]) with (autovacuum_enabled = off);
create index gin_test_idx1 on gin_test_tbl1 using gin (i);

--session1:
select become_lock_group_leader();
select gin_clean_pending_list('gin_test_idx');

--session2:
select become_lock_group_member(session1_pid);
select gin_clean_pending_list('gin_test_idx1');

--session3:
select become_lock_group_leader();
select gin_clean_pending_list('gin_test_idx1');

--session4:
select become_lock_group_member(session3_pid);
select gin_clean_pending_list('gin_test_idx');

ERROR: deadlock detected
DETAIL: Process 61953 waits for ExclusiveLock on page 0 of relation
16399 of database 13577; blocked by process 62197.
Process 62197 waits for ExclusiveLock on page 0 of relation 16400 of
database 13577; blocked by process 61953.
HINT: See server log for query details.

Session1 and Session3 acquire the PageLock on two different index's
meta-pages and blocked in gdb, meanwhile, their member tries to
acquire the page lock as shown in the above example and it detects the
deadlock which is solved after applying the patch.

I have modified 0001 and 0002 slightly, Basically, instead of two
function CheckAndSetLockHeld and CheckAndReSetLockHeld, I have created
a one function.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v8-0001-Assert-that-we-don-t-acquire-a-heavyweight-lock-o.patchapplication/octet-stream; name=v8-0001-Assert-that-we-don-t-acquire-a-heavyweight-lock-o.patchDownload

From 9a690242e080cbb8182519895efcc575fd020959 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Sat, 14 Mar 2020 12:43:35 +0530
Subject: [PATCH v8 1/4] Assert that we don't acquire a heavyweight lock on
 another object after relation extension lock.

The only exception to the rule is that we can try to acquire the same
relation extension lock more than once.  This is allowed as we are not
creating any new lock for this case.  This restriction implies that the
relation extension lock won't ever participate in the deadlock cycle
because we can never wait for any other heavyweight lock after acquiring
this lock.

Such a restriction is okay for relation extension locks as unlike other
heavyweight locks these are not held till the transaction end.  These are
taken for a short duration to extend a particular relation and then
released.

Author: Dilip Kumar, with few changes by Amit Kapila
Reviewed-by: Amit Kapila and Kuntal Ghosh
Discussion: https://postgr.es/m/CAD21AoCmT3cFQUN4aVvzy5chw7DuzXrJCbrjTU05B+Ss=Gn1LA@mail.gmail.com
---
 src/backend/storage/lmgr/lock.c | 43 +++++++++++++++++++++++++++++++++++++++++
 src/include/storage/lock.h      |  1 +
 2 files changed, 44 insertions(+)

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 1df7b8e..bad4e3a 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -170,6 +170,21 @@ typedef struct TwoPhaseLockRecord
  */
 static int	FastPathLocalUseCount = 0;
 
+/*
+ * Flag to indicate if the relation extension lock is held by this backend.
+ * This flag is used to ensure that while holding the relation extension lock
+ * we don't try to acquire a heavyweight lock on any other object.  This
+ * restriction implies that the relation extension lock won't ever participate
+ * in the deadlock cycle because we can never wait for any other heavyweight
+ * lock after acquiring this lock.
+ *
+ * Such a restriction is okay for relation extension locks as unlike other
+ * heavyweight locks these are not held till the transaction end.  These are
+ * taken for a short duration to extend a particular relation and then
+ * released.
+ */
+static bool IsRelationExtensionLockHeld = false;
+
 /* Macros for manipulating proc->fpLockBits */
 #define FAST_PATH_BITS_PER_SLOT			3
 #define FAST_PATH_LOCKNUMBER_OFFSET		1
@@ -841,6 +856,13 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	}
 
 	/*
+	 * We don't acquire any other heavyweight lock while holding the relation
+	 * extension lock.  We do allow to acquire the same relation extension
+	 * lock more than once but that case won't reach here.
+	 */
+	Assert(!IsRelationExtensionLockHeld);
+
+	/*
 	 * Prepare to emit a WAL record if acquisition of this lock needs to be
 	 * replayed in a standby server.
 	 *
@@ -1288,6 +1310,19 @@ SetupLockInTable(LockMethod lockMethodTable, PGPROC *proc,
 }
 
 /*
+ * Check and set/reset the flag that we hold the relation extension lock.
+ *
+ * It is callers responsibility that this function is called after
+ * acquiring/releasing the relation extension lock.
+ */
+static inline void
+CheckAndSetLockHeld(LOCALLOCK *locallock, bool value)
+{
+	if (LOCALLOCK_LOCKTAG(*locallock) == LOCKTAG_RELATION_EXTEND)
+		IsRelationExtensionLockHeld = value;
+}
+
+/*
  * Subroutine to free a locallock entry
  */
 static void
@@ -1322,6 +1357,11 @@ RemoveLocalLock(LOCALLOCK *locallock)
 					 (void *) &(locallock->tag),
 					 HASH_REMOVE, NULL))
 		elog(WARNING, "locallock table corrupted");
+
+	/*
+	 * Indicate that the lock is released for a particular type of locks.
+	 */
+	CheckAndSetLockHeld(locallock, false);
 }
 
 /*
@@ -1618,6 +1658,9 @@ GrantLockLocal(LOCALLOCK *locallock, ResourceOwner owner)
 	locallock->numLockOwners++;
 	if (owner != NULL)
 		ResourceOwnerRememberLock(owner, locallock);
+
+	/* Indicate that the lock is acquired for a certain type of locks. */
+	CheckAndSetLockHeld(locallock, true);
 }
 
 /*
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index bb8e4e6..fc0a712 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -419,6 +419,7 @@ typedef struct LOCALLOCK
 } LOCALLOCK;
 
 #define LOCALLOCK_LOCKMETHOD(llock) ((llock).tag.lock.locktag_lockmethodid)
+#define LOCALLOCK_LOCKTAG(llock) ((LockTagType) (llock).tag.lock.locktag_type)
 
 
 /*
-- 
1.8.3.1

v8-0003-Allow-relation-extension-lock-to-conflict-among-p.patchapplication/octet-stream; name=v8-0003-Allow-relation-extension-lock-to-conflict-among-p.patchDownload

From 46afc78e3fa3bba3609c6c52647672606a250d39 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Sat, 14 Mar 2020 17:48:17 +0530
Subject: [PATCH v8 3/4] Allow relation extension lock to conflict among
 parallel group members.

This is required as it is no safer for two related processes to extend the
same relation at a time than for unrelated processes to do the same.  We
don't acquire a heavyweight lock on any other object after relation
extension lock which means such a lock can never participate in the
deadlock cycle.  So, avoid checking wait edges from this lock.

This provides an infrastructure to allow parallel operations like insert,
copy, etc. which were earlier not possible as parallel group members won't
conflict for relation extension lock.

Author: Dilip Kumar, Amit Kapila
Reviewed-by: Amit Kapila and Kuntal Ghosh
Discussion: https://postgr.es/m/CAD21AoCmT3cFQUN4aVvzy5chw7DuzXrJCbrjTU05B+Ss=Gn1LA@mail.gmail.com
---
 src/backend/storage/lmgr/deadlock.c |  8 ++++++++
 src/backend/storage/lmgr/lock.c     | 10 ++++++++++
 src/backend/storage/lmgr/proc.c     |  8 +++++++-
 src/include/storage/lock.h          |  1 +
 4 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/lmgr/deadlock.c b/src/backend/storage/lmgr/deadlock.c
index f8c5df0..59060b6 100644
--- a/src/backend/storage/lmgr/deadlock.c
+++ b/src/backend/storage/lmgr/deadlock.c
@@ -555,6 +555,14 @@ FindLockCycleRecurseMember(PGPROC *checkProc,
 	int			numLockModes,
 				lm;
 
+	/*
+	 * The relation extension lock can never participate in actual deadlock
+	 * cycle.  See Assert in LockAcquireExtended.  So, there is no advantage
+	 * in checking wait edges from it.
+	 */
+	if (LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND)
+		return false;
+
 	lockMethodTable = GetLocksMethodTable(lock);
 	numLockModes = lockMethodTable->numLockModes;
 	conflictMask = lockMethodTable->conflictTab[checkProc->waitLockMode];
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 7e65943..0aae51e 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -1465,6 +1465,16 @@ LockCheckConflicts(LockMethod lockMethodTable,
 	}
 
 	/*
+	 * The relation extension lock conflict even between the group members.
+	 */
+	if (LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND)
+	{
+		PROCLOCK_PRINT("LockCheckConflicts: conflicting (group)",
+					   proclock);
+		return true;
+	}
+
+	/*
 	 * Locks held in conflicting modes by members of our own lock group are
 	 * not real conflicts; we can subtract those out and see if we still have
 	 * a conflict.  This is O(N) in the number of processes holding or
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index eb321f7..fa07ddf 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -1077,7 +1077,13 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
 
 	/*
 	 * If group locking is in use, locks held by members of my locking group
-	 * need to be included in myHeldLocks.
+	 * need to be included in myHeldLocks.  This is not required for relation
+	 * extension lock which conflict among group members. However, including
+	 * them in myHeldLocks will give group members the priority to get those
+	 * locks as compared to other backends which are also trying to acquire
+	 * those locks.  OTOH, we can avoid giving priority to group members for
+	 * that kind of locks, but there doesn't appear to be a clear advantage of
+	 * the same.
 	 */
 	if (leader != NULL)
 	{
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index fc0a712..a89e54d 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -301,6 +301,7 @@ typedef struct LOCK
 } LOCK;
 
 #define LOCK_LOCKMETHOD(lock) ((LOCKMETHODID) (lock).tag.locktag_lockmethodid)
+#define LOCK_LOCKTAG(lock) ((LockTagType) (lock).tag.locktag_type)
 
 
 /*
-- 
1.8.3.1

v8-0004-Allow-page-lock-to-conflict-among-parallel-group-.patchapplication/octet-stream; name=v8-0004-Allow-page-lock-to-conflict-among-parallel-group-.patchDownload

From ecab22b049ddeb42af899c8403b6962b371d5876 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Sat, 14 Mar 2020 19:16:09 +0530
Subject: [PATCH v8 4/4] Allow page lock to conflict among parallel group
 members.

This is required as it is no safer for two related processes to perform
clean up in gin indexes at a time than for unrelated processes to do the
same.  After acquiring page locks, we can acquire relation extension lock
but reverse never happens which means these will also not participate in
deadlock.  So, avoid checking wait edges from this lock.

Currently, the parallel mode is strictly read-only, but after this patch
we have the infrastructure to allow parallel inserts and parallel copy.

Author: Dilip Kumar, Amit Kapila
Reviewed-by: Amit Kapila and Kuntal Ghosh
Discussion: https://postgr.es/m/CAD21AoCmT3cFQUN4aVvzy5chw7DuzXrJCbrjTU05B+Ss=Gn1LA@mail.gmail.com
---
 src/backend/storage/lmgr/README     | 60 ++++++++++++++++++++-----------------
 src/backend/storage/lmgr/deadlock.c |  9 +++---
 src/backend/storage/lmgr/lock.c     |  6 ++--
 src/backend/storage/lmgr/proc.c     | 12 ++++----
 4 files changed, 48 insertions(+), 39 deletions(-)

diff --git a/src/backend/storage/lmgr/README b/src/backend/storage/lmgr/README
index 56b0a12..13eb1cc 100644
--- a/src/backend/storage/lmgr/README
+++ b/src/backend/storage/lmgr/README
@@ -597,21 +597,22 @@ deadlock detection algorithm very much, but it makes the bookkeeping more
 complicated.
 
 We choose to regard locks held by processes in the same parallel group as
-non-conflicting.  This means that two processes in a parallel group can hold a
-self-exclusive lock on the same relation at the same time, or one process can
-acquire an AccessShareLock while the other already holds AccessExclusiveLock.
-This might seem dangerous and could be in some cases (more on that below), but
-if we didn't do this then parallel query would be extremely prone to
-self-deadlock.  For example, a parallel query against a relation on which the
-leader already had AccessExclusiveLock would hang, because the workers would
-try to lock the same relation and be blocked by the leader; yet the leader
-can't finish until it receives completion indications from all workers.  An
-undetected deadlock results.  This is far from the only scenario where such a
-problem happens.  The same thing will occur if the leader holds only
-AccessShareLock, the worker seeks AccessShareLock, but between the time the
-leader attempts to acquire the lock and the time the worker attempts to
-acquire it, some other process queues up waiting for an AccessExclusiveLock.
-In this case, too, an indefinite hang results.
+non-conflicting with the exception of relation extension and page locks.  This
+means that two processes in a parallel group can hold a self-exclusive lock on
+the same relation at the same time, or one process can acquire an AccessShareLock
+while the other already holds AccessExclusiveLock.  This might seem dangerous and
+could be in some cases (more on that below), but if we didn't do this then
+parallel query would be extremely prone to self-deadlock.  For example, a
+parallel query against a relation on which the leader already had
+AccessExclusiveLock would hang, because the workers would try to lock the same
+relation and be blocked by the leader; yet the leader can't finish until it
+receives completion indications from all workers.  An undetected deadlock
+results.  This is far from the only scenario where such a problem happens.  The
+same thing will occur if the leader holds only AccessShareLock, the worker
+seeks AccessShareLock, but between the time the leader attempts to acquire the
+lock and the time the worker attempts to acquire it, some other process queues
+up waiting for an AccessExclusiveLock.  In this case, too, an indefinite hang
+results.
 
 It might seem that we could predict which locks the workers will attempt to
 acquire and ensure before going parallel that those locks would be acquired
@@ -637,18 +638,23 @@ the other is safe enough.  Problems would occur if the leader initiated
 parallelism from a point in the code at which it had some backend-private
 state that made table access from another process unsafe, for example after
 calling SetReindexProcessing and before calling ResetReindexProcessing,
-catastrophe could ensue, because the worker won't have that state.  Similarly,
-problems could occur with certain kinds of non-relation locks, such as
-relation extension locks.  It's no safer for two related processes to extend
-the same relation at the time than for unrelated processes to do the same.
-However, since parallel mode is strictly read-only at present, neither this
-nor most of the similar cases can arise at present.  To allow parallel writes,
-we'll either need to (1) further enhance the deadlock detector to handle those
-types of locks in a different way than other types; or (2) have parallel
-workers use some other mutual exclusion method for such cases; or (3) revise
-those cases so that they no longer use heavyweight locking in the first place
-(which is not a crazy idea, given that such lock acquisitions are not expected
-to deadlock and that heavyweight lock acquisition is fairly slow anyway).
+catastrophe could ensue, because the worker won't have that state.
+
+To allow parallel inserts and parallel copy, we have ensured that relation
+extension and page locks don't participate in group locking which means such
+locks can conflict among the same group members.  This is required as it is no
+safer for two related processes to extend the same relation or perform clean up
+in gin indexes at a time than for unrelated processes to do the same.  We don't
+acquire a heavyweight lock on any other object after relation extension lock
+which means such a lock can never participate in the deadlock cycle.  After
+acquiring page locks, we can acquire relation extension lock but reverse never
+happens, so those will also not participate in deadlock.  To allow for other
+parallel writes like parallel update or parallel delete, we'll either need to
+(1) further enhance the deadlock detector to handle those tuple locks in a
+different way than other types; or (2) have parallel workers use some other
+mutual exclusion method for such cases.  Currently, the parallel mode is
+strictly read-only, but now we have the infrastructure to allow parallel
+inserts and parallel copy.
 
 Group locking adds three new members to each PGPROC: lockGroupLeader,
 lockGroupMembers, and lockGroupLink. A PGPROC's lockGroupLeader is NULL for
diff --git a/src/backend/storage/lmgr/deadlock.c b/src/backend/storage/lmgr/deadlock.c
index 59060b6..beedc79 100644
--- a/src/backend/storage/lmgr/deadlock.c
+++ b/src/backend/storage/lmgr/deadlock.c
@@ -556,11 +556,12 @@ FindLockCycleRecurseMember(PGPROC *checkProc,
 				lm;
 
 	/*
-	 * The relation extension lock can never participate in actual deadlock
-	 * cycle.  See Assert in LockAcquireExtended.  So, there is no advantage
-	 * in checking wait edges from it.
+	 * The relation extension or page lock can never participate in actual
+	 * deadlock cycle.  See Asserts in LockAcquireExtended.  So, there is no
+	 * advantage in checking wait edges from them.
 	 */
-	if (LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND)
+	if (LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND ||
+		(LOCK_LOCKTAG(*lock) == LOCKTAG_PAGE))
 		return false;
 
 	lockMethodTable = GetLocksMethodTable(lock);
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 0aae51e..7db514b 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -1465,9 +1465,11 @@ LockCheckConflicts(LockMethod lockMethodTable,
 	}
 
 	/*
-	 * The relation extension lock conflict even between the group members.
+	 * The relation extension or page lock conflict even between the group
+	 * members.
 	 */
-	if (LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND)
+	if (LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND ||
+		(LOCK_LOCKTAG(*lock) == LOCKTAG_PAGE))
 	{
 		PROCLOCK_PRINT("LockCheckConflicts: conflicting (group)",
 					   proclock);
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index fa07ddf..9938cdd 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -1078,12 +1078,12 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
 	/*
 	 * If group locking is in use, locks held by members of my locking group
 	 * need to be included in myHeldLocks.  This is not required for relation
-	 * extension lock which conflict among group members. However, including
-	 * them in myHeldLocks will give group members the priority to get those
-	 * locks as compared to other backends which are also trying to acquire
-	 * those locks.  OTOH, we can avoid giving priority to group members for
-	 * that kind of locks, but there doesn't appear to be a clear advantage of
-	 * the same.
+	 * extension or page locks which conflict among group members. However,
+	 * including them in myHeldLocks will give group members the priority to
+	 * get those locks as compared to other backends which are also trying to
+	 * acquire those locks.  OTOH, we can avoid giving priority to group
+	 * members for that kind of locks, but there doesn't appear to be a clear
+	 * advantage of the same.
 	 */
 	if (leader != NULL)
 	{
-- 
1.8.3.1

v8-0002-Add-assert-to-ensure-that-page-locks-don-t-partic.patchapplication/octet-stream; name=v8-0002-Add-assert-to-ensure-that-page-locks-don-t-partic.patchDownload

From 4c021fa0343964268c03d1af627f6ece05b23fc9 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Sun, 15 Mar 2020 16:24:02 +0530
Subject: [PATCH v8 2/4] Add assert to ensure that page locks don't participate
 in  deadlock cycle.

Assert that we don't acquire any other heavyweight lock while holding the
page lock except for relation extension.  However, these locks are never
taken in reverse order which implies that page locks will never
participate in the deadlock cycle.

Similar to relation extension, page locks are also held for a short
duration, so imposing such a restriction won't hurt.

Author: Dilip Kumar, with few changes by Amit Kapila
Reviewed-by: Amit Kapila and Kuntal Ghosh
Discussion: https://postgr.es/m/CAD21AoCmT3cFQUN4aVvzy5chw7DuzXrJCbrjTU05B+Ss=Gn1LA@mail.gmail.com
---
 src/backend/storage/lmgr/lock.c | 25 +++++++++++++++++++++++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index bad4e3a..7e65943 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -185,6 +185,18 @@ static int	FastPathLocalUseCount = 0;
  */
 static bool IsRelationExtensionLockHeld = false;
 
+/*
+ * Flag to indicate if the page lock is held by this backend.  We don't
+ * acquire any other heavyweight lock while holding the page lock except for
+ * relation extension.  However, these locks are never taken in reverse order
+ * which implies that page locks will also never participate in the deadlock
+ * cycle.
+ *
+ * Similar to relation extension, page locks are also held for a short
+ * duration, so imposing such a restriction won't hurt.
+ */
+static bool IsPageLockHeld = false;
+
 /* Macros for manipulating proc->fpLockBits */
 #define FAST_PATH_BITS_PER_SLOT			3
 #define FAST_PATH_LOCKNUMBER_OFFSET		1
@@ -863,6 +875,13 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	Assert(!IsRelationExtensionLockHeld);
 
 	/*
+	 * We don't acquire any other heavyweight lock while holding the page lock
+	 * except for relation extension lock.
+	 */
+	Assert(!IsPageLockHeld ||
+		   (locktag->locktag_type == LOCKTAG_RELATION_EXTEND));
+
+	/*
 	 * Prepare to emit a WAL record if acquisition of this lock needs to be
 	 * replayed in a standby server.
 	 *
@@ -1310,16 +1329,18 @@ SetupLockInTable(LockMethod lockMethodTable, PGPROC *proc,
 }
 
 /*
- * Check and set/reset the flag that we hold the relation extension lock.
+ * Check and set/reset the flag that we hold the relation extension/page lock.
  *
  * It is callers responsibility that this function is called after
- * acquiring/releasing the relation extension lock.
+ * acquiring/releasing the relation extension/page lock.
  */
 static inline void
 CheckAndSetLockHeld(LOCALLOCK *locallock, bool value)
 {
 	if (LOCALLOCK_LOCKTAG(*locallock) == LOCKTAG_RELATION_EXTEND)
 		IsRelationExtensionLockHeld = value;
+	else if (LOCALLOCK_LOCKTAG(*locallock) == LOCKTAG_PAGE)
+		IsPageLockHeld = value;
 }
 
 /*
-- 
1.8.3.1

#178

Amit Kapila

amit.kapila16@gmail.com

almost 6 years ago

In reply to: Dilip Kumar (#176)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Sun, Mar 15, 2020 at 1:15 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Sat, Mar 14, 2020 at 7:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Mar 13, 2020 at 7:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Apart from that, I have also extended the solution for the page lock.
And, I have also broken down the 3rd patch in two parts for relation
extension and for the page lock.

Thanks, I have made a number of cosmetic changes and written
appropriate commit messages for all patches. See the attached patch
series and let me know your opinion? BTW, did you get a chance to test
page locks by using the extension which I have posted above or by some
other way? I think it is important to test page-lock related patches
now.

I have reviewed the updated patches and looks fine to me. Apart from
this I have done testing for the Page Lock using group locking
extension.

--Setup
create table gin_test_tbl(i int4[]) with (autovacuum_enabled = off);
create index gin_test_idx on gin_test_tbl using gin (i);
create table gin_test_tbl1(i int4[]) with (autovacuum_enabled = off);
create index gin_test_idx1 on gin_test_tbl1 using gin (i);

--session1:
select become_lock_group_leader();
select gin_clean_pending_list('gin_test_idx');

--session2:
select become_lock_group_member(session1_pid);
select gin_clean_pending_list('gin_test_idx1');

--session3:
select become_lock_group_leader();
select gin_clean_pending_list('gin_test_idx1');

--session4:
select become_lock_group_member(session3_pid);
select gin_clean_pending_list('gin_test_idx');

ERROR: deadlock detected
DETAIL: Process 61953 waits for ExclusiveLock on page 0 of relation
16399 of database 13577; blocked by process 62197.
Process 62197 waits for ExclusiveLock on page 0 of relation 16400 of
database 13577; blocked by process 61953.
HINT: See server log for query details.

Session1 and Session3 acquire the PageLock on two different index's
meta-pages and blocked in gdb, meanwhile, their member tries to
acquire the page lock as shown in the above example and it detects the
deadlock which is solved after applying the patch.

So, in this test, you have first performed the actions from Session-1
and Session-3 (blocked them via GDB after acquiring page lock) and
then performed the actions from Session-2 and Session-4, right?
Though this is not a very realistic case, it proves the point that
page locks don't participate in the deadlock cycle after the patch. I
think we can do a few more tests that test other aspects of the patch.

1. Group members wait for page locks. If you test that the leader
acquires the page lock and then member also tries to acquire the same
lock on the same index, it wouldn't block before the patch, but after
the patch, the member should wait for the leader to release the lock.
2. Try to hit Assert in LockAcquireExtended (a) by trying to
re-acquire the page lock via the debugger, (b) try to acquire the
relation extension lock after page lock and it should be allowed
(after acquiring page lock, we take relation extension lock in
following code path:
ginInsertCleanup->ginEntryInsert->ginFindLeafPage->ginPlaceToPage->GinNewBuffer).

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#179

Amit Kapila

amit.kapila16@gmail.com

almost 6 years ago

In reply to: Dilip Kumar (#177)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Sun, Mar 15, 2020 at 4:34 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have modified 0001 and 0002 slightly, Basically, instead of two
function CheckAndSetLockHeld and CheckAndReSetLockHeld, I have created
a one function.

+CheckAndSetLockHeld(LOCALLOCK *locallock, bool value)

Can we rename the parameter as lock_held, acquired or something like
that so that it indicates what it intends to do and probably add a
comment for that variable atop of function?

There is some work left related to testing some parts of the patch and
I can do some more review, but it started to look good to me, so I am
planning to push this in the coming week (say by Wednesday or so)
unless there are some major comments. There are primarily two parts
of the patch-series (a) Assert that we don't acquire a heavyweight
lock on another object after relation extension lock. (b) Allow
relation extension lock to conflict among the parallel group members.
On similar lines there are two patches for page locks.

I think we have discussed in detail about LWLock approach and it seems
that it might be tricky than we initially thought especially with some
of the latest findings where we have noticed that there are multiple
cases where we can try to re-acquire the relation extension lock and
other things which we have discussed. Also, all of us don't agree
with that idea.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#180

Dilip Kumar

dilipbalaut@gmail.com

almost 6 years ago

In reply to: Amit Kapila (#178)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Sun, Mar 15, 2020 at 5:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, Mar 15, 2020 at 1:15 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Sat, Mar 14, 2020 at 7:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Mar 13, 2020 at 7:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Apart from that, I have also extended the solution for the page lock.
And, I have also broken down the 3rd patch in two parts for relation
extension and for the page lock.

Thanks, I have made a number of cosmetic changes and written
appropriate commit messages for all patches. See the attached patch
series and let me know your opinion? BTW, did you get a chance to test
page locks by using the extension which I have posted above or by some
other way? I think it is important to test page-lock related patches
now.

I have reviewed the updated patches and looks fine to me. Apart from
this I have done testing for the Page Lock using group locking
extension.

--Setup
create table gin_test_tbl(i int4[]) with (autovacuum_enabled = off);
create index gin_test_idx on gin_test_tbl using gin (i);
create table gin_test_tbl1(i int4[]) with (autovacuum_enabled = off);
create index gin_test_idx1 on gin_test_tbl1 using gin (i);

--session1:
select become_lock_group_leader();
select gin_clean_pending_list('gin_test_idx');

--session2:
select become_lock_group_member(session1_pid);
select gin_clean_pending_list('gin_test_idx1');

--session3:
select become_lock_group_leader();
select gin_clean_pending_list('gin_test_idx1');

--session4:
select become_lock_group_member(session3_pid);
select gin_clean_pending_list('gin_test_idx');

ERROR: deadlock detected
DETAIL: Process 61953 waits for ExclusiveLock on page 0 of relation
16399 of database 13577; blocked by process 62197.
Process 62197 waits for ExclusiveLock on page 0 of relation 16400 of
database 13577; blocked by process 61953.
HINT: See server log for query details.

Session1 and Session3 acquire the PageLock on two different index's
meta-pages and blocked in gdb, meanwhile, their member tries to
acquire the page lock as shown in the above example and it detects the
deadlock which is solved after applying the patch.

So, in this test, you have first performed the actions from Session-1
and Session-3 (blocked them via GDB after acquiring page lock) and
then performed the actions from Session-2 and Session-4, right?

Yes

Though this is not a very realistic case, it proves the point that
page locks don't participate in the deadlock cycle after the patch. I
think we can do a few more tests that test other aspects of the patch.

1. Group members wait for page locks. If you test that the leader
acquires the page lock and then member also tries to acquire the same
lock on the same index, it wouldn't block before the patch, but after
the patch, the member should wait for the leader to release the lock.

Okay, I will test this part.

2. Try to hit Assert in LockAcquireExtended (a) by trying to
re-acquire the page lock via the debugger,

I am not sure whether it is true or not, Because, if we are holding
the page lock and we try the same page lock then the lock will be
granted without reaching the code path. However, I agree that this is
not intended instead this is a side effect of allowing relation
extension lock while holding the same relation extension lock. So
basically, now the situation is that if the lock is directly granted
because we are holding the same lock then it will not go to the assert
code. IMHO, we don't need to add extra code to make it behave
differently. Please let me know what is your opinion on this.

(b) try to acquire the

relation extension lock after page lock and it should be allowed
(after acquiring page lock, we take relation extension lock in
following code path:
ginInsertCleanup->ginEntryInsert->ginFindLeafPage->ginPlaceToPage->GinNewBuffer).

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#181

Dilip Kumar

dilipbalaut@gmail.com

almost 6 years ago

In reply to: Amit Kapila (#179)

4 attachment(s)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Sun, Mar 15, 2020 at 6:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, Mar 15, 2020 at 4:34 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have modified 0001 and 0002 slightly, Basically, instead of two
function CheckAndSetLockHeld and CheckAndReSetLockHeld, I have created
a one function.

+CheckAndSetLockHeld(LOCALLOCK *locallock, bool value)

Can we rename the parameter as lock_held, acquired or something like
that so that it indicates what it intends to do and probably add a
comment for that variable atop of function?

Done

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v9-0002-Add-assert-to-ensure-that-page-locks-don-t-partic.patchapplication/octet-stream; name=v9-0002-Add-assert-to-ensure-that-page-locks-don-t-partic.patchDownload

From 74874bad13296ce298f66f334ac7c9cf41af0204 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Sun, 15 Mar 2020 21:21:07 +0530
Subject: [PATCH v9 2/4] Add assert to ensure that page locks don't participate
   in  deadlock cycle.

Assert that we don't acquire any other heavyweight lock while holding the
page lock except for relation extension.  However, these locks are never
taken in reverse order which implies that page locks will never
participate in the deadlock cycle.

Similar to relation extension, page locks are also held for a short
duration, so imposing such a restriction won't hurt.

Author: Dilip Kumar, with few changes by Amit Kapila
Reviewed-by: Amit Kapila and Kuntal Ghosh
Discussion: https://postgr.es/m/CAD21AoCmT3cFQUN4aVvzy5chw7DuzXrJCbrjTU05B+Ss=Gn1LA@mail.gmail.com
---
 src/backend/storage/lmgr/lock.c | 25 +++++++++++++++++++++++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 076f436..57db6cc 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -185,6 +185,18 @@ static int	FastPathLocalUseCount = 0;
  */
 static bool IsRelationExtensionLockHeld = false;
 
+/*
+ * Flag to indicate if the page lock is held by this backend.  We don't
+ * acquire any other heavyweight lock while holding the page lock except for
+ * relation extension.  However, these locks are never taken in reverse order
+ * which implies that page locks will also never participate in the deadlock
+ * cycle.
+ *
+ * Similar to relation extension, page locks are also held for a short
+ * duration, so imposing such a restriction won't hurt.
+ */
+static bool IsPageLockHeld = false;
+
 /* Macros for manipulating proc->fpLockBits */
 #define FAST_PATH_BITS_PER_SLOT			3
 #define FAST_PATH_LOCKNUMBER_OFFSET		1
@@ -863,6 +875,13 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	Assert(!IsRelationExtensionLockHeld);
 
 	/*
+	 * We don't acquire any other heavyweight lock while holding the page lock
+	 * except for relation extension lock.
+	 */
+	Assert(!IsPageLockHeld ||
+		   (locktag->locktag_type == LOCKTAG_RELATION_EXTEND));
+
+	/*
 	 * Prepare to emit a WAL record if acquisition of this lock needs to be
 	 * replayed in a standby server.
 	 *
@@ -1310,10 +1329,10 @@ SetupLockInTable(LockMethod lockMethodTable, PGPROC *proc,
 }
 
 /*
- * Check and set/reset the flag that we hold the relation extension lock.
+ * Check and set/reset the flag that we hold the relation extension/page lock.
  *
  * It is callers responsibility that this function is called after
- * acquiring/releasing the relation extension lock.
+ * acquiring/releasing the relation extension/page lock.
  *
  * Pass acquired = true if lock is acquired, false otherwise.
  */
@@ -1322,6 +1341,8 @@ CheckAndSetLockHeld(LOCALLOCK *locallock, bool acquired)
 {
 	if (LOCALLOCK_LOCKTAG(*locallock) == LOCKTAG_RELATION_EXTEND)
 		IsRelationExtensionLockHeld = acquired;
+	else if (LOCALLOCK_LOCKTAG(*locallock) == LOCKTAG_PAGE)
+		IsPageLockHeld = acquired;
 }
 
 /*
-- 
1.8.3.1

v9-0003-Allow-relation-extension-lock-to-conflict-among-p.patchapplication/octet-stream; name=v9-0003-Allow-relation-extension-lock-to-conflict-among-p.patchDownload

From 580ca51d64dd8f57c4862bdcf4150898a3f9cf1a Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Sat, 14 Mar 2020 17:48:17 +0530
Subject: [PATCH v9 3/4] Allow relation extension lock to conflict among
 parallel group members.

This is required as it is no safer for two related processes to extend the
same relation at a time than for unrelated processes to do the same.  We
don't acquire a heavyweight lock on any other object after relation
extension lock which means such a lock can never participate in the
deadlock cycle.  So, avoid checking wait edges from this lock.

This provides an infrastructure to allow parallel operations like insert,
copy, etc. which were earlier not possible as parallel group members won't
conflict for relation extension lock.

Author: Dilip Kumar, Amit Kapila
Reviewed-by: Amit Kapila and Kuntal Ghosh
Discussion: https://postgr.es/m/CAD21AoCmT3cFQUN4aVvzy5chw7DuzXrJCbrjTU05B+Ss=Gn1LA@mail.gmail.com
---
 src/backend/storage/lmgr/deadlock.c |  8 ++++++++
 src/backend/storage/lmgr/lock.c     | 10 ++++++++++
 src/backend/storage/lmgr/proc.c     |  8 +++++++-
 src/include/storage/lock.h          |  1 +
 4 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/lmgr/deadlock.c b/src/backend/storage/lmgr/deadlock.c
index f8c5df0..59060b6 100644
--- a/src/backend/storage/lmgr/deadlock.c
+++ b/src/backend/storage/lmgr/deadlock.c
@@ -555,6 +555,14 @@ FindLockCycleRecurseMember(PGPROC *checkProc,
 	int			numLockModes,
 				lm;
 
+	/*
+	 * The relation extension lock can never participate in actual deadlock
+	 * cycle.  See Assert in LockAcquireExtended.  So, there is no advantage
+	 * in checking wait edges from it.
+	 */
+	if (LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND)
+		return false;
+
 	lockMethodTable = GetLocksMethodTable(lock);
 	numLockModes = lockMethodTable->numLockModes;
 	conflictMask = lockMethodTable->conflictTab[checkProc->waitLockMode];
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 57db6cc..5403261 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -1467,6 +1467,16 @@ LockCheckConflicts(LockMethod lockMethodTable,
 	}
 
 	/*
+	 * The relation extension lock conflict even between the group members.
+	 */
+	if (LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND)
+	{
+		PROCLOCK_PRINT("LockCheckConflicts: conflicting (group)",
+					   proclock);
+		return true;
+	}
+
+	/*
 	 * Locks held in conflicting modes by members of our own lock group are
 	 * not real conflicts; we can subtract those out and see if we still have
 	 * a conflict.  This is O(N) in the number of processes holding or
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index eb321f7..fa07ddf 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -1077,7 +1077,13 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
 
 	/*
 	 * If group locking is in use, locks held by members of my locking group
-	 * need to be included in myHeldLocks.
+	 * need to be included in myHeldLocks.  This is not required for relation
+	 * extension lock which conflict among group members. However, including
+	 * them in myHeldLocks will give group members the priority to get those
+	 * locks as compared to other backends which are also trying to acquire
+	 * those locks.  OTOH, we can avoid giving priority to group members for
+	 * that kind of locks, but there doesn't appear to be a clear advantage of
+	 * the same.
 	 */
 	if (leader != NULL)
 	{
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index fc0a712..a89e54d 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -301,6 +301,7 @@ typedef struct LOCK
 } LOCK;
 
 #define LOCK_LOCKMETHOD(lock) ((LOCKMETHODID) (lock).tag.locktag_lockmethodid)
+#define LOCK_LOCKTAG(lock) ((LockTagType) (lock).tag.locktag_type)
 
 
 /*
-- 
1.8.3.1

v9-0001-Assert-that-we-don-t-acquire-a-heavyweight-lock-o.patchapplication/octet-stream; name=v9-0001-Assert-that-we-don-t-acquire-a-heavyweight-lock-o.patchDownload

From b7623c3e76898f94431236786438364a440268e3 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Sat, 14 Mar 2020 12:43:35 +0530
Subject: [PATCH v9 1/4] Assert that we don't acquire a heavyweight lock on
 another object after relation extension lock.

The only exception to the rule is that we can try to acquire the same
relation extension lock more than once.  This is allowed as we are not
creating any new lock for this case.  This restriction implies that the
relation extension lock won't ever participate in the deadlock cycle
because we can never wait for any other heavyweight lock after acquiring
this lock.

Such a restriction is okay for relation extension locks as unlike other
heavyweight locks these are not held till the transaction end.  These are
taken for a short duration to extend a particular relation and then
released.

Author: Dilip Kumar, with few changes by Amit Kapila
Reviewed-by: Amit Kapila and Kuntal Ghosh
Discussion: https://postgr.es/m/CAD21AoCmT3cFQUN4aVvzy5chw7DuzXrJCbrjTU05B+Ss=Gn1LA@mail.gmail.com
---
 src/backend/storage/lmgr/lock.c | 45 +++++++++++++++++++++++++++++++++++++++++
 src/include/storage/lock.h      |  1 +
 2 files changed, 46 insertions(+)

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 1df7b8e..076f436 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -170,6 +170,21 @@ typedef struct TwoPhaseLockRecord
  */
 static int	FastPathLocalUseCount = 0;
 
+/*
+ * Flag to indicate if the relation extension lock is held by this backend.
+ * This flag is used to ensure that while holding the relation extension lock
+ * we don't try to acquire a heavyweight lock on any other object.  This
+ * restriction implies that the relation extension lock won't ever participate
+ * in the deadlock cycle because we can never wait for any other heavyweight
+ * lock after acquiring this lock.
+ *
+ * Such a restriction is okay for relation extension locks as unlike other
+ * heavyweight locks these are not held till the transaction end.  These are
+ * taken for a short duration to extend a particular relation and then
+ * released.
+ */
+static bool IsRelationExtensionLockHeld = false;
+
 /* Macros for manipulating proc->fpLockBits */
 #define FAST_PATH_BITS_PER_SLOT			3
 #define FAST_PATH_LOCKNUMBER_OFFSET		1
@@ -841,6 +856,13 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	}
 
 	/*
+	 * We don't acquire any other heavyweight lock while holding the relation
+	 * extension lock.  We do allow to acquire the same relation extension
+	 * lock more than once but that case won't reach here.
+	 */
+	Assert(!IsRelationExtensionLockHeld);
+
+	/*
 	 * Prepare to emit a WAL record if acquisition of this lock needs to be
 	 * replayed in a standby server.
 	 *
@@ -1288,6 +1310,21 @@ SetupLockInTable(LockMethod lockMethodTable, PGPROC *proc,
 }
 
 /*
+ * Check and set/reset the flag that we hold the relation extension lock.
+ *
+ * It is callers responsibility that this function is called after
+ * acquiring/releasing the relation extension lock.
+ *
+ * Pass acquired = true if lock is acquired, false otherwise.
+ */
+static inline void
+CheckAndSetLockHeld(LOCALLOCK *locallock, bool acquired)
+{
+	if (LOCALLOCK_LOCKTAG(*locallock) == LOCKTAG_RELATION_EXTEND)
+		IsRelationExtensionLockHeld = acquired;
+}
+
+/*
  * Subroutine to free a locallock entry
  */
 static void
@@ -1322,6 +1359,11 @@ RemoveLocalLock(LOCALLOCK *locallock)
 					 (void *) &(locallock->tag),
 					 HASH_REMOVE, NULL))
 		elog(WARNING, "locallock table corrupted");
+
+	/*
+	 * Indicate that the lock is released for a particular type of locks.
+	 */
+	CheckAndSetLockHeld(locallock, false);
 }
 
 /*
@@ -1618,6 +1660,9 @@ GrantLockLocal(LOCALLOCK *locallock, ResourceOwner owner)
 	locallock->numLockOwners++;
 	if (owner != NULL)
 		ResourceOwnerRememberLock(owner, locallock);
+
+	/* Indicate that the lock is acquired for a certain type of locks. */
+	CheckAndSetLockHeld(locallock, true);
 }
 
 /*
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index bb8e4e6..fc0a712 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -419,6 +419,7 @@ typedef struct LOCALLOCK
 } LOCALLOCK;
 
 #define LOCALLOCK_LOCKMETHOD(llock) ((llock).tag.lock.locktag_lockmethodid)
+#define LOCALLOCK_LOCKTAG(llock) ((LockTagType) (llock).tag.lock.locktag_type)
 
 
 /*
-- 
1.8.3.1

v9-0004-Allow-page-lock-to-conflict-among-parallel-group-.patchapplication/octet-stream; name=v9-0004-Allow-page-lock-to-conflict-among-parallel-group-.patchDownload

From 841f044d78bab6cf61527f0f2832026f70e4d6df Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Sat, 14 Mar 2020 19:16:09 +0530
Subject: [PATCH v9 4/4] Allow page lock to conflict among parallel group
 members.

This is required as it is no safer for two related processes to perform
clean up in gin indexes at a time than for unrelated processes to do the
same.  After acquiring page locks, we can acquire relation extension lock
but reverse never happens which means these will also not participate in
deadlock.  So, avoid checking wait edges from this lock.

Currently, the parallel mode is strictly read-only, but after this patch
we have the infrastructure to allow parallel inserts and parallel copy.

Author: Dilip Kumar, Amit Kapila
Reviewed-by: Amit Kapila and Kuntal Ghosh
Discussion: https://postgr.es/m/CAD21AoCmT3cFQUN4aVvzy5chw7DuzXrJCbrjTU05B+Ss=Gn1LA@mail.gmail.com
---
 src/backend/storage/lmgr/README     | 60 ++++++++++++++++++++-----------------
 src/backend/storage/lmgr/deadlock.c |  9 +++---
 src/backend/storage/lmgr/lock.c     |  6 ++--
 src/backend/storage/lmgr/proc.c     | 12 ++++----
 4 files changed, 48 insertions(+), 39 deletions(-)

diff --git a/src/backend/storage/lmgr/README b/src/backend/storage/lmgr/README
index 56b0a12..13eb1cc 100644
--- a/src/backend/storage/lmgr/README
+++ b/src/backend/storage/lmgr/README
@@ -597,21 +597,22 @@ deadlock detection algorithm very much, but it makes the bookkeeping more
 complicated.
 
 We choose to regard locks held by processes in the same parallel group as
-non-conflicting.  This means that two processes in a parallel group can hold a
-self-exclusive lock on the same relation at the same time, or one process can
-acquire an AccessShareLock while the other already holds AccessExclusiveLock.
-This might seem dangerous and could be in some cases (more on that below), but
-if we didn't do this then parallel query would be extremely prone to
-self-deadlock.  For example, a parallel query against a relation on which the
-leader already had AccessExclusiveLock would hang, because the workers would
-try to lock the same relation and be blocked by the leader; yet the leader
-can't finish until it receives completion indications from all workers.  An
-undetected deadlock results.  This is far from the only scenario where such a
-problem happens.  The same thing will occur if the leader holds only
-AccessShareLock, the worker seeks AccessShareLock, but between the time the
-leader attempts to acquire the lock and the time the worker attempts to
-acquire it, some other process queues up waiting for an AccessExclusiveLock.
-In this case, too, an indefinite hang results.
+non-conflicting with the exception of relation extension and page locks.  This
+means that two processes in a parallel group can hold a self-exclusive lock on
+the same relation at the same time, or one process can acquire an AccessShareLock
+while the other already holds AccessExclusiveLock.  This might seem dangerous and
+could be in some cases (more on that below), but if we didn't do this then
+parallel query would be extremely prone to self-deadlock.  For example, a
+parallel query against a relation on which the leader already had
+AccessExclusiveLock would hang, because the workers would try to lock the same
+relation and be blocked by the leader; yet the leader can't finish until it
+receives completion indications from all workers.  An undetected deadlock
+results.  This is far from the only scenario where such a problem happens.  The
+same thing will occur if the leader holds only AccessShareLock, the worker
+seeks AccessShareLock, but between the time the leader attempts to acquire the
+lock and the time the worker attempts to acquire it, some other process queues
+up waiting for an AccessExclusiveLock.  In this case, too, an indefinite hang
+results.
 
 It might seem that we could predict which locks the workers will attempt to
 acquire and ensure before going parallel that those locks would be acquired
@@ -637,18 +638,23 @@ the other is safe enough.  Problems would occur if the leader initiated
 parallelism from a point in the code at which it had some backend-private
 state that made table access from another process unsafe, for example after
 calling SetReindexProcessing and before calling ResetReindexProcessing,
-catastrophe could ensue, because the worker won't have that state.  Similarly,
-problems could occur with certain kinds of non-relation locks, such as
-relation extension locks.  It's no safer for two related processes to extend
-the same relation at the time than for unrelated processes to do the same.
-However, since parallel mode is strictly read-only at present, neither this
-nor most of the similar cases can arise at present.  To allow parallel writes,
-we'll either need to (1) further enhance the deadlock detector to handle those
-types of locks in a different way than other types; or (2) have parallel
-workers use some other mutual exclusion method for such cases; or (3) revise
-those cases so that they no longer use heavyweight locking in the first place
-(which is not a crazy idea, given that such lock acquisitions are not expected
-to deadlock and that heavyweight lock acquisition is fairly slow anyway).
+catastrophe could ensue, because the worker won't have that state.
+
+To allow parallel inserts and parallel copy, we have ensured that relation
+extension and page locks don't participate in group locking which means such
+locks can conflict among the same group members.  This is required as it is no
+safer for two related processes to extend the same relation or perform clean up
+in gin indexes at a time than for unrelated processes to do the same.  We don't
+acquire a heavyweight lock on any other object after relation extension lock
+which means such a lock can never participate in the deadlock cycle.  After
+acquiring page locks, we can acquire relation extension lock but reverse never
+happens, so those will also not participate in deadlock.  To allow for other
+parallel writes like parallel update or parallel delete, we'll either need to
+(1) further enhance the deadlock detector to handle those tuple locks in a
+different way than other types; or (2) have parallel workers use some other
+mutual exclusion method for such cases.  Currently, the parallel mode is
+strictly read-only, but now we have the infrastructure to allow parallel
+inserts and parallel copy.
 
 Group locking adds three new members to each PGPROC: lockGroupLeader,
 lockGroupMembers, and lockGroupLink. A PGPROC's lockGroupLeader is NULL for
diff --git a/src/backend/storage/lmgr/deadlock.c b/src/backend/storage/lmgr/deadlock.c
index 59060b6..beedc79 100644
--- a/src/backend/storage/lmgr/deadlock.c
+++ b/src/backend/storage/lmgr/deadlock.c
@@ -556,11 +556,12 @@ FindLockCycleRecurseMember(PGPROC *checkProc,
 				lm;
 
 	/*
-	 * The relation extension lock can never participate in actual deadlock
-	 * cycle.  See Assert in LockAcquireExtended.  So, there is no advantage
-	 * in checking wait edges from it.
+	 * The relation extension or page lock can never participate in actual
+	 * deadlock cycle.  See Asserts in LockAcquireExtended.  So, there is no
+	 * advantage in checking wait edges from them.
 	 */
-	if (LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND)
+	if (LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND ||
+		(LOCK_LOCKTAG(*lock) == LOCKTAG_PAGE))
 		return false;
 
 	lockMethodTable = GetLocksMethodTable(lock);
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 5403261..10fd15e 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -1467,9 +1467,11 @@ LockCheckConflicts(LockMethod lockMethodTable,
 	}
 
 	/*
-	 * The relation extension lock conflict even between the group members.
+	 * The relation extension or page lock conflict even between the group
+	 * members.
 	 */
-	if (LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND)
+	if (LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND ||
+		(LOCK_LOCKTAG(*lock) == LOCKTAG_PAGE))
 	{
 		PROCLOCK_PRINT("LockCheckConflicts: conflicting (group)",
 					   proclock);
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index fa07ddf..9938cdd 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -1078,12 +1078,12 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
 	/*
 	 * If group locking is in use, locks held by members of my locking group
 	 * need to be included in myHeldLocks.  This is not required for relation
-	 * extension lock which conflict among group members. However, including
-	 * them in myHeldLocks will give group members the priority to get those
-	 * locks as compared to other backends which are also trying to acquire
-	 * those locks.  OTOH, we can avoid giving priority to group members for
-	 * that kind of locks, but there doesn't appear to be a clear advantage of
-	 * the same.
+	 * extension or page locks which conflict among group members. However,
+	 * including them in myHeldLocks will give group members the priority to
+	 * get those locks as compared to other backends which are also trying to
+	 * acquire those locks.  OTOH, we can avoid giving priority to group
+	 * members for that kind of locks, but there doesn't appear to be a clear
+	 * advantage of the same.
 	 */
 	if (leader != NULL)
 	{
-- 
1.8.3.1

#182

Amit Kapila

amit.kapila16@gmail.com

almost 6 years ago

In reply to: Dilip Kumar (#180)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Sun, Mar 15, 2020 at 9:17 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Sun, Mar 15, 2020 at 5:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

1. Group members wait for page locks. If you test that the leader
acquires the page lock and then member also tries to acquire the same
lock on the same index, it wouldn't block before the patch, but after
the patch, the member should wait for the leader to release the lock.

Okay, I will test this part.

2. Try to hit Assert in LockAcquireExtended (a) by trying to
re-acquire the page lock via the debugger,

I am not sure whether it is true or not, Because, if we are holding
the page lock and we try the same page lock then the lock will be
granted without reaching the code path. However, I agree that this is
not intended instead this is a side effect of allowing relation
extension lock while holding the same relation extension lock. So
basically, now the situation is that if the lock is directly granted
because we are holding the same lock then it will not go to the assert
code. IMHO, we don't need to add extra code to make it behave
differently. Please let me know what is your opinion on this.

I also don't think there is any reason to add code to prevent that.
Actually, what I wanted to test was to somehow hit the Assert for the
cases where it will actually hit if someone tomorrow tries to acquire
any other type of lock. Can we mimic such a situation by hacking code
(say try to acquire some other type of heavyweight lock) or in some
way to hit the newly added Assert?

(b) try to acquire the

relation extension lock after page lock and it should be allowed
(after acquiring page lock, we take relation extension lock in
following code path:
ginInsertCleanup->ginEntryInsert->ginFindLeafPage->ginPlaceToPage->GinNewBuffer).

ok

Thanks.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#183

Masahiko Sawada

masahiko.sawada@2ndquadrant.com

almost 6 years ago

In reply to: Dilip Kumar (#181)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Mon, 16 Mar 2020 at 00:54, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Sun, Mar 15, 2020 at 6:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, Mar 15, 2020 at 4:34 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have modified 0001 and 0002 slightly, Basically, instead of two
function CheckAndSetLockHeld and CheckAndReSetLockHeld, I have created
a one function.

+CheckAndSetLockHeld(LOCALLOCK *locallock, bool value)

Can we rename the parameter as lock_held, acquired or something like
that so that it indicates what it intends to do and probably add a
comment for that variable atop of function?

Done

I've looked at the patches and ISTM these work as expected.
IsRelationExtensionLockHeld and IsPageLockHeld are used only when
assertion is enabled. So how about making CheckAndSetLockHeld work
only if USE_ASSERT_CHECKING to avoid overheads?

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#184

Dilip Kumar

dilipbalaut@gmail.com

almost 6 years ago

In reply to: Masahiko Sawada (#183)

4 attachment(s)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Mon, Mar 16, 2020 at 8:57 AM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:

On Mon, 16 Mar 2020 at 00:54, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Sun, Mar 15, 2020 at 6:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, Mar 15, 2020 at 4:34 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have modified 0001 and 0002 slightly, Basically, instead of two
function CheckAndSetLockHeld and CheckAndReSetLockHeld, I have created
a one function.

+CheckAndSetLockHeld(LOCALLOCK *locallock, bool value)

Can we rename the parameter as lock_held, acquired or something like
that so that it indicates what it intends to do and probably add a
comment for that variable atop of function?

Done

I've looked at the patches and ISTM these work as expected.

Thanks for verifying.

IsRelationExtensionLockHeld and IsPageLockHeld are used only when
assertion is enabled. So how about making CheckAndSetLockHeld work
only if USE_ASSERT_CHECKING to avoid overheads?

That makes sense to me so updated the patch.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v10-0001-Assert-that-we-don-t-acquire-a-heavyweight-lock-.patchapplication/octet-stream; name=v10-0001-Assert-that-we-don-t-acquire-a-heavyweight-lock-.patchDownload

From 649f3bc8adfe2ba4ad8723740d457c445ef7cb9a Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Sat, 14 Mar 2020 12:43:35 +0530
Subject: [PATCH v10 1/4] Assert that we don't acquire a heavyweight lock on
 another object after relation extension lock.

The only exception to the rule is that we can try to acquire the same
relation extension lock more than once.  This is allowed as we are not
creating any new lock for this case.  This restriction implies that the
relation extension lock won't ever participate in the deadlock cycle
because we can never wait for any other heavyweight lock after acquiring
this lock.

Such a restriction is okay for relation extension locks as unlike other
heavyweight locks these are not held till the transaction end.  These are
taken for a short duration to extend a particular relation and then
released.

Author: Dilip Kumar, with few changes by Amit Kapila
Reviewed-by: Amit Kapila and Kuntal Ghosh
Discussion: https://postgr.es/m/CAD21AoCmT3cFQUN4aVvzy5chw7DuzXrJCbrjTU05B+Ss=Gn1LA@mail.gmail.com
---
 src/backend/storage/lmgr/lock.c | 53 +++++++++++++++++++++++++++++++++++++++++
 src/include/storage/lock.h      |  1 +
 2 files changed, 54 insertions(+)

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 1df7b8e..34a7ed9 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -170,6 +170,23 @@ typedef struct TwoPhaseLockRecord
  */
 static int	FastPathLocalUseCount = 0;
 
+/*
+ * Flag to indicate if the relation extension lock is held by this backend.
+ * This flag is used to ensure that while holding the relation extension lock
+ * we don't try to acquire a heavyweight lock on any other object.  This
+ * restriction implies that the relation extension lock won't ever participate
+ * in the deadlock cycle because we can never wait for any other heavyweight
+ * lock after acquiring this lock.
+ *
+ * Such a restriction is okay for relation extension locks as unlike other
+ * heavyweight locks these are not held till the transaction end.  These are
+ * taken for a short duration to extend a particular relation and then
+ * released.
+ */
+#ifdef USE_ASSERT_CHECKING
+static bool IsRelationExtensionLockHeld = false;
+#endif
+
 /* Macros for manipulating proc->fpLockBits */
 #define FAST_PATH_BITS_PER_SLOT			3
 #define FAST_PATH_LOCKNUMBER_OFFSET		1
@@ -841,6 +858,13 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	}
 
 	/*
+	 * We don't acquire any other heavyweight lock while holding the relation
+	 * extension lock.  We do allow to acquire the same relation extension
+	 * lock more than once but that case won't reach here.
+	 */
+	Assert(!IsRelationExtensionLockHeld);
+
+	/*
 	 * Prepare to emit a WAL record if acquisition of this lock needs to be
 	 * replayed in a standby server.
 	 *
@@ -1288,6 +1312,23 @@ SetupLockInTable(LockMethod lockMethodTable, PGPROC *proc,
 }
 
 /*
+ * Check and set/reset the flag that we hold the relation extension lock.
+ *
+ * It is callers responsibility that this function is called after
+ * acquiring/releasing the relation extension lock.
+ *
+ * Pass acquired = true if lock is acquired, false otherwise.
+ */
+#ifdef USE_ASSERT_CHECKING
+static inline void
+CheckAndSetLockHeld(LOCALLOCK *locallock, bool acquired)
+{
+	if (LOCALLOCK_LOCKTAG(*locallock) == LOCKTAG_RELATION_EXTEND)
+		IsRelationExtensionLockHeld = acquired;
+}
+#endif
+
+/*
  * Subroutine to free a locallock entry
  */
 static void
@@ -1322,6 +1363,13 @@ RemoveLocalLock(LOCALLOCK *locallock)
 					 (void *) &(locallock->tag),
 					 HASH_REMOVE, NULL))
 		elog(WARNING, "locallock table corrupted");
+
+	/*
+	 * Indicate that the lock is released for a particular type of locks.
+	 */
+#ifdef USE_ASSERT_CHECKING
+	CheckAndSetLockHeld(locallock, false);
+#endif
 }
 
 /*
@@ -1618,6 +1666,11 @@ GrantLockLocal(LOCALLOCK *locallock, ResourceOwner owner)
 	locallock->numLockOwners++;
 	if (owner != NULL)
 		ResourceOwnerRememberLock(owner, locallock);
+
+	/* Indicate that the lock is acquired for a certain type of locks. */
+#ifdef USE_ASSERT_CHECKING
+	CheckAndSetLockHeld(locallock, true);
+#endif
 }
 
 /*
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index bb8e4e6..fc0a712 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -419,6 +419,7 @@ typedef struct LOCALLOCK
 } LOCALLOCK;
 
 #define LOCALLOCK_LOCKMETHOD(llock) ((llock).tag.lock.locktag_lockmethodid)
+#define LOCALLOCK_LOCKTAG(llock) ((LockTagType) (llock).tag.lock.locktag_type)
 
 
 /*
-- 
1.8.3.1

v10-0003-Allow-relation-extension-lock-to-conflict-among-.patchapplication/octet-stream; name=v10-0003-Allow-relation-extension-lock-to-conflict-among-.patchDownload

From 2163a6c79e233f285674a8832f420078922a444d Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Sat, 14 Mar 2020 17:48:17 +0530
Subject: [PATCH v10 3/4] Allow relation extension lock to conflict among
 parallel group members.

This is required as it is no safer for two related processes to extend the
same relation at a time than for unrelated processes to do the same.  We
don't acquire a heavyweight lock on any other object after relation
extension lock which means such a lock can never participate in the
deadlock cycle.  So, avoid checking wait edges from this lock.

This provides an infrastructure to allow parallel operations like insert,
copy, etc. which were earlier not possible as parallel group members won't
conflict for relation extension lock.

Author: Dilip Kumar, Amit Kapila
Reviewed-by: Amit Kapila and Kuntal Ghosh
Discussion: https://postgr.es/m/CAD21AoCmT3cFQUN4aVvzy5chw7DuzXrJCbrjTU05B+Ss=Gn1LA@mail.gmail.com
---
 src/backend/storage/lmgr/deadlock.c |  8 ++++++++
 src/backend/storage/lmgr/lock.c     | 10 ++++++++++
 src/backend/storage/lmgr/proc.c     |  8 +++++++-
 src/include/storage/lock.h          |  1 +
 4 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/lmgr/deadlock.c b/src/backend/storage/lmgr/deadlock.c
index f8c5df0..59060b6 100644
--- a/src/backend/storage/lmgr/deadlock.c
+++ b/src/backend/storage/lmgr/deadlock.c
@@ -555,6 +555,14 @@ FindLockCycleRecurseMember(PGPROC *checkProc,
 	int			numLockModes,
 				lm;
 
+	/*
+	 * The relation extension lock can never participate in actual deadlock
+	 * cycle.  See Assert in LockAcquireExtended.  So, there is no advantage
+	 * in checking wait edges from it.
+	 */
+	if (LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND)
+		return false;
+
 	lockMethodTable = GetLocksMethodTable(lock);
 	numLockModes = lockMethodTable->numLockModes;
 	conflictMask = lockMethodTable->conflictTab[checkProc->waitLockMode];
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 6a4bda57..658bcf1 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -1475,6 +1475,16 @@ LockCheckConflicts(LockMethod lockMethodTable,
 	}
 
 	/*
+	 * The relation extension lock conflict even between the group members.
+	 */
+	if (LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND)
+	{
+		PROCLOCK_PRINT("LockCheckConflicts: conflicting (group)",
+					   proclock);
+		return true;
+	}
+
+	/*
 	 * Locks held in conflicting modes by members of our own lock group are
 	 * not real conflicts; we can subtract those out and see if we still have
 	 * a conflict.  This is O(N) in the number of processes holding or
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index eb321f7..fa07ddf 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -1077,7 +1077,13 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
 
 	/*
 	 * If group locking is in use, locks held by members of my locking group
-	 * need to be included in myHeldLocks.
+	 * need to be included in myHeldLocks.  This is not required for relation
+	 * extension lock which conflict among group members. However, including
+	 * them in myHeldLocks will give group members the priority to get those
+	 * locks as compared to other backends which are also trying to acquire
+	 * those locks.  OTOH, we can avoid giving priority to group members for
+	 * that kind of locks, but there doesn't appear to be a clear advantage of
+	 * the same.
 	 */
 	if (leader != NULL)
 	{
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index fc0a712..a89e54d 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -301,6 +301,7 @@ typedef struct LOCK
 } LOCK;
 
 #define LOCK_LOCKMETHOD(lock) ((LOCKMETHODID) (lock).tag.locktag_lockmethodid)
+#define LOCK_LOCKTAG(lock) ((LockTagType) (lock).tag.locktag_type)
 
 
 /*
-- 
1.8.3.1

v10-0004-Allow-page-lock-to-conflict-among-parallel-group.patchapplication/octet-stream; name=v10-0004-Allow-page-lock-to-conflict-among-parallel-group.patchDownload

From 296d94391195b61188168aa0133b34c186c54dcc Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Sat, 14 Mar 2020 19:16:09 +0530
Subject: [PATCH v10 4/4] Allow page lock to conflict among parallel group
 members.

This is required as it is no safer for two related processes to perform
clean up in gin indexes at a time than for unrelated processes to do the
same.  After acquiring page locks, we can acquire relation extension lock
but reverse never happens which means these will also not participate in
deadlock.  So, avoid checking wait edges from this lock.

Currently, the parallel mode is strictly read-only, but after this patch
we have the infrastructure to allow parallel inserts and parallel copy.

Author: Dilip Kumar, Amit Kapila
Reviewed-by: Amit Kapila and Kuntal Ghosh
Discussion: https://postgr.es/m/CAD21AoCmT3cFQUN4aVvzy5chw7DuzXrJCbrjTU05B+Ss=Gn1LA@mail.gmail.com
---
 src/backend/storage/lmgr/README     | 60 ++++++++++++++++++++-----------------
 src/backend/storage/lmgr/deadlock.c |  9 +++---
 src/backend/storage/lmgr/lock.c     |  6 ++--
 src/backend/storage/lmgr/proc.c     | 12 ++++----
 4 files changed, 48 insertions(+), 39 deletions(-)

diff --git a/src/backend/storage/lmgr/README b/src/backend/storage/lmgr/README
index 56b0a12..13eb1cc 100644
--- a/src/backend/storage/lmgr/README
+++ b/src/backend/storage/lmgr/README
@@ -597,21 +597,22 @@ deadlock detection algorithm very much, but it makes the bookkeeping more
 complicated.
 
 We choose to regard locks held by processes in the same parallel group as
-non-conflicting.  This means that two processes in a parallel group can hold a
-self-exclusive lock on the same relation at the same time, or one process can
-acquire an AccessShareLock while the other already holds AccessExclusiveLock.
-This might seem dangerous and could be in some cases (more on that below), but
-if we didn't do this then parallel query would be extremely prone to
-self-deadlock.  For example, a parallel query against a relation on which the
-leader already had AccessExclusiveLock would hang, because the workers would
-try to lock the same relation and be blocked by the leader; yet the leader
-can't finish until it receives completion indications from all workers.  An
-undetected deadlock results.  This is far from the only scenario where such a
-problem happens.  The same thing will occur if the leader holds only
-AccessShareLock, the worker seeks AccessShareLock, but between the time the
-leader attempts to acquire the lock and the time the worker attempts to
-acquire it, some other process queues up waiting for an AccessExclusiveLock.
-In this case, too, an indefinite hang results.
+non-conflicting with the exception of relation extension and page locks.  This
+means that two processes in a parallel group can hold a self-exclusive lock on
+the same relation at the same time, or one process can acquire an AccessShareLock
+while the other already holds AccessExclusiveLock.  This might seem dangerous and
+could be in some cases (more on that below), but if we didn't do this then
+parallel query would be extremely prone to self-deadlock.  For example, a
+parallel query against a relation on which the leader already had
+AccessExclusiveLock would hang, because the workers would try to lock the same
+relation and be blocked by the leader; yet the leader can't finish until it
+receives completion indications from all workers.  An undetected deadlock
+results.  This is far from the only scenario where such a problem happens.  The
+same thing will occur if the leader holds only AccessShareLock, the worker
+seeks AccessShareLock, but between the time the leader attempts to acquire the
+lock and the time the worker attempts to acquire it, some other process queues
+up waiting for an AccessExclusiveLock.  In this case, too, an indefinite hang
+results.
 
 It might seem that we could predict which locks the workers will attempt to
 acquire and ensure before going parallel that those locks would be acquired
@@ -637,18 +638,23 @@ the other is safe enough.  Problems would occur if the leader initiated
 parallelism from a point in the code at which it had some backend-private
 state that made table access from another process unsafe, for example after
 calling SetReindexProcessing and before calling ResetReindexProcessing,
-catastrophe could ensue, because the worker won't have that state.  Similarly,
-problems could occur with certain kinds of non-relation locks, such as
-relation extension locks.  It's no safer for two related processes to extend
-the same relation at the time than for unrelated processes to do the same.
-However, since parallel mode is strictly read-only at present, neither this
-nor most of the similar cases can arise at present.  To allow parallel writes,
-we'll either need to (1) further enhance the deadlock detector to handle those
-types of locks in a different way than other types; or (2) have parallel
-workers use some other mutual exclusion method for such cases; or (3) revise
-those cases so that they no longer use heavyweight locking in the first place
-(which is not a crazy idea, given that such lock acquisitions are not expected
-to deadlock and that heavyweight lock acquisition is fairly slow anyway).
+catastrophe could ensue, because the worker won't have that state.
+
+To allow parallel inserts and parallel copy, we have ensured that relation
+extension and page locks don't participate in group locking which means such
+locks can conflict among the same group members.  This is required as it is no
+safer for two related processes to extend the same relation or perform clean up
+in gin indexes at a time than for unrelated processes to do the same.  We don't
+acquire a heavyweight lock on any other object after relation extension lock
+which means such a lock can never participate in the deadlock cycle.  After
+acquiring page locks, we can acquire relation extension lock but reverse never
+happens, so those will also not participate in deadlock.  To allow for other
+parallel writes like parallel update or parallel delete, we'll either need to
+(1) further enhance the deadlock detector to handle those tuple locks in a
+different way than other types; or (2) have parallel workers use some other
+mutual exclusion method for such cases.  Currently, the parallel mode is
+strictly read-only, but now we have the infrastructure to allow parallel
+inserts and parallel copy.
 
 Group locking adds three new members to each PGPROC: lockGroupLeader,
 lockGroupMembers, and lockGroupLink. A PGPROC's lockGroupLeader is NULL for
diff --git a/src/backend/storage/lmgr/deadlock.c b/src/backend/storage/lmgr/deadlock.c
index 59060b6..beedc79 100644
--- a/src/backend/storage/lmgr/deadlock.c
+++ b/src/backend/storage/lmgr/deadlock.c
@@ -556,11 +556,12 @@ FindLockCycleRecurseMember(PGPROC *checkProc,
 				lm;
 
 	/*
-	 * The relation extension lock can never participate in actual deadlock
-	 * cycle.  See Assert in LockAcquireExtended.  So, there is no advantage
-	 * in checking wait edges from it.
+	 * The relation extension or page lock can never participate in actual
+	 * deadlock cycle.  See Asserts in LockAcquireExtended.  So, there is no
+	 * advantage in checking wait edges from them.
 	 */
-	if (LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND)
+	if (LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND ||
+		(LOCK_LOCKTAG(*lock) == LOCKTAG_PAGE))
 		return false;
 
 	lockMethodTable = GetLocksMethodTable(lock);
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 658bcf1..d423b0c 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -1475,9 +1475,11 @@ LockCheckConflicts(LockMethod lockMethodTable,
 	}
 
 	/*
-	 * The relation extension lock conflict even between the group members.
+	 * The relation extension or page lock conflict even between the group
+	 * members.
 	 */
-	if (LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND)
+	if (LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND ||
+		(LOCK_LOCKTAG(*lock) == LOCKTAG_PAGE))
 	{
 		PROCLOCK_PRINT("LockCheckConflicts: conflicting (group)",
 					   proclock);
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index fa07ddf..9938cdd 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -1078,12 +1078,12 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
 	/*
 	 * If group locking is in use, locks held by members of my locking group
 	 * need to be included in myHeldLocks.  This is not required for relation
-	 * extension lock which conflict among group members. However, including
-	 * them in myHeldLocks will give group members the priority to get those
-	 * locks as compared to other backends which are also trying to acquire
-	 * those locks.  OTOH, we can avoid giving priority to group members for
-	 * that kind of locks, but there doesn't appear to be a clear advantage of
-	 * the same.
+	 * extension or page locks which conflict among group members. However,
+	 * including them in myHeldLocks will give group members the priority to
+	 * get those locks as compared to other backends which are also trying to
+	 * acquire those locks.  OTOH, we can avoid giving priority to group
+	 * members for that kind of locks, but there doesn't appear to be a clear
+	 * advantage of the same.
 	 */
 	if (leader != NULL)
 	{
-- 
1.8.3.1

v10-0002-Add-assert-to-ensure-that-page-locks-don-t-parti.patchapplication/octet-stream; name=v10-0002-Add-assert-to-ensure-that-page-locks-don-t-parti.patchDownload

From e5986a551ec0a8a8289ca1273c383a8ec4834e23 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 16 Mar 2020 09:22:58 +0530
Subject: [PATCH v10 2/4] Add assert to ensure that page locks don't
 participate    in  deadlock cycle.

Assert that we don't acquire any other heavyweight lock while holding the
page lock except for relation extension.  However, these locks are never
taken in reverse order which implies that page locks will never
participate in the deadlock cycle.

Similar to relation extension, page locks are also held for a short
duration, so imposing such a restriction won't hurt.

Author: Dilip Kumar, with few changes by Amit Kapila
Reviewed-by: Amit Kapila and Kuntal Ghosh
Discussion: https://postgr.es/m/CAD21AoCmT3cFQUN4aVvzy5chw7DuzXrJCbrjTU05B+Ss=Gn1LA@mail.gmail.com
---
 src/backend/storage/lmgr/lock.c | 27 +++++++++++++++++++++++++--
 1 file changed, 25 insertions(+), 2 deletions(-)

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 34a7ed9..6a4bda57 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -187,6 +187,20 @@ static int	FastPathLocalUseCount = 0;
 static bool IsRelationExtensionLockHeld = false;
 #endif
 
+/*
+ * Flag to indicate if the page lock is held by this backend.  We don't
+ * acquire any other heavyweight lock while holding the page lock except for
+ * relation extension.  However, these locks are never taken in reverse order
+ * which implies that page locks will also never participate in the deadlock
+ * cycle.
+ *
+ * Similar to relation extension, page locks are also held for a short
+ * duration, so imposing such a restriction won't hurt.
+ */
+#ifdef USE_ASSERT_CHECKING
+static bool IsPageLockHeld = false;
+#endif
+
 /* Macros for manipulating proc->fpLockBits */
 #define FAST_PATH_BITS_PER_SLOT			3
 #define FAST_PATH_LOCKNUMBER_OFFSET		1
@@ -865,6 +879,13 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	Assert(!IsRelationExtensionLockHeld);
 
 	/*
+	 * We don't acquire any other heavyweight lock while holding the page lock
+	 * except for relation extension lock.
+	 */
+	Assert(!IsPageLockHeld ||
+		   (locktag->locktag_type == LOCKTAG_RELATION_EXTEND));
+
+	/*
 	 * Prepare to emit a WAL record if acquisition of this lock needs to be
 	 * replayed in a standby server.
 	 *
@@ -1312,10 +1333,10 @@ SetupLockInTable(LockMethod lockMethodTable, PGPROC *proc,
 }
 
 /*
- * Check and set/reset the flag that we hold the relation extension lock.
+ * Check and set/reset the flag that we hold the relation extension/page lock.
  *
  * It is callers responsibility that this function is called after
- * acquiring/releasing the relation extension lock.
+ * acquiring/releasing the relation extension/page lock.
  *
  * Pass acquired = true if lock is acquired, false otherwise.
  */
@@ -1325,6 +1346,8 @@ CheckAndSetLockHeld(LOCALLOCK *locallock, bool acquired)
 {
 	if (LOCALLOCK_LOCKTAG(*locallock) == LOCKTAG_RELATION_EXTEND)
 		IsRelationExtensionLockHeld = acquired;
+	else if (LOCALLOCK_LOCKTAG(*locallock) == LOCKTAG_PAGE)
+		IsPageLockHeld = acquired;
 }
 #endif
 
-- 
1.8.3.1

#185

Dilip Kumar

dilipbalaut@gmail.com

almost 6 years ago

In reply to: Amit Kapila (#182)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Mon, Mar 16, 2020 at 8:15 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, Mar 15, 2020 at 9:17 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Sun, Mar 15, 2020 at 5:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

1. Group members wait for page locks. If you test that the leader
acquires the page lock and then member also tries to acquire the same
lock on the same index, it wouldn't block before the patch, but after
the patch, the member should wait for the leader to release the lock.

Okay, I will test this part.

2. Try to hit Assert in LockAcquireExtended (a) by trying to
re-acquire the page lock via the debugger,

I am not sure whether it is true or not, Because, if we are holding
the page lock and we try the same page lock then the lock will be
granted without reaching the code path. However, I agree that this is
not intended instead this is a side effect of allowing relation
extension lock while holding the same relation extension lock. So
basically, now the situation is that if the lock is directly granted
because we are holding the same lock then it will not go to the assert
code. IMHO, we don't need to add extra code to make it behave
differently. Please let me know what is your opinion on this.

I also don't think there is any reason to add code to prevent that.
Actually, what I wanted to test was to somehow hit the Assert for the
cases where it will actually hit if someone tomorrow tries to acquire
any other type of lock. Can we mimic such a situation by hacking code
(say try to acquire some other type of heavyweight lock) or in some
way to hit the newly added Assert?

I have hacked the code by calling another heavyweight lock and the
assert is hit.

(b) try to acquire the

relation extension lock after page lock and it should be allowed
(after acquiring page lock, we take relation extension lock in
following code path:
ginInsertCleanup->ginEntryInsert->ginFindLeafPage->ginPlaceToPage->GinNewBuffer).

I have tested this part and it works as expected i.e. assert is not hit.

--test case
create table gin_test_tbl(i int4[]) with (autovacuum_enabled = off);
create index gin_test_idx on gin_test_tbl using gin (i);
insert into gin_test_tbl select array[1, 2, g] from generate_series(1, 20000) g;
select gin_clean_pending_list('gin_test_idx');

BTW, this test is already covered by the existing gin.sql file so we
don't need to add any new test.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#186

Kuntal Ghosh

kuntalghosh.2007@gmail.com

almost 6 years ago

In reply to: Dilip Kumar (#184)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Mon, Mar 16, 2020 at 9:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Mar 16, 2020 at 8:57 AM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:

IsRelationExtensionLockHeld and IsPageLockHeld are used only when
assertion is enabled. So how about making CheckAndSetLockHeld work
only if USE_ASSERT_CHECKING to avoid overheads?

That makes sense to me so updated the patch.

In v10-0001-Assert-that-we-don-t-acquire-a-heavyweight-lock-.patch,

+ * Indicate that the lock is released for a particular type of locks.
s/lock is/locks are

+ /* Indicate that the lock is acquired for a certain type of locks. */
s/lock is/locks are

In v10-0002-*.patch,

+ * Flag to indicate if the page lock is held by this backend.  We don't
+ * acquire any other heavyweight lock while holding the page lock except for
+ * relation extension.  However, these locks are never taken in reverse order
+ * which implies that page locks will also never participate in the deadlock
+ * cycle.
s/while holding the page lock except for relation extension/while
holding the page lock except for relation extension and page lock

+ * We don't acquire any other heavyweight lock while holding the page lock
+ * except for relation extension lock.
Same as above

Other than that, the patches look good to me. I've also done some
testing after applying the Test-group-deadlock patch provided by Amit
earlier in the thread. It works as expected.

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

#187

Dilip Kumar

dilipbalaut@gmail.com

almost 6 years ago

In reply to: Kuntal Ghosh (#186)

4 attachment(s)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Mon, Mar 16, 2020 at 11:56 AM Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:

On Mon, Mar 16, 2020 at 9:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Mar 16, 2020 at 8:57 AM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:

IsRelationExtensionLockHeld and IsPageLockHeld are used only when
assertion is enabled. So how about making CheckAndSetLockHeld work
only if USE_ASSERT_CHECKING to avoid overheads?

That makes sense to me so updated the patch.

+1

In v10-0001-Assert-that-we-don-t-acquire-a-heavyweight-lock-.patch,

+ * Indicate that the lock is released for a particular type of locks.
s/lock is/locks are

Done

+ /* Indicate that the lock is acquired for a certain type of locks. */
s/lock is/locks are

Done

In v10-0002-*.patch,

+ * Flag to indicate if the page lock is held by this backend.  We don't
+ * acquire any other heavyweight lock while holding the page lock except for
+ * relation extension.  However, these locks are never taken in reverse order
+ * which implies that page locks will also never participate in the deadlock
+ * cycle.
s/while holding the page lock except for relation extension/while
holding the page lock except for relation extension and page lock

Done

+ * We don't acquire any other heavyweight lock while holding the page lock
+ * except for relation extension lock.
Same as above

Done

Other than that, the patches look good to me. I've also done some
testing after applying the Test-group-deadlock patch provided by Amit
earlier in the thread. It works as expected.

Thanks for testing.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v11-0001-Assert-that-we-don-t-acquire-a-heavyweight-lock-.patchapplication/octet-stream; name=v11-0001-Assert-that-we-don-t-acquire-a-heavyweight-lock-.patchDownload

From e6bcd61945c95f1e05743c97d693fe7655af409b Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Sat, 14 Mar 2020 12:43:35 +0530
Subject: [PATCH v11 1/4] Assert that we don't acquire a heavyweight lock on
 another object after relation extension lock.

The only exception to the rule is that we can try to acquire the same
relation extension lock more than once.  This is allowed as we are not
creating any new lock for this case.  This restriction implies that the
relation extension lock won't ever participate in the deadlock cycle
because we can never wait for any other heavyweight lock after acquiring
this lock.

Such a restriction is okay for relation extension locks as unlike other
heavyweight locks these are not held till the transaction end.  These are
taken for a short duration to extend a particular relation and then
released.

Author: Dilip Kumar, with few changes by Amit Kapila
Reviewed-by: Amit Kapila and Kuntal Ghosh
Discussion: https://postgr.es/m/CAD21AoCmT3cFQUN4aVvzy5chw7DuzXrJCbrjTU05B+Ss=Gn1LA@mail.gmail.com
---
 src/backend/storage/lmgr/lock.c | 53 +++++++++++++++++++++++++++++++++++++++++
 src/include/storage/lock.h      |  1 +
 2 files changed, 54 insertions(+)

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 1df7b8e..54b55fb 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -170,6 +170,23 @@ typedef struct TwoPhaseLockRecord
  */
 static int	FastPathLocalUseCount = 0;
 
+/*
+ * Flag to indicate if the relation extension lock is held by this backend.
+ * This flag is used to ensure that while holding the relation extension lock
+ * we don't try to acquire a heavyweight lock on any other object.  This
+ * restriction implies that the relation extension lock won't ever participate
+ * in the deadlock cycle because we can never wait for any other heavyweight
+ * lock after acquiring this lock.
+ *
+ * Such a restriction is okay for relation extension locks as unlike other
+ * heavyweight locks these are not held till the transaction end.  These are
+ * taken for a short duration to extend a particular relation and then
+ * released.
+ */
+#ifdef USE_ASSERT_CHECKING
+static bool IsRelationExtensionLockHeld = false;
+#endif
+
 /* Macros for manipulating proc->fpLockBits */
 #define FAST_PATH_BITS_PER_SLOT			3
 #define FAST_PATH_LOCKNUMBER_OFFSET		1
@@ -841,6 +858,13 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	}
 
 	/*
+	 * We don't acquire any other heavyweight lock while holding the relation
+	 * extension lock.  We do allow to acquire the same relation extension
+	 * lock more than once but that case won't reach here.
+	 */
+	Assert(!IsRelationExtensionLockHeld);
+
+	/*
 	 * Prepare to emit a WAL record if acquisition of this lock needs to be
 	 * replayed in a standby server.
 	 *
@@ -1288,6 +1312,23 @@ SetupLockInTable(LockMethod lockMethodTable, PGPROC *proc,
 }
 
 /*
+ * Check and set/reset the flag that we hold the relation extension lock.
+ *
+ * It is callers responsibility that this function is called after
+ * acquiring/releasing the relation extension lock.
+ *
+ * Pass acquired = true if lock is acquired, false otherwise.
+ */
+#ifdef USE_ASSERT_CHECKING
+static inline void
+CheckAndSetLockHeld(LOCALLOCK *locallock, bool acquired)
+{
+	if (LOCALLOCK_LOCKTAG(*locallock) == LOCKTAG_RELATION_EXTEND)
+		IsRelationExtensionLockHeld = acquired;
+}
+#endif
+
+/*
  * Subroutine to free a locallock entry
  */
 static void
@@ -1322,6 +1363,13 @@ RemoveLocalLock(LOCALLOCK *locallock)
 					 (void *) &(locallock->tag),
 					 HASH_REMOVE, NULL))
 		elog(WARNING, "locallock table corrupted");
+
+	/*
+	 * Indicate that the lock is released for certain types of locks
+	 */
+#ifdef USE_ASSERT_CHECKING
+	CheckAndSetLockHeld(locallock, false);
+#endif
 }
 
 /*
@@ -1618,6 +1666,11 @@ GrantLockLocal(LOCALLOCK *locallock, ResourceOwner owner)
 	locallock->numLockOwners++;
 	if (owner != NULL)
 		ResourceOwnerRememberLock(owner, locallock);
+
+	/* Indicate that the lock is acquired for certain types of locks. */
+#ifdef USE_ASSERT_CHECKING
+	CheckAndSetLockHeld(locallock, true);
+#endif
 }
 
 /*
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index bb8e4e6..fc0a712 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -419,6 +419,7 @@ typedef struct LOCALLOCK
 } LOCALLOCK;
 
 #define LOCALLOCK_LOCKMETHOD(llock) ((llock).tag.lock.locktag_lockmethodid)
+#define LOCALLOCK_LOCKTAG(llock) ((LockTagType) (llock).tag.lock.locktag_type)
 
 
 /*
-- 
1.8.3.1

v11-0004-Allow-page-lock-to-conflict-among-parallel-group.patchapplication/octet-stream; name=v11-0004-Allow-page-lock-to-conflict-among-parallel-group.patchDownload

From d95a40f9451f95bd68505544793a300741c18394 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Sat, 14 Mar 2020 19:16:09 +0530
Subject: [PATCH v11 4/4] Allow page lock to conflict among parallel group
 members.

This is required as it is no safer for two related processes to perform
clean up in gin indexes at a time than for unrelated processes to do the
same.  After acquiring page locks, we can acquire relation extension lock
but reverse never happens which means these will also not participate in
deadlock.  So, avoid checking wait edges from this lock.

Currently, the parallel mode is strictly read-only, but after this patch
we have the infrastructure to allow parallel inserts and parallel copy.

Author: Dilip Kumar, Amit Kapila
Reviewed-by: Amit Kapila and Kuntal Ghosh
Discussion: https://postgr.es/m/CAD21AoCmT3cFQUN4aVvzy5chw7DuzXrJCbrjTU05B+Ss=Gn1LA@mail.gmail.com
---
 src/backend/storage/lmgr/README     | 60 ++++++++++++++++++++-----------------
 src/backend/storage/lmgr/deadlock.c |  9 +++---
 src/backend/storage/lmgr/lock.c     |  6 ++--
 src/backend/storage/lmgr/proc.c     | 12 ++++----
 4 files changed, 48 insertions(+), 39 deletions(-)

diff --git a/src/backend/storage/lmgr/README b/src/backend/storage/lmgr/README
index 56b0a12..13eb1cc 100644
--- a/src/backend/storage/lmgr/README
+++ b/src/backend/storage/lmgr/README
@@ -597,21 +597,22 @@ deadlock detection algorithm very much, but it makes the bookkeeping more
 complicated.
 
 We choose to regard locks held by processes in the same parallel group as
-non-conflicting.  This means that two processes in a parallel group can hold a
-self-exclusive lock on the same relation at the same time, or one process can
-acquire an AccessShareLock while the other already holds AccessExclusiveLock.
-This might seem dangerous and could be in some cases (more on that below), but
-if we didn't do this then parallel query would be extremely prone to
-self-deadlock.  For example, a parallel query against a relation on which the
-leader already had AccessExclusiveLock would hang, because the workers would
-try to lock the same relation and be blocked by the leader; yet the leader
-can't finish until it receives completion indications from all workers.  An
-undetected deadlock results.  This is far from the only scenario where such a
-problem happens.  The same thing will occur if the leader holds only
-AccessShareLock, the worker seeks AccessShareLock, but between the time the
-leader attempts to acquire the lock and the time the worker attempts to
-acquire it, some other process queues up waiting for an AccessExclusiveLock.
-In this case, too, an indefinite hang results.
+non-conflicting with the exception of relation extension and page locks.  This
+means that two processes in a parallel group can hold a self-exclusive lock on
+the same relation at the same time, or one process can acquire an AccessShareLock
+while the other already holds AccessExclusiveLock.  This might seem dangerous and
+could be in some cases (more on that below), but if we didn't do this then
+parallel query would be extremely prone to self-deadlock.  For example, a
+parallel query against a relation on which the leader already had
+AccessExclusiveLock would hang, because the workers would try to lock the same
+relation and be blocked by the leader; yet the leader can't finish until it
+receives completion indications from all workers.  An undetected deadlock
+results.  This is far from the only scenario where such a problem happens.  The
+same thing will occur if the leader holds only AccessShareLock, the worker
+seeks AccessShareLock, but between the time the leader attempts to acquire the
+lock and the time the worker attempts to acquire it, some other process queues
+up waiting for an AccessExclusiveLock.  In this case, too, an indefinite hang
+results.
 
 It might seem that we could predict which locks the workers will attempt to
 acquire and ensure before going parallel that those locks would be acquired
@@ -637,18 +638,23 @@ the other is safe enough.  Problems would occur if the leader initiated
 parallelism from a point in the code at which it had some backend-private
 state that made table access from another process unsafe, for example after
 calling SetReindexProcessing and before calling ResetReindexProcessing,
-catastrophe could ensue, because the worker won't have that state.  Similarly,
-problems could occur with certain kinds of non-relation locks, such as
-relation extension locks.  It's no safer for two related processes to extend
-the same relation at the time than for unrelated processes to do the same.
-However, since parallel mode is strictly read-only at present, neither this
-nor most of the similar cases can arise at present.  To allow parallel writes,
-we'll either need to (1) further enhance the deadlock detector to handle those
-types of locks in a different way than other types; or (2) have parallel
-workers use some other mutual exclusion method for such cases; or (3) revise
-those cases so that they no longer use heavyweight locking in the first place
-(which is not a crazy idea, given that such lock acquisitions are not expected
-to deadlock and that heavyweight lock acquisition is fairly slow anyway).
+catastrophe could ensue, because the worker won't have that state.
+
+To allow parallel inserts and parallel copy, we have ensured that relation
+extension and page locks don't participate in group locking which means such
+locks can conflict among the same group members.  This is required as it is no
+safer for two related processes to extend the same relation or perform clean up
+in gin indexes at a time than for unrelated processes to do the same.  We don't
+acquire a heavyweight lock on any other object after relation extension lock
+which means such a lock can never participate in the deadlock cycle.  After
+acquiring page locks, we can acquire relation extension lock but reverse never
+happens, so those will also not participate in deadlock.  To allow for other
+parallel writes like parallel update or parallel delete, we'll either need to
+(1) further enhance the deadlock detector to handle those tuple locks in a
+different way than other types; or (2) have parallel workers use some other
+mutual exclusion method for such cases.  Currently, the parallel mode is
+strictly read-only, but now we have the infrastructure to allow parallel
+inserts and parallel copy.
 
 Group locking adds three new members to each PGPROC: lockGroupLeader,
 lockGroupMembers, and lockGroupLink. A PGPROC's lockGroupLeader is NULL for
diff --git a/src/backend/storage/lmgr/deadlock.c b/src/backend/storage/lmgr/deadlock.c
index 59060b6..beedc79 100644
--- a/src/backend/storage/lmgr/deadlock.c
+++ b/src/backend/storage/lmgr/deadlock.c
@@ -556,11 +556,12 @@ FindLockCycleRecurseMember(PGPROC *checkProc,
 				lm;
 
 	/*
-	 * The relation extension lock can never participate in actual deadlock
-	 * cycle.  See Assert in LockAcquireExtended.  So, there is no advantage
-	 * in checking wait edges from it.
+	 * The relation extension or page lock can never participate in actual
+	 * deadlock cycle.  See Asserts in LockAcquireExtended.  So, there is no
+	 * advantage in checking wait edges from them.
 	 */
-	if (LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND)
+	if (LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND ||
+		(LOCK_LOCKTAG(*lock) == LOCKTAG_PAGE))
 		return false;
 
 	lockMethodTable = GetLocksMethodTable(lock);
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index e894855..c44484a 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -1475,9 +1475,11 @@ LockCheckConflicts(LockMethod lockMethodTable,
 	}
 
 	/*
-	 * The relation extension lock conflict even between the group members.
+	 * The relation extension or page lock conflict even between the group
+	 * members.
 	 */
-	if (LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND)
+	if (LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND ||
+		(LOCK_LOCKTAG(*lock) == LOCKTAG_PAGE))
 	{
 		PROCLOCK_PRINT("LockCheckConflicts: conflicting (group)",
 					   proclock);
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index fa07ddf..9938cdd 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -1078,12 +1078,12 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
 	/*
 	 * If group locking is in use, locks held by members of my locking group
 	 * need to be included in myHeldLocks.  This is not required for relation
-	 * extension lock which conflict among group members. However, including
-	 * them in myHeldLocks will give group members the priority to get those
-	 * locks as compared to other backends which are also trying to acquire
-	 * those locks.  OTOH, we can avoid giving priority to group members for
-	 * that kind of locks, but there doesn't appear to be a clear advantage of
-	 * the same.
+	 * extension or page locks which conflict among group members. However,
+	 * including them in myHeldLocks will give group members the priority to
+	 * get those locks as compared to other backends which are also trying to
+	 * acquire those locks.  OTOH, we can avoid giving priority to group
+	 * members for that kind of locks, but there doesn't appear to be a clear
+	 * advantage of the same.
 	 */
 	if (leader != NULL)
 	{
-- 
1.8.3.1

v11-0003-Allow-relation-extension-lock-to-conflict-among-.patchapplication/octet-stream; name=v11-0003-Allow-relation-extension-lock-to-conflict-among-.patchDownload

From 6c89d0634bf1a4072f5489a611d7b3555ff51447 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Sat, 14 Mar 2020 17:48:17 +0530
Subject: [PATCH v11 3/4] Allow relation extension lock to conflict among
 parallel group members.

This is required as it is no safer for two related processes to extend the
same relation at a time than for unrelated processes to do the same.  We
don't acquire a heavyweight lock on any other object after relation
extension lock which means such a lock can never participate in the
deadlock cycle.  So, avoid checking wait edges from this lock.

This provides an infrastructure to allow parallel operations like insert,
copy, etc. which were earlier not possible as parallel group members won't
conflict for relation extension lock.

Author: Dilip Kumar, Amit Kapila
Reviewed-by: Amit Kapila and Kuntal Ghosh
Discussion: https://postgr.es/m/CAD21AoCmT3cFQUN4aVvzy5chw7DuzXrJCbrjTU05B+Ss=Gn1LA@mail.gmail.com
---
 src/backend/storage/lmgr/deadlock.c |  8 ++++++++
 src/backend/storage/lmgr/lock.c     | 10 ++++++++++
 src/backend/storage/lmgr/proc.c     |  8 +++++++-
 src/include/storage/lock.h          |  1 +
 4 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/lmgr/deadlock.c b/src/backend/storage/lmgr/deadlock.c
index f8c5df0..59060b6 100644
--- a/src/backend/storage/lmgr/deadlock.c
+++ b/src/backend/storage/lmgr/deadlock.c
@@ -555,6 +555,14 @@ FindLockCycleRecurseMember(PGPROC *checkProc,
 	int			numLockModes,
 				lm;
 
+	/*
+	 * The relation extension lock can never participate in actual deadlock
+	 * cycle.  See Assert in LockAcquireExtended.  So, there is no advantage
+	 * in checking wait edges from it.
+	 */
+	if (LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND)
+		return false;
+
 	lockMethodTable = GetLocksMethodTable(lock);
 	numLockModes = lockMethodTable->numLockModes;
 	conflictMask = lockMethodTable->conflictTab[checkProc->waitLockMode];
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 1e3eb77..e894855 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -1475,6 +1475,16 @@ LockCheckConflicts(LockMethod lockMethodTable,
 	}
 
 	/*
+	 * The relation extension lock conflict even between the group members.
+	 */
+	if (LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND)
+	{
+		PROCLOCK_PRINT("LockCheckConflicts: conflicting (group)",
+					   proclock);
+		return true;
+	}
+
+	/*
 	 * Locks held in conflicting modes by members of our own lock group are
 	 * not real conflicts; we can subtract those out and see if we still have
 	 * a conflict.  This is O(N) in the number of processes holding or
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index eb321f7..fa07ddf 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -1077,7 +1077,13 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
 
 	/*
 	 * If group locking is in use, locks held by members of my locking group
-	 * need to be included in myHeldLocks.
+	 * need to be included in myHeldLocks.  This is not required for relation
+	 * extension lock which conflict among group members. However, including
+	 * them in myHeldLocks will give group members the priority to get those
+	 * locks as compared to other backends which are also trying to acquire
+	 * those locks.  OTOH, we can avoid giving priority to group members for
+	 * that kind of locks, but there doesn't appear to be a clear advantage of
+	 * the same.
 	 */
 	if (leader != NULL)
 	{
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index fc0a712..a89e54d 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -301,6 +301,7 @@ typedef struct LOCK
 } LOCK;
 
 #define LOCK_LOCKMETHOD(lock) ((LOCKMETHODID) (lock).tag.locktag_lockmethodid)
+#define LOCK_LOCKTAG(lock) ((LockTagType) (lock).tag.locktag_type)
 
 
 /*
-- 
1.8.3.1

v11-0002-Add-assert-to-ensure-that-page-locks-don-t-parti.patchapplication/octet-stream; name=v11-0002-Add-assert-to-ensure-that-page-locks-don-t-parti.patchDownload

From 799fec1da14fcd7aecda0f5ef1362dfbe84652de Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 16 Mar 2020 09:22:58 +0530
Subject: [PATCH v11 2/4] Add assert to ensure that page locks don't
 participate in deadlock cycle.

Assert that we don't acquire any other heavyweight lock while holding the
page lock except for relation extension.  However, these locks are never
taken in reverse order which implies that page locks will never
participate in the deadlock cycle.

Similar to relation extension, page locks are also held for a short
duration, so imposing such a restriction won't hurt.

Author: Dilip Kumar, with few changes by Amit Kapila
Reviewed-by: Amit Kapila and Kuntal Ghosh
Discussion: https://postgr.es/m/CAD21AoCmT3cFQUN4aVvzy5chw7DuzXrJCbrjTU05B+Ss=Gn1LA@mail.gmail.com
---
 src/backend/storage/lmgr/lock.c | 27 +++++++++++++++++++++++++--
 1 file changed, 25 insertions(+), 2 deletions(-)

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 54b55fb..1e3eb77 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -187,6 +187,20 @@ static int	FastPathLocalUseCount = 0;
 static bool IsRelationExtensionLockHeld = false;
 #endif
 
+/*
+ * Flag to indicate if the page lock is held by this backend.  We don't
+ * acquire any other heavyweight lock while holding the page lock except for
+ * relation extension and page lock.  However, these locks are never taken in
+ * reverse order which implies that page locks will also never participate in
+ * the deadlock cycle.
+ *
+ * Similar to relation extension, page locks are also held for a short
+ * duration, so imposing such a restriction won't hurt.
+ */
+#ifdef USE_ASSERT_CHECKING
+static bool IsPageLockHeld = false;
+#endif
+
 /* Macros for manipulating proc->fpLockBits */
 #define FAST_PATH_BITS_PER_SLOT			3
 #define FAST_PATH_LOCKNUMBER_OFFSET		1
@@ -865,6 +879,13 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	Assert(!IsRelationExtensionLockHeld);
 
 	/*
+	 * We don't acquire any other heavyweight lock while holding the page lock
+	 * except for relation extension and page lock.
+	 */
+	Assert(!IsPageLockHeld ||
+		   (locktag->locktag_type == LOCKTAG_RELATION_EXTEND));
+
+	/*
 	 * Prepare to emit a WAL record if acquisition of this lock needs to be
 	 * replayed in a standby server.
 	 *
@@ -1312,10 +1333,10 @@ SetupLockInTable(LockMethod lockMethodTable, PGPROC *proc,
 }
 
 /*
- * Check and set/reset the flag that we hold the relation extension lock.
+ * Check and set/reset the flag that we hold the relation extension/page lock.
  *
  * It is callers responsibility that this function is called after
- * acquiring/releasing the relation extension lock.
+ * acquiring/releasing the relation extension/page lock.
  *
  * Pass acquired = true if lock is acquired, false otherwise.
  */
@@ -1325,6 +1346,8 @@ CheckAndSetLockHeld(LOCALLOCK *locallock, bool acquired)
 {
 	if (LOCALLOCK_LOCKTAG(*locallock) == LOCKTAG_RELATION_EXTEND)
 		IsRelationExtensionLockHeld = acquired;
+	else if (LOCALLOCK_LOCKTAG(*locallock) == LOCKTAG_PAGE)
+		IsPageLockHeld = acquired;
 }
 #endif
 
-- 
1.8.3.1

#188

Amit Kapila

amit.kapila16@gmail.com

almost 6 years ago

In reply to: Dilip Kumar (#187)

4 attachment(s)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Mon, Mar 16, 2020 at 3:24 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

+
+ /*
+ * Indicate that the lock is released for certain types of locks
+ */
+#ifdef USE_ASSERT_CHECKING
+ CheckAndSetLockHeld(locallock, false);
+#endif
 }

 /*
@@ -1618,6 +1666,11 @@ GrantLockLocal(LOCALLOCK *locallock, ResourceOwner owner)
  locallock->numLockOwners++;
  if (owner != NULL)
  ResourceOwnerRememberLock(owner, locallock);
+
+ /* Indicate that the lock is acquired for certain types of locks. */
+#ifdef USE_ASSERT_CHECKING
+ CheckAndSetLockHeld(locallock, true);
+#endif
 }

There is no need to sprinkle USE_ASSERT_CHECKING at so many places,
having inside the new function is sufficient. I have changed that,
added few more comments and
made minor changes. See, what you think about attached?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v12-0001-Assert-that-we-don-t-acquire-a-heavyweight-lock-on-a.patchapplication/octet-stream; name=v12-0001-Assert-that-we-don-t-acquire-a-heavyweight-lock-on-a.patchDownload

From e61f45a7153c9374cc057223c8416afc8d325a34 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Sat, 14 Mar 2020 12:43:35 +0530
Subject: [PATCH 1/4] Assert that we don't acquire a heavyweight lock on
 another object after relation extension lock.

The only exception to the rule is that we can try to acquire the same
relation extension lock more than once.  This is allowed as we are not
creating any new lock for this case.  This restriction implies that the
relation extension lock won't ever participate in the deadlock cycle
because we can never wait for any other heavyweight lock after acquiring
this lock.

Such a restriction is okay for relation extension locks as unlike other
heavyweight locks these are not held till the transaction end.  These are
taken for a short duration to extend a particular relation and then
released.

Author: Dilip Kumar, with few changes by Amit Kapila
Reviewed-by: Amit Kapila, Kuntal Ghosh and Sawada Masahiko
Discussion: https://postgr.es/m/CAD21AoCmT3cFQUN4aVvzy5chw7DuzXrJCbrjTU05B+Ss=Gn1LA@mail.gmail.com
---
 src/backend/storage/lmgr/lock.c | 47 +++++++++++++++++++++++++++++++++++++++++
 src/include/storage/lock.h      |  1 +
 2 files changed, 48 insertions(+)

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 1df7b8e..9f55132 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -170,6 +170,21 @@ typedef struct TwoPhaseLockRecord
  */
 static int	FastPathLocalUseCount = 0;
 
+/*
+ * Flag to indicate if the relation extension lock is held by this backend.
+ * This flag is used to ensure that while holding the relation extension lock
+ * we don't try to acquire a heavyweight lock on any other object.  This
+ * restriction implies that the relation extension lock won't ever participate
+ * in the deadlock cycle because we can never wait for any other heavyweight
+ * lock after acquiring this lock.
+ *
+ * Such a restriction is okay for relation extension locks as unlike other
+ * heavyweight locks these are not held till the transaction end.  These are
+ * taken for a short duration to extend a particular relation and then
+ * released.
+ */
+static bool IsRelationExtensionLockHeld PG_USED_FOR_ASSERTS_ONLY = false;
+
 /* Macros for manipulating proc->fpLockBits */
 #define FAST_PATH_BITS_PER_SLOT			3
 #define FAST_PATH_LOCKNUMBER_OFFSET		1
@@ -841,6 +856,13 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	}
 
 	/*
+	 * We don't acquire any other heavyweight lock while holding the relation
+	 * extension lock.  We do allow to acquire the same relation extension
+	 * lock more than once but that case won't reach here.
+	 */
+	Assert(!IsRelationExtensionLockHeld);
+
+	/*
 	 * Prepare to emit a WAL record if acquisition of this lock needs to be
 	 * replayed in a standby server.
 	 *
@@ -1288,6 +1310,23 @@ SetupLockInTable(LockMethod lockMethodTable, PGPROC *proc,
 }
 
 /*
+ * Check and set/reset the flag that we hold the relation extension lock.
+ *
+ * It is callers responsibility that this function is called after
+ * acquiring/releasing the relation extension lock.
+ *
+ * Pass acquired as true if lock is acquired, false otherwise.
+ */
+static inline void
+CheckAndSetLockHeld(LOCALLOCK *locallock, bool acquired)
+{
+#ifdef USE_ASSERT_CHECKING
+	if (LOCALLOCK_LOCKTAG(*locallock) == LOCKTAG_RELATION_EXTEND)
+		IsRelationExtensionLockHeld = acquired;
+#endif
+}
+
+/*
  * Subroutine to free a locallock entry
  */
 static void
@@ -1322,6 +1361,11 @@ RemoveLocalLock(LOCALLOCK *locallock)
 					 (void *) &(locallock->tag),
 					 HASH_REMOVE, NULL))
 		elog(WARNING, "locallock table corrupted");
+
+	/*
+	 * Indicate that the lock is released for certain types of locks
+	 */
+	CheckAndSetLockHeld(locallock, false);
 }
 
 /*
@@ -1618,6 +1662,9 @@ GrantLockLocal(LOCALLOCK *locallock, ResourceOwner owner)
 	locallock->numLockOwners++;
 	if (owner != NULL)
 		ResourceOwnerRememberLock(owner, locallock);
+
+	/* Indicate that the lock is acquired for certain types of locks. */
+	CheckAndSetLockHeld(locallock, true);
 }
 
 /*
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index bb8e4e6..fc0a712 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -419,6 +419,7 @@ typedef struct LOCALLOCK
 } LOCALLOCK;
 
 #define LOCALLOCK_LOCKMETHOD(llock) ((llock).tag.lock.locktag_lockmethodid)
+#define LOCALLOCK_LOCKTAG(llock) ((LockTagType) (llock).tag.lock.locktag_type)
 
 
 /*
-- 
1.8.3.1

v12-0002-Add-assert-to-ensure-that-page-locks-don-t-participa.patchapplication/octet-stream; name=v12-0002-Add-assert-to-ensure-that-page-locks-don-t-participa.patchDownload

From 4c95ff706bbcc0694c974f2234b7a6a796db50e8 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 17 Mar 2020 15:58:45 +0530
Subject: [PATCH 2/4] Add assert to ensure that page locks don't participate in
 deadlock cycle.

Assert that we don't acquire any other heavyweight lock while holding the
page lock except for relation extension.  However, these locks are never
taken in reverse order which implies that page locks will never
participate in the deadlock cycle.

Similar to relation extension, page locks are also held for a short
duration, so imposing such a restriction won't hurt.

Author: Dilip Kumar, with few changes by Amit Kapila
Reviewed-by: Amit Kapila, Kuntal Ghosh and Sawada Masahiko
Discussion: https://postgr.es/m/CAD21AoCmT3cFQUN4aVvzy5chw7DuzXrJCbrjTU05B+Ss=Gn1LA@mail.gmail.com
---
 src/backend/storage/lmgr/lock.c | 26 ++++++++++++++++++++++++--
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 9f55132..be44913 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -185,6 +185,18 @@ static int	FastPathLocalUseCount = 0;
  */
 static bool IsRelationExtensionLockHeld PG_USED_FOR_ASSERTS_ONLY = false;
 
+/*
+ * Flag to indicate if the page lock is held by this backend.  We don't
+ * acquire any other heavyweight lock while holding the page lock except for
+ * relation extension.  However, these locks are never taken in reverse order
+ * which implies that page locks will also never participate in the deadlock
+ * cycle.
+ *
+ * Similar to relation extension, page locks are also held for a short
+ * duration, so imposing such a restriction won't hurt.
+ */
+static bool IsPageLockHeld PG_USED_FOR_ASSERTS_ONLY = false;
+
 /* Macros for manipulating proc->fpLockBits */
 #define FAST_PATH_BITS_PER_SLOT			3
 #define FAST_PATH_LOCKNUMBER_OFFSET		1
@@ -863,6 +875,13 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	Assert(!IsRelationExtensionLockHeld);
 
 	/*
+	 * We don't acquire any other heavyweight lock while holding the page lock
+	 * except for relation extension.
+	 */
+	Assert(!IsPageLockHeld ||
+		   (locktag->locktag_type == LOCKTAG_RELATION_EXTEND));
+
+	/*
 	 * Prepare to emit a WAL record if acquisition of this lock needs to be
 	 * replayed in a standby server.
 	 *
@@ -1310,10 +1329,10 @@ SetupLockInTable(LockMethod lockMethodTable, PGPROC *proc,
 }
 
 /*
- * Check and set/reset the flag that we hold the relation extension lock.
+ * Check and set/reset the flag that we hold the relation extension/page lock.
  *
  * It is callers responsibility that this function is called after
- * acquiring/releasing the relation extension lock.
+ * acquiring/releasing the relation extension/page lock.
  *
  * Pass acquired as true if lock is acquired, false otherwise.
  */
@@ -1323,6 +1342,9 @@ CheckAndSetLockHeld(LOCALLOCK *locallock, bool acquired)
 #ifdef USE_ASSERT_CHECKING
 	if (LOCALLOCK_LOCKTAG(*locallock) == LOCKTAG_RELATION_EXTEND)
 		IsRelationExtensionLockHeld = acquired;
+	else if (LOCALLOCK_LOCKTAG(*locallock) == LOCKTAG_PAGE)
+		IsPageLockHeld = acquired;
+
 #endif
 }
 
-- 
1.8.3.1

v12-0003-Allow-relation-extension-lock-to-conflict-among-para.patchapplication/octet-stream; name=v12-0003-Allow-relation-extension-lock-to-conflict-among-para.patchDownload

From c0969460cdcbe89927d454606893d309c1881e4e Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Sat, 14 Mar 2020 17:48:17 +0530
Subject: [PATCH 3/4] Allow relation extension lock to conflict among parallel
 group members.

This is required as it is no safer for two related processes to extend the
same relation at a time than for unrelated processes to do the same.  We
don't acquire a heavyweight lock on any other object after relation
extension lock which means such a lock can never participate in the
deadlock cycle.  So, avoid checking wait edges from this lock.

This provides an infrastructure to allow parallel operations like insert,
copy, etc. which were earlier not possible as parallel group members won't
conflict for relation extension lock.

Author: Dilip Kumar, Amit Kapila
Reviewed-by: Amit Kapila, Kuntal Ghosh and Sawada Masahiko
Discussion: https://postgr.es/m/CAD21AoCmT3cFQUN4aVvzy5chw7DuzXrJCbrjTU05B+Ss=Gn1LA@mail.gmail.com
---
 src/backend/storage/lmgr/deadlock.c |  8 ++++++++
 src/backend/storage/lmgr/lock.c     | 10 ++++++++++
 src/backend/storage/lmgr/proc.c     |  8 +++++++-
 src/include/storage/lock.h          |  1 +
 4 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/lmgr/deadlock.c b/src/backend/storage/lmgr/deadlock.c
index f8c5df0..59060b6 100644
--- a/src/backend/storage/lmgr/deadlock.c
+++ b/src/backend/storage/lmgr/deadlock.c
@@ -555,6 +555,14 @@ FindLockCycleRecurseMember(PGPROC *checkProc,
 	int			numLockModes,
 				lm;
 
+	/*
+	 * The relation extension lock can never participate in actual deadlock
+	 * cycle.  See Assert in LockAcquireExtended.  So, there is no advantage
+	 * in checking wait edges from it.
+	 */
+	if (LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND)
+		return false;
+
 	lockMethodTable = GetLocksMethodTable(lock);
 	numLockModes = lockMethodTable->numLockModes;
 	conflictMask = lockMethodTable->conflictTab[checkProc->waitLockMode];
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index be44913..5f32617 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -1470,6 +1470,16 @@ LockCheckConflicts(LockMethod lockMethodTable,
 	}
 
 	/*
+	 * The relation extension lock conflict even between the group members.
+	 */
+	if (LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND)
+	{
+		PROCLOCK_PRINT("LockCheckConflicts: conflicting (group)",
+					   proclock);
+		return true;
+	}
+
+	/*
 	 * Locks held in conflicting modes by members of our own lock group are
 	 * not real conflicts; we can subtract those out and see if we still have
 	 * a conflict.  This is O(N) in the number of processes holding or
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index eb321f7..fa07ddf 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -1077,7 +1077,13 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
 
 	/*
 	 * If group locking is in use, locks held by members of my locking group
-	 * need to be included in myHeldLocks.
+	 * need to be included in myHeldLocks.  This is not required for relation
+	 * extension lock which conflict among group members. However, including
+	 * them in myHeldLocks will give group members the priority to get those
+	 * locks as compared to other backends which are also trying to acquire
+	 * those locks.  OTOH, we can avoid giving priority to group members for
+	 * that kind of locks, but there doesn't appear to be a clear advantage of
+	 * the same.
 	 */
 	if (leader != NULL)
 	{
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index fc0a712..a89e54d 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -301,6 +301,7 @@ typedef struct LOCK
 } LOCK;
 
 #define LOCK_LOCKMETHOD(lock) ((LOCKMETHODID) (lock).tag.locktag_lockmethodid)
+#define LOCK_LOCKTAG(lock) ((LockTagType) (lock).tag.locktag_type)
 
 
 /*
-- 
1.8.3.1

v12-0004-Allow-page-lock-to-conflict-among-parallel-group-mem.patchapplication/octet-stream; name=v12-0004-Allow-page-lock-to-conflict-among-parallel-group-mem.patchDownload

From d679698da0599187e357d5fd145f151063b5d36a Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Sat, 14 Mar 2020 19:16:09 +0530
Subject: [PATCH 4/4] Allow page lock to conflict among parallel group members.

This is required as it is no safer for two related processes to perform
clean up in gin indexes at a time than for unrelated processes to do the
same.  After acquiring page locks, we can acquire relation extension lock
but reverse never happens which means these will also not participate in
deadlock.  So, avoid checking wait edges from this lock.

Currently, the parallel mode is strictly read-only, but after this patch
we have the infrastructure to allow parallel inserts and parallel copy.

Author: Dilip Kumar, Amit Kapila
Reviewed-by: Amit Kapila, Kuntal Ghosh and Sawada Masahiko
Discussion: https://postgr.es/m/CAD21AoCmT3cFQUN4aVvzy5chw7DuzXrJCbrjTU05B+Ss=Gn1LA@mail.gmail.com
---
 src/backend/optimizer/plan/planner.c | 13 +++-----
 src/backend/storage/lmgr/README      | 60 ++++++++++++++++++++----------------
 src/backend/storage/lmgr/deadlock.c  |  9 +++---
 src/backend/storage/lmgr/lock.c      |  6 ++--
 src/backend/storage/lmgr/proc.c      | 12 ++++----
 5 files changed, 53 insertions(+), 47 deletions(-)

diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index b44efd6..6d9fdad 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -322,14 +322,11 @@ standard_planner(Query *parse, int cursorOptions, ParamListInfo boundParams)
 	 * functions are present in the query tree.
 	 *
 	 * (Note that we do allow CREATE TABLE AS, SELECT INTO, and CREATE
-	 * MATERIALIZED VIEW to use parallel plans, but this is safe only because
-	 * the command is writing into a completely new table which workers won't
-	 * be able to see.  If the workers could see the table, the fact that
-	 * group locking would cause them to ignore the leader's heavyweight
-	 * relation extension lock and GIN page locks would make this unsafe.
-	 * We'll have to fix that somehow if we want to allow parallel inserts in
-	 * general; updates and deletes have additional problems especially around
-	 * combo CIDs.)
+	 * MATERIALIZED VIEW to use parallel plans, but as of now, only the leader
+	 * backend writes into a completely new table.  In the future, we can
+	 * extend it to allow workers to write into the table.  However, to allow
+	 * parallel updates and deletes, we have to solve other problems,
+	 * especially around combo CIDs.)
 	 *
 	 * For now, we don't try to use parallel mode if we're running inside a
 	 * parallel worker.  We might eventually be able to relax this
diff --git a/src/backend/storage/lmgr/README b/src/backend/storage/lmgr/README
index 56b0a12..13eb1cc 100644
--- a/src/backend/storage/lmgr/README
+++ b/src/backend/storage/lmgr/README
@@ -597,21 +597,22 @@ deadlock detection algorithm very much, but it makes the bookkeeping more
 complicated.
 
 We choose to regard locks held by processes in the same parallel group as
-non-conflicting.  This means that two processes in a parallel group can hold a
-self-exclusive lock on the same relation at the same time, or one process can
-acquire an AccessShareLock while the other already holds AccessExclusiveLock.
-This might seem dangerous and could be in some cases (more on that below), but
-if we didn't do this then parallel query would be extremely prone to
-self-deadlock.  For example, a parallel query against a relation on which the
-leader already had AccessExclusiveLock would hang, because the workers would
-try to lock the same relation and be blocked by the leader; yet the leader
-can't finish until it receives completion indications from all workers.  An
-undetected deadlock results.  This is far from the only scenario where such a
-problem happens.  The same thing will occur if the leader holds only
-AccessShareLock, the worker seeks AccessShareLock, but between the time the
-leader attempts to acquire the lock and the time the worker attempts to
-acquire it, some other process queues up waiting for an AccessExclusiveLock.
-In this case, too, an indefinite hang results.
+non-conflicting with the exception of relation extension and page locks.  This
+means that two processes in a parallel group can hold a self-exclusive lock on
+the same relation at the same time, or one process can acquire an AccessShareLock
+while the other already holds AccessExclusiveLock.  This might seem dangerous and
+could be in some cases (more on that below), but if we didn't do this then
+parallel query would be extremely prone to self-deadlock.  For example, a
+parallel query against a relation on which the leader already had
+AccessExclusiveLock would hang, because the workers would try to lock the same
+relation and be blocked by the leader; yet the leader can't finish until it
+receives completion indications from all workers.  An undetected deadlock
+results.  This is far from the only scenario where such a problem happens.  The
+same thing will occur if the leader holds only AccessShareLock, the worker
+seeks AccessShareLock, but between the time the leader attempts to acquire the
+lock and the time the worker attempts to acquire it, some other process queues
+up waiting for an AccessExclusiveLock.  In this case, too, an indefinite hang
+results.
 
 It might seem that we could predict which locks the workers will attempt to
 acquire and ensure before going parallel that those locks would be acquired
@@ -637,18 +638,23 @@ the other is safe enough.  Problems would occur if the leader initiated
 parallelism from a point in the code at which it had some backend-private
 state that made table access from another process unsafe, for example after
 calling SetReindexProcessing and before calling ResetReindexProcessing,
-catastrophe could ensue, because the worker won't have that state.  Similarly,
-problems could occur with certain kinds of non-relation locks, such as
-relation extension locks.  It's no safer for two related processes to extend
-the same relation at the time than for unrelated processes to do the same.
-However, since parallel mode is strictly read-only at present, neither this
-nor most of the similar cases can arise at present.  To allow parallel writes,
-we'll either need to (1) further enhance the deadlock detector to handle those
-types of locks in a different way than other types; or (2) have parallel
-workers use some other mutual exclusion method for such cases; or (3) revise
-those cases so that they no longer use heavyweight locking in the first place
-(which is not a crazy idea, given that such lock acquisitions are not expected
-to deadlock and that heavyweight lock acquisition is fairly slow anyway).
+catastrophe could ensue, because the worker won't have that state.
+
+To allow parallel inserts and parallel copy, we have ensured that relation
+extension and page locks don't participate in group locking which means such
+locks can conflict among the same group members.  This is required as it is no
+safer for two related processes to extend the same relation or perform clean up
+in gin indexes at a time than for unrelated processes to do the same.  We don't
+acquire a heavyweight lock on any other object after relation extension lock
+which means such a lock can never participate in the deadlock cycle.  After
+acquiring page locks, we can acquire relation extension lock but reverse never
+happens, so those will also not participate in deadlock.  To allow for other
+parallel writes like parallel update or parallel delete, we'll either need to
+(1) further enhance the deadlock detector to handle those tuple locks in a
+different way than other types; or (2) have parallel workers use some other
+mutual exclusion method for such cases.  Currently, the parallel mode is
+strictly read-only, but now we have the infrastructure to allow parallel
+inserts and parallel copy.
 
 Group locking adds three new members to each PGPROC: lockGroupLeader,
 lockGroupMembers, and lockGroupLink. A PGPROC's lockGroupLeader is NULL for
diff --git a/src/backend/storage/lmgr/deadlock.c b/src/backend/storage/lmgr/deadlock.c
index 59060b6..beedc79 100644
--- a/src/backend/storage/lmgr/deadlock.c
+++ b/src/backend/storage/lmgr/deadlock.c
@@ -556,11 +556,12 @@ FindLockCycleRecurseMember(PGPROC *checkProc,
 				lm;
 
 	/*
-	 * The relation extension lock can never participate in actual deadlock
-	 * cycle.  See Assert in LockAcquireExtended.  So, there is no advantage
-	 * in checking wait edges from it.
+	 * The relation extension or page lock can never participate in actual
+	 * deadlock cycle.  See Asserts in LockAcquireExtended.  So, there is no
+	 * advantage in checking wait edges from them.
 	 */
-	if (LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND)
+	if (LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND ||
+		(LOCK_LOCKTAG(*lock) == LOCKTAG_PAGE))
 		return false;
 
 	lockMethodTable = GetLocksMethodTable(lock);
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 5f32617..3013ef6 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -1470,9 +1470,11 @@ LockCheckConflicts(LockMethod lockMethodTable,
 	}
 
 	/*
-	 * The relation extension lock conflict even between the group members.
+	 * The relation extension or page lock conflict even between the group
+	 * members.
 	 */
-	if (LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND)
+	if (LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND ||
+		(LOCK_LOCKTAG(*lock) == LOCKTAG_PAGE))
 	{
 		PROCLOCK_PRINT("LockCheckConflicts: conflicting (group)",
 					   proclock);
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index fa07ddf..9938cdd 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -1078,12 +1078,12 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
 	/*
 	 * If group locking is in use, locks held by members of my locking group
 	 * need to be included in myHeldLocks.  This is not required for relation
-	 * extension lock which conflict among group members. However, including
-	 * them in myHeldLocks will give group members the priority to get those
-	 * locks as compared to other backends which are also trying to acquire
-	 * those locks.  OTOH, we can avoid giving priority to group members for
-	 * that kind of locks, but there doesn't appear to be a clear advantage of
-	 * the same.
+	 * extension or page locks which conflict among group members. However,
+	 * including them in myHeldLocks will give group members the priority to
+	 * get those locks as compared to other backends which are also trying to
+	 * acquire those locks.  OTOH, we can avoid giving priority to group
+	 * members for that kind of locks, but there doesn't appear to be a clear
+	 * advantage of the same.
 	 */
 	if (leader != NULL)
 	{
-- 
1.8.3.1

#189

Dilip Kumar

dilipbalaut@gmail.com

almost 6 years ago

In reply to: Amit Kapila (#188)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Tue, Mar 17, 2020 at 5:14 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Mar 16, 2020 at 3:24 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
+
+ /*
+ * Indicate that the lock is released for certain types of locks
+ */
+#ifdef USE_ASSERT_CHECKING
+ CheckAndSetLockHeld(locallock, false);
+#endif
}
/*
@@ -1618,6 +1666,11 @@ GrantLockLocal(LOCALLOCK *locallock, ResourceOwner owner)
locallock->numLockOwners++;
if (owner != NULL)
ResourceOwnerRememberLock(owner, locallock);
+
+ /* Indicate that the lock is acquired for certain types of locks. */
+#ifdef USE_ASSERT_CHECKING
+ CheckAndSetLockHeld(locallock, true);
+#endif
}
There is no need to sprinkle USE_ASSERT_CHECKING at so many places,
having inside the new function is sufficient. I have changed that,
added few more comments and
made minor changes. See, what you think about attached?

Your changes look fine to me. I have also verified all the test and
everything works fine.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#190

Amit Kapila

amit.kapila16@gmail.com

almost 6 years ago

In reply to: Dilip Kumar (#189)

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Tue, Mar 17, 2020 at 6:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Your changes look fine to me. I have also verified all the test and
everything works fine.

I have pushed the first patch. I will push the others in coming days.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com