LWLock deadlock in brinRevmapDesummarizeRange

Started by Tomas Vondraalmost 3 years ago5 messages
#1Tomas Vondra
tomas.vondra@enterprisedb.com
2 attachment(s)

Hi,

While finalizing some fixes in BRIN, I decided to stress-test the
relevant part of the code to check if I missed something. Imagine a
simple script that builds BRIN indexes on random data, does random
changes and cross-checks the results with/without the index.

But instead of I almost immediately ran into a LWLock deadlock :-(

I've managed to reproduce this on PG13+, but I believe it's there since
the brinRevmapDesummarizeRange was introduced in PG10. I just haven't
tried on pre-13 releases.

The stress-test-2.sh script (attached .tgz) builds a table, fills it
with random data and then runs a mix of updates and (de)summarization
DDL of random fraction of the index. The lockup is usually triggered
within a couple minutes, but might take longer (I guess it depends on
parameters used to generate the random data, so it may take a couple
runs to hit the right combination).

The root cause is that brin_doupdate and brinRevmapDesummarizeRange end
up locking buffers in different orders. Attached is also a .patch that
adds a bunch of LOG messages for buffer locking in BRIN code (it's for
PG13, but should work on newer releases too).

Here's a fairly typical example of the interaction between brin_doupdate
and brinRevmapDesummarizeRange:

brin_doupdate (from UPDATE query):

LOG: brin_doupdate: samepage 0
LOG: number of LWLocks held: 0
LOG: brin_getinsertbuffer: locking 898 lock 0x7f9a99a5af64
LOG: brin_getinsertbuffer: buffer locked
LOG: brin_getinsertbuffer B: locking 899 lock 0x7f9a99a5afa4
LOG: brin_getinsertbuffer B: buffer locked
LOG: number of LWLocks held: 2
LOG: lock 0 => 0x7f9a99a5af64
LOG: lock 1 => 0x7f9a99a5afa4
LOG: brin_doupdate: locking buffer for update
LOG: brinLockRevmapPageForUpdate: locking 158 lock 0x7f9a99a4f664

brinRevmapDesummarizeRange (from brin_desummarize_range):

LOG: starting brinRevmapDesummarizeRange
LOG: number of LWLocks held: 0
LOG: brinLockRevmapPageForUpdate: locking 158 lock 0x7f9a99a4f664
LOG: brinLockRevmapPageForUpdate: buffer locked
LOG: number of LWLocks held: 1
LOG: lock 0 => 0x7f9a99a4f664
LOG: brinRevmapDesummarizeRange: locking 898 lock 0x7f9a99a5af64

So, brin_doupdate starts with no LWLocks, and locks buffers 898, 899
(through getinsertbuffer). And then tries to lock 158.

Meanwhile brinRevmapDesummarizeRange locks 158 first, and then tries to
lock 898.

So, a LWLock deadlock :-(

I've now seen a bunch of these traces, with only minor differences. For
example brinRevmapDesummarizeRange might gets stuck on the second buffer
locked by getinsertbuffer (not the first one like here).

I don't have a great idea how to fix this - I guess we need to ensure
the buffers are locked in the same order, but that seems tricky.

Obviously, people don't call brin_desummarize_range() very often, which
likely explains the lack of reports.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

lock-debug.patchtext/x-patch; charset=UTF-8; name=lock-debug.patchDownload
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 0becfde113..65e4ed76b4 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -689,7 +689,9 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	meta = ReadBuffer(index, P_NEW);
 	Assert(BufferGetBlockNumber(meta) == BRIN_METAPAGE_BLKNO);
+	elog(LOG, "brinbuild: locking buffer %d lock %p", meta, BufferDescriptorGetContentLock2(meta));
 	LockBuffer(meta, BUFFER_LOCK_EXCLUSIVE);
+	elog(LOG, "brinbuild: buffer locked");
 
 	brin_metapage_init(BufferGetPage(meta), BrinGetPagesPerRange(index),
 					   BRIN_CURRENT_VERSION);
@@ -756,7 +758,9 @@ brinbuildempty(Relation index)
 	/* An empty BRIN index has a metapage only. */
 	metabuf =
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
+	elog(LOG, "brinbuildempty: locking buffer %d lock %p", metabuf, BufferDescriptorGetContentLock2(metabuf));
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+	elog(LOG, "brinbuildempty: buffer locked");
 
 	/* Initialize and xlog metabuffer. */
 	START_CRIT_SECTION();
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index d5b19293a0..d4a118eb54 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -78,6 +78,10 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 		return false;			/* keep compiler quiet */
 	}
 
+	elog(LOG, "brin_doupdate: samepage %d", samepage);
+
+	DumpHeldLWLocks();
+
 	/* make sure the revmap is long enough to contain the entry we need */
 	brinRevmapExtend(revmap, heapBlk);
 
@@ -106,13 +110,18 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	}
 	else
 	{
+		elog(LOG, "brin_doupdate: locking buffer %d lock %p", oldbuf, BufferDescriptorGetContentLock2(oldbuf));
 		LockBuffer(oldbuf, BUFFER_LOCK_EXCLUSIVE);
+		elog(LOG, "brin_doupdate: buffer locked");
 		newbuf = InvalidBuffer;
 		extended = false;
 	}
 	oldpage = BufferGetPage(oldbuf);
 	oldlp = PageGetItemId(oldpage, oldoff);
 
+
+	DumpHeldLWLocks();
+
 	/*
 	 * Check that the old tuple wasn't updated concurrently: it might have
 	 * moved someplace else entirely, and for that matter the whole page
@@ -125,6 +134,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 		oldoff > PageGetMaxOffsetNumber(oldpage) ||
 		!ItemIdIsNormal(oldlp))
 	{
+		elog(LOG, "brin_doupdate A: unlock %d new %d", oldbuf, newbuf);
 		LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
 
 		/*
@@ -151,6 +161,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	 */
 	if (!brin_tuples_equal(oldtup, oldsz, origtup, origsz))
 	{
+		elog(LOG, "brin_doupdate B: unlock %d new %d", oldbuf, newbuf);
 		LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
 		if (BufferIsValid(newbuf))
 		{
@@ -223,6 +234,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 		 * Not enough space, but caller said that there was. Tell them to
 		 * start over.
 		 */
+		elog(LOG, "brin_doupdate C: unlock %d", oldbuf);
 		LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
 		return false;
 	}
@@ -238,7 +250,9 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 		OffsetNumber newoff;
 		Size		freespace = 0;
 
+		elog(LOG, "brin_doupdate: locking buffer for update");
 		revmapbuf = brinLockRevmapPageForUpdate(revmap, heapBlk);
+		elog(LOG, "brin_doupdate: buffer locked");
 
 		START_CRIT_SECTION();
 
@@ -303,6 +317,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 
 		END_CRIT_SECTION();
 
+		elog(LOG, "brin_doupdate D: unlock %d %d %d", revmapbuf, oldbuf, newbuf);
 		LockBuffer(revmapbuf, BUFFER_LOCK_UNLOCK);
 		LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
 		UnlockReleaseBuffer(newbuf);
@@ -378,7 +393,9 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 		 * revmap over the page we held a pin on, so we cannot assume that
 		 * it's still a regular page.
 		 */
+		elog(LOG, "brin_doinsert: locking buffer %d lock %p", *buffer, BufferDescriptorGetContentLock2(*buffer));
 		LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+		elog(LOG, "brin_doinsert: buffer locked");
 		if (br_page_get_freespace(BufferGetPage(*buffer)) < itemsz)
 		{
 			UnlockReleaseBuffer(*buffer);
@@ -641,7 +658,9 @@ brin_page_cleanup(Relation idxrel, Buffer buf)
 		LockRelationForExtension(idxrel, ShareLock);
 		UnlockRelationForExtension(idxrel, ShareLock);
 
+		elog(LOG, "brin_page_cleanup: locking buffer %d lock %p", buf, BufferDescriptorGetContentLock2(buf));
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+		elog(LOG, "brin_page_cleanup: buffer locked");
 		if (PageIsNew(page))
 		{
 			brin_initialize_empty_new_buffer(idxrel, buf);
@@ -764,7 +783,9 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 		 */
 		if (BufferIsValid(oldbuf) && oldblk < newblk)
 		{
+			elog(LOG, "brin_getinsertbuffer: locking buffer %d lock %p", oldbuf, BufferDescriptorGetContentLock2(oldbuf));
 			LockBuffer(oldbuf, BUFFER_LOCK_EXCLUSIVE);
+			elog(LOG, "brin_getinsertbuffer: buffer locked");
 			if (!BRIN_IS_REGULAR_PAGE(BufferGetPage(oldbuf)))
 			{
 				LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
@@ -796,7 +817,9 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 			}
 		}
 
+		elog(LOG, "brin_getinsertbuffer B: locking buffer %d lock %p", buf, BufferDescriptorGetContentLock2(buf));
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+		elog(LOG, "brin_getinsertbuffer B: buffer locked");
 
 		if (extensionLockHeld)
 			UnlockRelationForExtension(irel, ExclusiveLock);
@@ -823,7 +846,9 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 			 */
 			if (BufferIsValid(oldbuf) && oldblk > newblk)
 			{
+				elog(LOG, "brin_getinsertbuffer C: locking buffer %d lock %p", oldbuf, BufferDescriptorGetContentLock2(oldbuf));
 				LockBuffer(oldbuf, BUFFER_LOCK_EXCLUSIVE);
+				elog(LOG, "brin_getinsertbuffer: buffer locked");
 				Assert(BRIN_IS_REGULAR_PAGE(BufferGetPage(oldbuf)));
 			}
 
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index 35746714a7..072d82514c 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -139,7 +139,9 @@ brinLockRevmapPageForUpdate(BrinRevmap *revmap, BlockNumber heapBlk)
 	Buffer		rmBuf;
 
 	rmBuf = revmap_get_buffer(revmap, heapBlk);
+	elog(LOG, "brinLockRevmapPageForUpdate: locking buffer %d lock %p", rmBuf, BufferDescriptorGetContentLock2(rmBuf));
 	LockBuffer(rmBuf, BUFFER_LOCK_EXCLUSIVE);
+	elog(LOG, "brinLockRevmapPageForUpdate: buffer locked");
 
 	return rmBuf;
 }
@@ -341,6 +343,9 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	OffsetNumber regOffset;
 	ItemId		lp;
 
+	elog(LOG, "starting brinRevmapDesummarizeRange");
+	DumpHeldLWLocks();
+
 	revmap = brinRevmapInitialize(idxrel, &pagesPerRange, NULL);
 
 	revmapBlk = revmap_get_blkno(revmap, heapBlk);
@@ -368,8 +373,12 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 		return true;
 	}
 
+	DumpHeldLWLocks();
+
 	regBuf = ReadBuffer(idxrel, ItemPointerGetBlockNumber(iptr));
+	elog(LOG, "brinRevmapDesummarizeRange: locking buffer %d lock %p", regBuf, BufferDescriptorGetContentLock2(regBuf));
 	LockBuffer(regBuf, BUFFER_LOCK_EXCLUSIVE);
+	elog(LOG, "brinRevmapDesummarizeRange: buffer locked");
 	regPg = BufferGetPage(regBuf);
 	/*
 	 * We're only removing data, not reading it, so there's no need to
@@ -544,7 +553,9 @@ revmap_physical_extend(BrinRevmap *revmap)
 	 * but note that we still need to grab the relation extension lock because
 	 * another backend can extend the index with regular BRIN pages.
 	 */
+	elog(LOG, "revmap_physical_extend: locking buffer %d lock %p", revmap->rm_metaBuf, BufferDescriptorGetContentLock2(revmap->rm_metaBuf));
 	LockBuffer(revmap->rm_metaBuf, BUFFER_LOCK_EXCLUSIVE);
+	elog(LOG, "revmap_physical_extend: buffer locked");
 	metapage = BufferGetPage(revmap->rm_metaBuf);
 	metadata = (BrinMetaPageData *) PageGetContents(metapage);
 
@@ -564,7 +575,9 @@ revmap_physical_extend(BrinRevmap *revmap)
 	if (mapBlk < nblocks)
 	{
 		buf = ReadBuffer(irel, mapBlk);
+		elog(LOG, "revmap_physical_extend B: locking buffer %d lock %p", buf, BufferDescriptorGetContentLock2(buf));
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+		elog(LOG, "revmap_physical_extend B: buffer locked");
 		page = BufferGetPage(buf);
 	}
 	else
@@ -587,7 +600,9 @@ revmap_physical_extend(BrinRevmap *revmap)
 			ReleaseBuffer(buf);
 			return;
 		}
+		elog(LOG, "revmap_physical_extend C: locking buffer %d lock %p", buf, BufferDescriptorGetContentLock2(buf));
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+		elog(LOG, "revmap_physical_extend C: buffer locked");
 		page = BufferGetPage(buf);
 
 		if (needLock)
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 69a1401654..0a9231f649 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3740,6 +3740,16 @@ LockBuffer(Buffer buffer, int mode)
 		elog(ERROR, "unrecognized buffer lock mode: %d", mode);
 }
 
+void *
+BufferDescriptorGetContentLock2(Buffer buffer)
+{
+	BufferDesc *buf;
+
+	buf = GetBufferDescriptor(buffer - 1);
+
+	return BufferDescriptorGetContentLock(buf);
+}
+
 /*
  * Acquire the content_lock for the buffer, but only if we don't have to wait.
  *
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 365d215f2b..8952226e1d 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -1984,3 +1984,11 @@ LWLockHeldByMeInMode(LWLock *l, LWLockMode mode)
 	}
 	return false;
 }
+
+void
+DumpHeldLWLocks(void)
+{
+	elog(LOG, "number of LWLocks held: %d", num_held_lwlocks);
+	for (int i = 0; i < num_held_lwlocks; i++)
+		elog(LOG, "lock %d => %p", i, held_lwlocks[i].lock);
+}
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index ee91b8fa26..d1325154ab 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -168,6 +168,8 @@ extern PGDLLIMPORT int32 *LocalRefCount;
  */
 #define BufferGetPage(buffer) ((Page)BufferGetBlock(buffer))
 
+extern void *BufferDescriptorGetContentLock2(Buffer desc);
+
 /*
  * prototypes for functions in bufmgr.c
  */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index cdbfbed118..bd34994013 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -187,6 +187,8 @@ extern int	LWLockNewTrancheId(void);
 extern void LWLockRegisterTranche(int tranche_id, const char *tranche_name);
 extern void LWLockInitialize(LWLock *lock, int tranche_id);
 
+extern void DumpHeldLWLocks(void);
+
 /*
  * Every tranche ID less than NUM_INDIVIDUAL_LWLOCKS is reserved; also,
  * we reserve additional tranche IDs for builtin tranches not included in
stress-test.tgzapplication/x-compressed-tar; name=stress-test.tgzDownload
#2Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Tomas Vondra (#1)
Re: LWLock deadlock in brinRevmapDesummarizeRange

On 2023-Feb-22, Tomas Vondra wrote:

But instead of I almost immediately ran into a LWLock deadlock :-(

Ouch.

I've managed to reproduce this on PG13+, but I believe it's there since
the brinRevmapDesummarizeRange was introduced in PG10. I just haven't
tried on pre-13 releases.

Hmm, I think that might just be an "easy" way to hit it, but the problem
is actually older than that, since AFAICS brin_doupdate is careless
regarding locking order of revmap page vs. regular page.

Sadly, the README doesn't cover locking considerations. I had that in a
file called 'minmax-proposal' in version 16 of the patch here
/messages/by-id/20140820225133.GB6343@eldon.alvh.no-ip.org
but by version 18 (when 'minmax' became BRIN) I seem to have removed
that file and replaced it with the README and apparently I didn't copy
this material over.

... and in there, I wrote that we would first write the brin tuple in
the regular page, unlock that, and then lock the revmap for the update,
without holding lock on the data page. I don't remember why we do it
differently now, but maybe the fix is just to release the regular page
lock before locking the revmap page? One very important change is that
in previous versions the revmap used a separate fork, and we had to
introduce an "evacuation protocol" when we integrated the revmap into
the main fork, which may have changed the locking considerations.

Another point: to desummarize a range, just unlinking the entry from
revmap should suffice, from the POV of other index scanners. Maybe we
can simplify the whole procedure to: lock revmap, remove entry, remember
page number, unlock revmap; lock regular page, delete entry, unlock.
Then there are no two locks held at the same time during desummarize.

This comes from v16:

+ Locking considerations
+ ----------------------
+ 
+ To read the TID during an index scan, we follow this protocol:
+ 
+ * read revmap page
+ * obtain share lock on the revmap buffer
+ * read the TID
+ * obtain share lock on buffer of main fork
+ * LockTuple the TID (using the index as relation).  A shared lock is
+   sufficient.  We need the LockTuple to prevent VACUUM from recycling
+   the index tuple; see below.
+ * release revmap buffer lock
+ * read the index tuple
+ * release the tuple lock
+ * release main fork buffer lock
+ 
+ 
+ To update the summary tuple for a page range, we use this protocol:
+ 
+ * insert a new index tuple somewhere in the main fork; note its TID
+ * read revmap page
+ * obtain exclusive lock on revmap buffer
+ * write the TID
+ * release lock
+ 
+ This ensures no concurrent reader can obtain a partially-written TID.
+ Note we don't need a tuple lock here.  Concurrent scans don't have to
+ worry about whether they got the old or new index tuple: if they get the
+ old one, the tighter values are okay from a correctness standpoint because
+ due to MVCC they can't possibly see the just-inserted heap tuples anyway.
+
+ [vacuum stuff elided]

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
"Escucha y olvidarás; ve y recordarás; haz y entenderás" (Confucio)

#3Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Alvaro Herrera (#2)
Re: LWLock deadlock in brinRevmapDesummarizeRange

On 2/22/23 12:35, Alvaro Herrera wrote:

On 2023-Feb-22, Tomas Vondra wrote:

But instead of I almost immediately ran into a LWLock deadlock :-(

Ouch.

I've managed to reproduce this on PG13+, but I believe it's there since
the brinRevmapDesummarizeRange was introduced in PG10. I just haven't
tried on pre-13 releases.

Hmm, I think that might just be an "easy" way to hit it, but the problem
is actually older than that, since AFAICS brin_doupdate is careless
regarding locking order of revmap page vs. regular page.

That's certainly possible, although I ran a lot of BRIN stress tests and
it only started failing after I added the desummarization. Although, the
tests are "randomized" like this:

UPDATE t SET a = '...' WHERE random() < 0.05;

which is fairly sequential. Maybe reordering the CTIDs a bit would hit
additional deadlocks, I'll probably give it a try. OTOH that'd probably
be much more likely to be hit by users, and I don't recall any such reports.

Sadly, the README doesn't cover locking considerations. I had that in a
file called 'minmax-proposal' in version 16 of the patch here
/messages/by-id/20140820225133.GB6343@eldon.alvh.no-ip.org
but by version 18 (when 'minmax' became BRIN) I seem to have removed
that file and replaced it with the README and apparently I didn't copy
this material over.

Yeah :-( There's a couple more things that are missing in the README,
like what oi_regular_nulls mean.

... and in there, I wrote that we would first write the brin tuple in
the regular page, unlock that, and then lock the revmap for the update,
without holding lock on the data page. I don't remember why we do it
differently now, but maybe the fix is just to release the regular page
lock before locking the revmap page? One very important change is that
in previous versions the revmap used a separate fork, and we had to
introduce an "evacuation protocol" when we integrated the revmap into
the main fork, which may have changed the locking considerations.

What would happen if two processes built the summary concurrently? How
would they find the other tuple, so that we don't end up with two BRIN
tuples for the same range?

Another point: to desummarize a range, just unlinking the entry from
revmap should suffice, from the POV of other index scanners. Maybe we
can simplify the whole procedure to: lock revmap, remove entry, remember
page number, unlock revmap; lock regular page, delete entry, unlock.
Then there are no two locks held at the same time during desummarize.

Perhaps, as long as it doesn't confuse anything else.

This comes from v16:

I don't follow - what do you mean by v16? I don't see anything like that
anywhere in the repository.

+ Locking considerations
+ ----------------------
+ 
+ To read the TID during an index scan, we follow this protocol:
+ 
+ * read revmap page
+ * obtain share lock on the revmap buffer
+ * read the TID
+ * obtain share lock on buffer of main fork
+ * LockTuple the TID (using the index as relation).  A shared lock is
+   sufficient.  We need the LockTuple to prevent VACUUM from recycling
+   the index tuple; see below.
+ * release revmap buffer lock
+ * read the index tuple
+ * release the tuple lock
+ * release main fork buffer lock
+ 
+ 
+ To update the summary tuple for a page range, we use this protocol:
+ 
+ * insert a new index tuple somewhere in the main fork; note its TID
+ * read revmap page
+ * obtain exclusive lock on revmap buffer
+ * write the TID
+ * release lock
+ 
+ This ensures no concurrent reader can obtain a partially-written TID.
+ Note we don't need a tuple lock here.  Concurrent scans don't have to
+ worry about whether they got the old or new index tuple: if they get the
+ old one, the tighter values are okay from a correctness standpoint because
+ due to MVCC they can't possibly see the just-inserted heap tuples anyway.
+
+ [vacuum stuff elided]

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#4Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Tomas Vondra (#3)
Re: LWLock deadlock in brinRevmapDesummarizeRange

On 2023-Feb-22, Tomas Vondra wrote:

... and in there, I wrote that we would first write the brin tuple in
the regular page, unlock that, and then lock the revmap for the update,
without holding lock on the data page. I don't remember why we do it
differently now, but maybe the fix is just to release the regular page
lock before locking the revmap page? One very important change is that
in previous versions the revmap used a separate fork, and we had to
introduce an "evacuation protocol" when we integrated the revmap into
the main fork, which may have changed the locking considerations.

What would happen if two processes built the summary concurrently? How
would they find the other tuple, so that we don't end up with two BRIN
tuples for the same range?

Well, the revmap can only keep track of one tuple per range; if two
processes build summary tuples, and each tries to insert its tuple in a
regular page, that part may succeed; but then only one of them is going
to successfully register the summary tuple in the revmap: when the other
goes to do the same, it would find that a CTID is already present.

... Looking at the code (brinSetHeapBlockItemptr), I think what happens
here is that the second process would overwrite the TID with its own.
Not sure if it would work to see whether the item is empty and bail out
if it's not.

But in any case, it seems to me that the update of the regular page is
pretty much independent of the update of the revmap.

Another point: to desummarize a range, just unlinking the entry from
revmap should suffice, from the POV of other index scanners. Maybe we
can simplify the whole procedure to: lock revmap, remove entry, remember
page number, unlock revmap; lock regular page, delete entry, unlock.
Then there are no two locks held at the same time during desummarize.

Perhaps, as long as it doesn't confuse anything else.

Well, I don't have the details fresh in mind, but I think it shouldn't,
because the only way to reach a regular tuple is coming from the revmap;
and we reuse "items" (lines) in a regular page only when they are
empty (so vacuuming should also be OK).

This comes from v16:

I don't follow - what do you mean by v16? I don't see anything like that
anywhere in the repository.

I meant the minmax-proposal file in patch v16, the one that I linked to.

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
"You're _really_ hosed if the person doing the hiring doesn't understand
relational systems: you end up with a whole raft of programmers, none of
whom has had a Date with the clue stick." (Andrew Sullivan)

#5Arseniy Mukhin
arseniy.mukhin.dev@gmail.com
In reply to: Alvaro Herrera (#4)
4 attachment(s)
Re: LWLock deadlock in brinRevmapDesummarizeRange

Hello Tomas and Alvaro,

Came across this old thread while searching for information about brin,
and decided to try to fix it.

On 22.02.2023 17:41, Alvaro Herrera wrote:

On 2023-Feb-22, Tomas Vondra wrote:

... and in there, I wrote that we would first write the brin tuple in
the regular page, unlock that, and then lock the revmap for the update,
without holding lock on the data page.  I don't remember why we do it
differently now, but maybe the fix is just to release the regular page
lock before locking the revmap page?  One very important change is that
in previous versions the revmap used a separate fork, and we had to
introduce an "evacuation protocol" when we integrated the revmap into
the main fork, which may have changed the locking considerations.

What would happen if two processes built the summary concurrently? How
would they find the other tuple, so that we don't end up with two BRIN
tuples for the same range?

Well, the revmap can only keep track of one tuple per range; if two
processes build summary tuples, and each tries to insert its tuple in a
regular page, that part may succeed; but then only one of them is going
to successfully register the summary tuple in the revmap: when the other
goes to do the same, it would find that a CTID is already present.

... Looking at the code (brinSetHeapBlockItemptr), I think what happens
here is that the second process would overwrite the TID with its own.
Not sure if it would work to see whether the item is empty and bail out
if it's not.

AFAIS it's impossible to have two processes summarizing or desummarizing
ranges simultaneously,
because you need ShareUpdateExclusiveLock to do it. Usual inserts /
updates can't lead
to summarization, they affects only those ranges that are summarized
already.

Another point: to desummarize a range, just unlinking the entry from
revmap should suffice, from the POV of other index scanners.  Maybe we
can simplify the whole procedure to: lock revmap, remove entry,

remember

page number, unlock revmap; lock regular page, delete entry, unlock.
Then there are no two locks held at the same time during desummarize.

Perhaps, as long as it doesn't confuse anything else.

Well, I don't have the details fresh in mind, but I think it shouldn't,
because the only way to reach a regular tuple is coming from the revmap;
and we reuse "items" (lines) in a regular page only when they are
empty (so vacuuming should also be OK).

I think it would work, but there's one thing to handle in this case:
You may have a concurrent update that moved the index tuple to another
reg page and wants to update
the revmap page with a new pointer. This may happen after the revmap
entry is deleted,
so you can have kind of lost update. But this seems to be easy to detect:
when you try to delete the index tuple, you'll see an unused item.
This means that a concurrent update moved index tuple and updated revmap
page and you need to retry the desummarization.

Also if you don't hold locks on both revmap page and regular pages, you
will have
to split wal record and rework recovery for desummarization. And
probably in case of crash between revmap update
and regular page update you can have dangling index tuple, so it seems
that having one wal record for update simplifies
recovery.

I think there is a straightforward way to implement desummarization and
fix the deadlock.
Desummarization is actually very similar to usual update, so we can just
use update flow
(like brininsert or second part of summarize_range where you need to
merge summary with placeholder tuple):
- lock revmap page
- save pointer to the index tuple
- unlock revmap page
- lock regular page
- check if pointer is still valid (blkno is as expected, NORMAL tuple etc.)
  if we see that pointer is not valid anymore -> unlock reg page + retry
- lock revmap page
- update reg page and revmap page

First five steps have been implemented already in
brinGetTupleForHeapBlock, so we can use it (like every update actually
do, if I see correctly).

Please find the attached patch implementing the above idea.

I tried Tomas's stress tests with applied patch and didn't have deadlocks.

Also there is a another way how issue can be reproduced:
apply deadlock-sleep.patch (it just adds 15 sec sleep in brin_doupdate
and few logs)
execute deadlock.sql
execute deadlock2.sql from another client (there are 15 seconds to do it
while first process is sleeping).

Best regards,
Arseniy Mukhin

Attachments:

0001-brin-desummarization-deadlock.patchtext/x-patch; charset=UTF-8; name=0001-brin-desummarization-deadlock.patchDownload
From 31a8a558b845cda6b1ec1d18829e0f7ff0ec350d Mon Sep 17 00:00:00 2001
From: Arseniy Mukhin <arseniy.mukhin.dev@gmail.com>
Date: Wed, 2 Apr 2025 21:55:24 +0300
Subject: [PATCH] brin desummarization deadlock

---
 src/backend/access/brin/brin.c        |  8 +--
 src/backend/access/brin/brin_revmap.c | 71 ++++++++++++---------------
 src/include/access/brin_revmap.h      |  2 +-
 3 files changed, 33 insertions(+), 48 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 01e1db7f856..c52677534ed 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -1496,7 +1496,6 @@ brin_desummarize_range(PG_FUNCTION_ARGS)
 	Oid			heapoid;
 	Relation	heapRel;
 	Relation	indexRel;
-	bool		done;
 
 	if (RecoveryInProgress())
 		ereport(ERROR,
@@ -1555,12 +1554,7 @@ brin_desummarize_range(PG_FUNCTION_ARGS)
 	/* see gin_clean_pending_list() */
 	if (indexRel->rd_index->indisvalid)
 	{
-		/* the revmap does the hard work */
-		do
-		{
-			done = brinRevmapDesummarizeRange(indexRel, heapBlk);
-		}
-		while (!done);
+		brinRevmapDesummarizeRange(indexRel, heapBlk);
 	}
 	else
 		ereport(DEBUG1,
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index 4e380ecc710..d3ef03d34ae 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -316,10 +316,8 @@ brinGetTupleForHeapBlock(BrinRevmap *revmap, BlockNumber heapBlk,
  * Delete an index tuple, marking a page range as unsummarized.
  *
  * Index must be locked in ShareUpdateExclusiveLock mode.
- *
- * Return false if caller should retry.
  */
-bool
+void
 brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 {
 	BrinRevmap *revmap;
@@ -327,26 +325,37 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	RevmapContents *contents;
 	ItemPointerData *iptr;
 	ItemPointerData invalidIptr;
-	BlockNumber revmapBlk;
 	Buffer		revmapBuf;
-	Buffer		regBuf;
+	Buffer		regBuf = InvalidBuffer;
 	Page		revmapPg;
 	Page		regPg;
 	OffsetNumber revmapOffset;
 	OffsetNumber regOffset;
-	ItemId		lp;
+	BrinTuple  *btup;
+	Size		size;
 
 	revmap = brinRevmapInitialize(idxrel, &pagesPerRange);
 
-	revmapBlk = revmap_get_blkno(revmap, heapBlk);
-	if (!BlockNumberIsValid(revmapBlk))
+	/* Try to get index tuple for the target block number */
+	btup = brinGetTupleForHeapBlock(revmap, heapBlk, &regBuf, &regOffset, &size, BUFFER_LOCK_EXCLUSIVE);
+
+	/* If range is not summarized, release resources and we are done */
+	if (!btup)
 	{
-		/* revmap page doesn't exist: range not summarized, we're done */
+		if (BufferIsValid(regBuf))
+		{
+			ReleaseBuffer(regBuf);
+		}
 		brinRevmapTerminate(revmap);
-		return true;
+
+		return;
 	}
 
-	/* Lock the revmap page, obtain the index tuple pointer from it */
+	/*
+	 * We know that range is summarized (or it's a dangling placeholder), and
+	 * we have the regular page with the index tuple locked exclusively. Now
+	 * lock the revmap page with pointer to our index tuple
+	 */
 	revmapBuf = brinLockRevmapPageForUpdate(revmap, heapBlk);
 	revmapPg = BufferGetPage(revmapBuf);
 	revmapOffset = HEAPBLK_TO_REVMAP_INDEX(revmap->rm_pagesPerRange, heapBlk);
@@ -355,38 +364,22 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	iptr = contents->rm_tids;
 	iptr += revmapOffset;
 
-	if (!ItemPointerIsValid(iptr))
-	{
-		/* no index tuple: range not summarized, we're done */
-		LockBuffer(revmapBuf, BUFFER_LOCK_UNLOCK);
-		brinRevmapTerminate(revmap);
-		return true;
-	}
-
-	regBuf = ReadBuffer(idxrel, ItemPointerGetBlockNumber(iptr));
-	LockBuffer(regBuf, BUFFER_LOCK_EXCLUSIVE);
-	regPg = BufferGetPage(regBuf);
-
-	/* if this is no longer a regular page, tell caller to start over */
-	if (!BRIN_IS_REGULAR_PAGE(regPg))
+	/*
+	 * No concurrent updates for the range are possible as any update should
+	 * lock regular page first, and we already hold the lock. We know that
+	 * index tuple for the range exists, so revmap item pointer couldn't be
+	 * invalid and should point to index tuple btup.
+	 */
+	if (!ItemPointerIsValid(iptr) ||
+		ItemPointerGetBlockNumber(iptr) != BufferGetBlockNumber(regBuf) ||
+		ItemPointerGetOffsetNumber(iptr) != regOffset)
 	{
-		LockBuffer(revmapBuf, BUFFER_LOCK_UNLOCK);
-		LockBuffer(regBuf, BUFFER_LOCK_UNLOCK);
-		brinRevmapTerminate(revmap);
-		return false;
-	}
-
-	regOffset = ItemPointerGetOffsetNumber(iptr);
-	if (regOffset > PageGetMaxOffsetNumber(regPg))
 		ereport(ERROR,
 				(errcode(ERRCODE_INDEX_CORRUPTED),
 				 errmsg("corrupted BRIN index: inconsistent range map")));
+	}
 
-	lp = PageGetItemId(regPg, regOffset);
-	if (!ItemIdIsUsed(lp))
-		ereport(ERROR,
-				(errcode(ERRCODE_INDEX_CORRUPTED),
-				 errmsg("corrupted BRIN index: inconsistent range map")));
+	regPg = BufferGetPage(regBuf);
 
 	/*
 	 * Placeholder tuples only appear during unfinished summarization, and we
@@ -429,8 +422,6 @@ brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk)
 	UnlockReleaseBuffer(regBuf);
 	LockBuffer(revmapBuf, BUFFER_LOCK_UNLOCK);
 	brinRevmapTerminate(revmap);
-
-	return true;
 }
 
 /*
diff --git a/src/include/access/brin_revmap.h b/src/include/access/brin_revmap.h
index b88473c9860..322109e903d 100644
--- a/src/include/access/brin_revmap.h
+++ b/src/include/access/brin_revmap.h
@@ -36,6 +36,6 @@ extern void brinSetHeapBlockItemptr(Buffer buf, BlockNumber pagesPerRange,
 extern BrinTuple *brinGetTupleForHeapBlock(BrinRevmap *revmap,
 										   BlockNumber heapBlk, Buffer *buf, OffsetNumber *off,
 										   Size *size, int mode);
-extern bool brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk);
+extern void brinRevmapDesummarizeRange(Relation idxrel, BlockNumber heapBlk);
 
 #endif							/* BRIN_REVMAP_H */
-- 
2.43.0

deadlock.sqlapplication/sql; name=deadlock.sqlDownload
deadlock2.sqlapplication/sql; name=deadlock2.sqlDownload
deadlock-sleep.patchtext/x-patch; charset=UTF-8; name=deadlock-sleep.patchDownload
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index 6d8dd1512d6..9c3ad1f0bab 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -237,8 +237,11 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 		OffsetNumber newoff;
 		Size		freespace = 0;
 
+        elog(DEBUG1, "sleep while update");
+        sleep(15);
+        elog(DEBUG1, "Try to revmap for update reg page - update");
 		revmapbuf = brinLockRevmapPageForUpdate(revmap, heapBlk);
-
+        elog(DEBUG1, "Try to revmap for update reg page - update");
 		START_CRIT_SECTION();
 
 		/*