md.c vs elog.c vs smgrreleaseall() in barrier

Started by Andres Freund10 months ago14 messages
#1Andres Freund
andres@anarazel.de

Hi,

If there is any LOG/DEBUG elog inside functions like mdreadv(), mdwritev(),
mdextend(), be it directly or indirectly, the functions become unsafe. The
problem is that at the end of errfinish() there is a CFI:

/*
* Check for cancel/die interrupt first --- this is so that the user can
* stop a query emitting tons of notice or warning messages, even if it's
* in a loop that otherwise fails to check for interrupts.
*/
CHECK_FOR_INTERRUPTS();

that CFI in turn can call
ProcessProcSignalBarrier() ->
ProcessBarrierSmgrRelease() ->
smgrreleaseall()

which will free the MdfdVec that was opened at the start of mdreadv() etc.

I of course originally hit this with AIO, but it's not really AIO specific. If
I #define FDDEBUG I can quickly trigger crashes in master too. E.g.

(gdb) bt
#4 0x00005563a1bcb266 in ExceptionalCondition (conditionName=0x5563a10e2aae "FileIsValid(file)",
fileName=0x5563a10e24c8 "../../../../../home/andres/src/postgresql/src/backend/storage/file/fd.c", lineNumber=2485)
at ../../../../../home/andres/src/postgresql/src/backend/utils/error/assert.c:66
#5 0x00005563a198d10b in FilePathName (file=-917820408) at ../../../../../home/andres/src/postgresql/src/backend/storage/file/fd.c:2485
#6 0x00005563a19cd40a in mdextend (reln=0x5563c952d568, forknum=MAIN_FORKNUM, blocknum=18, buffer=0x7fff13dda000, skipFsync=true)
at ../../../../../home/andres/src/postgresql/src/backend/storage/smgr/md.c:505
#7 0x00005563a19d0848 in smgrextend (reln=0x5563c952d568, forknum=MAIN_FORKNUM, blocknum=18, buffer=0x7fff13dda000, skipFsync=true)
at ../../../../../home/andres/src/postgresql/src/backend/storage/smgr/smgr.c:541
#8 0x00005563a1981800 in RelationCopyStorageUsingBuffer (srclocator=..., dstlocator=..., forkNum=MAIN_FORKNUM, permanent=true)
at ../../../../../home/andres/src/postgresql/src/backend/storage/buffer/bufmgr.c:4667
#9 0x00005563a1981b8f in CreateAndCopyRelationData (src_rlocator=..., dst_rlocator=..., permanent=true)
at ../../../../../home/andres/src/postgresql/src/backend/storage/buffer/bufmgr.c:4760
#10 0x00005563a16561c7 in CreateDatabaseUsingWalLog (src_dboid=4, dst_dboid=1414794, src_tsid=1663, dst_tsid=1663)
at ../../../../../home/andres/src/postgresql/src/backend/commands/dbcommands.c:216
#11 0x00005563a16594de in createdb (pstate=0x5563c942f8b0, stmt=0x5563c9476fd8)
at ../../../../../home/andres/src/postgresql/src/backend/commands/dbcommands.c:1522
#12 0x00005563a19e05b2 in standard_ProcessUtility (pstmt=0x5563c9477068,

Right now I don't really see a solution other than putting
HOLD_INTERRUPTS()/RESUME_INTERRUPTS() into all those functions.

If we go for that, I can see an argument for doing that in smgr.c instead of
md.c, afaict any plausible smgr implementation would run into this, as the
smgrcloseall() is triggered from the smgr level.

Greetings,

Andres Freund

#2Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#1)
Re: md.c vs elog.c vs smgrreleaseall() in barrier

On Fri, Mar 14, 2025 at 5:03 AM Andres Freund <andres@anarazel.de> wrote:

If there is any LOG/DEBUG elog inside functions like mdreadv(), mdwritev(),
mdextend(), be it directly or indirectly, the functions become unsafe. The
problem is that at the end of errfinish() there is a CFI:

/*
* Check for cancel/die interrupt first --- this is so that the user can
* stop a query emitting tons of notice or warning messages, even if it's
* in a loop that otherwise fails to check for interrupts.
*/
CHECK_FOR_INTERRUPTS();

that CFI in turn can call
ProcessProcSignalBarrier() ->
ProcessBarrierSmgrRelease() ->
smgrreleaseall()

which will free the MdfdVec that was opened at the start of mdreadv() etc.

Right now I don't really see a solution other than putting
HOLD_INTERRUPTS()/RESUME_INTERRUPTS() into all those functions.

Some other ideas:

1. An smgr-local smgr-release-hold flag, which would just cause
ProcessBarrierSmgrReleaseAll() to return false. Main complication
seems to be making sure you reset it on error, which is a bit
annoying, elog.c probably needs to know how to clean it up. Yuck.

2. Somehow tell elog.c not to call CHECK_FOR_INTERRUPTS() instead.
Also yuck, but at least elog.c is already the right place to clean
state up on non-local exit...

3. Considering errfinish()'s stated goal, a sort of backstop to help
you cancel looping message-spewing code only, I wonder if we should
just restrict the kinds of interrupts it processes, so that it only
calls CFI() if we're definitely throwing ERROR or FATAL? Something
like CHECK_FOR_CANCEL_OR_DIE() which would enter the regular code only
if it can already determine that, which would probably be defined as
something like: if (INTERRUPTS_PENDING_CONDITION() &&
INTERRUPTS_CAN_BE_PROCESSED() && (QueryCancelPending ||
ProcDiePending)) ProcessInterrupts(). I realise that what I'm
describing is essentially what CHECK_FOR_INTERRUPTS() used to be like,
before a bunch of non-cancel, non-die stuff moved into
CHECK_FOR_INTERRUPTS(), but that's probably the conditions under which
that code was written. Could anything else be accidentally on elog()
CFIs? Seems pretty weird?

4. Bigger refactoring of the interrupts system so you can express
more precisely what kinds of stuff you want to handle. Well, we have
some stuff like that coming in the nearby procsignal/interrupt
refactoring thread. I don't feel enthusiastic about messing with it
in the back branches, hence easier suggestion in #3.

If we go for that, I can see an argument for doing that in smgr.c instead of
md.c, afaict any plausible smgr implementation would run into this, as the
smgrcloseall() is triggered from the smgr level.

Seems like maybe not a great idea to assume that future smgrs will be
OK with being non-interruptible, if it can be avoided?

#3Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#2)
Re: md.c vs elog.c vs smgrreleaseall() in barrier

Hi,

On 2025-03-14 11:31:47 +1300, Thomas Munro wrote:

2. Somehow tell elog.c not to call CHECK_FOR_INTERRUPTS() instead.
Also yuck, but at least elog.c is already the right place to clean
state up on non-local exit...

How would that differ from HOLD_INTERRUPTS?

3. Considering errfinish()'s stated goal, a sort of backstop to help
you cancel looping message-spewing code only, I wonder if we should
just restrict the kinds of interrupts it processes, so that it only
calls CFI() if we're definitely throwing ERROR or FATAL?

I'm not even sure it'd be safe to ERROR out in all the relevant places...

E.g.
/* implementation-specific initialization */
smgrsw[reln->smgr_which].smgr_open(reln);

/* it is not pinned yet */
reln->pincount = 0;
dlist_push_tail(&unpinned_relns, &reln->node);

doesn't this mean that ->pincount is uninitialized in case smgr_open() errors
out and that we'd loose track of the reln because it wasn't yet added to
unpinned_rels?

We should probably make sure it's safe against ERROR, given that there are
several paths towards that...

Could anything else be accidentally on elog() CFIs? Seems pretty weird?

Hm, can't quite parse this...

If we go for that, I can see an argument for doing that in smgr.c instead of
md.c, afaict any plausible smgr implementation would run into this, as the
smgrcloseall() is triggered from the smgr level.

Seems like maybe not a great idea to assume that future smgrs will be
OK with being non-interruptible, if it can be avoided?

I think you'd need a fairly large surgery of smgr.c to make that viable - I
rather doubt that smgr.c itself is actually safe against interrupts.

I can see some arguable uses of smgr handling interrupts, say an smgr
implementation based on a network backed store, but you'd need rather massive
changes to actually be sure that smgr.c is called while accepting interrupts -
e.g. today sgmrwritev() will largely be called with interrupts held. Plenty
reads too. I doubt that undoing a few HOLD_INTERRUPTS() is a meaningful part
of the necessary work.

Greetings,

Andres Freund

#4Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#3)
Re: md.c vs elog.c vs smgrreleaseall() in barrier

On Fri, Mar 14, 2025 at 12:23 PM Andres Freund <andres@anarazel.de> wrote:

On 2025-03-14 11:31:47 +1300, Thomas Munro wrote:

2. Somehow tell elog.c not to call CHECK_FOR_INTERRUPTS() instead.
Also yuck, but at least elog.c is already the right place to clean
state up on non-local exit...

How would that differ from HOLD_INTERRUPTS?

Well I was just saying that if you try to make something similar you
face the same problems, hence "yuck". The motivation for wanting to
avoid full-scale HOLD_INTERRUPTS() is that there could be some other
code path that does want interrupt processing of some limited kind,
but it's all a bit hypothetical...

3. Considering errfinish()'s stated goal, a sort of backstop to help
you cancel looping message-spewing code only, I wonder if we should
just restrict the kinds of interrupts it processes, so that it only
calls CFI() if we're definitely throwing ERROR or FATAL?

I'm not even sure it'd be safe to ERROR out in all the relevant places...

E.g.
/* implementation-specific initialization */
smgrsw[reln->smgr_which].smgr_open(reln);

/* it is not pinned yet */
reln->pincount = 0;
dlist_push_tail(&unpinned_relns, &reln->node);

doesn't this mean that ->pincount is uninitialized in case smgr_open() errors
out and that we'd loose track of the reln because it wasn't yet added to
unpinned_rels?

Ugh, right.

We should probably make sure it's safe against ERROR, given that there are
several paths towards that...

Yeah.

Could anything else be accidentally on elog() CFIs? Seems pretty weird?

Hm, can't quite parse this...

Sorry accidentally *depending* on.

If we go for that, I can see an argument for doing that in smgr.c instead of
md.c, afaict any plausible smgr implementation would run into this, as the
smgrcloseall() is triggered from the smgr level.

Seems like maybe not a great idea to assume that future smgrs will be
OK with being non-interruptible, if it can be avoided?

I think you'd need a fairly large surgery of smgr.c to make that viable - I
rather doubt that smgr.c itself is actually safe against interrupts.

I can see some arguable uses of smgr handling interrupts, say an smgr
implementation based on a network backed store, but you'd need rather massive
changes to actually be sure that smgr.c is called while accepting interrupts -
e.g. today sgmrwritev() will largely be called with interrupts held. Plenty
reads too. I doubt that undoing a few HOLD_INTERRUPTS() is a meaningful part
of the necessary work.

Right, exactly the case I was thinking of: if some hypothetical future
network smgr wants to be able to process *some* kinds of things
carefully while waiting for the network. I don't think we want to
constrain ourselves to NFS-style "pretend it's local and make it
non-interruptible" without any escape hatches, but you're quite right
that that's probably a whole research project of its own and we just
haven't refined the interrupt system enough for that yet, so yeah I
see how you arrived at that conclusion and it makes sense.

#5Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#4)
2 attachment(s)
Re: md.c vs elog.c vs smgrreleaseall() in barrier

Hi,

On 2025-03-14 12:57:51 +1300, Thomas Munro wrote:

On Fri, Mar 14, 2025 at 12:23 PM Andres Freund <andres@anarazel.de> wrote:

3. Considering errfinish()'s stated goal, a sort of backstop to help
you cancel looping message-spewing code only, I wonder if we should
just restrict the kinds of interrupts it processes, so that it only
calls CFI() if we're definitely throwing ERROR or FATAL?

I'm not even sure it'd be safe to ERROR out in all the relevant places...

E.g.
/* implementation-specific initialization */
smgrsw[reln->smgr_which].smgr_open(reln);

/* it is not pinned yet */
reln->pincount = 0;
dlist_push_tail(&unpinned_relns, &reln->node);

doesn't this mean that ->pincount is uninitialized in case smgr_open() errors
out and that we'd loose track of the reln because it wasn't yet added to
unpinned_rels?

Ugh, right.

Patch for that attached.

If we go for that, I can see an argument for doing that in smgr.c instead of
md.c, afaict any plausible smgr implementation would run into this, as the
smgrcloseall() is triggered from the smgr level.

Seems like maybe not a great idea to assume that future smgrs will be
OK with being non-interruptible, if it can be avoided?

I think you'd need a fairly large surgery of smgr.c to make that viable - I
rather doubt that smgr.c itself is actually safe against interrupts.

I can see some arguable uses of smgr handling interrupts, say an smgr
implementation based on a network backed store, but you'd need rather massive
changes to actually be sure that smgr.c is called while accepting interrupts -
e.g. today sgmrwritev() will largely be called with interrupts held. Plenty
reads too. I doubt that undoing a few HOLD_INTERRUPTS() is a meaningful part
of the necessary work.

Right, exactly the case I was thinking of: if some hypothetical future
network smgr wants to be able to process *some* kinds of things
carefully while waiting for the network. I don't think we want to
constrain ourselves to NFS-style "pretend it's local and make it
non-interruptible" without any escape hatches, but you're quite right
that that's probably a whole research project of its own and we just
haven't refined the interrupt system enough for that yet, so yeah I
see how you arrived at that conclusion and it makes sense.

Here's a proposed patch for this. It turns out that the bug might already be
reachable, even without defining FDDEBUG. There's a debug ereport() in
register_dirty_segment() - but it's hard to reach in practice.

I don't really know whether that means we ought to backpatch or not - which
makes me want to be conservative and not backpatch.

Greetings,

Andres Freund

Attachments:

v2-0001-smgr-Hold-interrupts-in-most-smgr-functions.patchtext/x-diff; charset=us-asciiDownload
From 79c33df9e53038b7ebc9e1e6d5e575f6ef46e4f2 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 17 Mar 2025 19:41:54 -0400
Subject: [PATCH v2 1/2] smgr: Hold interrupts in most smgr functions

We need to hold interrupts across most of the smgr.c/md.c functions, as
otherwise interrupt processing, e.g. due to a debug elog/ereport, can trigger
procsignal processing, which in turn can trigger smgrreleaseall(). As the
relevant code is not reentrant, we quickly end up in a bad situation.

The only reason we haven't noticed this before is that there is only one
non-error ereport called in affected routines, in register_dirty_segments(),
and that one is extremely rarely reached. If one enables fd.c's FDDEBUG it's
easy to reproduce crashes.

It seems better to put the HOLD_INTERRUPTS()/RESUME_INTERRUPTS() in smgr.c,
instead of trying to push them down to md.c where possible: For one, every
smgr implementation would be vulnerable, for another, a good bit of smgr.c
code itself is affected too.

Eventually we might want a more targeted solution, allowing e.g. a networked
smgr implementation to be interrupted, but many other, more complicated,
problems would need to be fixed for that to be viable (e.g. smgr.c is often
called with interrupts already held).

Discussion: https://postgr.es/m/3vae7l5ozvqtxmd7rr7zaeq3qkuipz365u3rtim5t5wdkr6f4g@vkgf2fogjirl
---
 src/backend/storage/smgr/smgr.c | 97 ++++++++++++++++++++++++++++++++-
 1 file changed, 94 insertions(+), 3 deletions(-)

diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index ebe35c04de5..53a09fe4aaa 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -40,6 +40,18 @@
  * themselves, as there could pointers to them in active use.  See
  * smgrrelease() and smgrreleaseall().
  *
+ * NB: We need to hold interrupts across most of the functions in this file,
+ * as otherwise interrupt processing, e.g. due to a debug elog/ereport, can
+ * trigger procsignal processing, which in turn can trigger
+ * smgrreleaseall(). None of the relevant code is reentrant.  It seems better
+ * to put the HOLD_INTERRUPTS()/RESUME_INTERRUPTS() here, instead of trying to
+ * push them down to md.c where possible: For one, every smgr implementation
+ * would be vulnerable, for another, a good bit of smgr.c code itself is
+ * affected too.  Eventually we might want a more targeted solution, allowing
+ * e.g. a networked smgr implementation to be interrupted, but many other,
+ * more complicated, problems would need to be fixed for that to be viable
+ * (e.g. smgr.c is often called with interrupts already held).
+ *
  * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
  *
@@ -53,6 +65,7 @@
 
 #include "access/xlogutils.h"
 #include "lib/ilist.h"
+#include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/md.h"
@@ -158,12 +171,16 @@ smgrinit(void)
 {
 	int			i;
 
+	HOLD_INTERRUPTS();
+
 	for (i = 0; i < NSmgr; i++)
 	{
 		if (smgrsw[i].smgr_init)
 			smgrsw[i].smgr_init();
 	}
 
+	RESUME_INTERRUPTS();
+
 	/* register the shutdown proc */
 	on_proc_exit(smgrshutdown, 0);
 }
@@ -176,11 +193,13 @@ smgrshutdown(int code, Datum arg)
 {
 	int			i;
 
+	HOLD_INTERRUPTS();
 	for (i = 0; i < NSmgr; i++)
 	{
 		if (smgrsw[i].smgr_shutdown)
 			smgrsw[i].smgr_shutdown();
 	}
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -206,6 +225,8 @@ smgropen(RelFileLocator rlocator, ProcNumber backend)
 
 	Assert(RelFileNumberIsValid(rlocator.relNumber));
 
+	HOLD_INTERRUPTS();
+
 	if (SMgrRelationHash == NULL)
 	{
 		/* First time through: initialize the hash table */
@@ -242,6 +263,8 @@ smgropen(RelFileLocator rlocator, ProcNumber backend)
 		dlist_push_tail(&unpinned_relns, &reln->node);
 	}
 
+	RESUME_INTERRUPTS();
+
 	return reln;
 }
 
@@ -283,6 +306,8 @@ smgrdestroy(SMgrRelation reln)
 
 	Assert(reln->pincount == 0);
 
+	HOLD_INTERRUPTS();
+
 	for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
 		smgrsw[reln->smgr_which].smgr_close(reln, forknum);
 
@@ -292,6 +317,8 @@ smgrdestroy(SMgrRelation reln)
 					&(reln->smgr_rlocator),
 					HASH_REMOVE, NULL) == NULL)
 		elog(ERROR, "SMgrRelation hashtable corrupted");
+
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -302,12 +329,16 @@ smgrdestroy(SMgrRelation reln)
 void
 smgrrelease(SMgrRelation reln)
 {
+	HOLD_INTERRUPTS();
+
 	for (ForkNumber forknum = 0; forknum <= MAX_FORKNUM; forknum++)
 	{
 		smgrsw[reln->smgr_which].smgr_close(reln, forknum);
 		reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
 	}
 	reln->smgr_targblock = InvalidBlockNumber;
+
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -336,6 +367,8 @@ smgrdestroyall(void)
 {
 	dlist_mutable_iter iter;
 
+	HOLD_INTERRUPTS();
+
 	/*
 	 * Zap all unpinned SMgrRelations.  We rely on smgrdestroy() to remove
 	 * each one from the list.
@@ -347,6 +380,8 @@ smgrdestroyall(void)
 
 		smgrdestroy(rel);
 	}
+
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -362,12 +397,16 @@ smgrreleaseall(void)
 	if (SMgrRelationHash == NULL)
 		return;
 
+	HOLD_INTERRUPTS();
+
 	hash_seq_init(&status, SMgrRelationHash);
 
 	while ((reln = (SMgrRelation) hash_seq_search(&status)) != NULL)
 	{
 		smgrrelease(reln);
 	}
+
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -400,7 +439,13 @@ smgrreleaserellocator(RelFileLocatorBackend rlocator)
 bool
 smgrexists(SMgrRelation reln, ForkNumber forknum)
 {
-	return smgrsw[reln->smgr_which].smgr_exists(reln, forknum);
+	bool		ret;
+
+	HOLD_INTERRUPTS();
+	ret = smgrsw[reln->smgr_which].smgr_exists(reln, forknum);
+	RESUME_INTERRUPTS();
+
+	return ret;
 }
 
 /*
@@ -413,7 +458,9 @@ smgrexists(SMgrRelation reln, ForkNumber forknum)
 void
 smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
 {
+	HOLD_INTERRUPTS();
 	smgrsw[reln->smgr_which].smgr_create(reln, forknum, isRedo);
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -434,6 +481,8 @@ smgrdosyncall(SMgrRelation *rels, int nrels)
 	if (nrels == 0)
 		return;
 
+	HOLD_INTERRUPTS();
+
 	FlushRelationsAllBuffers(rels, nrels);
 
 	/*
@@ -449,6 +498,8 @@ smgrdosyncall(SMgrRelation *rels, int nrels)
 				smgrsw[which].smgr_immedsync(rels[i], forknum);
 		}
 	}
+
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -471,6 +522,8 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
 	if (nrels == 0)
 		return;
 
+	HOLD_INTERRUPTS();
+
 	/*
 	 * Get rid of any remaining buffers for the relations.  bufmgr will just
 	 * drop them without bothering to write the contents.
@@ -522,6 +575,8 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
 	}
 
 	pfree(rlocators);
+
+	RESUME_INTERRUPTS();
 }
 
 
@@ -538,6 +593,8 @@ void
 smgrextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		   const void *buffer, bool skipFsync)
 {
+	HOLD_INTERRUPTS();
+
 	smgrsw[reln->smgr_which].smgr_extend(reln, forknum, blocknum,
 										 buffer, skipFsync);
 
@@ -550,6 +607,8 @@ smgrextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		reln->smgr_cached_nblocks[forknum] = blocknum + 1;
 	else
 		reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
+
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -563,6 +622,8 @@ void
 smgrzeroextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 			   int nblocks, bool skipFsync)
 {
+	HOLD_INTERRUPTS();
+
 	smgrsw[reln->smgr_which].smgr_zeroextend(reln, forknum, blocknum,
 											 nblocks, skipFsync);
 
@@ -575,6 +636,8 @@ smgrzeroextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		reln->smgr_cached_nblocks[forknum] = blocknum + nblocks;
 	else
 		reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
+
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -588,7 +651,13 @@ bool
 smgrprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 			 int nblocks)
 {
-	return smgrsw[reln->smgr_which].smgr_prefetch(reln, forknum, blocknum, nblocks);
+	bool		ret;
+
+	HOLD_INTERRUPTS();
+	ret = smgrsw[reln->smgr_which].smgr_prefetch(reln, forknum, blocknum, nblocks);
+	RESUME_INTERRUPTS();
+
+	return ret;
 }
 
 /*
@@ -601,7 +670,13 @@ uint32
 smgrmaxcombine(SMgrRelation reln, ForkNumber forknum,
 			   BlockNumber blocknum)
 {
-	return smgrsw[reln->smgr_which].smgr_maxcombine(reln, forknum, blocknum);
+	uint32		ret;
+
+	HOLD_INTERRUPTS();
+	ret = smgrsw[reln->smgr_which].smgr_maxcombine(reln, forknum, blocknum);
+	RESUME_INTERRUPTS();
+
+	return ret;
 }
 
 /*
@@ -619,8 +694,10 @@ void
 smgrreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		  void **buffers, BlockNumber nblocks)
 {
+	HOLD_INTERRUPTS();
 	smgrsw[reln->smgr_which].smgr_readv(reln, forknum, blocknum, buffers,
 										nblocks);
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -653,8 +730,10 @@ void
 smgrwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		   const void **buffers, BlockNumber nblocks, bool skipFsync)
 {
+	HOLD_INTERRUPTS();
 	smgrsw[reln->smgr_which].smgr_writev(reln, forknum, blocknum,
 										 buffers, nblocks, skipFsync);
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -665,8 +744,10 @@ void
 smgrwriteback(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 			  BlockNumber nblocks)
 {
+	HOLD_INTERRUPTS();
 	smgrsw[reln->smgr_which].smgr_writeback(reln, forknum, blocknum,
 											nblocks);
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -683,10 +764,14 @@ smgrnblocks(SMgrRelation reln, ForkNumber forknum)
 	if (result != InvalidBlockNumber)
 		return result;
 
+	HOLD_INTERRUPTS();
+
 	result = smgrsw[reln->smgr_which].smgr_nblocks(reln, forknum);
 
 	reln->smgr_cached_nblocks[forknum] = result;
 
+	RESUME_INTERRUPTS();
+
 	return result;
 }
 
@@ -731,6 +816,8 @@ smgrtruncate(SMgrRelation reln, ForkNumber *forknum, int nforks,
 {
 	int			i;
 
+	Assert(!INTERRUPTS_CAN_BE_PROCESSED());
+
 	/*
 	 * Get rid of any buffers for the about-to-be-deleted blocks. bufmgr will
 	 * just drop them without bothering to write the contents.
@@ -784,7 +871,9 @@ smgrtruncate(SMgrRelation reln, ForkNumber *forknum, int nforks,
 void
 smgrregistersync(SMgrRelation reln, ForkNumber forknum)
 {
+	HOLD_INTERRUPTS();
 	smgrsw[reln->smgr_which].smgr_registersync(reln, forknum);
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -816,7 +905,9 @@ smgrregistersync(SMgrRelation reln, ForkNumber forknum)
 void
 smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
 {
+	HOLD_INTERRUPTS();
 	smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
+	RESUME_INTERRUPTS();
 }
 
 /*
-- 
2.48.1.76.g4e746b1a31.dirty

v2-0002-smgr-Make-SMgrRelation-initialization-safer-again.patchtext/x-diff; charset=us-asciiDownload
From ae6b57a919e92af228748a3c3a60d00b455ccf12 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 17 Mar 2025 19:46:01 -0400
Subject: [PATCH v2 2/2] smgr: Make SMgrRelation initialization safer against
 errors

In case the smgr_open callback failed, the ->pincount field would not be
initialized and the relation would not be put onto the unpinned_relns list.

This buglet was introduced in 21d9c3ee4ef7. As that commit is only in HEAD, no
need to backpatch.

Discussion: https://postgr.es/m/3vae7l5ozvqtxmd7rr7zaeq3qkuipz365u3rtim5t5wdkr6f4g@vkgf2fogjirl
---
 src/backend/storage/smgr/smgr.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 53a09fe4aaa..24971304b85 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -255,12 +255,12 @@ smgropen(RelFileLocator rlocator, ProcNumber backend)
 			reln->smgr_cached_nblocks[i] = InvalidBlockNumber;
 		reln->smgr_which = 0;	/* we only have md.c at present */
 
-		/* implementation-specific initialization */
-		smgrsw[reln->smgr_which].smgr_open(reln);
-
 		/* it is not pinned yet */
 		reln->pincount = 0;
 		dlist_push_tail(&unpinned_relns, &reln->node);
+
+		/* implementation-specific initialization */
+		smgrsw[reln->smgr_which].smgr_open(reln);
 	}
 
 	RESUME_INTERRUPTS();
-- 
2.48.1.76.g4e746b1a31.dirty

#6Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#5)
Re: md.c vs elog.c vs smgrreleaseall() in barrier

Hi,

On 2025-03-17 19:52:02 -0400, Andres Freund wrote:

On 2025-03-14 12:57:51 +1300, Thomas Munro wrote:

On Fri, Mar 14, 2025 at 12:23 PM Andres Freund <andres@anarazel.de> wrote:

3. Considering errfinish()'s stated goal, a sort of backstop to help
you cancel looping message-spewing code only, I wonder if we should
just restrict the kinds of interrupts it processes, so that it only
calls CFI() if we're definitely throwing ERROR or FATAL?

I'm not even sure it'd be safe to ERROR out in all the relevant places...

E.g.
/* implementation-specific initialization */
smgrsw[reln->smgr_which].smgr_open(reln);

/* it is not pinned yet */
reln->pincount = 0;
dlist_push_tail(&unpinned_relns, &reln->node);

doesn't this mean that ->pincount is uninitialized in case smgr_open() errors
out and that we'd loose track of the reln because it wasn't yet added to
unpinned_rels?

Ugh, right.

Patch for that attached.

The patch claimed:

This buglet was introduced in 21d9c3ee4ef7. As that commit is only in HEAD, no
need to backpatch.

But looking at it again that's unfortunately not true. Turns out it's not
2024 anymore... Planning to backpatch that part to 17 soon.

Greetings,

Andres Freund

#7Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#5)
Re: md.c vs elog.c vs smgrreleaseall() in barrier

On Mon, Mar 17, 2025 at 07:52:02PM -0400, Andres Freund wrote:

Here's a proposed patch for this. It turns out that the bug might already be
reachable, even without defining FDDEBUG. There's a debug ereport() in
register_dirty_segment() - but it's hard to reach in practice.

I don't really know whether that means we ought to backpatch or not - which
makes me want to be conservative and not backpatch.

Non-backpatch sounds fine.

Subject: [PATCH v2 1/2] smgr: Hold interrupts in most smgr functions

It seems better to put the HOLD_INTERRUPTS()/RESUME_INTERRUPTS() in smgr.c,
instead of trying to push them down to md.c where possible: For one, every
smgr implementation would be vulnerable, for another, a good bit of smgr.c
code itself is affected too.

I'm fine with putting it in smgr.c. Academically, I don't see every smgr
implementation being vulnerable for most smgr entry points. For example, the
upthread backtrace happens because mdclose() undoes the fd-opening work of
mdnblocks(). Another smgr implementation could make its smgr_close callback a
no-op. smgrrelease() doesn't destroy anything important at the smgr level; it
is mdclose() that destroys state that md.c still needs.

@@ -362,12 +397,16 @@ smgrreleaseall(void)
if (SMgrRelationHash == NULL)
return;

+ HOLD_INTERRUPTS();

Likely not important, but it's not clear to me why smgrdestroyall() and
smgrreleaseall() get HOLD_INTERRUPTS(), as opposed to relying on the holds in
smgrdestroy() and smgrrelease(). In contrast, smgrreleaserellocator() does
rely on smgrrelease() for the hold.

+
hash_seq_init(&status, SMgrRelationHash);

while ((reln = (SMgrRelation) hash_seq_search(&status)) != NULL)
{
smgrrelease(reln);
}
+
+ RESUME_INTERRUPTS();
}

/*

@@ -434,6 +481,8 @@ smgrdosyncall(SMgrRelation *rels, int nrels)
if (nrels == 0)
return;

+	HOLD_INTERRUPTS();
+
FlushRelationsAllBuffers(rels, nrels);

FlushRelationsAllBuffers() isn't part of smgr or md.c, so it's unlikely to
become sensitive to smgrrelease(). It may do a ton of I/O. Hence, I'd
HOLD_INTERRUPTS() after FlushRelationsAllBuffers(), not before.

/*
@@ -449,6 +498,8 @@ smgrdosyncall(SMgrRelation *rels, int nrels)
smgrsw[which].smgr_immedsync(rels[i], forknum);

Long-term, someone might change this to hold interrupts once per immedsync
with a CFI between immedsync calls. That would be safe. It's not this
change's job, though. I'm mentioning it for the archives.

}
}
+
+ RESUME_INTERRUPTS();
}

/*
@@ -471,6 +522,8 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
if (nrels == 0)
return;

+ HOLD_INTERRUPTS();

I would move this below DropRelationsAllBuffers(), for reasons like
FlushRelationsAllBuffers() above.

Subject: [PATCH v2 2/2] smgr: Make SMgrRelation initialization safer against
errors

In case the smgr_open callback failed, the ->pincount field would not be
initialized and the relation would not be put onto the unpinned_relns list.

This buglet was introduced in 21d9c3ee4ef7. As that commit is only in HEAD, no
need to backpatch.

Discussion: /messages/by-id/3vae7l5ozvqtxmd7rr7zaeq3qkuipz365u3rtim5t5wdkr6f4g@vkgf2fogjirl
---
src/backend/storage/smgr/smgr.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 53a09fe4aaa..24971304b85 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -255,12 +255,12 @@ smgropen(RelFileLocator rlocator, ProcNumber backend)
reln->smgr_cached_nblocks[i] = InvalidBlockNumber;
reln->smgr_which = 0;	/* we only have md.c at present */
-		/* implementation-specific initialization */
-		smgrsw[reln->smgr_which].smgr_open(reln);
-
/* it is not pinned yet */
reln->pincount = 0;
dlist_push_tail(&unpinned_relns, &reln->node);
+
+		/* implementation-specific initialization */
+		smgrsw[reln->smgr_which].smgr_open(reln);
}

I struggle to speculate about the merits of this, because mdopen() can't fail.
If mdopen() started to do things that could fail, mdnblocks() would be
reasonable in assuming those things are done. Hence, the long-term direction
should be more like destroying the new smgr entry in the event of error.

I would not make this change. I'd maybe add a comment that smgr_open
callbacks currently aren't allowed to fail, since smgropen() isn't ready to
clean up smgr-level state from a failed open. How do you see it?

#8Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#7)
Re: md.c vs elog.c vs smgrreleaseall() in barrier

Hi,

On 2025-03-19 12:55:53 -0700, Noah Misch wrote:

On Mon, Mar 17, 2025 at 07:52:02PM -0400, Andres Freund wrote:

Here's a proposed patch for this. It turns out that the bug might already be
reachable, even without defining FDDEBUG. There's a debug ereport() in
register_dirty_segment() - but it's hard to reach in practice.

I don't really know whether that means we ought to backpatch or not - which
makes me want to be conservative and not backpatch.

Non-backpatch sounds fine.

Cool.

@@ -362,12 +397,16 @@ smgrreleaseall(void)
if (SMgrRelationHash == NULL)
return;

+ HOLD_INTERRUPTS();

Likely not important, but it's not clear to me why smgrdestroyall() and
smgrreleaseall() get HOLD_INTERRUPTS(), as opposed to relying on the holds in
smgrdestroy() and smgrrelease(). In contrast, smgrreleaserellocator() does
rely on smgrrelease() for the hold.

It didn't seem particularly safe to allow interrupts, which in turn could
change the list of open relations, while iterating over a linked list / a
hashtable.

@@ -434,6 +481,8 @@ smgrdosyncall(SMgrRelation *rels, int nrels)
if (nrels == 0)
return;

+	HOLD_INTERRUPTS();
+
FlushRelationsAllBuffers(rels, nrels);

FlushRelationsAllBuffers() isn't part of smgr or md.c, so it's unlikely to
become sensitive to smgrrelease(). It may do a ton of I/O. Hence, I'd
HOLD_INTERRUPTS() after FlushRelationsAllBuffers(), not before.

Hm - we never would want to process interrupts while in
FlushRelationsAllBuffers() or such, would we? I'm ok with changing it, I
guess I just didn't see a reason not to use a wider scope.

I guess I am a bit paranoid because gave me flashbacks to issues around
smgrtruncate() failing after doing DropRelationBuffers(), that Thomas recently
fixed (and I had worked on a few years before). But of course
DropRelationBuffers() is way more dangerous than FlushRelationsAllBuffers().

}
}
+
+ RESUME_INTERRUPTS();
}

/*
@@ -471,6 +522,8 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
if (nrels == 0)
return;

+ HOLD_INTERRUPTS();

I would move this below DropRelationsAllBuffers(), for reasons like
FlushRelationsAllBuffers() above.

I think that'd be unsafe. Once we dropped buffers from the buffer pool we
can't just continue without also unlinking the underlying relation, otherwise
an older view of the data later can be "revived from the dead" after an error,
causing all manner of corruption.

I suspect it's always called with interrupts held already though.

But unfortunately I think this probably needs to be done in a critical
section, not just run with interrupts held.

We really really shouldn't ever palloc() after doing something like
DropRelationsAllBuffers(). Thomas just spent a lot of time fixing corruption
issues arising for related issues around relation truncations...

I think this may mean that an error during smgr_unlink() leads to a cluster in
a corrupted state?

Subject: [PATCH v2 2/2] smgr: Make SMgrRelation initialization safer against
errors

In case the smgr_open callback failed, the ->pincount field would not be
initialized and the relation would not be put onto the unpinned_relns list.

This buglet was introduced in 21d9c3ee4ef7. As that commit is only in HEAD, no
need to backpatch.

Discussion: /messages/by-id/3vae7l5ozvqtxmd7rr7zaeq3qkuipz365u3rtim5t5wdkr6f4g@vkgf2fogjirl
---
src/backend/storage/smgr/smgr.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 53a09fe4aaa..24971304b85 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -255,12 +255,12 @@ smgropen(RelFileLocator rlocator, ProcNumber backend)
reln->smgr_cached_nblocks[i] = InvalidBlockNumber;
reln->smgr_which = 0;	/* we only have md.c at present */
-		/* implementation-specific initialization */
-		smgrsw[reln->smgr_which].smgr_open(reln);
-
/* it is not pinned yet */
reln->pincount = 0;
dlist_push_tail(&unpinned_relns, &reln->node);
+
+		/* implementation-specific initialization */
+		smgrsw[reln->smgr_which].smgr_open(reln);
}

I struggle to speculate about the merits of this, because mdopen() can't fail.
If mdopen() started to do things that could fail, mdnblocks() would be
reasonable in assuming those things are done. Hence, the long-term direction
should be more like destroying the new smgr entry in the event of error.

I would not make this change. I'd maybe add a comment that smgr_open
callbacks currently aren't allowed to fail, since smgropen() isn't ready to
clean up smgr-level state from a failed open. How do you see it?

I see no disadvantage in the change - it seems strictly better to initialize
pincount earlier. I agree that it might be a good idea to explicitly handle
errors eventually, but that'd not be made harder by this change...

Greetings,

Andres Freund

#9Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#8)
Re: md.c vs elog.c vs smgrreleaseall() in barrier

Hi,

On 2025-03-19 18:45:20 -0400, Andres Freund wrote:

On 2025-03-19 12:55:53 -0700, Noah Misch wrote:

On Mon, Mar 17, 2025 at 07:52:02PM -0400, Andres Freund wrote:

@@ -471,6 +522,8 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
if (nrels == 0)
return;

+ HOLD_INTERRUPTS();

I would move this below DropRelationsAllBuffers(), for reasons like
FlushRelationsAllBuffers() above.

I think that'd be unsafe. Once we dropped buffers from the buffer pool we
can't just continue without also unlinking the underlying relation, otherwise
an older view of the data later can be "revived from the dead" after an error,
causing all manner of corruption.

I suspect it's always called with interrupts held already though.

But unfortunately I think this probably needs to be done in a critical
section, not just run with interrupts held.

We really really shouldn't ever palloc() after doing something like
DropRelationsAllBuffers(). Thomas just spent a lot of time fixing corruption
issues arising for related issues around relation truncations...

I think this may mean that an error during smgr_unlink() leads to a cluster in
a corrupted state?

Ah - it effectively is already in a critical section, just a weirdly spelled one:

2025-03-19 19:00:06.398 EDT [2156613][client backend][0/3:0][psql] LOG: statement: DROP TABLE foo;
2025-03-19 19:00:06.404 EDT [2156613][client backend][0/0:43139][psql] ERROR: muahahaha
2025-03-19 19:00:06.404 EDT [2156613][client backend][0/0:43139][psql] STATEMENT: DROP TABLE foo;
2025-03-19 19:00:06.404 EDT [2156613][client backend][0/0:43139][psql] WARNING: AbortTransaction while in COMMIT state
2025-03-19 19:00:06.404 EDT [2156613][client backend][0/0:43139][psql] PANIC: cannot abort transaction 43139, it was already committed

Obviously not great, but better than corruption.

Until the IO issue preventing smgrdounlinkall() to work are fixed, the DB
doesn't start up again...

2025-03-19 19:00:57.711 EDT [2156761][startup][:0][] FATAL: muahahaha
2025-03-19 19:00:57.711 EDT [2156761][startup][:0][] CONTEXT: WAL redo at B/66D6E650 for Transaction/COMMIT: 2025-03-19 19:00:06.400748-04; rels: base/5/25449; dropped stats: 2/5/25449; inval msgs: ...
2025-03-19 19:00:57.715 EDT [2156758][postmaster][:0][] LOG: startup process (PID 2156761) exited with exit code 1
2025-03-19 19:00:57.715 EDT [2156758][postmaster][:0][] LOG: terminating any other active server processes

If one uses a large s_b and the system is busy, it's not even that unlikely
that we would fail in that spot:
mdunlinkfork()->
register_forget_request()->
RegisterSyncRequest()->
ForwardSyncRequest()->
CompactCheckpointerRequestQueue()

can require a NBuffers * sizeof(bool) array and hashtable that fits all the
entries...

Greetings,

Andres Freund

#10Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#9)
Re: md.c vs elog.c vs smgrreleaseall() in barrier

On Thu, Mar 20, 2025 at 12:06 PM Andres Freund <andres@anarazel.de> wrote:

Ah - it effectively is already in a critical section, just a weirdly spelled one:

2025-03-19 19:00:06.398 EDT [2156613][client backend][0/3:0][psql] LOG: statement: DROP TABLE foo;
2025-03-19 19:00:06.404 EDT [2156613][client backend][0/0:43139][psql] ERROR: muahahaha
2025-03-19 19:00:06.404 EDT [2156613][client backend][0/0:43139][psql] STATEMENT: DROP TABLE foo;
2025-03-19 19:00:06.404 EDT [2156613][client backend][0/0:43139][psql] WARNING: AbortTransaction while in COMMIT state
2025-03-19 19:00:06.404 EDT [2156613][client backend][0/0:43139][psql] PANIC: cannot abort transaction 43139, it was already committed

Obviously not great, but better than corruption.

Yeah, I called that a crypto-critical-section over in this thread:

/messages/by-id/ZYw8gVOMF9gfp6i5@pryzbyj2023

#11Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#8)
Re: md.c vs elog.c vs smgrreleaseall() in barrier

On Wed, Mar 19, 2025 at 06:45:20PM -0400, Andres Freund wrote:

On 2025-03-19 12:55:53 -0700, Noah Misch wrote:

On Mon, Mar 17, 2025 at 07:52:02PM -0400, Andres Freund wrote:

@@ -362,12 +397,16 @@ smgrreleaseall(void)
if (SMgrRelationHash == NULL)
return;

+ HOLD_INTERRUPTS();

Likely not important, but it's not clear to me why smgrdestroyall() and
smgrreleaseall() get HOLD_INTERRUPTS(), as opposed to relying on the holds in
smgrdestroy() and smgrrelease(). In contrast, smgrreleaserellocator() does
rely on smgrrelease() for the hold.

It didn't seem particularly safe to allow interrupts, which in turn could
change the list of open relations, while iterating over a linked list / a
hashtable.

Fair.

@@ -434,6 +481,8 @@ smgrdosyncall(SMgrRelation *rels, int nrels)
if (nrels == 0)
return;

+	HOLD_INTERRUPTS();
+
FlushRelationsAllBuffers(rels, nrels);

FlushRelationsAllBuffers() isn't part of smgr or md.c, so it's unlikely to
become sensitive to smgrrelease(). It may do a ton of I/O. Hence, I'd
HOLD_INTERRUPTS() after FlushRelationsAllBuffers(), not before.

Hm - we never would want to process interrupts while in
FlushRelationsAllBuffers() or such, would we? I'm ok with changing it, I
guess I just didn't see a reason not to use a wider scope.

If we get a query cancel or fast shutdown, it's better for the user to abort
the transaction rather than keep flushing. smgrDoPendingSyncs() calls here
rather late in the pre-commit actions, so failing is still supposed to be
fine. I think the code succeeds at making it fine to fail here.

I guess I am a bit paranoid because gave me flashbacks to issues around
smgrtruncate() failing after doing DropRelationBuffers(), that Thomas recently
fixed (and I had worked on a few years before). But of course
DropRelationBuffers() is way more dangerous than FlushRelationsAllBuffers().

Fair.

}
}
+
+ RESUME_INTERRUPTS();
}

/*
@@ -471,6 +522,8 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
if (nrels == 0)
return;

+ HOLD_INTERRUPTS();

I would move this below DropRelationsAllBuffers(), for reasons like
FlushRelationsAllBuffers() above.

I think that'd be unsafe. Once we dropped buffers from the buffer pool we
can't just continue without also unlinking the underlying relation, otherwise
an older view of the data later can be "revived from the dead" after an error,
causing all manner of corruption.

I suspect it's always called with interrupts held already though.

Ah, confirmed. If I put this assert at the top of smgrdounlinkall(),
check-world passes:

Assert(IsBinaryUpgrade || InRecovery || !INTERRUPTS_CAN_BE_PROCESSED());

Subject: [PATCH v2 2/2] smgr: Make SMgrRelation initialization safer against
errors

In case the smgr_open callback failed, the ->pincount field would not be
initialized and the relation would not be put onto the unpinned_relns list.

This buglet was introduced in 21d9c3ee4ef7. As that commit is only in HEAD, no
need to backpatch.

Discussion: /messages/by-id/3vae7l5ozvqtxmd7rr7zaeq3qkuipz365u3rtim5t5wdkr6f4g@vkgf2fogjirl
---
src/backend/storage/smgr/smgr.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 53a09fe4aaa..24971304b85 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -255,12 +255,12 @@ smgropen(RelFileLocator rlocator, ProcNumber backend)
reln->smgr_cached_nblocks[i] = InvalidBlockNumber;
reln->smgr_which = 0;	/* we only have md.c at present */
-		/* implementation-specific initialization */
-		smgrsw[reln->smgr_which].smgr_open(reln);
-
/* it is not pinned yet */
reln->pincount = 0;
dlist_push_tail(&unpinned_relns, &reln->node);
+
+		/* implementation-specific initialization */
+		smgrsw[reln->smgr_which].smgr_open(reln);
}

I struggle to speculate about the merits of this, because mdopen() can't fail.
If mdopen() started to do things that could fail, mdnblocks() would be
reasonable in assuming those things are done. Hence, the long-term direction
should be more like destroying the new smgr entry in the event of error.

I would not make this change. I'd maybe add a comment that smgr_open
callbacks currently aren't allowed to fail, since smgropen() isn't ready to
clean up smgr-level state from a failed open. How do you see it?

I see no disadvantage in the change - it seems strictly better to initialize
pincount earlier. I agree that it might be a good idea to explicitly handle
errors eventually, but that'd not be made harder by this change...

Okay. I suppose if mdopen() gained the ability to fail and mdnblocks() also
gained the ability to cure said failure, this change would make that okay.

#12Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#11)
1 attachment(s)
Re: md.c vs elog.c vs smgrreleaseall() in barrier

Hi,

I updated the patch with the following changes:

- Remove the assertion from smgrtruncate() - it would need to assert that it's
called in a critical section.

Not sure why it's not already asserting that?

The function header says:
* ... This function should normally
* be called in a critical section, but the current size must be checked
* outside the critical section, and no interrupts or smgr functions relating
* to this relation should be called in between.

The "should normally" is bit weird imo, when would it be safe to *not* use
it in a critical section?

- added comments about the reason for HOLD_INTERRUPTS to smgrdounlinkall(),
smgrdestroyall() and smgrreleaseall()

- moved the HOLD_INTERRUPTS after FlushRelationsAllBuffers() in smgrdosyncall

- updated the comment & commit message talking about the problem being due to
debug elogs/ereports - it's also LOG/WARNING.

I was writing a remark in the commit message, explaining that the only < ERROR
elog/ereport that can be reached is very unlikely to be reachable, and
therefore it's not worth the risk of backpatching. But it turns out there
also are WARNINGs in mdunlinkfork(), which seem a lot easier to reach than the
DEBUG1 in register_dirty_segment().

I still am leaning against backpatching, but I'm not sure that's not just
laziness.

On 2025-03-19 17:45:14 -0700, Noah Misch wrote:

On Wed, Mar 19, 2025 at 06:45:20PM -0400, Andres Freund wrote:

On 2025-03-19 12:55:53 -0700, Noah Misch wrote:

On Mon, Mar 17, 2025 at 07:52:02PM -0400, Andres Freund wrote:

@@ -434,6 +481,8 @@ smgrdosyncall(SMgrRelation *rels, int nrels)
if (nrels == 0)
return;

+	HOLD_INTERRUPTS();
+
FlushRelationsAllBuffers(rels, nrels);

FlushRelationsAllBuffers() isn't part of smgr or md.c, so it's unlikely to
become sensitive to smgrrelease(). It may do a ton of I/O. Hence, I'd
HOLD_INTERRUPTS() after FlushRelationsAllBuffers(), not before.

Hm - we never would want to process interrupts while in
FlushRelationsAllBuffers() or such, would we? I'm ok with changing it, I
guess I just didn't see a reason not to use a wider scope.

If we get a query cancel or fast shutdown, it's better for the user to abort
the transaction rather than keep flushing. smgrDoPendingSyncs() calls here
rather late in the pre-commit actions, so failing is still supposed to be
fine. I think the code succeeds at making it fine to fail here.

But we don't actually intentionally accept interrupts in
FlushRelationsAllBuffers()? It would only happen as a side-effect of a
non-error elog/ereport() processing interrupts, right?

It also looks like we couldn't accept interrupts when called by
AbortTransaction(), because there we already are in a HOLD_INTERRUPTS()
region. I'm pretty sure an error would trigger at least an assertion. But
that's really an independent issue.

Moved.

@@ -471,6 +522,8 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
if (nrels == 0)
return;

+ HOLD_INTERRUPTS();

I would move this below DropRelationsAllBuffers(), for reasons like
FlushRelationsAllBuffers() above.

I think that'd be unsafe. Once we dropped buffers from the buffer pool we
can't just continue without also unlinking the underlying relation, otherwise
an older view of the data later can be "revived from the dead" after an error,
causing all manner of corruption.

I suspect it's always called with interrupts held already though.

Ah, confirmed. If I put this assert at the top of smgrdounlinkall(),
check-world passes:

Assert(IsBinaryUpgrade || InRecovery || !INTERRUPTS_CAN_BE_PROCESSED());

I just made it hold interrupts for now, hope that makes sense?

I struggle to speculate about the merits of this, because mdopen() can't fail.
If mdopen() started to do things that could fail, mdnblocks() would be
reasonable in assuming those things are done. Hence, the long-term direction
should be more like destroying the new smgr entry in the event of error.

I would not make this change. I'd maybe add a comment that smgr_open
callbacks currently aren't allowed to fail, since smgropen() isn't ready to
clean up smgr-level state from a failed open. How do you see it?

I see no disadvantage in the change - it seems strictly better to initialize
pincount earlier. I agree that it might be a good idea to explicitly handle
errors eventually, but that'd not be made harder by this change...

Okay. I suppose if mdopen() gained the ability to fail and mdnblocks() also
gained the ability to cure said failure, this change would make that okay.

FWIW, if mdopen() could fail, it should probably only do fallible operations
after doing the non-fallible initialization. Then mdnblocks() wouldn't need to
cure anything.

Greetings,

Andres Freund

Attachments:

v3-0001-smgr-Hold-interrupts-in-most-smgr-functions.patchtext/x-diff; charset=us-asciiDownload
From 316777b4b749d404e3ae40cdb3f844847987ffe8 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:05 -0400
Subject: [PATCH v3] smgr: Hold interrupts in most smgr functions

We need to hold interrupts across most of the smgr.c/md.c functions, as
otherwise interrupt processing, e.g. due to a < ERROR elog/ereport, can
trigger procsignal processing, which in turn can trigger smgrreleaseall(). As
the relevant code is not reentrant, we quickly end up in a bad situation.

The only reason we haven't noticed this before is that there is only one
non-error ereport called in affected routines, in register_dirty_segments(),
and that one is extremely rarely reached. If one enables fd.c's FDDEBUG it's
easy to reproduce crashes.

It seems better to put the HOLD_INTERRUPTS()/RESUME_INTERRUPTS() in smgr.c,
instead of trying to push them down to md.c where possible: For one, every
smgr implementation would be vulnerable, for another, a good bit of smgr.c
code itself is affected too.

Eventually we might want a more targeted solution, allowing e.g. a networked
smgr implementation to be interrupted, but many other, more complicated,
problems would need to be fixed for that to be viable (e.g. smgr.c is often
called with interrupts already held).

One could argue this should be backpatched, but the existing < ERROR
elog/ereports that can be reached with unmodified sources are unlikely to be
reached. On balance the risk of backpatching seems higher than the gain - at
least for now.

Reviewed-by: Noah Misch <noah@leadboat.com>
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/3vae7l5ozvqtxmd7rr7zaeq3qkuipz365u3rtim5t5wdkr6f4g@vkgf2fogjirl
---
 src/backend/storage/smgr/smgr.c | 104 +++++++++++++++++++++++++++++++-
 1 file changed, 101 insertions(+), 3 deletions(-)

diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index e6cbb9b4ca2..af74f54b43b 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -40,6 +40,18 @@
  * themselves, as there could pointers to them in active use.  See
  * smgrrelease() and smgrreleaseall().
  *
+ * NB: We need to hold interrupts across most of the functions in this file,
+ * as otherwise interrupt processing, e.g. due to a < ERROR elog/ereport, can
+ * trigger procsignal processing, which in turn can trigger
+ * smgrreleaseall(). Most of the relevant code is not reentrant.  It seems
+ * better to put the HOLD_INTERRUPTS()/RESUME_INTERRUPTS() here, instead of
+ * trying to push them down to md.c where possible: For one, every smgr
+ * implementation would be vulnerable, for another, a good bit of smgr.c code
+ * itself is affected too.  Eventually we might want a more targeted solution,
+ * allowing e.g. a networked smgr implementation to be interrupted, but many
+ * other, more complicated, problems would need to be fixed for that to be
+ * viable (e.g. smgr.c is often called with interrupts already held).
+ *
  * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
  *
@@ -53,6 +65,7 @@
 
 #include "access/xlogutils.h"
 #include "lib/ilist.h"
+#include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/md.h"
@@ -158,12 +171,16 @@ smgrinit(void)
 {
 	int			i;
 
+	HOLD_INTERRUPTS();
+
 	for (i = 0; i < NSmgr; i++)
 	{
 		if (smgrsw[i].smgr_init)
 			smgrsw[i].smgr_init();
 	}
 
+	RESUME_INTERRUPTS();
+
 	/* register the shutdown proc */
 	on_proc_exit(smgrshutdown, 0);
 }
@@ -176,11 +193,15 @@ smgrshutdown(int code, Datum arg)
 {
 	int			i;
 
+	HOLD_INTERRUPTS();
+
 	for (i = 0; i < NSmgr; i++)
 	{
 		if (smgrsw[i].smgr_shutdown)
 			smgrsw[i].smgr_shutdown();
 	}
+
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -206,6 +227,8 @@ smgropen(RelFileLocator rlocator, ProcNumber backend)
 
 	Assert(RelFileNumberIsValid(rlocator.relNumber));
 
+	HOLD_INTERRUPTS();
+
 	if (SMgrRelationHash == NULL)
 	{
 		/* First time through: initialize the hash table */
@@ -242,6 +265,8 @@ smgropen(RelFileLocator rlocator, ProcNumber backend)
 		smgrsw[reln->smgr_which].smgr_open(reln);
 	}
 
+	RESUME_INTERRUPTS();
+
 	return reln;
 }
 
@@ -283,6 +308,8 @@ smgrdestroy(SMgrRelation reln)
 
 	Assert(reln->pincount == 0);
 
+	HOLD_INTERRUPTS();
+
 	for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
 		smgrsw[reln->smgr_which].smgr_close(reln, forknum);
 
@@ -292,6 +319,8 @@ smgrdestroy(SMgrRelation reln)
 					&(reln->smgr_rlocator),
 					HASH_REMOVE, NULL) == NULL)
 		elog(ERROR, "SMgrRelation hashtable corrupted");
+
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -302,12 +331,16 @@ smgrdestroy(SMgrRelation reln)
 void
 smgrrelease(SMgrRelation reln)
 {
+	HOLD_INTERRUPTS();
+
 	for (ForkNumber forknum = 0; forknum <= MAX_FORKNUM; forknum++)
 	{
 		smgrsw[reln->smgr_which].smgr_close(reln, forknum);
 		reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
 	}
 	reln->smgr_targblock = InvalidBlockNumber;
+
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -336,6 +369,9 @@ smgrdestroyall(void)
 {
 	dlist_mutable_iter iter;
 
+	/* seems unsafe to accept interrupts while in a dlist_foreach_modify() */
+	HOLD_INTERRUPTS();
+
 	/*
 	 * Zap all unpinned SMgrRelations.  We rely on smgrdestroy() to remove
 	 * each one from the list.
@@ -347,6 +383,8 @@ smgrdestroyall(void)
 
 		smgrdestroy(rel);
 	}
+
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -362,12 +400,17 @@ smgrreleaseall(void)
 	if (SMgrRelationHash == NULL)
 		return;
 
+	/* seems unsafe to accept interrupts while iterating */
+	HOLD_INTERRUPTS();
+
 	hash_seq_init(&status, SMgrRelationHash);
 
 	while ((reln = (SMgrRelation) hash_seq_search(&status)) != NULL)
 	{
 		smgrrelease(reln);
 	}
+
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -400,7 +443,13 @@ smgrreleaserellocator(RelFileLocatorBackend rlocator)
 bool
 smgrexists(SMgrRelation reln, ForkNumber forknum)
 {
-	return smgrsw[reln->smgr_which].smgr_exists(reln, forknum);
+	bool		ret;
+
+	HOLD_INTERRUPTS();
+	ret = smgrsw[reln->smgr_which].smgr_exists(reln, forknum);
+	RESUME_INTERRUPTS();
+
+	return ret;
 }
 
 /*
@@ -413,7 +462,9 @@ smgrexists(SMgrRelation reln, ForkNumber forknum)
 void
 smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
 {
+	HOLD_INTERRUPTS();
 	smgrsw[reln->smgr_which].smgr_create(reln, forknum, isRedo);
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -436,6 +487,8 @@ smgrdosyncall(SMgrRelation *rels, int nrels)
 
 	FlushRelationsAllBuffers(rels, nrels);
 
+	HOLD_INTERRUPTS();
+
 	/*
 	 * Sync the physical file(s).
 	 */
@@ -449,6 +502,8 @@ smgrdosyncall(SMgrRelation *rels, int nrels)
 				smgrsw[which].smgr_immedsync(rels[i], forknum);
 		}
 	}
+
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -471,6 +526,13 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
 	if (nrels == 0)
 		return;
 
+	/*
+	 * It would be unsafe to process interrupts between DropRelationBuffers()
+	 * and unlinking the underlying files. This probably should be a critical
+	 * section, but we're not there yet.
+	 */
+	HOLD_INTERRUPTS();
+
 	/*
 	 * Get rid of any remaining buffers for the relations.  bufmgr will just
 	 * drop them without bothering to write the contents.
@@ -522,6 +584,8 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
 	}
 
 	pfree(rlocators);
+
+	RESUME_INTERRUPTS();
 }
 
 
@@ -538,6 +602,8 @@ void
 smgrextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		   const void *buffer, bool skipFsync)
 {
+	HOLD_INTERRUPTS();
+
 	smgrsw[reln->smgr_which].smgr_extend(reln, forknum, blocknum,
 										 buffer, skipFsync);
 
@@ -550,6 +616,8 @@ smgrextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		reln->smgr_cached_nblocks[forknum] = blocknum + 1;
 	else
 		reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
+
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -563,6 +631,8 @@ void
 smgrzeroextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 			   int nblocks, bool skipFsync)
 {
+	HOLD_INTERRUPTS();
+
 	smgrsw[reln->smgr_which].smgr_zeroextend(reln, forknum, blocknum,
 											 nblocks, skipFsync);
 
@@ -575,6 +645,8 @@ smgrzeroextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		reln->smgr_cached_nblocks[forknum] = blocknum + nblocks;
 	else
 		reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
+
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -588,7 +660,13 @@ bool
 smgrprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 			 int nblocks)
 {
-	return smgrsw[reln->smgr_which].smgr_prefetch(reln, forknum, blocknum, nblocks);
+	bool		ret;
+
+	HOLD_INTERRUPTS();
+	ret = smgrsw[reln->smgr_which].smgr_prefetch(reln, forknum, blocknum, nblocks);
+	RESUME_INTERRUPTS();
+
+	return ret;
 }
 
 /*
@@ -601,7 +679,13 @@ uint32
 smgrmaxcombine(SMgrRelation reln, ForkNumber forknum,
 			   BlockNumber blocknum)
 {
-	return smgrsw[reln->smgr_which].smgr_maxcombine(reln, forknum, blocknum);
+	uint32		ret;
+
+	HOLD_INTERRUPTS();
+	ret = smgrsw[reln->smgr_which].smgr_maxcombine(reln, forknum, blocknum);
+	RESUME_INTERRUPTS();
+
+	return ret;
 }
 
 /*
@@ -619,8 +703,10 @@ void
 smgrreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		  void **buffers, BlockNumber nblocks)
 {
+	HOLD_INTERRUPTS();
 	smgrsw[reln->smgr_which].smgr_readv(reln, forknum, blocknum, buffers,
 										nblocks);
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -653,8 +739,10 @@ void
 smgrwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		   const void **buffers, BlockNumber nblocks, bool skipFsync)
 {
+	HOLD_INTERRUPTS();
 	smgrsw[reln->smgr_which].smgr_writev(reln, forknum, blocknum,
 										 buffers, nblocks, skipFsync);
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -665,8 +753,10 @@ void
 smgrwriteback(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 			  BlockNumber nblocks)
 {
+	HOLD_INTERRUPTS();
 	smgrsw[reln->smgr_which].smgr_writeback(reln, forknum, blocknum,
 											nblocks);
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -683,10 +773,14 @@ smgrnblocks(SMgrRelation reln, ForkNumber forknum)
 	if (result != InvalidBlockNumber)
 		return result;
 
+	HOLD_INTERRUPTS();
+
 	result = smgrsw[reln->smgr_which].smgr_nblocks(reln, forknum);
 
 	reln->smgr_cached_nblocks[forknum] = result;
 
+	RESUME_INTERRUPTS();
+
 	return result;
 }
 
@@ -784,7 +878,9 @@ smgrtruncate(SMgrRelation reln, ForkNumber *forknum, int nforks,
 void
 smgrregistersync(SMgrRelation reln, ForkNumber forknum)
 {
+	HOLD_INTERRUPTS();
 	smgrsw[reln->smgr_which].smgr_registersync(reln, forknum);
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -816,7 +912,9 @@ smgrregistersync(SMgrRelation reln, ForkNumber forknum)
 void
 smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
 {
+	HOLD_INTERRUPTS();
 	smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
+	RESUME_INTERRUPTS();
 }
 
 /*
-- 
2.48.1.76.g4e746b1a31.dirty

#13Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#12)
Re: md.c vs elog.c vs smgrreleaseall() in barrier

On Thu, Mar 20, 2025 at 03:53:11PM -0400, Andres Freund wrote:

I updated the patch with the following changes:

- Remove the assertion from smgrtruncate() - it would need to assert that it's
called in a critical section.

Not sure why it's not already asserting that?

The function header says:
* ... This function should normally
* be called in a critical section, but the current size must be checked
* outside the critical section, and no interrupts or smgr functions relating
* to this relation should be called in between.

The "should normally" is bit weird imo, when would it be safe to *not* use
it in a critical section?

I expect it would be okay in recovery, which is a crypto-critical-section
IIRC. All callers, including smgr_redo(), do have an explicit critical
section around the call. Hence, I gather we're no longer relying on any
exceptions to this one's need for a critical section.

- added comments about the reason for HOLD_INTERRUPTS to smgrdounlinkall(),
smgrdestroyall() and smgrreleaseall()

Perfect.

I still am leaning against backpatching, but I'm not sure that's not just
laziness.

It's also some risk reduction. One of these smgr APIs might have a useful
interruptibility that we're now blocking. (I'm not aware of one.)

On 2025-03-19 17:45:14 -0700, Noah Misch wrote:

On Wed, Mar 19, 2025 at 06:45:20PM -0400, Andres Freund wrote:

On 2025-03-19 12:55:53 -0700, Noah Misch wrote:

On Mon, Mar 17, 2025 at 07:52:02PM -0400, Andres Freund wrote:

@@ -434,6 +481,8 @@ smgrdosyncall(SMgrRelation *rels, int nrels)
if (nrels == 0)
return;

+	HOLD_INTERRUPTS();
+
FlushRelationsAllBuffers(rels, nrels);

FlushRelationsAllBuffers() isn't part of smgr or md.c, so it's unlikely to
become sensitive to smgrrelease(). It may do a ton of I/O. Hence, I'd
HOLD_INTERRUPTS() after FlushRelationsAllBuffers(), not before.

Hm - we never would want to process interrupts while in
FlushRelationsAllBuffers() or such, would we? I'm ok with changing it, I
guess I just didn't see a reason not to use a wider scope.

If we get a query cancel or fast shutdown, it's better for the user to abort
the transaction rather than keep flushing. smgrDoPendingSyncs() calls here
rather late in the pre-commit actions, so failing is still supposed to be
fine. I think the code succeeds at making it fine to fail here.

But we don't actually intentionally accept interrupts in
FlushRelationsAllBuffers()?

Yes. It would be reasonable for future work to add that.

It would only happen as a side-effect of a
non-error elog/ereport() processing interrupts, right?

Likely yes.

It also looks like we couldn't accept interrupts when called by
AbortTransaction(), because there we already are in a HOLD_INTERRUPTS()
region. I'm pretty sure an error would trigger at least an assertion. But
that's really an independent issue.

The only smgrdosyncall() caller is smgrDoPendingSyncs(), which doesn't call it
in the abort case. So I think we're good.

Moved.

Thanks.

I suspect it's always called with interrupts held already though.

Ah, confirmed. If I put this assert at the top of smgrdounlinkall(),
check-world passes:

Assert(IsBinaryUpgrade || InRecovery || !INTERRUPTS_CAN_BE_PROCESSED());

I just made it hold interrupts for now, hope that makes sense?

Yep.

Patch looks perfect.

#14Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#13)
Re: md.c vs elog.c vs smgrreleaseall() in barrier

Hi,

On 2025-03-20 13:16:44 -0700, Noah Misch wrote:

Patch looks perfect.

Thanks for the reviews!

Pushed.

Greetings,

Andres Freund