Extending SMgrRelation lifetimes

Started by Thomas Munroover 2 years ago10 messages
#1Thomas Munro
thomas.munro@gmail.com
2 attachment(s)

Hi,

SMgrRelationData objects don't currently have a defined lifetime, so
it's hard to know when the result of smgropen() might become a
dangling pointer. This has caused a few bugs in the past, and the
usual fix is to just call smgropen() more often and not hold onto
pointers. If you're doing that frequently enough, the hash table
lookups can show up in profiles. I'm interested in this topic for
more than just micro-optimisations, though: in order to be able to
batch/merge smgr operations, I'd like to be able to collect them in
data structures that survive more than just a few lines of code.
(Examples to follow in later emails).

The simplest idea seems to be to tie object lifetime to transactions
using the existing AtEOXact_SMgr() mechanism. In recovery, the
obvious corresponding time would be the commit/abort record that
destroys the storage.

This could be achieved by extending smgrrelease(). That was a
solution to the same problem in a narrower context: we didn't want
CFIs to randomly free SMgrRelations, but we needed to be able to
force-close fds in other backends, to fix various edge cases.

The new idea is to overload smgrrelease(it) so that it also clears the
owner, which means that AtEOXact_SMgr() will eventually smgrclose(it),
unless it is re-owned by a relation before then. That choice stems
from the complete lack of information available via sinval in the case
of an overflow. We must (1) close all descriptors because any file
might have been unlinked, (2) keep all pointers valid and yet (3) not
leak dropped smgr objects forever. In this patch, smgrreleaseall()
achieves those goals.

Proof-of-concept patch attached. Are there holes in this scheme?
Better ideas welcome. In terms of spelling, another option would be
to change the behaviour of smgrclose() to work as described, ie it
would call smgrrelease() and then also disown, so we don't have to
change most of those callers, and then add a new function
smgrdestroy() for the few places that truly need it. Or something
like that.

Other completely different ideas I've bounced around with various
hackers and decided against: references counts, "holder" objects that
can be an "owner" (like Relation, but when you don't have a Relation)
but can re-open on demand. Seemed needlessly complicated.

While studying this I noticed a minor thinko in smgrrelease() in
15+16, so here's a fix for that also. I haven't figured out a
sequence that makes anything bad happen, but we should really
invalidate smgr_targblock when a relfilenode is reused, since it might
point past the end. This becomes more obvious once smgrrelease() is
used for truncation, as proposed here.

Attachments:

0001-Invalidate-smgr_targblock-on-release.patchtext/x-patch; charset=US-ASCII; name=0001-Invalidate-smgr_targblock-on-release.patchDownload
From 844e56e9ea9a56e04813886a6f7ded19a3af7f90 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 14 Aug 2023 11:01:28 +1200
Subject: [PATCH 1/2] Invalidate smgr_targblock on release.

In rare circumstances involving relfilenode reuse, it may be possible
for smgr_targblock to finish up pointing past the end.

Oversight in b74e94dc.  Back-patch to 15.
---
 src/backend/storage/smgr/smgr.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index f76c4605db..65e7436306 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -295,6 +295,7 @@ smgrrelease(SMgrRelation reln)
 	{
 		smgrsw[reln->smgr_which].smgr_close(reln, forknum);
 		reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
+		reln->smgr_targblock = InvalidBlockNumber;
 	}
 }
 
-- 
2.39.2

0002-Give-SMgrRelation-pointers-a-well-defined-lifetime.patchtext/x-patch; charset=US-ASCII; name=0002-Give-SMgrRelation-pointers-a-well-defined-lifetime.patchDownload
From b18fea5f263c46f87b84e242ff228cf1ab97b3e8 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 10 Aug 2023 18:16:31 +1200
Subject: [PATCH 2/2] Give SMgrRelation pointers a well-defined lifetime.

After calling smgropen(), it was not clear how long you could continue
to use the result, because various code paths including cache
invalidation could call smgrclose().

Guarantee that the object won't be freed until the end of the current
transaction, or in recovery, the commit/abort record that destroys the
storage.

This is achieved by making most places that previously called
smgrclose() use smgrrelease() instead, and giving smgrrelease() the
additional task of 'disowning' the relation so that it is closed at
end-of-transaction.  That allows all pointers to remain valid, even in
the case of sinval overflow where we have no information about which
relations have been dropped or truncated.
---
 src/backend/access/heap/heapam_handler.c |  6 +++---
 src/backend/access/heap/visibilitymap.c  |  2 +-
 src/backend/access/transam/xlogutils.c   |  6 +++---
 src/backend/catalog/storage.c            |  2 +-
 src/backend/commands/cluster.c           |  4 ++--
 src/backend/commands/dbcommands.c        |  2 +-
 src/backend/commands/sequence.c          |  2 +-
 src/backend/commands/tablecmds.c         |  4 ++--
 src/backend/storage/buffer/bufmgr.c      |  4 ++--
 src/backend/storage/smgr/smgr.c          | 21 ++++++++++++++------
 src/backend/utils/cache/inval.c          |  2 +-
 src/backend/utils/cache/relcache.c       | 25 ++++++++++++------------
 src/include/storage/smgr.h               |  2 +-
 src/include/utils/rel.h                  | 16 +++++----------
 src/include/utils/relcache.h             |  2 +-
 15 files changed, 52 insertions(+), 48 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 5a17112c91..83673da2fb 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -623,7 +623,7 @@ heapam_relation_set_new_filelocator(Relation rel,
 		smgrimmedsync(srel, INIT_FORKNUM);
 	}
 
-	smgrclose(srel);
+	smgrrelease(srel);
 }
 
 static void
@@ -682,9 +682,9 @@ heapam_relation_copy_data(Relation rel, const RelFileLocator *newrlocator)
 	}
 
 
-	/* drop old relation, and close new one */
+	/* drop old relation, and release new one */
 	RelationDropStorage(rel);
-	smgrclose(dstrel);
+	smgrrelease(dstrel);
 }
 
 static void
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 7d54ec9c0f..06b16bf2f1 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -635,7 +635,7 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 							  RBM_ZERO_ON_ERROR);
 
 	/*
-	 * Send a shared-inval message to force other backends to close any smgr
+	 * Send a shared-inval message to force other backends to release any smgr
 	 * references they may have for this rel, which we are about to change.
 	 * This is a useful optimization because it means that backends don't have
 	 * to keep checking for creation or extension of the file, which happens
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index e174a2a891..1bbd65376e 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -654,12 +654,12 @@ void
 XLogDropDatabase(Oid dbid)
 {
 	/*
-	 * This is unnecessarily heavy-handed, as it will close SMgrRelation
+	 * This is unnecessarily heavy-handed, as it will release SMgrRelation
 	 * objects for other databases as well. DROP DATABASE occurs seldom enough
 	 * that it's not worth introducing a variant of smgrclose for just this
-	 * purpose. XXX: Or should we rather leave the smgr entries dangling?
+	 * purpose.
 	 */
-	smgrcloseall();
+	smgrreleaseall();
 
 	forget_invalid_pages_db(dbid);
 }
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 2add053489..7d2697455f 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -226,7 +226,7 @@ RelationDropStorage(Relation rel)
 	 * for now I'll keep the logic simple.
 	 */
 
-	RelationCloseSmgr(rel);
+	RelationReleaseSmgr(rel);
 }
 
 /*
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index a3bef6ac34..6bf0431c62 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -1439,8 +1439,8 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class,
 	 * itself, the smgr close on pg_class must happen after all accesses in
 	 * this function.
 	 */
-	RelationCloseSmgrByOid(r1);
-	RelationCloseSmgrByOid(r2);
+	RelationReleaseSmgrByOid(r1);
+	RelationReleaseSmgrByOid(r2);
 }
 
 /*
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 307729ab7e..911ef2aa19 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -278,7 +278,7 @@ ScanSourceDatabasePgClass(Oid tbid, Oid dbid, char *srcpath)
 
 	smgr = smgropen(rlocator, InvalidBackendId);
 	nblocks = smgrnblocks(smgr, MAIN_FORKNUM);
-	smgrclose(smgr);
+	smgrrelease(smgr);
 
 	/* Use a buffer access strategy since this is a bulk read operation. */
 	bstrategy = GetAccessStrategy(BAS_BULKREAD);
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index fc4f77e787..64fd1564ae 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -360,7 +360,7 @@ fill_seq_with_data(Relation rel, HeapTuple tuple)
 		log_smgrcreate(&rel->rd_locator, INIT_FORKNUM);
 		fill_seq_fork_with_data(rel, tuple, INIT_FORKNUM);
 		FlushRelationBuffers(rel);
-		smgrclose(srel);
+		smgrrelease(srel);
 	}
 }
 
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 727f151750..1eee2e743c 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -14825,9 +14825,9 @@ index_copy_data(Relation rel, RelFileLocator newrlocator)
 		}
 	}
 
-	/* drop old relation, and close new one */
+	/* drop old relation, and release new one */
 	RelationDropStorage(rel);
-	smgrclose(dstrel);
+	smgrrelease(dstrel);
 }
 
 /*
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index df22aaa1c5..79ffa25cb1 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -4392,10 +4392,10 @@ CreateAndCopyRelationData(RelFileLocator src_rlocator,
 	rlocator.backend = InvalidBackendId;
 
 	rlocator.locator = src_rlocator;
-	smgrcloserellocator(rlocator);
+	smgrreleaserellocator(rlocator);
 
 	rlocator.locator = dst_rlocator;
-	smgrcloserellocator(rlocator);
+	smgrreleaserellocator(rlocator);
 }
 
 /* ---------------------------------------------------------------------
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 65e7436306..c3728b713f 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -286,7 +286,9 @@ smgrclose(SMgrRelation reln)
 /*
  * smgrrelease() -- Release all resources used by this object.
  *
- * The object remains valid.
+ * The object remains valid, but is moved to the unknown list where it will
+ * be closed by AtEOXact_SMgr().  It may be re-owned if it is accessed by
+ * a relation before then.
  */
 void
 smgrrelease(SMgrRelation reln)
@@ -296,6 +298,13 @@ smgrrelease(SMgrRelation reln)
 		smgrsw[reln->smgr_which].smgr_close(reln, forknum);
 		reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
 		reln->smgr_targblock = InvalidBlockNumber;
+
+		if (reln->smgr_owner)
+		{
+			*reln->smgr_owner = NULL;
+			reln->smgr_owner = NULL;
+			dlist_push_tail(&unowned_relns, &reln->node);
+		}
 	}
 }
 
@@ -340,15 +349,15 @@ smgrcloseall(void)
 }
 
 /*
- * smgrcloserellocator() -- Close SMgrRelation object for given RelFileLocator,
+ * smgrreleaserellocator() -- Close SMgrRelation object for given RelFileLocator,
  *							if one exists.
  *
- * This has the same effects as smgrclose(smgropen(rlocator)), but it avoids
+ * This has the same effects as smgrrelease(smgropen(rlocator)), but it avoids
  * uselessly creating a hashtable entry only to drop it again when no
  * such entry exists already.
  */
 void
-smgrcloserellocator(RelFileLocatorBackend rlocator)
+smgrreleaserellocator(RelFileLocatorBackend rlocator)
 {
 	SMgrRelation reln;
 
@@ -360,7 +369,7 @@ smgrcloserellocator(RelFileLocatorBackend rlocator)
 									  &rlocator,
 									  HASH_FIND, NULL);
 	if (reln != NULL)
-		smgrclose(reln);
+		smgrrelease(reln);
 }
 
 /*
@@ -664,7 +673,7 @@ smgrtruncate(SMgrRelation reln, ForkNumber *forknum, int nforks, BlockNumber *nb
 	DropRelationBuffers(reln, forknum, nforks, nblocks);
 
 	/*
-	 * Send a shared-inval message to force other backends to close any smgr
+	 * Send a shared-inval message to force other backends to release any smgr
 	 * references they may have for this rel.  This is useful because they
 	 * might have open file pointers to segments that got removed, and/or
 	 * smgr_targblock variables pointing past the new rel end.  (The inval
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 0008826f67..b755d2fe9e 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -665,7 +665,7 @@ LocalExecuteInvalidationMessage(SharedInvalidationMessage *msg)
 
 		rlocator.locator = msg->sm.rlocator;
 		rlocator.backend = (msg->sm.backend_hi << 16) | (int) msg->sm.backend_lo;
-		smgrcloserellocator(rlocator);
+		smgrreleaserellocator(rlocator);
 	}
 	else if (msg->id == SHAREDINVALRELMAP_ID)
 	{
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 8e08ca1c68..68f001dc80 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -2226,8 +2226,8 @@ RelationReloadIndexInfo(Relation relation)
 		   !relation->rd_isvalid &&
 		   relation->rd_droppedSubid == InvalidSubTransactionId);
 
-	/* Ensure it's closed at smgr level */
-	RelationCloseSmgr(relation);
+	/* Ensure it's released at smgr level */
+	RelationReleaseSmgr(relation);
 
 	/* Must free any AM cached data upon relcache flush */
 	if (relation->rd_amcache)
@@ -2409,7 +2409,7 @@ RelationDestroyRelation(Relation relation, bool remember_tupdesc)
 	 * weren't closed already.  (This was probably done by caller, but let's
 	 * just be real sure.)
 	 */
-	RelationCloseSmgr(relation);
+	RelationReleaseSmgr(relation);
 
 	/* break mutual link with stats entry */
 	pgstat_unlink_relation(relation);
@@ -2512,7 +2512,7 @@ RelationClearRelation(Relation relation, bool rebuild)
 	 * that the low-level file access state is updated after, say, a vacuum
 	 * truncation.
 	 */
-	RelationCloseSmgr(relation);
+	RelationReleaseSmgr(relation);
 
 	/* Free AM cached data, if any */
 	if (relation->rd_amcache)
@@ -2953,8 +2953,9 @@ RelationCacheInvalidate(bool debug_discard)
 	{
 		relation = idhentry->reldesc;
 
-		/* Must close all smgr references to avoid leaving dangling ptrs */
-		RelationCloseSmgr(relation);
+		/* Disconnect smgr references to avoid leaving dangling ptrs */
+		if (relation->rd_smgr)
+			smgrclearowner(&relation->rd_smgr, relation->rd_smgr);
 
 		/*
 		 * Ignore new relations; no other backend will manipulate them before
@@ -3007,11 +3008,11 @@ RelationCacheInvalidate(bool debug_discard)
 	}
 
 	/*
-	 * Now zap any remaining smgr cache entries.  This must happen before we
+	 * Now release remaining smgr cache entries.  This must happen before we
 	 * start to rebuild entries, since that may involve catalog fetches which
 	 * will re-open catalog files.
 	 */
-	smgrcloseall();
+	smgrreleaseall();
 
 	/* Phase 2: rebuild the items found to need rebuild in phase 1 */
 	foreach(l, rebuildFirstList)
@@ -3034,13 +3035,13 @@ RelationCacheInvalidate(bool debug_discard)
 }
 
 /*
- * RelationCloseSmgrByOid - close a relcache entry's smgr link
+ * RelationReleaseSmgrByOid - close a relcache entry's smgr link
  *
  * Needed in some cases where we are changing a relation's physical mapping.
  * The link will be automatically reopened on next use.
  */
 void
-RelationCloseSmgrByOid(Oid relationId)
+RelationReleaseSmgrByOid(Oid relationId)
 {
 	Relation	relation;
 
@@ -3049,7 +3050,7 @@ RelationCloseSmgrByOid(Oid relationId)
 	if (!PointerIsValid(relation))
 		return;					/* not in cache, nothing to do */
 
-	RelationCloseSmgr(relation);
+	RelationReleaseSmgr(relation);
 }
 
 static void
@@ -3816,7 +3817,7 @@ RelationSetNewRelfilenumber(Relation relation, char persistence)
 		SMgrRelation srel;
 
 		srel = RelationCreateStorage(newrlocator, persistence, true);
-		smgrclose(srel);
+		smgrrelease(srel);
 	}
 	else
 	{
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index a9a179aaba..9f4ced6721 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -84,9 +84,9 @@ extern void smgrsetowner(SMgrRelation *owner, SMgrRelation reln);
 extern void smgrclearowner(SMgrRelation *owner, SMgrRelation reln);
 extern void smgrclose(SMgrRelation reln);
 extern void smgrcloseall(void);
-extern void smgrcloserellocator(RelFileLocatorBackend rlocator);
 extern void smgrrelease(SMgrRelation reln);
 extern void smgrreleaseall(void);
+extern void smgrreleaserellocator(RelFileLocatorBackend rlocator);
 extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
 extern void smgrdosyncall(SMgrRelation *rels, int nrels);
 extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 1426a353cd..210c5aa5b3 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -561,12 +561,6 @@ typedef struct ViewOptions
  *
  * Very little code is authorized to touch rel->rd_smgr directly.  Instead
  * use this function to fetch its value.
- *
- * Note: since a relcache flush can cause the file handle to be closed again,
- * it's unwise to hold onto the pointer returned by this function for any
- * long period.  Recommended practice is to just re-execute RelationGetSmgr
- * each time you need to access the SMgrRelation.  It's quite cheap in
- * comparison to whatever an smgr function is going to do.
  */
 static inline SMgrRelation
 RelationGetSmgr(Relation rel)
@@ -577,16 +571,16 @@ RelationGetSmgr(Relation rel)
 }
 
 /*
- * RelationCloseSmgr
- *		Close the relation at the smgr level, if not already done.
+ * RelationReleaseSmgr
+ *		Release the relation at the smgr level, if not already done.
  */
 static inline void
-RelationCloseSmgr(Relation relation)
+RelationReleaseSmgr(Relation relation)
 {
 	if (relation->rd_smgr != NULL)
-		smgrclose(relation->rd_smgr);
+		smgrrelease(relation->rd_smgr);
 
-	/* smgrclose should unhook from owner pointer */
+	/* smgrrelease should unhook from owner pointer */
 	Assert(relation->rd_smgr == NULL);
 }
 #endif							/* !FRONTEND */
diff --git a/src/include/utils/relcache.h b/src/include/utils/relcache.h
index 38524641f4..c02ebab649 100644
--- a/src/include/utils/relcache.h
+++ b/src/include/utils/relcache.h
@@ -130,7 +130,7 @@ extern void RelationCacheInvalidateEntry(Oid relationId);
 
 extern void RelationCacheInvalidate(bool debug_discard);
 
-extern void RelationCloseSmgrByOid(Oid relationId);
+extern void RelationReleaseSmgrByOid(Oid relationId);
 
 #ifdef USE_ASSERT_CHECKING
 extern void AssertPendingSyncs_RelationCache(void);
-- 
2.39.2

#2Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Thomas Munro (#1)
Re: Extending SMgrRelation lifetimes

On 14/08/2023 05:41, Thomas Munro wrote:

The new idea is to overload smgrrelease(it) so that it also clears the
owner, which means that AtEOXact_SMgr() will eventually smgrclose(it),
unless it is re-owned by a relation before then. That choice stems
from the complete lack of information available via sinval in the case
of an overflow. We must (1) close all descriptors because any file
might have been unlinked, (2) keep all pointers valid and yet (3) not
leak dropped smgr objects forever. In this patch, smgrreleaseall()
achieves those goals.

Makes sense.

Some of the smgrclose() calls could perhaps still be smgrclose(). If you
can guarantee that no-one is holding the SMgrRelation, it's still OK to
call smgrclose(). There's little gain from closing earlier, though.

Proof-of-concept patch attached. Are there holes in this scheme?
Better ideas welcome. In terms of spelling, another option would be
to change the behaviour of smgrclose() to work as described, ie it
would call smgrrelease() and then also disown, so we don't have to
change most of those callers, and then add a new function
smgrdestroy() for the few places that truly need it. Or something
like that.

If you change smgrclose() to do what smgrrelease() does now, then it
will apply automatically to extensions.

If an extension is currently using smgropen()/smgrclose() correctly,
this patch alone won't make it wrong, so it's not very critical for
extensions to adopt the change. However, if after this we consider it OK
to hold a pointer to SMgrRelation for longer, and start doing that in
the backend, then extensions need to be adapted too.

While studying this I noticed a minor thinko in smgrrelease() in
15+16, so here's a fix for that also. I haven't figured out a
sequence that makes anything bad happen, but we should really
invalidate smgr_targblock when a relfilenode is reused, since it might
point past the end. This becomes more obvious once smgrrelease() is
used for truncation, as proposed here.

+1. You can move the smgr_targblock clearing out of the loop, though.

--
Heikki Linnakangas
Neon (https://neon.tech)

#3Thomas Munro
thomas.munro@gmail.com
In reply to: Heikki Linnakangas (#2)
1 attachment(s)
Re: Extending SMgrRelation lifetimes

On Wed, Aug 16, 2023 at 4:11 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Makes sense.

Thanks for looking!

If you change smgrclose() to do what smgrrelease() does now, then it
will apply automatically to extensions.

If an extension is currently using smgropen()/smgrclose() correctly,
this patch alone won't make it wrong, so it's not very critical for
extensions to adopt the change. However, if after this we consider it OK
to hold a pointer to SMgrRelation for longer, and start doing that in
the backend, then extensions need to be adapted too.

Yeah, that sounds quite compelling. Let's try that way:

* smgrrelease() is removed
* smgrclose() now releases resources, but doesn't destroy until EOX
* smgrdestroy() now frees memory, and should rarely be used

Still WIP while I think about edge cases, but so far I think this is
the better option.

While studying this I noticed a minor thinko in smgrrelease() in
15+16, so here's a fix for that also. I haven't figured out a
sequence that makes anything bad happen, but we should really
invalidate smgr_targblock when a relfilenode is reused, since it might
point past the end. This becomes more obvious once smgrrelease() is
used for truncation, as proposed here.

+1. You can move the smgr_targblock clearing out of the loop, though.

Right, thanks. Pushed.

Attachments:

v2-0001-Give-SMgrRelation-pointers-a-well-defined-lifetim.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Give-SMgrRelation-pointers-a-well-defined-lifetim.patchDownload
From 71648a5b137540320fe782116e81f80c053c7f2c Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 10 Aug 2023 18:16:31 +1200
Subject: [PATCH v2] Give SMgrRelation pointers a well-defined lifetime.

After calling smgropen(), it was not clear how long you could continue
to use the result, because various code paths including cache
invalidation could call smgrclose(), which freed the memory.

Guarantee that the object won't be destroyed until the end of the
current transaction, or in recovery, the commit/abort record that
destroys the underlying storage.

The existing smgrclose() function now closes files and forgets all state
except the rlocator, like smgrrelease() used to do, and additionally
"disowns" the object so that it is disconnected from any Relation and
queued for destruction at AtEOXact_SMgr(), unless it is re-owned by a
Relation again before then.

A new smgrdestroy() function is used by rare places that know there
should be no other references to the SMgrRelation.

The short version:

 * smgrrelease() is removed
 * smgrclose() now releases resources, but doesn't destroy until EOX
 * smgrdestroy() now frees memory, and should rarely be used

Existing code should be unaffected, but it is now possible for code that
has an SMgrRelation object to use it repeatedly during a transaction as
long as the storage hasn't been physically dropped.  Such code would
normally hold a lock on the relation.

Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/CA%2BhUKGJ8NTvqLHz6dqbQnt2c8XCki4r2QvXjBQcXpVwxTY_pvA%40mail.gmail.com
---
 src/backend/access/transam/xlogutils.c |  2 +-
 src/backend/postmaster/bgwriter.c      |  4 +--
 src/backend/postmaster/checkpointer.c  |  4 +--
 src/backend/storage/smgr/md.c          |  2 +-
 src/backend/storage/smgr/smgr.c        | 46 ++++++++++++++++----------
 src/include/storage/smgr.h             |  4 +--
 src/include/utils/rel.h                |  6 ----
 7 files changed, 37 insertions(+), 31 deletions(-)

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index e174a2a891..bed15dad0f 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -657,7 +657,7 @@ XLogDropDatabase(Oid dbid)
 	 * This is unnecessarily heavy-handed, as it will close SMgrRelation
 	 * objects for other databases as well. DROP DATABASE occurs seldom enough
 	 * that it's not worth introducing a variant of smgrclose for just this
-	 * purpose. XXX: Or should we rather leave the smgr entries dangling?
+	 * purpose.
 	 */
 	smgrcloseall();
 
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index caad642ec9..a0ecc7f841 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -202,7 +202,7 @@ BackgroundWriterMain(void)
 		 * where holding deleted files open causes various strange errors.
 		 * It's not clear we need it elsewhere, but shouldn't hurt.
 		 */
-		smgrcloseall();
+		smgrdestroyall();
 
 		/* Report wait end here, when there is no further possibility of wait */
 		pgstat_report_wait_end();
@@ -248,7 +248,7 @@ BackgroundWriterMain(void)
 			 * After any checkpoint, close all smgr files.  This is so we
 			 * won't hang onto smgr references to deleted files indefinitely.
 			 */
-			smgrcloseall();
+			smgrdestroyall();
 		}
 
 		/*
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index ace9893d95..6daccaab78 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -316,7 +316,7 @@ CheckpointerMain(void)
 		 * where holding deleted files open causes various strange errors.
 		 * It's not clear we need it elsewhere, but shouldn't hurt.
 		 */
-		smgrcloseall();
+		smgrdestroyall();
 	}
 
 	/* We can now handle ereport(ERROR) */
@@ -461,7 +461,7 @@ CheckpointerMain(void)
 			 * After any checkpoint, close all smgr files.  This is so we
 			 * won't hang onto smgr references to deleted files indefinitely.
 			 */
-			smgrcloseall();
+			smgrdestroyall();
 
 			/*
 			 * Indicate checkpoint completion to any waiting backends.
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index fdecbad170..c54cbccb4c 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -1272,7 +1272,7 @@ DropRelationFiles(RelFileLocator *delrels, int ndelrels, bool isRedo)
 	smgrdounlinkall(srels, ndelrels, isRedo);
 
 	for (i = 0; i < ndelrels; i++)
-		smgrclose(srels[i]);
+		smgrdestroy(srels[i]);
 	pfree(srels);
 }
 
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index f76c4605db..63ebcd3a39 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -254,10 +254,10 @@ smgrexists(SMgrRelation reln, ForkNumber forknum)
 }
 
 /*
- * smgrclose() -- Close and delete an SMgrRelation object.
+ * smgrdestroy() -- Delete an SMgrRelation object.
  */
 void
-smgrclose(SMgrRelation reln)
+smgrdestroy(SMgrRelation reln)
 {
 	SMgrRelation *owner;
 	ForkNumber	forknum;
@@ -284,27 +284,35 @@ smgrclose(SMgrRelation reln)
 }
 
 /*
- * smgrrelease() -- Release all resources used by this object.
+ * smgrclose() -- Release all resources used by this object.
  *
- * The object remains valid.
+ * The object remains valid, but is moved to the unknown list where it will
+ * be destroy by AtEOXact_SMgr().  It may be re-owned if it is accessed by
+ * a relation before then.
  */
 void
-smgrrelease(SMgrRelation reln)
+smgrclose(SMgrRelation reln)
 {
 	for (ForkNumber forknum = 0; forknum <= MAX_FORKNUM; forknum++)
 	{
 		smgrsw[reln->smgr_which].smgr_close(reln, forknum);
 		reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
 	}
+	reln->smgr_targblock = InvalidBlockNumber;
+
+	if (reln->smgr_owner)
+	{
+		*reln->smgr_owner = NULL;
+		reln->smgr_owner = NULL;
+		dlist_push_tail(&unowned_relns, &reln->node);
+	}
 }
 
 /*
- * smgrreleaseall() -- Release resources used by all objects.
- *
- * This is called for PROCSIGNAL_BARRIER_SMGRRELEASE.
+ * smgrcloseall() -- Close all objects.
  */
 void
-smgrreleaseall(void)
+smgrcloseall(void)
 {
 	HASH_SEQ_STATUS status;
 	SMgrRelation reln;
@@ -316,14 +324,17 @@ smgrreleaseall(void)
 	hash_seq_init(&status, SMgrRelationHash);
 
 	while ((reln = (SMgrRelation) hash_seq_search(&status)) != NULL)
-		smgrrelease(reln);
+		smgrclose(reln);
 }
 
 /*
- * smgrcloseall() -- Close all existing SMgrRelation objects.
+ * smgrdestroyall() -- Destroy all SMgrRelation objects.
+ *
+ * It must be known that there are no pointers to SMgrRelations, other than
+ * those registered with smgrsetowner().
  */
 void
-smgrcloseall(void)
+smgrdestroyall(void)
 {
 	HASH_SEQ_STATUS status;
 	SMgrRelation reln;
@@ -335,7 +346,7 @@ smgrcloseall(void)
 	hash_seq_init(&status, SMgrRelationHash);
 
 	while ((reln = (SMgrRelation) hash_seq_search(&status)) != NULL)
-		smgrclose(reln);
+		smgrdestroy(reln);
 }
 
 /*
@@ -726,7 +737,8 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
  * AtEOXact_SMgr
  *
  * This routine is called during transaction commit or abort (it doesn't
- * particularly care which).  All transient SMgrRelation objects are closed.
+ * particularly care which).  All transient SMgrRelation objects are
+ * destroyed.
  *
  * We do this as a compromise between wanting transient SMgrRelations to
  * live awhile (to amortize the costs of blind writes of multiple blocks)
@@ -740,7 +752,7 @@ AtEOXact_SMgr(void)
 	dlist_mutable_iter iter;
 
 	/*
-	 * Zap all unowned SMgrRelations.  We rely on smgrclose() to remove each
+	 * Zap all unowned SMgrRelations.  We rely on smgrdestroy() to remove each
 	 * one from the list.
 	 */
 	dlist_foreach_modify(iter, &unowned_relns)
@@ -750,7 +762,7 @@ AtEOXact_SMgr(void)
 
 		Assert(rel->smgr_owner == NULL);
 
-		smgrclose(rel);
+		smgrdestroy(rel);
 	}
 }
 
@@ -761,6 +773,6 @@ AtEOXact_SMgr(void)
 bool
 ProcessBarrierSmgrRelease(void)
 {
-	smgrreleaseall();
+	smgrcloseall();
 	return true;
 }
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index a9a179aaba..18dd8168df 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -85,8 +85,8 @@ extern void smgrclearowner(SMgrRelation *owner, SMgrRelation reln);
 extern void smgrclose(SMgrRelation reln);
 extern void smgrcloseall(void);
 extern void smgrcloserellocator(RelFileLocatorBackend rlocator);
-extern void smgrrelease(SMgrRelation reln);
-extern void smgrreleaseall(void);
+extern void smgrdestroy(SMgrRelation reln);
+extern void smgrdestroyall(void);
 extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
 extern void smgrdosyncall(SMgrRelation *rels, int nrels);
 extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 1426a353cd..a1402c1dd2 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -561,12 +561,6 @@ typedef struct ViewOptions
  *
  * Very little code is authorized to touch rel->rd_smgr directly.  Instead
  * use this function to fetch its value.
- *
- * Note: since a relcache flush can cause the file handle to be closed again,
- * it's unwise to hold onto the pointer returned by this function for any
- * long period.  Recommended practice is to just re-execute RelationGetSmgr
- * each time you need to access the SMgrRelation.  It's quite cheap in
- * comparison to whatever an smgr function is going to do.
  */
 static inline SMgrRelation
 RelationGetSmgr(Relation rel)
-- 
2.39.2

#4Robert Haas
robertmhaas@gmail.com
In reply to: Thomas Munro (#3)
Re: Extending SMgrRelation lifetimes

On Thu, Aug 17, 2023 at 1:11 AM Thomas Munro <thomas.munro@gmail.com> wrote:

Still WIP while I think about edge cases, but so far I think this is
the better option.

I think this direction makes a lot of sense. The lack of a defined
lifetime for SMgrRelation objects makes correct programming difficult,
makes efficient programming difficult, and doesn't really have any
advantages. I know this is just a WIP patch but for the final version
I think it would make sense to try to do a bit more work on the
comments. For instance:

- In src/include/utils/rel.h, instead of just deleting that comment,
how about documenting the new object lifetime? Or maybe that comment
belongs elsewhere, but I think it would definitely good to spell it
out very explicitly at some suitable place.

- When we change smgrcloseall() to smgrdestroyall(), maybe it's worth
spelling out why destroying is needed and not just closing. For
example, the second hunk in bgwriter.c includes a comment that says
"After any checkpoint, close all smgr files. This is so we won't hang
onto smgr references to deleted files indefinitely." But maybe it
should say something like "After any checkpoint, close all smgr files
and destroy the associated smgr objects. This guarantees that we close
the actual file descriptors, that we close the File objects as managed
by fd.c, and that we also destroy the smgr objects. We don't want any
of these resources to stick around indefinitely after a relation file
has been deleted."

- Maybe it's worth adding comments around some of the smgrclose[all]
calls to mentioning that in those cases we want the file descriptors
(and File objects?) to get closed but don't want to invalidate
pointers. But I'm less sure that this is necessary. I don't want to
have a million comments everywhere, just enough that someone looking
at this stuff in the future can orient themselves about what's going
on without too much difficulty.

--
Robert Haas
EDB: http://www.enterprisedb.com

#5Thomas Munro
thomas.munro@gmail.com
In reply to: Robert Haas (#4)
2 attachment(s)
Re: Extending SMgrRelation lifetimes

On Fri, Aug 18, 2023 at 2:30 AM Robert Haas <robertmhaas@gmail.com> wrote:

I think this direction makes a lot of sense. The lack of a defined
lifetime for SMgrRelation objects makes correct programming difficult,
makes efficient programming difficult, and doesn't really have any
advantages.

Thanks for looking!

I know this is just a WIP patch but for the final version
I think it would make sense to try to do a bit more work on the
comments. For instance:

- In src/include/utils/rel.h, instead of just deleting that comment,
how about documenting the new object lifetime? Or maybe that comment
belongs elsewhere, but I think it would definitely good to spell it
out very explicitly at some suitable place.

Right, let's one find one good place. I think smgropen() would be best.

- When we change smgrcloseall() to smgrdestroyall(), maybe it's worth
spelling out why destroying is needed and not just closing. For
example, the second hunk in bgwriter.c includes a comment that says
"After any checkpoint, close all smgr files. This is so we won't hang
onto smgr references to deleted files indefinitely." But maybe it
should say something like "After any checkpoint, close all smgr files
and destroy the associated smgr objects. This guarantees that we close
the actual file descriptors, that we close the File objects as managed
by fd.c, and that we also destroy the smgr objects. We don't want any
of these resources to stick around indefinitely after a relation file
has been deleted."

There are several similar comments. I think they can be divided into
two categories:

1. The error-path ones, that we should now just delete along with the
code they describe, because the "various strange errors" should have
been fixed comprehensively by PROCSIGNAL_BARRIER_SMGRRELEASE. Here is
a separate patch to do that.

2. The per-checkpoint ones that still make sense to avoid unbounded
resource usage. Here is a new attempt at explaining:

                        /*
-                        * After any checkpoint, close all smgr files.
This is so we
-                        * won't hang onto smgr references to deleted
files indefinitely.
+                        * After any checkpoint, free all smgr
objects.  Otherwise we
+                        * would never do so for dropped relations, as
the checkpointer
+                        * does not process shared invalidation messages or call
+                        * AtEOXact_SMgr().
                         */
-                       smgrcloseall();
+                       smgrdestroyall();

- Maybe it's worth adding comments around some of the smgrclose[all]
calls to mentioning that in those cases we want the file descriptors
(and File objects?) to get closed but don't want to invalidate
pointers. But I'm less sure that this is necessary. I don't want to
have a million comments everywhere, just enough that someone looking
at this stuff in the future can orient themselves about what's going
on without too much difficulty.

I covered that with the following comment for smgrclose():

+ * The object remains valid, but is moved to the unknown list where it will
+ * be destroyed by AtEOXact_SMgr().  It may be re-owned if it is accessed by a
+ * relation before then.

Attachments:

v3-0001-Remove-some-obsolete-smgrcloseall-calls.patchtext/x-patch; charset=US-ASCII; name=v3-0001-Remove-some-obsolete-smgrcloseall-calls.patchDownload
From fa3beb19e6fafccb0d75d2e3e60d75412757b326 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 23 Aug 2023 14:38:38 +1200
Subject: [PATCH v3 1/2] Remove some obsolete smgrcloseall() calls.

Before the advent of PROCSIGNAL_BARRIER_SMGRRELEASE, we didn't have a
comprehensive way to deal with Windows file handles that get in the way
of unlinking directories.  We had smgrcloseall() calls in a few places
to try to mitigate.

It's still a good idea for bgwriter and checkpointer to do that once per
checkpoint so they don't accumulate unbounded SMgrRelation objects, but
there is no longer any reason to close them at other random places such
as the error path, and the explanation as given in the comments is now
obsolete.

Discussion: https://postgr.es/m/message-id/CA%2BhUKGJ8NTvqLHz6dqbQnt2c8XCki4r2QvXjBQcXpVwxTY_pvA%40mail.gmail.com
---
 src/backend/postmaster/bgwriter.c     | 7 -------
 src/backend/postmaster/checkpointer.c | 7 -------
 src/backend/postmaster/walwriter.c    | 7 -------
 3 files changed, 21 deletions(-)

diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index caad642ec9..093cd034ea 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -197,13 +197,6 @@ BackgroundWriterMain(void)
 		 */
 		pg_usleep(1000000L);
 
-		/*
-		 * Close all open files after any error.  This is helpful on Windows,
-		 * where holding deleted files open causes various strange errors.
-		 * It's not clear we need it elsewhere, but shouldn't hurt.
-		 */
-		smgrcloseall();
-
 		/* Report wait end here, when there is no further possibility of wait */
 		pgstat_report_wait_end();
 	}
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index ace9893d95..0e1c60ca72 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -310,13 +310,6 @@ CheckpointerMain(void)
 		 * fast as we can.
 		 */
 		pg_usleep(1000000L);
-
-		/*
-		 * Close all open files after any error.  This is helpful on Windows,
-		 * where holding deleted files open causes various strange errors.
-		 * It's not clear we need it elsewhere, but shouldn't hurt.
-		 */
-		smgrcloseall();
 	}
 
 	/* We can now handle ereport(ERROR) */
diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index 266fbc2339..74526cf133 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -189,13 +189,6 @@ WalWriterMain(void)
 		 * fast as we can.
 		 */
 		pg_usleep(1000000L);
-
-		/*
-		 * Close all open files after any error.  This is helpful on Windows,
-		 * where holding deleted files open causes various strange errors.
-		 * It's not clear we need it elsewhere, but shouldn't hurt.
-		 */
-		smgrcloseall();
 	}
 
 	/* We can now handle ereport(ERROR) */
-- 
2.39.2

v3-0002-Give-SMgrRelation-pointers-a-well-defined-lifetim.patchtext/x-patch; charset=US-ASCII; name=v3-0002-Give-SMgrRelation-pointers-a-well-defined-lifetim.patchDownload
From 620f328d645742d94a352258d081dad1a55f495f Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 10 Aug 2023 18:16:31 +1200
Subject: [PATCH v3 2/2] Give SMgrRelation pointers a well-defined lifetime.

After calling smgropen(), it was not clear how long you could continue
to use the result, because various code paths including cache
invalidation could call smgrclose(), which freed the memory.

Guarantee that the object won't be destroyed until the end of the
current transaction, or in recovery, the commit/abort record that
destroys the underlying storage.

The existing smgrclose() function now closes files and forgets all state
except the rlocator, like smgrrelease() used to do, and additionally
"disowns" the object so that it is disconnected from any Relation and
queued for destruction at AtEOXact_SMgr(), unless it is re-owned by a
Relation again before then.

A new smgrdestroy() function is used by rare places that know there
should be no other references to the SMgrRelation.

The short version:

 * smgrrelease() is removed
 * smgrclose() now releases resources, but doesn't destroy until EOX
 * smgrdestroy() now frees memory, and should rarely be used

Existing code should be unaffected, but it is now possible for code that
has an SMgrRelation object to use it repeatedly during a transaction as
long as the storage hasn't been physically dropped.  Such code would
normally hold a lock on the relation.

Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKGJ8NTvqLHz6dqbQnt2c8XCki4r2QvXjBQcXpVwxTY_pvA%40mail.gmail.com
---
 src/backend/access/transam/xlogutils.c |  2 +-
 src/backend/postmaster/bgwriter.c      |  8 +++--
 src/backend/postmaster/checkpointer.c  | 15 ++++----
 src/backend/storage/smgr/smgr.c        | 49 ++++++++++++++++----------
 src/include/storage/smgr.h             |  4 +--
 src/include/utils/rel.h                |  6 ----
 6 files changed, 46 insertions(+), 38 deletions(-)

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 43f7b31205..bdc0f7e1da 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -657,7 +657,7 @@ XLogDropDatabase(Oid dbid)
 	 * This is unnecessarily heavy-handed, as it will close SMgrRelation
 	 * objects for other databases as well. DROP DATABASE occurs seldom enough
 	 * that it's not worth introducing a variant of smgrclose for just this
-	 * purpose. XXX: Or should we rather leave the smgr entries dangling?
+	 * purpose.
 	 */
 	smgrcloseall();
 
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 093cd034ea..5eea8b7af0 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -238,10 +238,12 @@ BackgroundWriterMain(void)
 		if (FirstCallSinceLastCheckpoint())
 		{
 			/*
-			 * After any checkpoint, close all smgr files.  This is so we
-			 * won't hang onto smgr references to deleted files indefinitely.
+			 * After any checkpoint, free all smgr objects.  Otherwise we
+			 * would never do so for dropped relations, as the bgwriter does
+			 * not process shared invalidation messages or call
+			 * AtEOXact_SMgr().
 			 */
-			smgrcloseall();
+			smgrdestroyall();
 		}
 
 		/*
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 0e1c60ca72..0947ec62e2 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -451,10 +451,12 @@ CheckpointerMain(void)
 				ckpt_performed = CreateRestartPoint(flags);
 
 			/*
-			 * After any checkpoint, close all smgr files.  This is so we
-			 * won't hang onto smgr references to deleted files indefinitely.
+			 * After any checkpoint, free all smgr objects.  Otherwise we
+			 * would never do so for dropped relations, as the checkpointer
+			 * does not process shared invalidation messages or call
+			 * AtEOXact_SMgr().
 			 */
-			smgrcloseall();
+			smgrdestroyall();
 
 			/*
 			 * Indicate checkpoint completion to any waiting backends.
@@ -937,11 +939,8 @@ RequestCheckpoint(int flags)
 		 */
 		CreateCheckPoint(flags | CHECKPOINT_IMMEDIATE);
 
-		/*
-		 * After any checkpoint, close all smgr files.  This is so we won't
-		 * hang onto smgr references to deleted files indefinitely.
-		 */
-		smgrcloseall();
+		/* Free all smgr objects, as CheckpointerMain() normally would. */
+		smgrdestroyall();
 
 		return;
 	}
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 5d0f3d515c..d8e28c543e 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -144,7 +144,9 @@ smgrshutdown(int code, Datum arg)
 /*
  * smgropen() -- Return an SMgrRelation object, creating it if need be.
  *
- * This does not attempt to actually open the underlying file.
+ * This does not attempt to actually open the underlying files.  The returned
+ * object remains valid at least until AtEOXact_SMgr() is called, or until
+ * smgrdestroy() is called in non-transactional backends.
  */
 SMgrRelation
 smgropen(RelFileLocator rlocator, BackendId backend)
@@ -254,10 +256,10 @@ smgrexists(SMgrRelation reln, ForkNumber forknum)
 }
 
 /*
- * smgrclose() -- Close and delete an SMgrRelation object.
+ * smgrdestroy() -- Delete an SMgrRelation object.
  */
 void
-smgrclose(SMgrRelation reln)
+smgrdestroy(SMgrRelation reln)
 {
 	SMgrRelation *owner;
 	ForkNumber	forknum;
@@ -284,12 +286,14 @@ smgrclose(SMgrRelation reln)
 }
 
 /*
- * smgrrelease() -- Release all resources used by this object.
+ * smgrclose() -- Release all resources used by this object.
  *
- * The object remains valid.
+ * The object remains valid, but is moved to the unknown list where it will
+ * be destroyed by AtEOXact_SMgr().  It may be re-owned if it is accessed by a
+ * relation before then.
  */
 void
-smgrrelease(SMgrRelation reln)
+smgrclose(SMgrRelation reln)
 {
 	for (ForkNumber forknum = 0; forknum <= MAX_FORKNUM; forknum++)
 	{
@@ -297,15 +301,20 @@ smgrrelease(SMgrRelation reln)
 		reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
 	}
 	reln->smgr_targblock = InvalidBlockNumber;
+
+	if (reln->smgr_owner)
+	{
+		*reln->smgr_owner = NULL;
+		reln->smgr_owner = NULL;
+		dlist_push_tail(&unowned_relns, &reln->node);
+	}
 }
 
 /*
- * smgrreleaseall() -- Release resources used by all objects.
- *
- * This is called for PROCSIGNAL_BARRIER_SMGRRELEASE.
+ * smgrcloseall() -- Close all objects.
  */
 void
-smgrreleaseall(void)
+smgrcloseall(void)
 {
 	HASH_SEQ_STATUS status;
 	SMgrRelation reln;
@@ -317,14 +326,17 @@ smgrreleaseall(void)
 	hash_seq_init(&status, SMgrRelationHash);
 
 	while ((reln = (SMgrRelation) hash_seq_search(&status)) != NULL)
-		smgrrelease(reln);
+		smgrclose(reln);
 }
 
 /*
- * smgrcloseall() -- Close all existing SMgrRelation objects.
+ * smgrdestroyall() -- Destroy all SMgrRelation objects.
+ *
+ * It must be known that there are no pointers to SMgrRelations, other than
+ * those registered with smgrsetowner().
  */
 void
-smgrcloseall(void)
+smgrdestroyall(void)
 {
 	HASH_SEQ_STATUS status;
 	SMgrRelation reln;
@@ -336,7 +348,7 @@ smgrcloseall(void)
 	hash_seq_init(&status, SMgrRelationHash);
 
 	while ((reln = (SMgrRelation) hash_seq_search(&status)) != NULL)
-		smgrclose(reln);
+		smgrdestroy(reln);
 }
 
 /*
@@ -727,7 +739,8 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
  * AtEOXact_SMgr
  *
  * This routine is called during transaction commit or abort (it doesn't
- * particularly care which).  All transient SMgrRelation objects are closed.
+ * particularly care which).  All transient SMgrRelation objects are
+ * destroyed.
  *
  * We do this as a compromise between wanting transient SMgrRelations to
  * live awhile (to amortize the costs of blind writes of multiple blocks)
@@ -741,7 +754,7 @@ AtEOXact_SMgr(void)
 	dlist_mutable_iter iter;
 
 	/*
-	 * Zap all unowned SMgrRelations.  We rely on smgrclose() to remove each
+	 * Zap all unowned SMgrRelations.  We rely on smgrdestroy() to remove each
 	 * one from the list.
 	 */
 	dlist_foreach_modify(iter, &unowned_relns)
@@ -751,7 +764,7 @@ AtEOXact_SMgr(void)
 
 		Assert(rel->smgr_owner == NULL);
 
-		smgrclose(rel);
+		smgrdestroy(rel);
 	}
 }
 
@@ -762,6 +775,6 @@ AtEOXact_SMgr(void)
 bool
 ProcessBarrierSmgrRelease(void)
 {
-	smgrreleaseall();
+	smgrcloseall();
 	return true;
 }
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index a9a179aaba..18dd8168df 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -85,8 +85,8 @@ extern void smgrclearowner(SMgrRelation *owner, SMgrRelation reln);
 extern void smgrclose(SMgrRelation reln);
 extern void smgrcloseall(void);
 extern void smgrcloserellocator(RelFileLocatorBackend rlocator);
-extern void smgrrelease(SMgrRelation reln);
-extern void smgrreleaseall(void);
+extern void smgrdestroy(SMgrRelation reln);
+extern void smgrdestroyall(void);
 extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
 extern void smgrdosyncall(SMgrRelation *rels, int nrels);
 extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 1426a353cd..a1402c1dd2 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -561,12 +561,6 @@ typedef struct ViewOptions
  *
  * Very little code is authorized to touch rel->rd_smgr directly.  Instead
  * use this function to fetch its value.
- *
- * Note: since a relcache flush can cause the file handle to be closed again,
- * it's unwise to hold onto the pointer returned by this function for any
- * long period.  Recommended practice is to just re-execute RelationGetSmgr
- * each time you need to access the SMgrRelation.  It's quite cheap in
- * comparison to whatever an smgr function is going to do.
  */
 static inline SMgrRelation
 RelationGetSmgr(Relation rel)
-- 
2.39.2

#6Robert Haas
robertmhaas@gmail.com
In reply to: Thomas Munro (#5)
Re: Extending SMgrRelation lifetimes

I think that if you believe 0001 to be correct you should go ahead and
commit it sooner rather than later. If you're wrong and something
weird starts happening we'll then have a chance to notice that before
too much other stuff gets changed on top of this and confuses the
matter.

On Wed, Aug 23, 2023 at 12:55 AM Thomas Munro <thomas.munro@gmail.com> wrote:

Right, let's one find one good place. I think smgropen() would be best.

I think it would be a good idea to give this comment a bit more oomph.
In lieu of this:

+ * This does not attempt to actually open the underlying files.  The returned
+ * object remains valid at least until AtEOXact_SMgr() is called, or until
+ * smgrdestroy() is called in non-transactional backends.

I would leave the existing "This does not attempt to actually open the
underlying files." comment as a separate comment, and add something
like this as a new paragraph:

In versions of PostgreSQL prior to 17, this function returned an
object with no defined lifetime. Now, however, the object remains
valid for the lifetime of the transaction, up to the point where
AtEOXact_SMgr() is called, making it much easier for callers to know
for how long they can hold on to a pointer to the returned object. If
this function is called outside of a transaction, the object remains
valid until smgrdestroy() or smgrdestroyall() is called. Background
processes that use smgr but not transactions typically do this once
per checkpoint cycle.

Apart from that, the main thing that is bothering me is that the
justification for redefining smgrclose() to do what smgrrelease() now
does isn't really spelled out anywhere. You mentioned some reasons and
Heikki mentioned the benefit to extensions, but I feel like somebody
should be able to understand the reasoning clearly from reading the
commit message or the comments in the patch, rather than having to
visit the mailing list discussion, and I'm not sure we're there yet. I
feel like I understood why we were doing this and was convinced it was
a good idea at some point, but now the reasoning has gone out of my
head and I can't recover it. If somebody does smgropen() .. stuff ...
smgrclose() as in heapam_relation_copy_data() or index_copy_data(),
this change has the effect of making the SmgrRelation remain valid
until eoxact whereas before it would have been destroyed instantly. Is
that what we want? Presumably yes, or this patch wouldn't be shaped
like this, but I don't know why we want that...

Another thing that seems a bit puzzling is how this is intended to, or
does, interact with the ownership mechanism. Intuitively, I feel like
a Relation owns an SMgrRelation *because* the SMgrRelation has no
defined lifetime. But I think that's not quite right. I guess the
ownership mechanism doesn't guarantee anything about the lifetime of
the object, just that the pointer in the Relation won't hang around
any longer than the object to which it's pointing. So does that mean
that we're free to redefine the object lifetime to be pretty much
anything we want and that mechanism doesn't really care? Huh,
interesting.

--
Robert Haas
EDB: http://www.enterprisedb.com

#7Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Robert Haas (#6)
3 attachment(s)
Re: Extending SMgrRelation lifetimes

I spent some more time digging into this, experimenting with different
approaches. Came up with pretty significant changes; see below:

On 18/09/2023 18:19, Robert Haas wrote:

I think that if you believe 0001 to be correct you should go ahead and
commit it sooner rather than later. If you're wrong and something
weird starts happening we'll then have a chance to notice that before
too much other stuff gets changed on top of this and confuses the
matter.

+1

On Wed, Aug 23, 2023 at 12:55 AM Thomas Munro <thomas.munro@gmail.com> wrote:

Right, let's one find one good place. I think smgropen() would be best.

I think it would be a good idea to give this comment a bit more oomph.
In lieu of this:

+ * This does not attempt to actually open the underlying files.  The returned
+ * object remains valid at least until AtEOXact_SMgr() is called, or until
+ * smgrdestroy() is called in non-transactional backends.

I would leave the existing "This does not attempt to actually open the
underlying files." comment as a separate comment, and add something
like this as a new paragraph:

In versions of PostgreSQL prior to 17, this function returned an
object with no defined lifetime. Now, however, the object remains
valid for the lifetime of the transaction, up to the point where
AtEOXact_SMgr() is called, making it much easier for callers to know
for how long they can hold on to a pointer to the returned object. If
this function is called outside of a transaction, the object remains
valid until smgrdestroy() or smgrdestroyall() is called. Background
processes that use smgr but not transactions typically do this once
per checkpoint cycle.

+1

Apart from that, the main thing that is bothering me is that the
justification for redefining smgrclose() to do what smgrrelease() now
does isn't really spelled out anywhere. You mentioned some reasons and
Heikki mentioned the benefit to extensions, but I feel like somebody
should be able to understand the reasoning clearly from reading the
commit message or the comments in the patch, rather than having to
visit the mailing list discussion, and I'm not sure we're there yet. I
feel like I understood why we were doing this and was convinced it was
a good idea at some point, but now the reasoning has gone out of my
head and I can't recover it. If somebody does smgropen() .. stuff ...
smgrclose() as in heapam_relation_copy_data() or index_copy_data(),
this change has the effect of making the SmgrRelation remain valid
until eoxact whereas before it would have been destroyed instantly. Is
that what we want? Presumably yes, or this patch wouldn't be shaped
like this, but I don't know why we want that...

Fair. I tried to address that by adding an overview comment at top of
smgr.c, explaining how this stuff work. I hope that helps.

Another thing that seems a bit puzzling is how this is intended to, or
does, interact with the ownership mechanism. Intuitively, I feel like
a Relation owns an SMgrRelation *because* the SMgrRelation has no
defined lifetime. But I think that's not quite right. I guess the
ownership mechanism doesn't guarantee anything about the lifetime of
the object, just that the pointer in the Relation won't hang around
any longer than the object to which it's pointing. So does that mean
that we're free to redefine the object lifetime to be pretty much
anything we want and that mechanism doesn't really care? Huh,
interesting.

Yeah that owner mechanism is weird. It guarantees that the pointer to
the SMgrRelation is cleared when the SMgrRelation is destroyed. But it
also prevents the SMgrRelation from being destroyed at end of
transaction. That's how it is in 'master' too.

But with this patch, we don't normally call smgrdestroy() on an
SMgrRelation that came from the relation cache. We do call
smgrdestroyall() in the aux processes, but they don't have a relcache.
So the real effect of setting the owner now is to prevent the
SMgrRelation from being destroyed at end of transaction; the mechanism
of clearing the pointer is unused.

I found two exceptions to that, though, by adding some extra assertions
and running the regression tests:

1. The smgrdestroyall() in a single-user backend in RequestCheckpoint().
It destroys SMgrRelations belonging to relcache entries, and the owner
mechanism clears the pointers from the relcache. I think smgrcloseall(),
or doing nothing, would actually be more appropriate there.

2. A funny case with foreign tables: ANALYZE on a foreign table calls
visibilitymap_count(). A foreign table has no visibility map so it
returns 0, but before doing so it calls RelationGetSmgr on the foreign
table, which has 0/0/0 rellocator. That creates an SMgrRelation for
0/0/0, and sets the foreign table's relcache entry as its owner. If you
then call ANALYZE on another foreign table, it also calls
RelationGetSmgr with 0/0/0 rellocator, returning the same SMgrRelation
entry, and changes its owner to the new relcache entry. That doesn't
make much sense and it's pretty accidental that it works at all, so
attached is a patch to avoid calling visibilitymap_count() on foreign
tables.

I propose that we replace the single owner with a "pin counter". One
SMgrRelation can have more than one pin on it, and the guarantee is that
as long as the pin counter is non-zero, the SMgrRelation cannot be
destroyed and the pointer remains valid. We don't really need the
capability for more than one pin at the moment (the regression tests
pass with an extra assertion that pincount <= 1 after fixing the foreign
table issue), but it seems more straightforward than tracking an owner.

Here's another reason to do that: I noticed this at the end of
swap_relation_files():

/*
* Close both relcache entries' smgr links. We need this kluge because
* both links will be invalidated during upcoming CommandCounterIncrement.
* Whichever of the rels is the second to be cleared will have a dangling
* reference to the other's smgr entry. Rather than trying to avoid this
* by ordering operations just so, it's easiest to close the links first.
* (Fortunately, since one of the entries is local in our transaction,
* it's sufficient to clear out our own relcache this way; the problem
* cannot arise for other backends when they see our update on the
* non-transient relation.)
*
* Caution: the placement of this step interacts with the decision to
* handle toast rels by recursion. When we are trying to rebuild pg_class
* itself, the smgr close on pg_class must happen after all accesses in
* this function.
*/
RelationCloseSmgrByOid(r1);
RelationCloseSmgrByOid(r2);

If RelationCloseSmgrByOid() merely closes the underlying file descriptor
but doesn't destroy the SMgrRelation object - as it does with these
patches - I think we reintroduce the dangling reference problem that the
comment mentions. But if we allow the same SMgrRelation to be pinned by
two different relcache entries, the problem goes away and we can remove
that kluge.

I think we're missing test coverage for that though. I commented out
those calls in 'master' and ran the regression tests, but got no
failures. I don't fully understand the problem anyway. Or does it not
exist anymore? Is there a moment where we have two relcache entries
pointing to the same SMgrRelation? I don't see it. In any case, with a
pin counter mechanism, I believe it would be fine.

Summary of the changes to the attached main patch:

* Added an overview comment at top of smgr.c

* Added the comment Robert suggested to smgropen()

* Replaced the single owner with a pin count and smgrpin() / smgrunpin()
functions. smgrdestroyall() now only destroys unpinned entries

* Removed that kluge from swap_relation_files(). It should not be needed
anymore with the pin counter.

* Changed a few places in bufmgr.c where we called RelationGetSmgr() on
every smgr call to keep the SMgrRelation in a local variable. That was
not safe before, but is now. I don't think we should go on a spree to
change all callers - RelationGetSmgr() is still cheap - but in these few
places it seems worthwhile.

* I kept the separate smgrclose() and smgrrelease() functions. I know I
suggested to just change smgrclose() to do what smgrrelease() did, but
on second thoughts keeping them separate seems nicer. However,
sgmgrclose() just calls smgrrelease() now, so the distinction is just
pro forma. The idea is that if you call smgrclose(), you hint that you
don't need that SMgrRelation pointer anymore, although there might be
other pointers to the same object and they stay valid.

--
Heikki Linnakangas
Neon (https://neon.tech)

Attachments:

0001-Remove-some-obsolete-smgrcloseall-calls.patchtext/x-patch; charset=UTF-8; name=0001-Remove-some-obsolete-smgrcloseall-calls.patchDownload
From 23530bcdd36ba7dafbaa7aaf29890dc528d07fad Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 23 Aug 2023 14:38:38 +1200
Subject: [PATCH 1/3] Remove some obsolete smgrcloseall() calls.

Before the advent of PROCSIGNAL_BARRIER_SMGRRELEASE, we didn't have a
comprehensive way to deal with Windows file handles that get in the way
of unlinking directories.  We had smgrcloseall() calls in a few places
to try to mitigate.

It's still a good idea for bgwriter and checkpointer to do that once per
checkpoint so they don't accumulate unbounded SMgrRelation objects, but
there is no longer any reason to close them at other random places such
as the error path, and the explanation as given in the comments is now
obsolete.

Discussion: https://postgr.es/m/message-id/CA%2BhUKGJ8NTvqLHz6dqbQnt2c8XCki4r2QvXjBQcXpVwxTY_pvA%40mail.gmail.com
---
 src/backend/postmaster/bgwriter.c     | 7 -------
 src/backend/postmaster/checkpointer.c | 7 -------
 src/backend/postmaster/walwriter.c    | 7 -------
 3 files changed, 21 deletions(-)

diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index d02dc17b9c1..e14c88ac060 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -197,13 +197,6 @@ BackgroundWriterMain(void)
 		 */
 		pg_usleep(1000000L);
 
-		/*
-		 * Close all open files after any error.  This is helpful on Windows,
-		 * where holding deleted files open causes various strange errors.
-		 * It's not clear we need it elsewhere, but shouldn't hurt.
-		 */
-		smgrcloseall();
-
 		/* Report wait end here, when there is no further possibility of wait */
 		pgstat_report_wait_end();
 	}
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 42c807d3528..89ab97bdbc7 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -301,13 +301,6 @@ CheckpointerMain(void)
 		 * fast as we can.
 		 */
 		pg_usleep(1000000L);
-
-		/*
-		 * Close all open files after any error.  This is helpful on Windows,
-		 * where holding deleted files open causes various strange errors.
-		 * It's not clear we need it elsewhere, but shouldn't hurt.
-		 */
-		smgrcloseall();
 	}
 
 	/* We can now handle ereport(ERROR) */
diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index 48bc92205b5..8789ffb01d0 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -189,13 +189,6 @@ WalWriterMain(void)
 		 * fast as we can.
 		 */
 		pg_usleep(1000000L);
-
-		/*
-		 * Close all open files after any error.  This is helpful on Windows,
-		 * where holding deleted files open causes various strange errors.
-		 * It's not clear we need it elsewhere, but shouldn't hurt.
-		 */
-		smgrcloseall();
 	}
 
 	/* We can now handle ereport(ERROR) */
-- 
2.39.2

0002-Don-t-try-to-open-visibilitymap-when-analyzing-a-for.patchtext/x-patch; charset=UTF-8; name=0002-Don-t-try-to-open-visibilitymap-when-analyzing-a-for.patchDownload
From 7892cb9642286b81cdb45526f8d23117ed892578 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 29 Nov 2023 14:07:52 +0200
Subject: [PATCH 2/3] Don't try to open visibilitymap when analyzing a foreign
 table.

Also add an assertion in smgropen() that the relnumber is valid.
---
 src/backend/commands/analyze.c  | 5 ++++-
 src/backend/storage/smgr/smgr.c | 2 ++
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index e264ffdcf28..1f4a9516814 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -634,7 +634,10 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 	{
 		BlockNumber relallvisible;
 
-		visibilitymap_count(onerel, &relallvisible, NULL);
+		if (RELKIND_HAS_STORAGE(onerel->rd_rel->relkind))
+			visibilitymap_count(onerel, &relallvisible, NULL);
+		else
+			relallvisible = 0;
 
 		/* Update pg_class for table relation */
 		vac_update_relstats(onerel,
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 5d0f3d515c3..4c552649336 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -153,6 +153,8 @@ smgropen(RelFileLocator rlocator, BackendId backend)
 	SMgrRelation reln;
 	bool		found;
 
+	Assert(RelFileNumberIsValid(rlocator.relNumber));
+
 	if (SMgrRelationHash == NULL)
 	{
 		/* First time through: initialize the hash table */
-- 
2.39.2

0003-Give-SMgrRelation-pointers-a-well-defined-lifetime.patchtext/x-patch; charset=UTF-8; name=0003-Give-SMgrRelation-pointers-a-well-defined-lifetime.patchDownload
From 0c62df4fa2be891d61e380b354e6df6c7f3c5aff Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 10 Aug 2023 18:16:31 +1200
Subject: [PATCH 3/3] Give SMgrRelation pointers a well-defined lifetime.

After calling smgropen(), it was not clear how long you could continue
to use the result, because various code paths including cache
invalidation could call smgrclose(), which freed the memory.

Guarantee that the object won't be destroyed until the end of the
current transaction, or in recovery, the commit/abort record that
destroys the underlying storage.

smgrclose() is now just an alias for smgrrelease(). It closes files
and forgets all state except the rlocator, but keeps the SMgrRelation
object valid.

A new smgrdestroy() function is used by rare places that know there
should be no other references to the SMgrRelation.

The short version:

 * smgrclose() is now just an alias for smgrrelease(). It releases
   resources, but doesn't destroy until EOX
 * smgrdestroy() now frees memory, and should rarely be used.

Existing code should be unaffected, but it is now possible for code that
has an SMgrRelation object to use it repeatedly during a transaction as
long as the storage hasn't been physically dropped.  Such code would
normally hold a lock on the relation.

This also replaces the "ownership" mechanism of SMgrRelations with a
pin counter.  An SMgrRelation can now be "pinned", which prevents it
from being destroyed at end of transaction.  There can be multiple pins
on the same SMgrRelation.  In practice, the pin mechanism is only used
by the relcache, so there cannot be more than one pin on the same
SMgrRelation.  Except with swap_relation_files XXX

Author: Thomas Munro, Heikki Linnakangas
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKGJ8NTvqLHz6dqbQnt2c8XCki4r2QvXjBQcXpVwxTY_pvA%40mail.gmail.com
---
 src/backend/access/transam/xlogutils.c    |  15 +-
 src/backend/commands/cluster.c            |  19 --
 src/backend/postmaster/bgwriter.c         |   8 +-
 src/backend/postmaster/checkpointer.c     |  15 +-
 src/backend/storage/buffer/bufmgr.c       |  32 +--
 src/backend/storage/freespace/freespace.c |   9 +-
 src/backend/storage/smgr/smgr.c           | 234 ++++++++++++----------
 src/backend/utils/cache/inval.c           |   2 +-
 src/backend/utils/cache/relcache.c        |  21 +-
 src/include/storage/smgr.h                |  36 ++--
 src/include/utils/rel.h                   |  18 +-
 src/include/utils/relcache.h              |   2 -
 12 files changed, 184 insertions(+), 227 deletions(-)

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 43f7b31205d..e7d55dd9b03 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -616,7 +616,11 @@ CreateFakeRelcacheEntry(RelFileLocator rlocator)
 	rel->rd_lockInfo.lockRelId.dbId = rlocator.dbOid;
 	rel->rd_lockInfo.lockRelId.relId = rlocator.relNumber;
 
-	rel->rd_smgr = NULL;
+	/*
+	 * Set up a non-pinned SMgrRelation reference, so that we don't need to
+	 * worry about unpinning it on error.
+	 */
+	rel->rd_smgr = smgropen(rlocator, InvalidBackendId);
 
 	return rel;
 }
@@ -627,9 +631,6 @@ CreateFakeRelcacheEntry(RelFileLocator rlocator)
 void
 FreeFakeRelcacheEntry(Relation fakerel)
 {
-	/* make sure the fakerel is not referenced by the SmgrRelation anymore */
-	if (fakerel->rd_smgr != NULL)
-		smgrclearowner(&fakerel->rd_smgr, fakerel->rd_smgr);
 	pfree(fakerel);
 }
 
@@ -656,10 +657,10 @@ XLogDropDatabase(Oid dbid)
 	/*
 	 * This is unnecessarily heavy-handed, as it will close SMgrRelation
 	 * objects for other databases as well. DROP DATABASE occurs seldom enough
-	 * that it's not worth introducing a variant of smgrclose for just this
-	 * purpose. XXX: Or should we rather leave the smgr entries dangling?
+	 * that it's not worth introducing a variant of smgrdestroy for just this
+	 * purpose.
 	 */
-	smgrcloseall();
+	smgrdestroyall();
 
 	forget_invalid_pages_db(dbid);
 }
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index a3bef6ac34f..7113362eb36 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -1422,25 +1422,6 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class,
 	heap_freetuple(reltup2);
 
 	table_close(relRelation, RowExclusiveLock);
-
-	/*
-	 * Close both relcache entries' smgr links.  We need this kluge because
-	 * both links will be invalidated during upcoming CommandCounterIncrement.
-	 * Whichever of the rels is the second to be cleared will have a dangling
-	 * reference to the other's smgr entry.  Rather than trying to avoid this
-	 * by ordering operations just so, it's easiest to close the links first.
-	 * (Fortunately, since one of the entries is local in our transaction,
-	 * it's sufficient to clear out our own relcache this way; the problem
-	 * cannot arise for other backends when they see our update on the
-	 * non-transient relation.)
-	 *
-	 * Caution: the placement of this step interacts with the decision to
-	 * handle toast rels by recursion.  When we are trying to rebuild pg_class
-	 * itself, the smgr close on pg_class must happen after all accesses in
-	 * this function.
-	 */
-	RelationCloseSmgrByOid(r1);
-	RelationCloseSmgrByOid(r2);
 }
 
 /*
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index e14c88ac060..ffdcbe2d902 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -239,10 +239,12 @@ BackgroundWriterMain(void)
 		if (FirstCallSinceLastCheckpoint())
 		{
 			/*
-			 * After any checkpoint, close all smgr files.  This is so we
-			 * won't hang onto smgr references to deleted files indefinitely.
+			 * After any checkpoint, free all smgr objects.  Otherwise we
+			 * would never do so for dropped relations, as the bgwriter does
+			 * not process shared invalidation messages or call
+			 * AtEOXact_SMgr().
 			 */
-			smgrcloseall();
+			smgrdestroyall();
 		}
 
 		/*
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 89ab97bdbc7..46c3f034d5d 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -442,10 +442,12 @@ CheckpointerMain(void)
 				ckpt_performed = CreateRestartPoint(flags);
 
 			/*
-			 * After any checkpoint, close all smgr files.  This is so we
-			 * won't hang onto smgr references to deleted files indefinitely.
+			 * After any checkpoint, free all smgr objects.  Otherwise we
+			 * would never do so for dropped relations, as the checkpointer
+			 * does not process shared invalidation messages or call
+			 * AtEOXact_SMgr().
 			 */
-			smgrcloseall();
+			smgrdestroyall();
 
 			/*
 			 * Indicate checkpoint completion to any waiting backends.
@@ -928,11 +930,8 @@ RequestCheckpoint(int flags)
 		 */
 		CreateCheckPoint(flags | CHECKPOINT_IMMEDIATE);
 
-		/*
-		 * After any checkpoint, close all smgr files.  This is so we won't
-		 * hang onto smgr references to deleted files indefinitely.
-		 */
-		smgrcloseall();
+		/* Free all smgr objects, as CheckpointerMain() normally would. */
+		smgrdestroyall();
 
 		return;
 	}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f7c67d504cd..9beeab9d04e 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -934,10 +934,6 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 	{
 		LockRelationForExtension(bmr.rel, ExclusiveLock);
 
-		/* could have been closed while waiting for lock */
-		if (bmr.rel)
-			bmr.smgr = RelationGetSmgr(bmr.rel);
-
 		/* recheck, fork might have been created concurrently */
 		if (!smgrexists(bmr.smgr, fork))
 			smgrcreate(bmr.smgr, fork, flags & EB_PERFORMING_RECOVERY);
@@ -1897,11 +1893,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 	 * we get the lock.
 	 */
 	if (!(flags & EB_SKIP_EXTENSION_LOCK))
-	{
 		LockRelationForExtension(bmr.rel, ExclusiveLock);
-		if (bmr.rel)
-			bmr.smgr = RelationGetSmgr(bmr.rel);
-	}
 
 	/*
 	 * If requested, invalidate size cache, so that smgrnblocks asks the
@@ -4155,6 +4147,7 @@ FlushRelationBuffers(Relation rel)
 {
 	int			i;
 	BufferDesc *bufHdr;
+	SMgrRelation srel = RelationGetSmgr(rel);
 
 	if (RelationUsesLocalBuffers(rel))
 	{
@@ -4183,7 +4176,7 @@ FlushRelationBuffers(Relation rel)
 
 				io_start = pgstat_prepare_io_time();
 
-				smgrwrite(RelationGetSmgr(rel),
+				smgrwrite(srel,
 						  BufTagGetForkNum(&bufHdr->tag),
 						  bufHdr->tag.blockNum,
 						  localpage,
@@ -4229,7 +4222,7 @@ FlushRelationBuffers(Relation rel)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, RelationGetSmgr(rel), IOOBJECT_RELATION, IOCONTEXT_NORMAL);
+			FlushBuffer(bufHdr, srel, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -4442,13 +4435,17 @@ void
 CreateAndCopyRelationData(RelFileLocator src_rlocator,
 						  RelFileLocator dst_rlocator, bool permanent)
 {
-	RelFileLocatorBackend rlocator;
 	char		relpersistence;
+	SMgrRelation src_rel;
+	SMgrRelation dst_rel;
 
 	/* Set the relpersistence. */
 	relpersistence = permanent ?
 		RELPERSISTENCE_PERMANENT : RELPERSISTENCE_UNLOGGED;
 
+	src_rel = smgropen(src_rlocator, InvalidBackendId);
+	dst_rel = smgropen(dst_rlocator, InvalidBackendId);
+
 	/*
 	 * Create and copy all forks of the relation.  During create database we
 	 * have a separate cleanup mechanism which deletes complete database
@@ -4465,9 +4462,9 @@ CreateAndCopyRelationData(RelFileLocator src_rlocator,
 	for (ForkNumber forkNum = MAIN_FORKNUM + 1;
 		 forkNum <= MAX_FORKNUM; forkNum++)
 	{
-		if (smgrexists(smgropen(src_rlocator, InvalidBackendId), forkNum))
+		if (smgrexists(src_rel, forkNum))
 		{
-			smgrcreate(smgropen(dst_rlocator, InvalidBackendId), forkNum, false);
+			smgrcreate(dst_rel, forkNum, false);
 
 			/*
 			 * WAL log creation if the relation is persistent, or this is the
@@ -4481,15 +4478,6 @@ CreateAndCopyRelationData(RelFileLocator src_rlocator,
 										   permanent);
 		}
 	}
-
-	/* close source and destination smgr if exists. */
-	rlocator.backend = InvalidBackendId;
-
-	rlocator.locator = src_rlocator;
-	smgrcloserellocator(rlocator);
-
-	rlocator.locator = dst_rlocator;
-	smgrcloserellocator(rlocator);
 }
 
 /* ---------------------------------------------------------------------
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index fb9440ff72f..71b386137f0 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -532,14 +532,7 @@ fsm_readbuf(Relation rel, FSMAddress addr, bool extend)
 {
 	BlockNumber blkno = fsm_logical_to_physical(addr);
 	Buffer		buf;
-	SMgrRelation reln;
-
-	/*
-	 * Caution: re-using this smgr pointer could fail if the relcache entry
-	 * gets closed.  It's safe as long as we only do smgr-level operations
-	 * between here and the last use of the pointer.
-	 */
-	reln = RelationGetSmgr(rel);
+	SMgrRelation reln = RelationGetSmgr(rel);
 
 	/*
 	 * If we haven't cached the size of the FSM yet, check it first.  Also
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 4c552649336..419fba0b2b9 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -3,8 +3,42 @@
  * smgr.c
  *	  public interface routines to storage manager switch.
  *
- *	  All file system operations in POSTGRES dispatch through these
- *	  routines.
+ * All file system operations on relations dispatch through these routines.
+ * An SMgrRelation represents physical on-disk relation files that are open
+ * for reading and writing.
+ *
+ * When a relation is first accessed through the relation cache, the
+ * corresponding SMgrRelation entry is opened by calling smgropen(), and the
+ * reference is stored in the relation cache entry.
+ *
+ * Accesses that don't go through the relation cache open the SMgrRelation
+ * directly.  That includes flushing buffers from the buffer cache, as well as
+ * all accesses in auxiliary processes like the checkpointer or the WAL redo
+ * in the startup process.
+ *
+ * Operations like CREATE, DROP, ALTER TABLE also hold SMgrRelation references
+ * independent of the relation cache.  They need to prepare the physical files
+ * before updating the relation cache.
+ *
+ * There is a hash table that holds all the SMgrRelation entries in the
+ * backend.  If you call smgropen() twice for the same rel locator, you get a
+ * reference to the same SMgrRelation. The reference is valid until the end of
+ * transaction.  This makes repeated access to the same relation efficient,
+ * and allows caching things like the relation size in the SMgrRelation entry.
+ *
+ * At end of transaction, all SMgrRelation entries that haven't been pinned
+ * are removed.  An SMgrRelation can hold kernel file system descriptors for
+ * the underlying files, and we'd like to close those reasonably soon if the
+ * file gets deleted.  The SMgrRelations references held by the relcache are
+ * pinned to prevent them from being closed.
+ *
+ * There is another mechanism to close file descriptors early:
+ * PROCSIGNAL_BARRIER_SMGRRELEASE.  It is a request to immediately close all
+ * file descriptors.  Upon receiveing that signal, the backend closes all file
+ * descriptors held open by SMgrRelations, but because it can happen in the
+ * middle of a transaction, we cannot destroy the SMgrRelation objects
+ * themselves, as there could pointers to them in active use.  See
+ * smgrrelease() and smgrreleaseall().
  *
  * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -93,14 +127,15 @@ static const int NSmgr = lengthof(smgrsw);
 
 /*
  * Each backend has a hashtable that stores all extant SMgrRelation objects.
- * In addition, "unowned" SMgrRelation objects are chained together in a list.
+ * In addition, "unpinned" SMgrRelation objects are chained together in a list.
  */
 static HTAB *SMgrRelationHash = NULL;
 
-static dlist_head unowned_relns;
+static dlist_head unpinned_relns;
 
 /* local function prototypes */
 static void smgrshutdown(int code, Datum arg);
+static void smgrdestroy(SMgrRelation reln);
 
 
 /*
@@ -144,7 +179,16 @@ smgrshutdown(int code, Datum arg)
 /*
  * smgropen() -- Return an SMgrRelation object, creating it if need be.
  *
- * This does not attempt to actually open the underlying file.
+ * In versions of PostgreSQL prior to 17, this function returned an object
+ * with no defined lifetime.  Now, however, the object remains valid for the
+ * lifetime of the transaction, up to the point where AtEOXact_SMgr() is
+ * called, making it much easier for callers to know for how long they can
+ * hold on to a pointer to the returned object.  If this function is called
+ * outside of a transaction, the object remains valid until smgrdestroy() or
+ * smgrdestroyall() is called.  Background processes that use smgr but not
+ * transactions typically do this once per checkpoint cycle.
+ *
+ * This does not attempt to actually open the underlying files.
  */
 SMgrRelation
 smgropen(RelFileLocator rlocator, BackendId backend)
@@ -164,7 +208,7 @@ smgropen(RelFileLocator rlocator, BackendId backend)
 		ctl.entrysize = sizeof(SMgrRelationData);
 		SMgrRelationHash = hash_create("smgr relation table", 400,
 									   &ctl, HASH_ELEM | HASH_BLOBS);
-		dlist_init(&unowned_relns);
+		dlist_init(&unpinned_relns);
 	}
 
 	/* Look up or create an entry */
@@ -178,7 +222,6 @@ smgropen(RelFileLocator rlocator, BackendId backend)
 	if (!found)
 	{
 		/* hash_search already filled in the lookup key */
-		reln->smgr_owner = NULL;
 		reln->smgr_targblock = InvalidBlockNumber;
 		for (int i = 0; i <= MAX_FORKNUM; ++i)
 			reln->smgr_cached_nblocks[i] = InvalidBlockNumber;
@@ -187,102 +230,61 @@ smgropen(RelFileLocator rlocator, BackendId backend)
 		/* implementation-specific initialization */
 		smgrsw[reln->smgr_which].smgr_open(reln);
 
-		/* it has no owner yet */
-		dlist_push_tail(&unowned_relns, &reln->node);
+		/* it is not pinned yet */
+		reln->pincount = 0;
+		dlist_push_tail(&unpinned_relns, &reln->node);
 	}
 
 	return reln;
 }
 
 /*
- * smgrsetowner() -- Establish a long-lived reference to an SMgrRelation object
- *
- * There can be only one owner at a time; this is sufficient since currently
- * the only such owners exist in the relcache.
+ * smgrpin() -- Prevent an SMgrRelation object from being destroyed at end of
+ *				of transaction
  */
 void
-smgrsetowner(SMgrRelation *owner, SMgrRelation reln)
+smgrpin(SMgrRelation reln)
 {
-	/* We don't support "disowning" an SMgrRelation here, use smgrclearowner */
-	Assert(owner != NULL);
-
-	/*
-	 * First, unhook any old owner.  (Normally there shouldn't be any, but it
-	 * seems possible that this can happen during swap_relation_files()
-	 * depending on the order of processing.  It's ok to close the old
-	 * relcache entry early in that case.)
-	 *
-	 * If there isn't an old owner, then the reln should be in the unowned
-	 * list, and we need to remove it.
-	 */
-	if (reln->smgr_owner)
-		*(reln->smgr_owner) = NULL;
-	else
+	if (reln->pincount == 0)
 		dlist_delete(&reln->node);
-
-	/* Now establish the ownership relationship. */
-	reln->smgr_owner = owner;
-	*owner = reln;
+	reln->pincount++;
 }
 
 /*
- * smgrclearowner() -- Remove long-lived reference to an SMgrRelation object
- *					   if one exists
+ * smgrunpin() -- Allow an SMgrRelation object to be desroyed at end of
+ *				  transaction
+ *
+ * The object remains valid, but if there are no other pins on it, it is moved
+ * to the unpinned list where it will be destroyed by AtEOXact_SMgr().
  */
 void
-smgrclearowner(SMgrRelation *owner, SMgrRelation reln)
-{
-	/* Do nothing if the SMgrRelation object is not owned by the owner */
-	if (reln->smgr_owner != owner)
-		return;
-
-	/* unset the owner's reference */
-	*owner = NULL;
-
-	/* unset our reference to the owner */
-	reln->smgr_owner = NULL;
-
-	/* add to list of unowned relations */
-	dlist_push_tail(&unowned_relns, &reln->node);
-}
-
-/*
- * smgrexists() -- Does the underlying file for a fork exist?
- */
-bool
-smgrexists(SMgrRelation reln, ForkNumber forknum)
+smgrunpin(SMgrRelation reln)
 {
-	return smgrsw[reln->smgr_which].smgr_exists(reln, forknum);
+	Assert(reln->pincount > 0);
+	reln->pincount--;
+	if (reln->pincount == 0)
+		dlist_push_tail(&unpinned_relns, &reln->node);
 }
 
 /*
- * smgrclose() -- Close and delete an SMgrRelation object.
+ * smgrdestroy() -- Delete an SMgrRelation object.
  */
-void
-smgrclose(SMgrRelation reln)
+static void
+smgrdestroy(SMgrRelation reln)
 {
-	SMgrRelation *owner;
 	ForkNumber	forknum;
 
+	Assert(reln->pincount == 0);
+
 	for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
 		smgrsw[reln->smgr_which].smgr_close(reln, forknum);
 
-	owner = reln->smgr_owner;
-
-	if (!owner)
-		dlist_delete(&reln->node);
+	dlist_delete(&reln->node);
 
 	if (hash_search(SMgrRelationHash,
 					&(reln->smgr_rlocator),
 					HASH_REMOVE, NULL) == NULL)
 		elog(ERROR, "SMgrRelation hashtable corrupted");
-
-	/*
-	 * Unhook the owner pointer, if any.  We do this last since in the remote
-	 * possibility of failure above, the SMgrRelation object will still exist.
-	 */
-	if (owner)
-		*owner = NULL;
 }
 
 /*
@@ -302,31 +304,51 @@ smgrrelease(SMgrRelation reln)
 }
 
 /*
- * smgrreleaseall() -- Release resources used by all objects.
+ * smgrclose() -- Close an SMgrRelation object.
  *
- * This is called for PROCSIGNAL_BARRIER_SMGRRELEASE.
+ * The SMgrRelation reference should not be used after this call.  However,
+ * because we don't keep track of the references returned by smgropen(), we
+ * don't know if there are other references still pointing to the same object,
+ * so we cannot remove the SMgrRelation object yet.  Therefore, this is just a
+ * synonym for smgrrelease() at the moment.
  */
 void
-smgrreleaseall(void)
+smgrclose(SMgrRelation reln)
 {
-	HASH_SEQ_STATUS status;
-	SMgrRelation reln;
+	smgrrelease(reln);
+}
 
-	/* Nothing to do if hashtable not set up */
-	if (SMgrRelationHash == NULL)
-		return;
+/*
+ * smgrdestroyall() -- Release resources used by all unpinned objects.
+ *
+ * It must be known that there are no pointers to SMgrRelations, other than
+ * those pinned with smgrpin().
+ */
+void
+smgrdestroyall(void)
+{
+	dlist_mutable_iter iter;
 
-	hash_seq_init(&status, SMgrRelationHash);
+	/*
+	 * Zap all unpinned SMgrRelations.  We rely on smgrdestroy() to remove
+	 * each one from the list.
+	 */
+	dlist_foreach_modify(iter, &unpinned_relns)
+	{
+		SMgrRelation rel = dlist_container(SMgrRelationData, node,
+										   iter.cur);
 
-	while ((reln = (SMgrRelation) hash_seq_search(&status)) != NULL)
-		smgrrelease(reln);
+		smgrdestroy(rel);
+	}
 }
 
 /*
- * smgrcloseall() -- Close all existing SMgrRelation objects.
+ * smgrreleaseall() -- Release resources used by all objects.
+ *
+ * This is called for PROCSIGNAL_BARRIER_SMGRRELEASE.
  */
 void
-smgrcloseall(void)
+smgrreleaseall(void)
 {
 	HASH_SEQ_STATUS status;
 	SMgrRelation reln;
@@ -338,19 +360,21 @@ smgrcloseall(void)
 	hash_seq_init(&status, SMgrRelationHash);
 
 	while ((reln = (SMgrRelation) hash_seq_search(&status)) != NULL)
-		smgrclose(reln);
+	{
+		smgrrelease(reln);
+	}
 }
 
 /*
- * smgrcloserellocator() -- Close SMgrRelation object for given RelFileLocator,
- *							if one exists.
+ * smgrreleaserellocator() -- Release resources for given RelFileLocator,
+ *							if it's open.
  *
- * This has the same effects as smgrclose(smgropen(rlocator)), but it avoids
+ * This has the same effects as smgrrelease(smgropen(rlocator)), but it avoids
  * uselessly creating a hashtable entry only to drop it again when no
  * such entry exists already.
  */
 void
-smgrcloserellocator(RelFileLocatorBackend rlocator)
+smgrreleaserellocator(RelFileLocatorBackend rlocator)
 {
 	SMgrRelation reln;
 
@@ -362,7 +386,16 @@ smgrcloserellocator(RelFileLocatorBackend rlocator)
 									  &rlocator,
 									  HASH_FIND, NULL);
 	if (reln != NULL)
-		smgrclose(reln);
+		smgrrelease(reln);
+}
+
+/*
+ * smgrexists() -- Does the underlying file for a fork exist?
+ */
+bool
+smgrexists(SMgrRelation reln, ForkNumber forknum)
+{
+	return smgrsw[reln->smgr_which].smgr_exists(reln, forknum);
 }
 
 /*
@@ -729,7 +762,8 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
  * AtEOXact_SMgr
  *
  * This routine is called during transaction commit or abort (it doesn't
- * particularly care which).  All transient SMgrRelation objects are closed.
+ * particularly care which).  All transient SMgrRelation objects are
+ * destroyed.
  *
  * We do this as a compromise between wanting transient SMgrRelations to
  * live awhile (to amortize the costs of blind writes of multiple blocks)
@@ -740,21 +774,7 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
 void
 AtEOXact_SMgr(void)
 {
-	dlist_mutable_iter iter;
-
-	/*
-	 * Zap all unowned SMgrRelations.  We rely on smgrclose() to remove each
-	 * one from the list.
-	 */
-	dlist_foreach_modify(iter, &unowned_relns)
-	{
-		SMgrRelation rel = dlist_container(SMgrRelationData, node,
-										   iter.cur);
-
-		Assert(rel->smgr_owner == NULL);
-
-		smgrclose(rel);
-	}
+	smgrdestroyall();
 }
 
 /*
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index a041d7d6047..edaa56194ff 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -756,7 +756,7 @@ LocalExecuteInvalidationMessage(SharedInvalidationMessage *msg)
 
 		rlocator.locator = msg->sm.rlocator;
 		rlocator.backend = (msg->sm.backend_hi << 16) | (int) msg->sm.backend_lo;
-		smgrcloserellocator(rlocator);
+		smgrreleaserellocator(rlocator);
 	}
 	else if (msg->id == SHAREDINVALRELMAP_ID)
 	{
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index b3faccbefe5..b4c7dacdc9e 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -3044,7 +3044,7 @@ RelationCacheInvalidate(bool debug_discard)
 	 * start to rebuild entries, since that may involve catalog fetches which
 	 * will re-open catalog files.
 	 */
-	smgrcloseall();
+	smgrdestroyall();
 
 	/* Phase 2: rebuild the items found to need rebuild in phase 1 */
 	foreach(l, rebuildFirstList)
@@ -3066,25 +3066,6 @@ RelationCacheInvalidate(bool debug_discard)
 			in_progress_list[i].invalidated = true;
 }
 
-/*
- * RelationCloseSmgrByOid - close a relcache entry's smgr link
- *
- * Needed in some cases where we are changing a relation's physical mapping.
- * The link will be automatically reopened on next use.
- */
-void
-RelationCloseSmgrByOid(Oid relationId)
-{
-	Relation	relation;
-
-	RelationIdCacheLookup(relationId, relation);
-
-	if (!PointerIsValid(relation))
-		return;					/* not in cache, nothing to do */
-
-	RelationCloseSmgr(relation);
-}
-
 static void
 RememberToFreeTupleDescAtEOX(TupleDesc td)
 {
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index a9a179aabac..edccc4c89bb 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -21,29 +21,21 @@
 /*
  * smgr.c maintains a table of SMgrRelation objects, which are essentially
  * cached file handles.  An SMgrRelation is created (if not already present)
- * by smgropen(), and destroyed by smgrclose().  Note that neither of these
- * operations imply I/O, they just create or destroy a hashtable entry.
- * (But smgrclose() may release associated resources, such as OS-level file
+ * by smgropen(), and destroyed by smgrdestroy().  Note that neither of these
+ * operations imply I/O, they just create or destroy a hashtable entry.  (But
+ * smgrdestroy() may release associated resources, such as OS-level file
  * descriptors.)
  *
- * An SMgrRelation may have an "owner", which is just a pointer to it from
- * somewhere else; smgr.c will clear this pointer if the SMgrRelation is
- * closed.  We use this to avoid dangling pointers from relcache to smgr
- * without having to make the smgr explicitly aware of relcache.  There
- * can't be more than one "owner" pointer per SMgrRelation, but that's
- * all we need.
- *
- * SMgrRelations that do not have an "owner" are considered to be transient,
- * and are deleted at end of transaction.
+ * An SMgrRelation may be "pinned", to prevent it from being destroyed while
+ * it's in use.  We use this to prevent pointers relcache to smgr from being
+ * invalidated.  SMgrRelations that are not pinned are deleted at end of
+ * transaction.
  */
 typedef struct SMgrRelationData
 {
 	/* rlocator is the hashtable lookup key, so it must be first! */
 	RelFileLocatorBackend smgr_rlocator;	/* relation physical identifier */
 
-	/* pointer to owning pointer, or NULL if none */
-	struct SMgrRelationData **smgr_owner;
-
 	/*
 	 * The following fields are reset to InvalidBlockNumber upon a cache flush
 	 * event, and hold the last known size for each fork.  This information is
@@ -68,7 +60,11 @@ typedef struct SMgrRelationData
 	int			md_num_open_segs[MAX_FORKNUM + 1];
 	struct _MdfdVec *md_seg_fds[MAX_FORKNUM + 1];
 
-	/* if unowned, list link in list of all unowned SMgrRelations */
+	/*
+	 * Pinning support.  If unpinned (pincount == 0), 'node' is a list link in
+	 * list of all unpinned SMgrRelations.
+	 */
+	int			pincount;
 	dlist_node	node;
 } SMgrRelationData;
 
@@ -80,13 +76,13 @@ typedef SMgrRelationData *SMgrRelation;
 extern void smgrinit(void);
 extern SMgrRelation smgropen(RelFileLocator rlocator, BackendId backend);
 extern bool smgrexists(SMgrRelation reln, ForkNumber forknum);
-extern void smgrsetowner(SMgrRelation *owner, SMgrRelation reln);
-extern void smgrclearowner(SMgrRelation *owner, SMgrRelation reln);
+extern void smgrpin(SMgrRelation reln);
+extern void smgrunpin(SMgrRelation reln);
 extern void smgrclose(SMgrRelation reln);
-extern void smgrcloseall(void);
-extern void smgrcloserellocator(RelFileLocatorBackend rlocator);
+extern void smgrdestroyall(void);
 extern void smgrrelease(SMgrRelation reln);
 extern void smgrreleaseall(void);
+extern void smgrreleaserellocator(RelFileLocatorBackend rlocator);
 extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
 extern void smgrdosyncall(SMgrRelation *rels, int nrels);
 extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 0ad613c4b88..5ab7d54d897 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -561,18 +561,15 @@ typedef struct ViewOptions
  *
  * Very little code is authorized to touch rel->rd_smgr directly.  Instead
  * use this function to fetch its value.
- *
- * Note: since a relcache flush can cause the file handle to be closed again,
- * it's unwise to hold onto the pointer returned by this function for any
- * long period.  Recommended practice is to just re-execute RelationGetSmgr
- * each time you need to access the SMgrRelation.  It's quite cheap in
- * comparison to whatever an smgr function is going to do.
  */
 static inline SMgrRelation
 RelationGetSmgr(Relation rel)
 {
 	if (unlikely(rel->rd_smgr == NULL))
-		smgrsetowner(&(rel->rd_smgr), smgropen(rel->rd_locator, rel->rd_backend));
+	{
+		rel->rd_smgr = smgropen(rel->rd_locator, rel->rd_backend);
+		smgrpin(rel->rd_smgr);
+	}
 	return rel->rd_smgr;
 }
 
@@ -584,10 +581,11 @@ static inline void
 RelationCloseSmgr(Relation relation)
 {
 	if (relation->rd_smgr != NULL)
+	{
+		smgrunpin(relation->rd_smgr);
 		smgrclose(relation->rd_smgr);
-
-	/* smgrclose should unhook from owner pointer */
-	Assert(relation->rd_smgr == NULL);
+		relation->rd_smgr = NULL;
+	}
 }
 #endif							/* !FRONTEND */
 
diff --git a/src/include/utils/relcache.h b/src/include/utils/relcache.h
index f04b2477e34..c1c31baa429 100644
--- a/src/include/utils/relcache.h
+++ b/src/include/utils/relcache.h
@@ -129,8 +129,6 @@ extern void RelationCacheInvalidateEntry(Oid relationId);
 
 extern void RelationCacheInvalidate(bool debug_discard);
 
-extern void RelationCloseSmgrByOid(Oid relationId);
-
 #ifdef USE_ASSERT_CHECKING
 extern void AssertPendingSyncs_RelationCache(void);
 #else
-- 
2.39.2

#8Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Heikki Linnakangas (#7)
Re: Extending SMgrRelation lifetimes

On 29/11/2023 14:41, Heikki Linnakangas wrote:

2. A funny case with foreign tables: ANALYZE on a foreign table calls
visibilitymap_count(). A foreign table has no visibility map so it
returns 0, but before doing so it calls RelationGetSmgr on the foreign
table, which has 0/0/0 rellocator. That creates an SMgrRelation for
0/0/0, and sets the foreign table's relcache entry as its owner. If you
then call ANALYZE on another foreign table, it also calls
RelationGetSmgr with 0/0/0 rellocator, returning the same SMgrRelation
entry, and changes its owner to the new relcache entry. That doesn't
make much sense and it's pretty accidental that it works at all, so
attached is a patch to avoid calling visibilitymap_count() on foreign
tables.

This patch seems uncontroversial and independent of the others, so I
committed it.

--
Heikki Linnakangas
Neon (https://neon.tech)

#9Thomas Munro
thomas.munro@gmail.com
In reply to: Heikki Linnakangas (#7)
Re: Extending SMgrRelation lifetimes

On Wed, Nov 29, 2023 at 1:42 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I spent some more time digging into this, experimenting with different
approaches. Came up with pretty significant changes; see below:

Hi Heikki,

I think this approach is good. As I wrote in the first email, I had
briefly considered reference counting, but at the time I figured there
wasn't much point if it's only ever going to be 0 or 1, so I was
trying to find the smallest change. But as you explained, there is
already an interesting case where it goes to 2, and modelling it that
way removes a weird hack, so it's a net improvement over the unusual
'owner' concept. +1 for your version. Are there any further tidying
or other improvements you want to make?

Typos in comments:

s/desroyed/destroyed/
s/receiveing/receiving/

#10Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Thomas Munro (#9)
Re: Extending SMgrRelation lifetimes

On 31/01/2024 10:54, Thomas Munro wrote:

On Wed, Nov 29, 2023 at 1:42 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I spent some more time digging into this, experimenting with different
approaches. Came up with pretty significant changes; see below:

Hi Heikki,

I think this approach is good. As I wrote in the first email, I had
briefly considered reference counting, but at the time I figured there
wasn't much point if it's only ever going to be 0 or 1, so I was
trying to find the smallest change. But as you explained, there is
already an interesting case where it goes to 2, and modelling it that
way removes a weird hack, so it's a net improvement over the unusual
'owner' concept. +1 for your version. Are there any further tidying
or other improvements you want to make?

Ok, no, this is good to go then. I'll rebase, fix the typos, run the
regression tests again, and push this shortly. Thanks!

--
Heikki Linnakangas
Neon (https://neon.tech)