Extensible storage manager API - SMGR hook Redux

Started by Matthias van de Meentover 2 years ago17 messages
#1Matthias van de Meent
boekewurm+postgres@gmail.com
2 attachment(s)

Hi hackers,

At Neon, we've been working on removing the file system dependency
from PostgreSQL and replacing it with a distributed storage layer. For
now, we've seen most success in this by replacing the implementation
of the smgr API, but it did require some core modifications like those
proposed early last year by Anastasia [0]/messages/by-id/CAP4vRV6JKXyFfEOf=n+v5RGsZywAQ3CTM8ESWvgq+S87Tmgx_g@mail.gmail.com.

As mentioned in the previous thread, there are several reasons why you
would want to use a non-default storage manager: storage-level
compression, encryption, and disk limit quotas [0]/messages/by-id/CAP4vRV6JKXyFfEOf=n+v5RGsZywAQ3CTM8ESWvgq+S87Tmgx_g@mail.gmail.com; offloading of cold
relation data was also mentioned [1]/messages/by-id/D365F19F-BC3E-4F96-A91E-8DB13049749E@yandex-team.ru.

In the thread on Anastasia's patch, Yura Sokolov mentioned that
instead of a hook-based smgr extension, a registration-based smgr
would be preferred, with integration into namespaces. Please find
attached an as of yet incomplete patch that starts to do that.

The patch is yet incomplete (as it isn't derived from Anastasia's
patch), but I would like comments on this regardless, as this is a
fairly fundamental component of PostgreSQL that is being modified, and
it is often better to get comments early in the development cycle. One
significant issue that I've seen so far are that catcache is not
guaranteed to be available in all backends that need to do smgr
operations, and I've not yet found a good solution.

Changes compared to HEAD:
- smgrsw is now dynamically allocated and grows as new storage
managers are loaded (during shared_preload_libraries)
- CREATE TABLESPACE has new optional syntax USING smgrname (option [, ...])
- tablespace storage is (planned) fully managed by smgr through some
new smgr apis

Changes compared to Anastasia's patch:
- extensions do not get to hook and replace the api of the smgr code
directly - they are hidden behind the smgr registry.

Successes:
- 0001 passes tests (make check-world)
- 0002 builds without warnings (make)

TODO:
- fix dependency failures when catcache is unavailable
- tablespace redo is currently broken with 0002
- fix tests for 0002
- ensure that pg_dump etc. works with the new tablespace storage manager options

Looking forward to any comments, suggestions and reviews.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech/)

[0]: /messages/by-id/CAP4vRV6JKXyFfEOf=n+v5RGsZywAQ3CTM8ESWvgq+S87Tmgx_g@mail.gmail.com
[1]: /messages/by-id/D365F19F-BC3E-4F96-A91E-8DB13049749E@yandex-team.ru

Attachments:

v1-0001-Expose-f_smgr-to-extensions-for-manual-implementa.patchapplication/octet-stream; name=v1-0001-Expose-f_smgr-to-extensions-for-manual-implementa.patchDownload
From bc4f8f9b43dc050ac2fa92d0770eb63c822838b7 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Tue, 27 Jun 2023 15:59:23 +0200
Subject: [PATCH v1 1/2] Expose f_smgr to extensions for manual implementation

There are various reasons why one would want to create their own
implementation of a storage manager, among which are block-level compression,
encryption and offloading to cold storage. This patch is a first patch that
allows extensions to register their own SMgr.

Note, however, that this SMgr is not yet used - only the first SMgr to register
is used, and this is currently the md.c smgr. Future commits will include
facilities to select an SMgr for each tablespace.
---
 src/backend/postmaster/postmaster.c |   5 +
 src/backend/storage/smgr/md.c       | 164 ++++++++++++++++++----------
 src/backend/storage/smgr/smgr.c     | 126 ++++++++++-----------
 src/backend/utils/init/miscinit.c   |  12 ++
 src/include/miscadmin.h             |   1 +
 src/include/storage/md.h            |   4 +
 src/include/storage/smgr.h          |  56 ++++++++--
 7 files changed, 242 insertions(+), 126 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 4c49393fc5..8685b9fde6 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1002,6 +1002,11 @@ PostmasterMain(int argc, char *argv[])
 	 */
 	ApplyLauncherRegister();
 
+	/*
+	 * Register built-in managers that are not part of static arrays
+	 */
+	register_builtin_dynamic_managers();
+
 	/*
 	 * process any libraries that should be preloaded at postmaster start
 	 */
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 30dbc02f82..690bdd27c5 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -86,6 +86,21 @@ typedef struct _MdfdVec
 } MdfdVec;
 
 static MemoryContext MdCxt;		/* context for all MdfdVec objects */
+SMgrId MdSMgrId;
+
+typedef struct MdSMgrRelationData
+{
+	/* parent data */
+	SMgrRelationData reln;
+	/*
+	 * for md.c; per-fork arrays of the number of open segments
+	 * (md_num_open_segs) and the segments themselves (md_seg_fds).
+	 */
+	int			md_num_open_segs[MAX_FORKNUM + 1];
+	struct _MdfdVec *md_seg_fds[MAX_FORKNUM + 1];
+} MdSMgrRelationData;
+
+typedef MdSMgrRelationData *MdSMgrRelation;
 
 
 /* Populate a file tag describing an md.c segment file. */
@@ -120,26 +135,52 @@ static MemoryContext MdCxt;		/* context for all MdfdVec objects */
 #define EXTENSION_DONT_OPEN			(1 << 5)
 
 
+void mdsmgr_register(void)
+{
+	/* magnetic disk */
+	f_smgr md_smgr = (f_smgr) {
+		.name = "md",
+		.smgr_init = mdinit,
+		.smgr_shutdown = NULL,
+		.smgr_open = mdopen,
+		.smgr_close = mdclose,
+		.smgr_create = mdcreate,
+		.smgr_exists = mdexists,
+		.smgr_unlink = mdunlink,
+		.smgr_extend = mdextend,
+		.smgr_zeroextend = mdzeroextend,
+		.smgr_prefetch = mdprefetch,
+		.smgr_read = mdread,
+		.smgr_write = mdwrite,
+		.smgr_writeback = mdwriteback,
+		.smgr_nblocks = mdnblocks,
+		.smgr_truncate = mdtruncate,
+		.smgr_immedsync = mdimmedsync,
+	};
+
+	MdSMgrId = smgr_register(&md_smgr, sizeof(MdSMgrRelationData));
+}
+
 /* local routines */
 static void mdunlinkfork(RelFileLocatorBackend rlocator, ForkNumber forknum,
 						 bool isRedo);
-static MdfdVec *mdopenfork(SMgrRelation reln, ForkNumber forknum, int behavior);
-static void register_dirty_segment(SMgrRelation reln, ForkNumber forknum,
+static MdfdVec *mdopenfork(MdSMgrRelation reln, ForkNumber forknum, int behavior);
+static void register_dirty_segment(MdSMgrRelation reln, ForkNumber forknum,
 								   MdfdVec *seg);
 static void register_unlink_segment(RelFileLocatorBackend rlocator, ForkNumber forknum,
 									BlockNumber segno);
 static void register_forget_request(RelFileLocatorBackend rlocator, ForkNumber forknum,
 									BlockNumber segno);
-static void _fdvec_resize(SMgrRelation reln,
+static void _fdvec_resize(MdSMgrRelation reln,
 						  ForkNumber forknum,
 						  int nseg);
-static char *_mdfd_segpath(SMgrRelation reln, ForkNumber forknum,
+static char *_mdfd_segpath(MdSMgrRelation reln, ForkNumber forknum,
 						   BlockNumber segno);
-static MdfdVec *_mdfd_openseg(SMgrRelation reln, ForkNumber forknum,
+static MdfdVec *_mdfd_openseg(MdSMgrRelation reln, ForkNumber forknum,
 							  BlockNumber segno, int oflags);
-static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forknum,
+static MdfdVec *_mdfd_getseg(MdSMgrRelation reln, ForkNumber forknum,
 							 BlockNumber blkno, bool skipFsync, int behavior);
-static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
+static BlockNumber _mdnblocks(MdSMgrRelation reln, ForkNumber forknum,
 							  MdfdVec *seg);
 
 static inline int
@@ -194,11 +235,13 @@ mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
 	MdfdVec    *mdfd;
 	char	   *path;
 	File		fd;
+	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
+	Assert(reln->smgr_which == MdSMgrId);
 
-	if (isRedo && reln->md_num_open_segs[forknum] > 0)
+	if (isRedo && mdreln->md_num_open_segs[forknum] > 0)
 		return;					/* created and opened already... */
 
-	Assert(reln->md_num_open_segs[forknum] == 0);
+	Assert(mdreln->md_num_open_segs[forknum] == 0);
 
 	/*
 	 * We may be using the target table space for the first time in this
@@ -235,8 +278,8 @@ mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
 
 	pfree(path);
 
-	_fdvec_resize(reln, forknum, 1);
-	mdfd = &reln->md_seg_fds[forknum][0];
+	_fdvec_resize(mdreln, forknum, 1);
+	mdfd = &mdreln->md_seg_fds[forknum][0];
 	mdfd->mdfd_vfd = fd;
 	mdfd->mdfd_segno = 0;
 }
@@ -462,6 +505,7 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	off_t		seekpos;
 	int			nbytes;
 	MdfdVec    *v;
+	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 
 	/* If this build supports direct I/O, the buffer must be I/O aligned. */
 	if (PG_O_DIRECT != 0 && PG_IO_ALIGN_SIZE <= BLCKSZ)
@@ -485,7 +529,7 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 						relpath(reln->smgr_rlocator, forknum),
 						InvalidBlockNumber)));
 
-	v = _mdfd_getseg(reln, forknum, blocknum, skipFsync, EXTENSION_CREATE);
+	v = _mdfd_getseg(mdreln, forknum, blocknum, skipFsync, EXTENSION_CREATE);
 
 	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
 
@@ -509,9 +553,9 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	}
 
 	if (!skipFsync && !SmgrIsTemp(reln))
-		register_dirty_segment(reln, forknum, v);
+		register_dirty_segment(mdreln, forknum, v);
 
-	Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));
+	Assert(_mdnblocks(mdreln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));
 }
 
 /*
@@ -527,6 +571,7 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 	MdfdVec    *v;
 	BlockNumber curblocknum = blocknum;
 	int			remblocks = nblocks;
+	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 
 	Assert(nblocks > 0);
 
@@ -558,7 +603,7 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 		else
 			numblocks = remblocks;
 
-		v = _mdfd_getseg(reln, forknum, curblocknum, skipFsync, EXTENSION_CREATE);
+		v = _mdfd_getseg(mdreln, forknum, curblocknum, skipFsync, EXTENSION_CREATE);
 
 		Assert(segstartblock < RELSEG_SIZE);
 		Assert(segstartblock + numblocks <= RELSEG_SIZE);
@@ -613,9 +658,9 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 		}
 
 		if (!skipFsync && !SmgrIsTemp(reln))
-			register_dirty_segment(reln, forknum, v);
+			register_dirty_segment(mdreln, forknum, v);
 
-		Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));
+		Assert(_mdnblocks(mdreln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));
 
 		remblocks -= numblocks;
 		curblocknum += numblocks;
@@ -633,7 +678,7 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum,
  * invent one out of whole cloth.
  */
 static MdfdVec *
-mdopenfork(SMgrRelation reln, ForkNumber forknum, int behavior)
+mdopenfork(MdSMgrRelation reln, ForkNumber forknum, int behavior)
 {
 	MdfdVec    *mdfd;
 	char	   *path;
@@ -643,7 +688,7 @@ mdopenfork(SMgrRelation reln, ForkNumber forknum, int behavior)
 	if (reln->md_num_open_segs[forknum] > 0)
 		return &reln->md_seg_fds[forknum][0];
 
-	path = relpath(reln->smgr_rlocator, forknum);
+	path = relpath(reln->reln.smgr_rlocator, forknum);
 
 	fd = PathNameOpenFile(path, _mdfd_open_flags());
 
@@ -678,9 +723,10 @@ mdopenfork(SMgrRelation reln, ForkNumber forknum, int behavior)
 void
 mdopen(SMgrRelation reln)
 {
+	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 	/* mark it not open */
 	for (int forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-		reln->md_num_open_segs[forknum] = 0;
+		mdreln->md_num_open_segs[forknum] = 0;
 }
 
 /*
@@ -689,7 +735,8 @@ mdopen(SMgrRelation reln)
 void
 mdclose(SMgrRelation reln, ForkNumber forknum)
 {
-	int			nopensegs = reln->md_num_open_segs[forknum];
+	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
+	int			nopensegs = mdreln->md_num_open_segs[forknum];
 
 	/* No work if already closed */
 	if (nopensegs == 0)
@@ -698,10 +745,10 @@ mdclose(SMgrRelation reln, ForkNumber forknum)
 	/* close segments starting from the end */
 	while (nopensegs > 0)
 	{
-		MdfdVec    *v = &reln->md_seg_fds[forknum][nopensegs - 1];
+		MdfdVec    *v = &mdreln->md_seg_fds[forknum][nopensegs - 1];
 
 		FileClose(v->mdfd_vfd);
-		_fdvec_resize(reln, forknum, nopensegs - 1);
+		_fdvec_resize(mdreln, forknum, nopensegs - 1);
 		nopensegs--;
 	}
 }
@@ -715,10 +762,11 @@ mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
 #ifdef USE_PREFETCH
 	off_t		seekpos;
 	MdfdVec    *v;
+	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 
 	Assert((io_direct_flags & IO_DIRECT_DATA) == 0);
 
-	v = _mdfd_getseg(reln, forknum, blocknum, false,
+	v = _mdfd_getseg(mdreln, forknum, blocknum, false,
 					 InRecovery ? EXTENSION_RETURN_NULL : EXTENSION_FAIL);
 	if (v == NULL)
 		return false;
@@ -743,6 +791,7 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	off_t		seekpos;
 	int			nbytes;
 	MdfdVec    *v;
+	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 
 	/* If this build supports direct I/O, the buffer must be I/O aligned. */
 	if (PG_O_DIRECT != 0 && PG_IO_ALIGN_SIZE <= BLCKSZ)
@@ -754,7 +803,7 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 										reln->smgr_rlocator.locator.relNumber,
 										reln->smgr_rlocator.backend);
 
-	v = _mdfd_getseg(reln, forknum, blocknum, false,
+	v = _mdfd_getseg(mdreln, forknum, blocknum, false,
 					 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
 
 	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
@@ -812,6 +861,7 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	off_t		seekpos;
 	int			nbytes;
 	MdfdVec    *v;
+	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 
 	/* If this build supports direct I/O, the buffer must be I/O aligned. */
 	if (PG_O_DIRECT != 0 && PG_IO_ALIGN_SIZE <= BLCKSZ)
@@ -828,7 +878,7 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 										 reln->smgr_rlocator.locator.relNumber,
 										 reln->smgr_rlocator.backend);
 
-	v = _mdfd_getseg(reln, forknum, blocknum, skipFsync,
+	v = _mdfd_getseg(mdreln, forknum, blocknum, skipFsync,
 					 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
 
 	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
@@ -863,7 +913,7 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	}
 
 	if (!skipFsync && !SmgrIsTemp(reln))
-		register_dirty_segment(reln, forknum, v);
+		register_dirty_segment(mdreln, forknum, v);
 }
 
 /*
@@ -876,6 +926,7 @@ void
 mdwriteback(SMgrRelation reln, ForkNumber forknum,
 			BlockNumber blocknum, BlockNumber nblocks)
 {
+	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 	Assert((io_direct_flags & IO_DIRECT_DATA) == 0);
 
 	/*
@@ -890,7 +941,7 @@ mdwriteback(SMgrRelation reln, ForkNumber forknum,
 		int			segnum_start,
 					segnum_end;
 
-		v = _mdfd_getseg(reln, forknum, blocknum, true /* not used */ ,
+		v = _mdfd_getseg(mdreln, forknum, blocknum, true /* not used */ ,
 						 EXTENSION_DONT_OPEN);
 
 		/*
@@ -937,11 +988,12 @@ mdnblocks(SMgrRelation reln, ForkNumber forknum)
 	MdfdVec    *v;
 	BlockNumber nblocks;
 	BlockNumber segno;
+	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 
-	mdopenfork(reln, forknum, EXTENSION_FAIL);
+	mdopenfork(mdreln, forknum, EXTENSION_FAIL);
 
 	/* mdopen has opened the first segment */
-	Assert(reln->md_num_open_segs[forknum] > 0);
+	Assert(mdreln->md_num_open_segs[forknum] > 0);
 
 	/*
 	 * Start from the last open segments, to avoid redundant seeks.  We have
@@ -956,12 +1008,12 @@ mdnblocks(SMgrRelation reln, ForkNumber forknum)
 	 * that's OK because the checkpointer never needs to compute relation
 	 * size.)
 	 */
-	segno = reln->md_num_open_segs[forknum] - 1;
-	v = &reln->md_seg_fds[forknum][segno];
+	segno = mdreln->md_num_open_segs[forknum] - 1;
+	v = &mdreln->md_seg_fds[forknum][segno];
 
 	for (;;)
 	{
-		nblocks = _mdnblocks(reln, forknum, v);
+		nblocks = _mdnblocks(mdreln, forknum, v);
 		if (nblocks > ((BlockNumber) RELSEG_SIZE))
 			elog(FATAL, "segment too big");
 		if (nblocks < ((BlockNumber) RELSEG_SIZE))
@@ -979,7 +1031,7 @@ mdnblocks(SMgrRelation reln, ForkNumber forknum)
 		 * undermines _mdfd_getseg's attempts to notice and report an error
 		 * upon access to a missing segment.
 		 */
-		v = _mdfd_openseg(reln, forknum, segno, 0);
+		v = _mdfd_openseg(mdreln, forknum, segno, 0);
 		if (v == NULL)
 			return segno * ((BlockNumber) RELSEG_SIZE);
 	}
@@ -994,6 +1046,7 @@ mdtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
 	BlockNumber curnblk;
 	BlockNumber priorblocks;
 	int			curopensegs;
+	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 
 	/*
 	 * NOTE: mdnblocks makes sure we have opened all active segments, so that
@@ -1017,14 +1070,14 @@ mdtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
 	 * Truncate segments, starting at the last one. Starting at the end makes
 	 * managing the memory for the fd array easier, should there be errors.
 	 */
-	curopensegs = reln->md_num_open_segs[forknum];
+	curopensegs = mdreln->md_num_open_segs[forknum];
 	while (curopensegs > 0)
 	{
 		MdfdVec    *v;
 
 		priorblocks = (curopensegs - 1) * RELSEG_SIZE;
 
-		v = &reln->md_seg_fds[forknum][curopensegs - 1];
+		v = &mdreln->md_seg_fds[forknum][curopensegs - 1];
 
 		if (priorblocks > nblocks)
 		{
@@ -1039,13 +1092,13 @@ mdtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
 								FilePathName(v->mdfd_vfd))));
 
 			if (!SmgrIsTemp(reln))
-				register_dirty_segment(reln, forknum, v);
+				register_dirty_segment(mdreln, forknum, v);
 
 			/* we never drop the 1st segment */
-			Assert(v != &reln->md_seg_fds[forknum][0]);
+			Assert(v != &mdreln->md_seg_fds[forknum][0]);
 
 			FileClose(v->mdfd_vfd);
-			_fdvec_resize(reln, forknum, curopensegs - 1);
+			_fdvec_resize(mdreln, forknum, curopensegs - 1);
 		}
 		else if (priorblocks + ((BlockNumber) RELSEG_SIZE) > nblocks)
 		{
@@ -1065,7 +1118,7 @@ mdtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
 								FilePathName(v->mdfd_vfd),
 								nblocks)));
 			if (!SmgrIsTemp(reln))
-				register_dirty_segment(reln, forknum, v);
+				register_dirty_segment(mdreln, forknum, v);
 		}
 		else
 		{
@@ -1095,6 +1148,7 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 {
 	int			segno;
 	int			min_inactive_seg;
+	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 
 	/*
 	 * NOTE: mdnblocks makes sure we have opened all active segments, so that
@@ -1102,7 +1156,7 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 	 */
 	mdnblocks(reln, forknum);
 
-	min_inactive_seg = segno = reln->md_num_open_segs[forknum];
+	min_inactive_seg = segno = mdreln->md_num_open_segs[forknum];
 
 	/*
 	 * Temporarily open inactive segments, then close them after sync.  There
@@ -1110,12 +1164,12 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 	 * is harmless.  We don't bother to clean them up and take a risk of
 	 * further trouble.  The next mdclose() will soon close them.
 	 */
-	while (_mdfd_openseg(reln, forknum, segno, 0) != NULL)
+	while (_mdfd_openseg(mdreln, forknum, segno, 0) != NULL)
 		segno++;
 
 	while (segno > 0)
 	{
-		MdfdVec    *v = &reln->md_seg_fds[forknum][segno - 1];
+		MdfdVec    *v = &mdreln->md_seg_fds[forknum][segno - 1];
 
 		/*
 		 * fsyncs done through mdimmedsync() should be tracked in a separate
@@ -1136,7 +1190,7 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 		if (segno > min_inactive_seg)
 		{
 			FileClose(v->mdfd_vfd);
-			_fdvec_resize(reln, forknum, segno - 1);
+			_fdvec_resize(mdreln, forknum, segno - 1);
 		}
 
 		segno--;
@@ -1153,14 +1207,14 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
  * enough to be a performance problem).
  */
 static void
-register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
+register_dirty_segment(MdSMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 {
 	FileTag		tag;
 
-	INIT_MD_FILETAG(tag, reln->smgr_rlocator.locator, forknum, seg->mdfd_segno);
+	INIT_MD_FILETAG(tag, reln->reln.smgr_rlocator.locator, forknum, seg->mdfd_segno);
 
 	/* Temp relations should never be fsync'd */
-	Assert(!SmgrIsTemp(reln));
+	Assert(!SmgrIsTemp(&reln->reln));
 
 	if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false /* retryOnError */ ))
 	{
@@ -1278,7 +1332,7 @@ DropRelationFiles(RelFileLocator *delrels, int ndelrels, bool isRedo)
  * _fdvec_resize() -- Resize the fork's open segments array
  */
 static void
-_fdvec_resize(SMgrRelation reln,
+_fdvec_resize(MdSMgrRelation reln,
 			  ForkNumber forknum,
 			  int nseg)
 {
@@ -1316,12 +1370,12 @@ _fdvec_resize(SMgrRelation reln,
  * returned string is palloc'd.
  */
 static char *
-_mdfd_segpath(SMgrRelation reln, ForkNumber forknum, BlockNumber segno)
+_mdfd_segpath(MdSMgrRelation reln, ForkNumber forknum, BlockNumber segno)
 {
 	char	   *path,
 			   *fullpath;
 
-	path = relpath(reln->smgr_rlocator, forknum);
+	path = relpath(reln->reln.smgr_rlocator, forknum);
 
 	if (segno > 0)
 	{
@@ -1339,7 +1393,7 @@ _mdfd_segpath(SMgrRelation reln, ForkNumber forknum, BlockNumber segno)
  * and make a MdfdVec object for it.  Returns NULL on failure.
  */
 static MdfdVec *
-_mdfd_openseg(SMgrRelation reln, ForkNumber forknum, BlockNumber segno,
+_mdfd_openseg(MdSMgrRelation reln, ForkNumber forknum, BlockNumber segno,
 			  int oflags)
 {
 	MdfdVec    *v;
@@ -1384,7 +1438,7 @@ _mdfd_openseg(SMgrRelation reln, ForkNumber forknum, BlockNumber segno,
  * EXTENSION_CREATE case.
  */
 static MdfdVec *
-_mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
+_mdfd_getseg(MdSMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
 			 bool skipFsync, int behavior)
 {
 	MdfdVec    *v;
@@ -1458,7 +1512,7 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
 				char	   *zerobuf = palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE,
 													 MCXT_ALLOC_ZERO);
 
-				mdextend(reln, forknum,
+				mdextend((SMgrRelation) reln, forknum,
 						 nextsegno * ((BlockNumber) RELSEG_SIZE) - 1,
 						 zerobuf, skipFsync);
 				pfree(zerobuf);
@@ -1515,7 +1569,7 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
  * Get number of blocks present in a single disk file
  */
 static BlockNumber
-_mdnblocks(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
+_mdnblocks(MdSMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 {
 	off_t		len;
 
@@ -1538,7 +1592,7 @@ _mdnblocks(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 int
 mdsyncfiletag(const FileTag *ftag, char *path)
 {
-	SMgrRelation reln = smgropen(ftag->rlocator, InvalidBackendId);
+	MdSMgrRelation reln = (MdSMgrRelation) smgropen(ftag->rlocator, InvalidBackendId);
 	File		file;
 	instr_time	io_start;
 	bool		need_to_close;
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index f76c4605db..d37202609f 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -19,77 +19,23 @@
 
 #include "access/xlogutils.h"
 #include "lib/ilist.h"
+#include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/md.h"
 #include "storage/smgr.h"
+#include "port/atomics.h"
 #include "utils/hsearch.h"
 #include "utils/inval.h"
+#include "utils/memutils.h"
 
 
-/*
- * This struct of function pointers defines the API between smgr.c and
- * any individual storage manager module.  Note that smgr subfunctions are
- * generally expected to report problems via elog(ERROR).  An exception is
- * that smgr_unlink should use elog(WARNING), rather than erroring out,
- * because we normally unlink relations during post-commit/abort cleanup,
- * and so it's too late to raise an error.  Also, various conditions that
- * would normally be errors should be allowed during bootstrap and/or WAL
- * recovery --- see comments in md.c for details.
- */
-typedef struct f_smgr
-{
-	void		(*smgr_init) (void);	/* may be NULL */
-	void		(*smgr_shutdown) (void);	/* may be NULL */
-	void		(*smgr_open) (SMgrRelation reln);
-	void		(*smgr_close) (SMgrRelation reln, ForkNumber forknum);
-	void		(*smgr_create) (SMgrRelation reln, ForkNumber forknum,
-								bool isRedo);
-	bool		(*smgr_exists) (SMgrRelation reln, ForkNumber forknum);
-	void		(*smgr_unlink) (RelFileLocatorBackend rlocator, ForkNumber forknum,
-								bool isRedo);
-	void		(*smgr_extend) (SMgrRelation reln, ForkNumber forknum,
-								BlockNumber blocknum, const void *buffer, bool skipFsync);
-	void		(*smgr_zeroextend) (SMgrRelation reln, ForkNumber forknum,
-									BlockNumber blocknum, int nblocks, bool skipFsync);
-	bool		(*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
-								  BlockNumber blocknum);
-	void		(*smgr_read) (SMgrRelation reln, ForkNumber forknum,
-							  BlockNumber blocknum, void *buffer);
-	void		(*smgr_write) (SMgrRelation reln, ForkNumber forknum,
-							   BlockNumber blocknum, const void *buffer, bool skipFsync);
-	void		(*smgr_writeback) (SMgrRelation reln, ForkNumber forknum,
-								   BlockNumber blocknum, BlockNumber nblocks);
-	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
-	void		(*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
-								  BlockNumber nblocks);
-	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
-} f_smgr;
-
-static const f_smgr smgrsw[] = {
-	/* magnetic disk */
-	{
-		.smgr_init = mdinit,
-		.smgr_shutdown = NULL,
-		.smgr_open = mdopen,
-		.smgr_close = mdclose,
-		.smgr_create = mdcreate,
-		.smgr_exists = mdexists,
-		.smgr_unlink = mdunlink,
-		.smgr_extend = mdextend,
-		.smgr_zeroextend = mdzeroextend,
-		.smgr_prefetch = mdprefetch,
-		.smgr_read = mdread,
-		.smgr_write = mdwrite,
-		.smgr_writeback = mdwriteback,
-		.smgr_nblocks = mdnblocks,
-		.smgr_truncate = mdtruncate,
-		.smgr_immedsync = mdimmedsync,
-	}
-};
+static f_smgr *smgrsw;
 
-static const int NSmgr = lengthof(smgrsw);
+static int NSmgr = 0;
+
+static Size LargestSMgrRelationSize = 0;
 
 /*
  * Each backend has a hashtable that stores all extant SMgrRelation objects.
@@ -102,6 +48,57 @@ static dlist_head unowned_relns;
 /* local function prototypes */
 static void smgrshutdown(int code, Datum arg);
 
+SMgrId
+smgr_register(const f_smgr *smgr, Size smgrrelation_size)
+{
+	SMgrId my_id;
+	MemoryContext old;
+
+	if (process_shared_preload_libraries_done)
+		elog(FATAL, "SMgrs must be registered in the shared_preload_libraries phase");
+	if (NSmgr == MaxSMgrId)
+		elog(FATAL, "Too many smgrs registered");
+	if (smgr->name == NULL || *smgr->name == 0)
+		elog(FATAL, "smgr registered with invalid name");
+
+	Assert(smgr->smgr_open != NULL);
+	Assert(smgr->smgr_close != NULL);
+	Assert(smgr->smgr_create != NULL);
+	Assert(smgr->smgr_exists != NULL);
+	Assert(smgr->smgr_unlink != NULL);
+	Assert(smgr->smgr_extend != NULL);
+	Assert(smgr->smgr_zeroextend != NULL);
+	Assert(smgr->smgr_prefetch != NULL);
+	Assert(smgr->smgr_read != NULL);
+	Assert(smgr->smgr_write != NULL);
+	Assert(smgr->smgr_writeback != NULL);
+	Assert(smgr->smgr_nblocks != NULL);
+	Assert(smgr->smgr_truncate != NULL);
+	Assert(smgr->smgr_immedsync != NULL);
+	old = MemoryContextSwitchTo(TopMemoryContext);
+
+	my_id = NSmgr++;
+	if (my_id == 0)
+		smgrsw = palloc(sizeof(f_smgr));
+	else
+		smgrsw = repalloc(smgrsw, sizeof(f_smgr) * NSmgr);
+
+	MemoryContextSwitchTo(old);
+
+	pg_compiler_barrier();
+
+	if (!smgrsw)
+	{
+		NSmgr--;
+		elog(FATAL, "Failed to extend smgr array");
+	}
+
+	memcpy(&smgrsw[my_id], smgr, sizeof(f_smgr));
+
+	LargestSMgrRelationSize = Max(LargestSMgrRelationSize, smgrrelation_size);
+
+	return my_id;
+}
 
 /*
  * smgrinit(), smgrshutdown() -- Initialize or shut down storage
@@ -157,9 +154,11 @@ smgropen(RelFileLocator rlocator, BackendId backend)
 	{
 		/* First time through: initialize the hash table */
 		HASHCTL		ctl;
+		LargestSMgrRelationSize = MAXALIGN(LargestSMgrRelationSize);
+		Assert(NSmgr > 0);
 
 		ctl.keysize = sizeof(RelFileLocatorBackend);
-		ctl.entrysize = sizeof(SMgrRelationData);
+		ctl.entrysize = LargestSMgrRelationSize;
 		SMgrRelationHash = hash_create("smgr relation table", 400,
 									   &ctl, HASH_ELEM | HASH_BLOBS);
 		dlist_init(&unowned_relns);
@@ -180,7 +179,8 @@ smgropen(RelFileLocator rlocator, BackendId backend)
 		reln->smgr_targblock = InvalidBlockNumber;
 		for (int i = 0; i <= MAX_FORKNUM; ++i)
 			reln->smgr_cached_nblocks[i] = InvalidBlockNumber;
-		reln->smgr_which = 0;	/* we only have md.c at present */
+
+		reln->smgr_which = MdSMgrId;	/* we only have md.c at present */
 
 		/* implementation-specific initialization */
 		smgrsw[reln->smgr_which].smgr_open(reln);
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index a604432126..dab4be80c9 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -42,6 +42,7 @@
 #include "postmaster/postmaster.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
+#include "storage/md.h"
 #include "storage/latch.h"
 #include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
@@ -199,6 +200,9 @@ InitStandaloneProcess(const char *argv0)
 	InitProcessLocalLatch();
 	InitializeLatchWaitSet();
 
+	/* Initialize smgrs */
+	register_builtin_dynamic_managers();
+
 	/*
 	 * For consistency with InitPostmasterChild, initialize signal mask here.
 	 * But we don't unblock SIGQUIT or provide a default handler for it.
@@ -1868,6 +1872,14 @@ process_session_preload_libraries(void)
 				   true);
 }
 
+/*
+ * Register any internal managers.
+ */
+void register_builtin_dynamic_managers(void)
+{
+	mdsmgr_register();
+}
+
 /*
  * process any shared memory requests from preloaded libraries
  */
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 14bd574fc2..8f53b6351c 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -488,6 +488,7 @@ extern void TouchSocketLockFiles(void);
 extern void AddToDataDirLockFile(int target_line, const char *str);
 extern bool RecheckDataDirLockFile(void);
 extern void ValidatePgVersion(const char *path);
+extern void register_builtin_dynamic_managers(void);
 extern void process_shared_preload_libraries(void);
 extern void process_session_preload_libraries(void);
 extern void process_shmem_requests(void);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 941879ee6a..beeddfd373 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -19,6 +19,10 @@
 #include "storage/smgr.h"
 #include "storage/sync.h"
 
+/* registration function for md storage manager */
+extern void mdsmgr_register(void);
+extern SMgrId MdSMgrId;
+
 /* md storage manager functionality */
 extern void mdinit(void);
 extern void mdopen(SMgrRelation reln);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index a9a179aaba..5ad1d50e0c 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -18,6 +18,10 @@
 #include "storage/block.h"
 #include "storage/relfilelocator.h"
 
+typedef uint8 SMgrId;
+
+#define MaxSMgrId UINT8_MAX
+
 /*
  * smgr.c maintains a table of SMgrRelation objects, which are essentially
  * cached file handles.  An SMgrRelation is created (if not already present)
@@ -59,14 +63,8 @@ typedef struct SMgrRelationData
 	 * Fields below here are intended to be private to smgr.c and its
 	 * submodules.  Do not touch them from elsewhere.
 	 */
-	int			smgr_which;		/* storage manager selector */
-
-	/*
-	 * for md.c; per-fork arrays of the number of open segments
-	 * (md_num_open_segs) and the segments themselves (md_seg_fds).
-	 */
-	int			md_num_open_segs[MAX_FORKNUM + 1];
-	struct _MdfdVec *md_seg_fds[MAX_FORKNUM + 1];
+	SMgrId		smgr_which;		/* storage manager selector */
+	int			smgrrelation_size;	/* size of this struct, incl. smgr-specific data */
 
 	/* if unowned, list link in list of all unowned SMgrRelations */
 	dlist_node	node;
@@ -77,6 +75,48 @@ typedef SMgrRelationData *SMgrRelation;
 #define SmgrIsTemp(smgr) \
 	RelFileLocatorBackendIsTemp((smgr)->smgr_rlocator)
 
+/*
+ * This struct of function pointers defines the API between smgr.c and
+ * any individual storage manager module.  Note that smgr subfunctions are
+ * generally expected to report problems via elog(ERROR).  An exception is
+ * that smgr_unlink should use elog(WARNING), rather than erroring out,
+ * because we normally unlink relations during post-commit/abort cleanup,
+ * and so it's too late to raise an error.  Also, various conditions that
+ * would normally be errors should be allowed during bootstrap and/or WAL
+ * recovery --- see comments in md.c for details.
+ */
+typedef struct f_smgr
+{
+	const char *name;
+	void		(*smgr_init) (void);		/* may be NULL */
+	void		(*smgr_shutdown) (void);	/* may be NULL */
+	void		(*smgr_open) (SMgrRelation reln);
+	void		(*smgr_close) (SMgrRelation reln, ForkNumber forknum);
+	void		(*smgr_create) (SMgrRelation reln, ForkNumber forknum,
+								bool isRedo);
+	bool		(*smgr_exists) (SMgrRelation reln, ForkNumber forknum);
+	void		(*smgr_unlink) (RelFileLocatorBackend rlocator, ForkNumber forknum,
+								bool isRedo);
+	void		(*smgr_extend) (SMgrRelation reln, ForkNumber forknum,
+								BlockNumber blocknum, const void *buffer, bool skipFsync);
+	void		(*smgr_zeroextend) (SMgrRelation reln, ForkNumber forknum,
+									BlockNumber blocknum, int nblocks, bool skipFsync);
+	bool		(*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
+								  BlockNumber blocknum);
+	void		(*smgr_read) (SMgrRelation reln, ForkNumber forknum,
+							  BlockNumber blocknum, void *buffer);
+	void		(*smgr_write) (SMgrRelation reln, ForkNumber forknum,
+							   BlockNumber blocknum, const void *buffer, bool skipFsync);
+	void		(*smgr_writeback) (SMgrRelation reln, ForkNumber forknum,
+								   BlockNumber blocknum, BlockNumber nblocks);
+	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
+	void		(*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
+								  BlockNumber nblocks);
+	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
+} f_smgr;
+
+extern SMgrId smgr_register(const f_smgr *smgr, Size smgrrelation_size);
+
 extern void smgrinit(void);
 extern SMgrRelation smgropen(RelFileLocator rlocator, BackendId backend);
 extern bool smgrexists(SMgrRelation reln, ForkNumber forknum);
-- 
2.39.0

v1-0002-Prototype-Allow-tablespaces-to-specify-which-SMGR.patchapplication/octet-stream; name=v1-0002-Prototype-Allow-tablespaces-to-specify-which-SMGR.patchDownload
From 8db3e73a6fe60c114335a47432a80ecb447b9357 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 30 Jun 2023 14:15:36 +0200
Subject: [PATCH v1 2/2] Prototype: Allow tablespaces to specify which SMGR
 they use

This allows for tablespaces that are not present on the local file system.

For now, the default tablespaces (pg_default and pg_global) are still
dependent on the md.c smgr, but in the future this may change as well.
---
 src/backend/access/rmgrdesc/tblspcdesc.c |   2 +-
 src/backend/commands/tablespace.c        | 182 +++----------------
 src/backend/parser/gram.y                |  32 +++-
 src/backend/storage/smgr/md.c            | 214 ++++++++++++++++++++++-
 src/backend/storage/smgr/smgr.c          |  72 +++++++-
 src/backend/utils/cache/spccache.c       |  38 +++-
 src/include/catalog/pg_tablespace.dat    |   6 +-
 src/include/catalog/pg_tablespace.h      |   1 +
 src/include/commands/tablespace.h        |   3 +-
 src/include/nodes/parsenodes.h           |   3 +-
 src/include/storage/md.h                 |  10 ++
 src/include/storage/smgr.h               |  20 ++-
 src/include/utils/spccache.h             |   2 +
 13 files changed, 407 insertions(+), 178 deletions(-)

diff --git a/src/backend/access/rmgrdesc/tblspcdesc.c b/src/backend/access/rmgrdesc/tblspcdesc.c
index b8c89f8c54..04cc15e121 100644
--- a/src/backend/access/rmgrdesc/tblspcdesc.c
+++ b/src/backend/access/rmgrdesc/tblspcdesc.c
@@ -27,7 +27,7 @@ tblspc_desc(StringInfo buf, XLogReaderState *record)
 	{
 		xl_tblspc_create_rec *xlrec = (xl_tblspc_create_rec *) rec;
 
-		appendStringInfo(buf, "%u \"%s\"", xlrec->ts_id, xlrec->ts_path);
+		appendStringInfo(buf, "%u \"%s\"", xlrec->ts_id, NameStr(xlrec->ts_smgr));
 	}
 	else if (info == XLOG_TBLSPC_DROP)
 	{
diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index 13b0dee146..b3da4a1b93 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -74,6 +74,7 @@
 #include "miscadmin.h"
 #include "postmaster/bgwriter.h"
 #include "storage/fd.h"
+#include "storage/md.h"
 #include "storage/lmgr.h"
 #include "storage/standby.h"
 #include "utils/acl.h"
@@ -92,8 +93,6 @@ bool		allow_in_place_tablespaces = false;
 
 Oid			binary_upgrade_next_pg_tablespace_oid = InvalidOid;
 
-static void create_tablespace_directories(const char *location,
-										  const Oid tablespaceoid);
 static bool destroy_tablespace_directories(Oid tablespaceoid, bool redo);
 
 
@@ -218,10 +217,8 @@ CreateTableSpace(CreateTableSpaceStmt *stmt)
 	bool		nulls[Natts_pg_tablespace] = {0};
 	HeapTuple	tuple;
 	Oid			tablespaceoid;
-	char	   *location;
 	Oid			ownerId;
 	Datum		newOptions;
-	bool		in_place;
 
 	/* Must be superuser */
 	if (!superuser())
@@ -237,47 +234,7 @@ CreateTableSpace(CreateTableSpaceStmt *stmt)
 	else
 		ownerId = GetUserId();
 
-	/* Unix-ify the offered path, and strip any trailing slashes */
-	location = pstrdup(stmt->location);
-	canonicalize_path(location);
-
-	/* disallow quotes, else CREATE DATABASE would be at risk */
-	if (strchr(location, '\''))
-		ereport(ERROR,
-				(errcode(ERRCODE_INVALID_NAME),
-				 errmsg("tablespace location cannot contain single quotes")));
-
-	in_place = allow_in_place_tablespaces && strlen(location) == 0;
-
-	/*
-	 * Allowing relative paths seems risky
-	 *
-	 * This also helps us ensure that location is not empty or whitespace,
-	 * unless specifying a developer-only in-place tablespace.
-	 */
-	if (!in_place && !is_absolute_path(location))
-		ereport(ERROR,
-				(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
-				 errmsg("tablespace location must be an absolute path")));
-
-	/*
-	 * Check that location isn't too long. Remember that we're going to append
-	 * 'PG_XXX/<dboid>/<relid>_<fork>.<nnn>'.  FYI, we never actually
-	 * reference the whole path here, but MakePGDirectory() uses the first two
-	 * parts.
-	 */
-	if (strlen(location) + 1 + strlen(TABLESPACE_VERSION_DIRECTORY) + 1 +
-		OIDCHARS + 1 + OIDCHARS + 1 + FORKNAMECHARS + 1 + OIDCHARS > MAXPGPATH)
-		ereport(ERROR,
-				(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
-				 errmsg("tablespace location \"%s\" is too long",
-						location)));
-
-	/* Warn if the tablespace is in the data directory. */
-	if (path_is_prefix_of_path(DataDir, location))
-		ereport(WARNING,
-				(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
-				 errmsg("tablespace location should not be inside the data directory")));
+	smgrvalidatetspopts(stmt->smgr, stmt->smgropts);
 
 	/*
 	 * Disallow creation of tablespaces named "pg_xxx"; we reserve this
@@ -334,6 +291,8 @@ CreateTableSpace(CreateTableSpaceStmt *stmt)
 	values[Anum_pg_tablespace_oid - 1] = ObjectIdGetDatum(tablespaceoid);
 	values[Anum_pg_tablespace_spcname - 1] =
 		DirectFunctionCall1(namein, CStringGetDatum(stmt->tablespacename));
+	values[Anum_pg_tablespace_spcsmgr - 1] =
+		DirectFunctionCall1(namein, CStringGetDatum(stmt->smgr));
 	values[Anum_pg_tablespace_spcowner - 1] =
 		ObjectIdGetDatum(ownerId);
 	nulls[Anum_pg_tablespace_spcacl - 1] = true;
@@ -360,18 +319,22 @@ CreateTableSpace(CreateTableSpaceStmt *stmt)
 	/* Post creation hook for new tablespace */
 	InvokeObjectPostCreateHook(TableSpaceRelationId, tablespaceoid, 0);
 
-	create_tablespace_directories(location, tablespaceoid);
+	smgrcreatetsp(stmt->smgr, tablespaceoid, stmt->smgropts, 0);
 
 	/* Record the filesystem change in XLOG */
 	{
-		xl_tblspc_create_rec xlrec;
+		xl_tblspc_create_rec xlrec = {0};
+		Datum	smgropts;
 
 		xlrec.ts_id = tablespaceoid;
+		memcpy(&xlrec.ts_smgr, stmt->smgr, strlen(stmt->smgr));
+		smgropts = transformRelOptions((Datum) 0, stmt->smgropts,
+									   NULL, NULL, false, false);
 
 		XLogBeginInsert();
 		XLogRegisterData((char *) &xlrec,
-						 offsetof(xl_tblspc_create_rec, ts_path));
-		XLogRegisterData((char *) location, strlen(location) + 1);
+						 offsetof(xl_tblspc_create_rec, ts_smgropts));
+		XLogRegisterData((char *) smgropts, VARSIZE_ANY(smgropts));
 
 		(void) XLogInsert(RM_TBLSPC_ID, XLOG_TBLSPC_CREATE);
 	}
@@ -384,8 +347,6 @@ CreateTableSpace(CreateTableSpaceStmt *stmt)
 	 */
 	ForceSyncCommit();
 
-	pfree(location);
-
 	/* We keep the lock on pg_tablespace until commit */
 	table_close(rel, NoLock);
 
@@ -401,6 +362,7 @@ void
 DropTableSpace(DropTableSpaceStmt *stmt)
 {
 	char	   *tablespacename = stmt->tablespacename;
+	char	   *smgrname;
 	TableScanDesc scandesc;
 	Relation	rel;
 	HeapTuple	tuple;
@@ -444,6 +406,7 @@ DropTableSpace(DropTableSpaceStmt *stmt)
 
 	spcform = (Form_pg_tablespace) GETSTRUCT(tuple);
 	tablespaceoid = spcform->oid;
+	smgrname = pstrdup(NameStr(spcform->spcsmgr));
 
 	/* Must be tablespace owner */
 	if (!object_ownercheck(TableSpaceRelationId, tablespaceoid, GetUserId()))
@@ -492,6 +455,8 @@ DropTableSpace(DropTableSpaceStmt *stmt)
 	 */
 	LWLockAcquire(TablespaceCreateLock, LW_EXCLUSIVE);
 
+	smgrdroptsp(smgrname, tablespaceoid, false);
+
 	/*
 	 * Try to remove the physical infrastructure.
 	 */
@@ -567,114 +532,6 @@ DropTableSpace(DropTableSpaceStmt *stmt)
 	table_close(rel, NoLock);
 }
 
-
-/*
- * create_tablespace_directories
- *
- *	Attempt to create filesystem infrastructure linking $PGDATA/pg_tblspc/
- *	to the specified directory
- */
-static void
-create_tablespace_directories(const char *location, const Oid tablespaceoid)
-{
-	char	   *linkloc;
-	char	   *location_with_version_dir;
-	struct stat st;
-	bool		in_place;
-
-	linkloc = psprintf("pg_tblspc/%u", tablespaceoid);
-
-	/*
-	 * If we're asked to make an 'in place' tablespace, create the directory
-	 * directly where the symlink would normally go.  This is a developer-only
-	 * option for now, to facilitate regression testing.
-	 */
-	in_place = strlen(location) == 0;
-
-	if (in_place)
-	{
-		if (MakePGDirectory(linkloc) < 0 && errno != EEXIST)
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not create directory \"%s\": %m",
-							linkloc)));
-	}
-
-	location_with_version_dir = psprintf("%s/%s", in_place ? linkloc : location,
-										 TABLESPACE_VERSION_DIRECTORY);
-
-	/*
-	 * Attempt to coerce target directory to safe permissions.  If this fails,
-	 * it doesn't exist or has the wrong owner.  Not needed for in-place mode,
-	 * because in that case we created the directory with the desired
-	 * permissions.
-	 */
-	if (!in_place && chmod(location, pg_dir_create_mode) != 0)
-	{
-		if (errno == ENOENT)
-			ereport(ERROR,
-					(errcode(ERRCODE_UNDEFINED_FILE),
-					 errmsg("directory \"%s\" does not exist", location),
-					 InRecovery ? errhint("Create this directory for the tablespace before "
-										  "restarting the server.") : 0));
-		else
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not set permissions on directory \"%s\": %m",
-							location)));
-	}
-
-	/*
-	 * The creation of the version directory prevents more than one tablespace
-	 * in a single location.  This imitates TablespaceCreateDbspace(), but it
-	 * ignores concurrency and missing parent directories.  The chmod() would
-	 * have failed in the absence of a parent.  pg_tablespace_spcname_index
-	 * prevents concurrency.
-	 */
-	if (stat(location_with_version_dir, &st) < 0)
-	{
-		if (errno != ENOENT)
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not stat directory \"%s\": %m",
-							location_with_version_dir)));
-		else if (MakePGDirectory(location_with_version_dir) < 0)
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not create directory \"%s\": %m",
-							location_with_version_dir)));
-	}
-	else if (!S_ISDIR(st.st_mode))
-		ereport(ERROR,
-				(errcode(ERRCODE_WRONG_OBJECT_TYPE),
-				 errmsg("\"%s\" exists but is not a directory",
-						location_with_version_dir)));
-	else if (!InRecovery)
-		ereport(ERROR,
-				(errcode(ERRCODE_OBJECT_IN_USE),
-				 errmsg("directory \"%s\" already in use as a tablespace",
-						location_with_version_dir)));
-
-	/*
-	 * In recovery, remove old symlink, in case it points to the wrong place.
-	 */
-	if (!in_place && InRecovery)
-		remove_tablespace_symlink(linkloc);
-
-	/*
-	 * Create the symlink under PGDATA
-	 */
-	if (!in_place && symlink(location, linkloc) < 0)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not create symbolic link \"%s\": %m",
-						linkloc)));
-
-	pfree(linkloc);
-	pfree(location_with_version_dir);
-}
-
-
 /*
  * destroy_tablespace_directories
  *
@@ -1524,9 +1381,12 @@ tblspc_redo(XLogReaderState *record)
 	if (info == XLOG_TBLSPC_CREATE)
 	{
 		xl_tblspc_create_rec *xlrec = (xl_tblspc_create_rec *) XLogRecGetData(record);
-		char	   *location = xlrec->ts_path;
+		smgrcreatetsp(NameStr(xlrec->ts_smgr), xlrec->ts_id,
+					  untransformRelOptions((Datum) &xlrec->ts_smgropts), true);
 
-		create_tablespace_directories(location, xlrec->ts_id);
+		/*
+		 * create_tablespace_directories(location, xlrec->ts_id);
+		 */
 	}
 	else if (info == XLOG_TBLSPC_DROP)
 	{
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 39ab7eac0d..49742553d4 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -60,6 +60,7 @@
 #include "nodes/nodeFuncs.h"
 #include "parser/parser.h"
 #include "storage/lmgr.h"
+#include "storage/md.h"
 #include "utils/date.h"
 #include "utils/datetime.h"
 #include "utils/numeric.h"
@@ -394,6 +395,8 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 				opt_inline_handler opt_validator validator_clause
 				opt_collate
 
+%type <node>	OptTableSpaceStorage
+
 %type <range>	qualified_name insert_target OptConstrFromTable
 
 %type <str>		all_Op MathOp
@@ -4931,18 +4934,35 @@ opt_procedural:
 /*****************************************************************************
  *
  *		QUERY:
- *             CREATE TABLESPACE tablespace LOCATION '/path/to/tablespace/'
+ *             CREATE TABLESPACE tablespace
+ *                 [ OWNER role ]
+ *                 [ LOCATION '/path/to/tablespace/' | USING smgr ( option [, ...] ) ]
+ *                 [ WITH ( option [ , ... ] ) ]
  *
  *****************************************************************************/
 
-CreateTableSpaceStmt: CREATE TABLESPACE name OptTableSpaceOwner LOCATION Sconst opt_reloptions
+CreateTableSpaceStmt: CREATE TABLESPACE name OptTableSpaceOwner OptTableSpaceStorage opt_reloptions
 				{
-					CreateTableSpaceStmt *n = makeNode(CreateTableSpaceStmt);
-
+					CreateTableSpaceStmt *n = (CreateTableSpaceStmt *) $5;
 					n->tablespacename = $3;
 					n->owner = $4;
-					n->location = $6;
-					n->options = $7;
+					n->options = $6;
+					$$ = (Node *) n;
+				}
+		;
+
+OptTableSpaceStorage: LOCATION Sconst
+				{
+					CreateTableSpaceStmt *n = makeNode(CreateTableSpaceStmt);
+					n->smgr = MD_SMGR_NAME;
+					n->smgropts = list_make1(makeDefElem("location", (Node *) makeString($2), @1));
+					$$ = (Node *) n;
+				}
+			| USING name '(' utility_option_list ')'
+				{
+					CreateTableSpaceStmt *n = makeNode(CreateTableSpaceStmt);
+					n->smgr = $2;
+					n->smgropts = $4;
 					$$ = (Node *) n;
 				}
 		;
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 690bdd27c5..dfc5a11da4 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -22,12 +22,16 @@
 #include "postgres.h"
 
 #include <unistd.h>
+#include <dirent.h>
 #include <fcntl.h>
 #include <sys/file.h>
+#include <sys/stat.h>
 
 #include "access/xlog.h"
 #include "access/xlogutils.h"
+#include "commands/defrem.h"
 #include "commands/tablespace.h"
+#include "common/file_perm.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "pgstat.h"
@@ -156,6 +160,9 @@ void mdsmgr_register(void)
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
 		.smgr_immedsync = mdimmedsync,
+		.smgr_validate_tspopts = mdvalidatetspopts,
+		.smgr_create_tsp = mdcreatetsp,
+		.smgr_drop_tsp = mddroptsp,
 	};
 
 	MdSMgrId = smgr_register(&md_smgr, sizeof(MdSMgrRelationData));
@@ -213,6 +220,7 @@ mdinit(void)
 bool
 mdexists(SMgrRelation reln, ForkNumber forknum)
 {
+	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 	/*
 	 * Close it first, to ensure that we notice if the fork has been unlinked
 	 * since we opened it.  As an optimization, we can skip that in recovery,
@@ -221,7 +229,7 @@ mdexists(SMgrRelation reln, ForkNumber forknum)
 	if (!InRecovery)
 		mdclose(reln, forknum);
 
-	return (mdopenfork(reln, forknum, EXTENSION_RETURN_NULL) != NULL);
+	return (mdopenfork(mdreln, forknum, EXTENSION_RETURN_NULL) != NULL);
 }
 
 /*
@@ -1672,3 +1680,207 @@ mdfiletagmatches(const FileTag *ftag, const FileTag *candidate)
 	 */
 	return ftag->rlocator.dbOid == candidate->rlocator.dbOid;
 }
+
+void mdvalidatetspopts(List *opts)
+{
+	ListCell   *option;
+	char	   *location;
+	bool		in_place;
+
+	if (list_length(opts) != 1)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_NAME),
+				 errmsg("too many storage options for the %s storage manager", MD_SMGR_NAME),
+				 errhint("Only LOCATION is supported")));
+
+	foreach(option, opts)
+	{
+		DefElem    *defel = lfirst_node(DefElem, option);
+
+		if (strcmp(defel->defname, "location") == 0)
+		{
+			location = pstrdup(defGetString(defel));
+		}
+		else
+		{
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("unrecognised option '%s' for to the %s storage manager",
+							defel->defname, MD_SMGR_NAME),
+					 errhint("Only 'location' is supported")),
+					 errposition(defel->location));
+		}
+	}
+
+	/* Unix-ify the offered path, and strip any trailing slashes */
+	canonicalize_path(location);
+
+	/* disallow quotes, else CREATE DATABASE would be at risk */
+	if (strchr(location, '\''))
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_NAME),
+					errmsg("tablespace location cannot contain single quotes")));
+
+	in_place = allow_in_place_tablespaces && strlen(location) == 0;
+
+	/*
+	 * Allowing relative paths seems risky
+	 *
+	 * This also helps us ensure that location is not empty or whitespace,
+	 * unless specifying a developer-only in-place tablespace.
+	 */
+	if (!in_place && !is_absolute_path(location))
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+					errmsg("tablespace location must be an absolute path")));
+
+	/*
+	 * Check that location isn't too long. Remember that we're going to append
+	 * 'PG_XXX/<dboid>/<relid>_<fork>.<nnn>'.  FYI, we never actually
+	 * reference the whole path here, but MakePGDirectory() uses the first two
+	 * parts.
+	 */
+	if (strlen(location) + 1 + strlen(TABLESPACE_VERSION_DIRECTORY) + 1 +
+		OIDCHARS + 1 + OIDCHARS + 1 + FORKNAMECHARS + 1 + OIDCHARS > MAXPGPATH)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+					errmsg("tablespace location \"%s\" is too long",
+						   location)));
+
+	/* Warn if the tablespace is in the data directory. */
+	if (path_is_prefix_of_path(DataDir, location))
+		ereport(WARNING,
+				(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+					errmsg("tablespace location should not be inside the data directory")));
+
+	pfree(location);
+}
+
+void mdcreatetsp(Oid tablespaceoid, List *opts, bool isredo)
+{
+	char	   *location;
+	DefElem	   *defel = (DefElem *) linitial_node(DefElem, opts);
+
+	Assert(strcmp(defel->defname, "location") == 0);
+	Assert(list_length(opts) == 1);
+
+	location = pstrdup(defGetString(defel));
+
+	/* Unix-ify the offered path, and strip any trailing slashes */
+	canonicalize_path(location);
+
+	create_tablespace_directories(location, tablespaceoid);
+
+	pfree(location);
+}
+
+void mddroptsp(Oid tsp, bool isredo)
+{
+	
+}
+
+/*
+ * create_tablespace_directories
+ *
+ *	Attempt to create filesystem infrastructure linking $PGDATA/pg_tblspc/
+ *	to the specified directory
+ */
+void
+create_tablespace_directories(const char *location, const Oid tablespaceoid)
+{
+	char	   *linkloc;
+	char	   *location_with_version_dir;
+	struct stat st;
+	bool		in_place;
+
+	linkloc = psprintf("pg_tblspc/%u", tablespaceoid);
+
+	/*
+	 * If we're asked to make an 'in place' tablespace, create the directory
+	 * directly where the symlink would normally go.  This is a developer-only
+	 * option for now, to facilitate regression testing.
+	 */
+	in_place = strlen(location) == 0;
+
+	if (in_place)
+	{
+		if (MakePGDirectory(linkloc) < 0 && errno != EEXIST)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+						errmsg("could not create directory \"%s\": %m",
+							   linkloc)));
+	}
+
+	location_with_version_dir = psprintf("%s/%s", in_place ? linkloc : location,
+										 TABLESPACE_VERSION_DIRECTORY);
+
+	/*
+	 * Attempt to coerce target directory to safe permissions.  If this fails,
+	 * it doesn't exist or has the wrong owner.  Not needed for in-place mode,
+	 * because in that case we created the directory with the desired
+	 * permissions.
+	 */
+	if (!in_place && chmod(location, pg_dir_create_mode) != 0)
+	{
+		if (errno == ENOENT)
+			ereport(ERROR,
+					(errcode(ERRCODE_UNDEFINED_FILE),
+						errmsg("directory \"%s\" does not exist", location),
+						InRecovery ? errhint("Create this directory for the tablespace before "
+											 "restarting the server.") : 0));
+		else
+			ereport(ERROR,
+					(errcode_for_file_access(),
+						errmsg("could not set permissions on directory \"%s\": %m",
+							   location)));
+	}
+
+	/*
+	 * The creation of the version directory prevents more than one tablespace
+	 * in a single location.  This imitates TablespaceCreateDbspace(), but it
+	 * ignores concurrency and missing parent directories.  The chmod() would
+	 * have failed in the absence of a parent.  pg_tablespace_spcname_index
+	 * prevents concurrency.
+	 */
+	if (stat(location_with_version_dir, &st) < 0)
+	{
+		if (errno != ENOENT)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+						errmsg("could not stat directory \"%s\": %m",
+							   location_with_version_dir)));
+		else if (MakePGDirectory(location_with_version_dir) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+						errmsg("could not create directory \"%s\": %m",
+							   location_with_version_dir)));
+	}
+	else if (!S_ISDIR(st.st_mode))
+		ereport(ERROR,
+				(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+					errmsg("\"%s\" exists but is not a directory",
+						   location_with_version_dir)));
+	else if (!InRecovery)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_IN_USE),
+					errmsg("directory \"%s\" already in use as a tablespace",
+						   location_with_version_dir)));
+
+	/*
+	 * In recovery, remove old symlink, in case it points to the wrong place.
+	 */
+	if (!in_place && InRecovery)
+		remove_tablespace_symlink(linkloc);
+
+	/*
+	 * Create the symlink under PGDATA
+	 */
+	if (!in_place && symlink(location, linkloc) < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+					errmsg("could not create symbolic link \"%s\": %m",
+						   linkloc)));
+
+	pfree(linkloc);
+	pfree(location_with_version_dir);
+}
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index d37202609f..b5cb720064 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -18,6 +18,7 @@
 #include "postgres.h"
 
 #include "access/xlogutils.h"
+#include "catalog/pg_tablespace_d.h"
 #include "lib/ilist.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -29,7 +30,7 @@
 #include "utils/hsearch.h"
 #include "utils/inval.h"
 #include "utils/memutils.h"
-
+#include "utils/spccache.h"
 
 static f_smgr *smgrsw;
 
@@ -174,13 +175,25 @@ smgropen(RelFileLocator rlocator, BackendId backend)
 	/* Initialize it if not present before */
 	if (!found)
 	{
+		Oid		tspid = reln->smgr_rlocator.locator.spcOid;
 		/* hash_search already filled in the lookup key */
 		reln->smgr_owner = NULL;
 		reln->smgr_targblock = InvalidBlockNumber;
 		for (int i = 0; i <= MAX_FORKNUM; ++i)
 			reln->smgr_cached_nblocks[i] = InvalidBlockNumber;
 
-		reln->smgr_which = MdSMgrId;	/* we only have md.c at present */
+		/*
+		 * There is a chicken-and-egg problem for determining which storage
+		 * manager to use for the global tablespace, as that holds the
+		 * pg_tablespace table which we'd use to look up this information.
+		 *
+		 * As the global tablespace can't be replaced, the default is used
+		 * instead, which is the md.c smgr (MD_SMGR_NAME).
+		 */
+		if (tspid == GLOBALTABLESPACE_OID || tspid == DEFAULTTABLESPACE_OID)
+			reln->smgr_which = get_smgr_id(MD_SMGR_NAME, false);
+		else
+			reln->smgr_which = get_tablespace_smgrid(tspid);
 
 		/* implementation-specific initialization */
 		smgrsw[reln->smgr_which].smgr_open(reln);
@@ -722,6 +735,61 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
 	smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
 }
 
+static const char *recent_smgrname = NULL;
+static SMgrId recent_smgrid = -1;
+
+static SMgrId get_smgr_by_name(const char *smgrname, bool missing_ok)
+{
+	if (recent_smgrname != NULL && strcmp(smgrname, recent_smgrname) == 0)
+		return recent_smgrid;
+
+	for (SMgrId id = 0; id < NSmgr; id++)
+	{
+		f_smgr *smgr = &smgrsw[id];
+
+		if (strcmp(smgrname, smgr->name) == 0)
+		{
+			recent_smgrname = smgr->name;
+			recent_smgrid = id;
+			return id;
+		}
+	}
+
+	if (missing_ok)
+		return InvalidSmgrId;
+
+	ereport(ERROR,
+			(errcode(ERRCODE_INVALID_NAME),
+			 errmsg("invalid smgr '%s'", smgrname)));
+}
+
+
+SMgrId get_smgr_id(const char *smgrname, bool missing_ok)
+{
+	return get_smgr_by_name(smgrname, missing_ok);
+}
+
+void smgrvalidatetspopts(const char *smgrname, List *opts)
+{
+	SMgrId smgrid = get_smgr_by_name(smgrname, false);
+
+	smgrsw[smgrid].smgr_validate_tspopts(opts);
+}
+
+void smgrcreatetsp(const char *smgrname, Oid tsp, List *opts, bool isredo)
+{
+	SMgrId smgrid = get_smgr_by_name(smgrname, false);
+
+	smgrsw[smgrid].smgr_create_tsp(tsp, opts, isredo);
+}
+
+void smgrdroptsp(const char *smgrname, Oid tsp, bool isredo)
+{
+	SMgrId smgrid = get_smgr_by_name(smgrname, false);
+
+	smgrsw[smgrid].smgr_drop_tsp(tsp, isredo);
+}
+
 /*
  * AtEOXact_SMgr
  *
diff --git a/src/backend/utils/cache/spccache.c b/src/backend/utils/cache/spccache.c
index 136fd737d3..ce7e403b53 100644
--- a/src/backend/utils/cache/spccache.c
+++ b/src/backend/utils/cache/spccache.c
@@ -24,6 +24,8 @@
 #include "miscadmin.h"
 #include "optimizer/optimizer.h"
 #include "storage/bufmgr.h"
+#include "storage/smgr.h"
+#include "storage/md.h"
 #include "utils/catcache.h"
 #include "utils/hsearch.h"
 #include "utils/inval.h"
@@ -38,6 +40,7 @@ static HTAB *TableSpaceCacheHash = NULL;
 typedef struct
 {
 	Oid			oid;			/* lookup key - must be first */
+	SMgrId		smgrid;			/* cached storage manager id */
 	TableSpaceOpts *opts;		/* options, or NULL if none */
 } TableSpaceCacheEntry;
 
@@ -98,7 +101,7 @@ InitializeTableSpaceCache(void)
 
 /*
  * get_tablespace
- *		Fetch TableSpaceCacheEntry structure for a specified table OID.
+ *		Fetch TableSpaceCacheEntry structure for a specified tablespace OID.
  *
  * Pointers returned by this function should not be stored, since a cache
  * flush will invalidate them.
@@ -109,6 +112,7 @@ get_tablespace(Oid spcid)
 	TableSpaceCacheEntry *spc;
 	HeapTuple	tp;
 	TableSpaceOpts *opts;
+	SMgrId		smgrid;
 
 	/*
 	 * Since spcid is always from a pg_class tuple, InvalidOid implies the
@@ -135,18 +139,32 @@ get_tablespace(Oid spcid)
 	 */
 	tp = SearchSysCache1(TABLESPACEOID, ObjectIdGetDatum(spcid));
 	if (!HeapTupleIsValid(tp))
+	{
 		opts = NULL;
+		smgrid = InvalidSmgrId;
+	}
 	else
 	{
 		Datum		datum;
 		bool		isNull;
+		char	   *smgrname;
+		
+		smgrname = NameStr(*DatumGetName(SysCacheGetAttr(TABLESPACEOID,
+														 tp,
+														 Anum_pg_tablespace_spcsmgr,
+														 &isNull)));
+
+		Assert(!isNull);
+		smgrid = get_smgr_id(smgrname, false);
 
 		datum = SysCacheGetAttr(TABLESPACEOID,
 								tp,
 								Anum_pg_tablespace_spcoptions,
 								&isNull);
 		if (isNull)
+		{
 			opts = NULL;
+		}
 		else
 		{
 			bytea	   *bytea_opts = tablespace_reloptions(datum, false);
@@ -167,6 +185,8 @@ get_tablespace(Oid spcid)
 											   HASH_ENTER,
 											   NULL);
 	spc->opts = opts;
+	spc->smgrid = smgrid;
+
 	return spc;
 }
 
@@ -235,3 +255,19 @@ get_tablespace_maintenance_io_concurrency(Oid spcid)
 	else
 		return spc->opts->maintenance_io_concurrency;
 }
+
+/*
+ * get_tablespace_smgrid
+ */
+SMgrId
+get_tablespace_smgrid(Oid spcid)
+{
+	TableSpaceCacheEntry *spc;
+	
+	if (spcid == GLOBALTABLESPACE_OID || spcid == DEFAULTTABLESPACE_OID)
+		return get_smgr_id(MD_SMGR_NAME, false);
+
+	spc = get_tablespace(spcid);
+
+	return spc->smgrid;
+}
diff --git a/src/include/catalog/pg_tablespace.dat b/src/include/catalog/pg_tablespace.dat
index 9fbc98a44d..5e20429619 100644
--- a/src/include/catalog/pg_tablespace.dat
+++ b/src/include/catalog/pg_tablespace.dat
@@ -13,8 +13,10 @@
 [
 
 { oid => '1663', oid_symbol => 'DEFAULTTABLESPACE_OID',
-  spcname => 'pg_default', spcacl => '_null_', spcoptions => '_null_' },
+  spcname => 'pg_default', spcacl => '_null_', spcsmgr => 'md',
+  spcoptions => '_null_' },
 { oid => '1664', oid_symbol => 'GLOBALTABLESPACE_OID',
-  spcname => 'pg_global', spcacl => '_null_', spcoptions => '_null_' },
+  spcname => 'pg_global', spcacl => '_null_', spcsmgr => 'md',
+  spcoptions => '_null_' },
 
 ]
diff --git a/src/include/catalog/pg_tablespace.h b/src/include/catalog/pg_tablespace.h
index ea1593d874..9385933c05 100644
--- a/src/include/catalog/pg_tablespace.h
+++ b/src/include/catalog/pg_tablespace.h
@@ -30,6 +30,7 @@ CATALOG(pg_tablespace,1213,TableSpaceRelationId) BKI_SHARED_RELATION
 {
 	Oid			oid;			/* oid */
 	NameData	spcname;		/* tablespace name */
+	NameData	spcsmgr;		/* tablespace storage manager */
 
 	/* owner of tablespace */
 	Oid			spcowner BKI_DEFAULT(POSTGRES) BKI_LOOKUP(pg_authid);
diff --git a/src/include/commands/tablespace.h b/src/include/commands/tablespace.h
index f1961c1813..15220ffb99 100644
--- a/src/include/commands/tablespace.h
+++ b/src/include/commands/tablespace.h
@@ -28,7 +28,8 @@ extern PGDLLIMPORT bool allow_in_place_tablespaces;
 typedef struct xl_tblspc_create_rec
 {
 	Oid			ts_id;
-	char		ts_path[FLEXIBLE_ARRAY_MEMBER]; /* null-terminated string */
+	NameData	ts_smgr;
+	char		ts_smgropts[FLEXIBLE_ARRAY_MEMBER];
 } xl_tblspc_create_rec;
 
 typedef struct xl_tblspc_drop_rec
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index b3bec90e52..e167acec7d 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -2613,8 +2613,9 @@ typedef struct CreateTableSpaceStmt
 {
 	NodeTag		type;
 	char	   *tablespacename;
+	char	   *smgr;
+	List	   *smgropts; /* list of DefElem nodes */
 	RoleSpec   *owner;
-	char	   *location;
 	List	   *options;
 } CreateTableSpaceStmt;
 
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index beeddfd373..a397aa1c10 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -21,6 +21,8 @@
 
 /* registration function for md storage manager */
 extern void mdsmgr_register(void);
+
+#define MD_SMGR_NAME "md"
 extern SMgrId MdSMgrId;
 
 /* md storage manager functionality */
@@ -55,4 +57,12 @@ extern int	mdsyncfiletag(const FileTag *ftag, char *path);
 extern int	mdunlinkfiletag(const FileTag *ftag, char *path);
 extern bool mdfiletagmatches(const FileTag *ftag, const FileTag *candidate);
 
+/* md tsp callbacks */
+extern void mdvalidatetspopts(List *opts);
+extern void mdcreatetsp(Oid tsp, List *opts, bool isredo);
+extern void mddroptsp(Oid tsp, bool isredo);
+void create_tablespace_directories(const char *location,
+								   const Oid tablespaceoid);
+
+
 #endif							/* MD_H */
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 5ad1d50e0c..12a9b5f00e 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -15,12 +15,18 @@
 #define SMGR_H
 
 #include "lib/ilist.h"
+#include "nodes/pg_list.h"
 #include "storage/block.h"
 #include "storage/relfilelocator.h"
 
-typedef uint8 SMgrId;
+/*
+ * volatile ID of the smgr. Across various configurations IDs may vary,
+ * true identity is the name of each smgr. 
+ */
+typedef int SMgrId;
 
-#define MaxSMgrId UINT8_MAX
+#define MaxSMgrId		INT_MAX
+#define InvalidSmgrId	(-1)
 
 /*
  * smgr.c maintains a table of SMgrRelation objects, which are essentially
@@ -113,8 +119,13 @@ typedef struct f_smgr
 	void		(*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
 								  BlockNumber nblocks);
 	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
+
+	void		(*smgr_validate_tspopts) (List *tspopts);
+	void		(*smgr_create_tsp) (Oid tspoid, List *tspopts, bool isredo);
+	void		(*smgr_drop_tsp) (Oid tspoid, bool isredo);
 } f_smgr;
 
+extern SMgrId get_smgr_id(const char *smgrname, bool missing_ok);
 extern SMgrId smgr_register(const f_smgr *smgr, Size smgrrelation_size);
 
 extern void smgrinit(void);
@@ -147,6 +158,11 @@ extern BlockNumber smgrnblocks_cached(SMgrRelation reln, ForkNumber forknum);
 extern void smgrtruncate(SMgrRelation reln, ForkNumber *forknum,
 						 int nforks, BlockNumber *nblocks);
 extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
+
+extern void smgrvalidatetspopts(const char *smgrname, List *opts);
+extern void smgrcreatetsp(const char *smgrname, Oid tsp, List *opts, bool isredo);
+extern void smgrdroptsp(const char *smgrname, Oid tsp, bool isredo);
+
 extern void AtEOXact_SMgr(void);
 extern bool ProcessBarrierSmgrRelease(void);
 
diff --git a/src/include/utils/spccache.h b/src/include/utils/spccache.h
index c6c754a2ec..6569452e91 100644
--- a/src/include/utils/spccache.h
+++ b/src/include/utils/spccache.h
@@ -12,10 +12,12 @@
  */
 #ifndef SPCCACHE_H
 #define SPCCACHE_H
+#include "storage/smgr.h"
 
 extern void get_tablespace_page_costs(Oid spcid, float8 *spc_random_page_cost,
 									  float8 *spc_seq_page_cost);
 extern int	get_tablespace_io_concurrency(Oid spcid);
 extern int	get_tablespace_maintenance_io_concurrency(Oid spcid);
+extern SMgrId get_tablespace_smgrid(Oid spcid);
 
 #endif							/* SPCCACHE_H */
-- 
2.39.0

#2Andres Freund
andres@anarazel.de
In reply to: Matthias van de Meent (#1)
Re: Extensible storage manager API - SMGR hook Redux

Hi,

On 2023-06-30 14:26:44 +0200, Matthias van de Meent wrote:

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 4c49393fc5..8685b9fde6 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1002,6 +1002,11 @@ PostmasterMain(int argc, char *argv[])
*/
ApplyLauncherRegister();
+	/*
+	 * Register built-in managers that are not part of static arrays
+	 */
+	register_builtin_dynamic_managers();
+
/*
* process any libraries that should be preloaded at postmaster start
*/

That doesn't strike me as a good place to initialize this, we'll need it in
multiple places that way. How about putting it into BaseInit()?

-static const f_smgr smgrsw[] = {
+static f_smgr *smgrsw;

This adds another level of indirection. I would rather limit the number of
registerable smgrs than do that.

+SMgrId
+smgr_register(const f_smgr *smgr, Size smgrrelation_size)
+{
+	MemoryContextSwitchTo(old);
+
+	pg_compiler_barrier();

Huh, what's that about?

@@ -59,14 +63,8 @@ typedef struct SMgrRelationData
* Fields below here are intended to be private to smgr.c and its
* submodules.  Do not touch them from elsewhere.
*/
-	int			smgr_which;		/* storage manager selector */
-
-	/*
-	 * for md.c; per-fork arrays of the number of open segments
-	 * (md_num_open_segs) and the segments themselves (md_seg_fds).
-	 */
-	int			md_num_open_segs[MAX_FORKNUM + 1];
-	struct _MdfdVec *md_seg_fds[MAX_FORKNUM + 1];
+	SMgrId		smgr_which;		/* storage manager selector */
+	int			smgrrelation_size;	/* size of this struct, incl. smgr-specific data */

It looked to me like you determined this globally - why do we need it in every
entry then?

Greetings,

Andres Freund

#3Tristan Partin
tristan@neon.tech
In reply to: Matthias van de Meent (#1)
Re: Extensible storage manager API - SMGR hook Redux

Subject: [PATCH v1 1/2] Expose f_smgr to extensions for manual implementation

From what I can see, all the md* APIs that were exposed in md.h can now
be made static in md.c. The only other references to those APIs were in
smgr.c.

Subject: [PATCH v1 2/2] Prototype: Allow tablespaces to specify which SMGR
they use

-typedef uint8 SMgrId;
+/*
+ * volatile ID of the smgr. Across various configurations IDs may vary,
+ * true identity is the name of each smgr.
+ */
+typedef int SMgrId;
-#define MaxSMgrId UINT8_MAX
+#define MaxSMgrId              INT_MAX

In a future revision of this patch, seems worthwhile to just start as
int instead of a uint8 to avoid this song and dance. Maybe int8 instead
of int?

+static SMgrId recent_smgrid = -1;

You could use InvalidSmgrId here.

+void smgrvalidatetspopts(const char *smgrname, List *opts)
+{
+       SMgrId smgrid = get_smgr_by_name(smgrname, false);
+
+       smgrsw[smgrid].smgr_validate_tspopts(opts);
+}
+
+void smgrcreatetsp(const char *smgrname, Oid tsp, List *opts, bool isredo)
+{
+       SMgrId smgrid = get_smgr_by_name(smgrname, false);
+
+       smgrsw[smgrid].smgr_create_tsp(tsp, opts, isredo);
+}
+
+void smgrdroptsp(const char *smgrname, Oid tsp, bool isredo)
+{
+       SMgrId smgrid = get_smgr_by_name(smgrname, false);
+
+       smgrsw[smgrid].smgr_drop_tsp(tsp, isredo);
+}

Do you not need to check if smgrid is the InvalidSmgrId? I didn't see
any other validation anywhere.

+       char       *smgr;
+       List       *smgropts; /* list of DefElem nodes */

smgrname would probably work better alongside tablespacename in that
struct.

@@ -221,7 +229,7 @@ mdexists(SMgrRelation reln, ForkNumber forknum)
if (!InRecovery)
mdclose(reln, forknum);

-       return (mdopenfork(reln, forknum, EXTENSION_RETURN_NULL) != NULL);
+       return (mdopenfork(mdreln, forknum, EXTENSION_RETURN_NULL) != NULL);
}

Was this a victim of a bad rebase? Seems like it belongs in the previous
patch.

+void mddroptsp(Oid tsp, bool isredo)
+{
+
+}

Some functions in this file have the return type on the previous line.

This is a pretty slick patchset. Excited to read more dicussion and how
it evolves.

--
Tristan Partin
Neon (https://neon.tech)

#4Tristan Partin
tristan@neon.tech
In reply to: Matthias van de Meent (#1)
1 attachment(s)
Re: Extensible storage manager API - SMGR hook Redux

Found these warnings while compiling while only 0001 is applied.

[1166/2337] Compiling C object src/backend/postgres_lib.a.p/storage_smgr_md.c.o
../src/backend/storage/smgr/md.c: In function ‘mdexists’:
../src/backend/storage/smgr/md.c:224:28: warning: passing argument 1 of ‘mdopenfork’ from incompatible pointer type [-Wincompatible-pointer-types]
224 | return (mdopenfork(reln, forknum, EXTENSION_RETURN_NULL) != NULL);
| ^~~~
| |
| SMgrRelation {aka SMgrRelationData *}
../src/backend/storage/smgr/md.c:167:43: note: expected ‘MdSMgrRelation’ {aka ‘MdSMgrRelationData *’} but argument is of type ‘SMgrRelation’ {aka ‘SMgrRelationData *’}
167 | static MdfdVec *mdopenfork(MdSMgrRelation reln, ForkNumber forknum, int behavior);
| ~~~~~~~~~~~~~~~^~~~
../src/backend/storage/smgr/md.c: In function ‘mdcreate’:
../src/backend/storage/smgr/md.c:287:40: warning: passing argument 1 of ‘register_dirty_segment’ from incompatible pointer type [-Wincompatible-pointer-types]
287 | register_dirty_segment(reln, forknum, mdfd);
| ^~~~
| |
| SMgrRelation {aka SMgrRelationData *}
../src/backend/storage/smgr/md.c:168:51: note: expected ‘MdSMgrRelation’ {aka ‘MdSMgrRelationData *’} but argument is of type ‘SMgrRelation’ {aka ‘SMgrRelationData *’}
168 | static void register_dirty_segment(MdSMgrRelation reln, ForkNumber forknum,

Here is a diff to be applied to 0001 which fixes the warnings that get
generated when compiling. I did see that one of the warnings gets fixed
0002 (the mdexists() one). I am assuming that change was just missed
while rebasing the patchset or something. I did not see a fix for
mdcreate() in 0002 however.

--
Tristan Partin
Neon (https://neon.tech)

Attachments:

smgr.difftext/x-patch; charset=utf-8; name=smgr.diffDownload
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index f3e4768160..fdc9f62fdf 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -213,6 +213,8 @@ mdinit(void)
 bool
 mdexists(SMgrRelation reln, ForkNumber forknum)
 {
+	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
+
 	/*
 	 * Close it first, to ensure that we notice if the fork has been unlinked
 	 * since we opened it.  As an optimization, we can skip that in recovery,
@@ -221,7 +223,7 @@ mdexists(SMgrRelation reln, ForkNumber forknum)
 	if (!InRecovery)
 		mdclose(reln, forknum);
 
-	return (mdopenfork(reln, forknum, EXTENSION_RETURN_NULL) != NULL);
+	return (mdopenfork(mdreln, forknum, EXTENSION_RETURN_NULL) != NULL);
 }
 
 /*
@@ -284,7 +286,7 @@ mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
 	mdfd->mdfd_segno = 0;
 
 	if (!SmgrIsTemp(reln))
-		register_dirty_segment(reln, forknum, mdfd);
+		register_dirty_segment(mdreln, forknum, mdfd);
 }
 
 /*
#5Kirill Reshke
reshkekirill@gmail.com
In reply to: Matthias van de Meent (#1)
Fwd: Extensible storage manager API - SMGR hook Redux

Sorry for double-posting, I accidentally replied to Matthias, not the
mailing list :(

---------- Forwarded message ---------
From: Kirill Reshke <reshkekirill@gmail.com>
Date: Mon, 4 Dec 2023 at 19:46
Subject: Re: Extensible storage manager API - SMGR hook Redux
To: Matthias van de Meent <boekewurm+postgres@gmail.com>

Hi!

On Fri, 30 Jun 2023 at 15:27, Matthias van de Meent <
boekewurm+postgres@gmail.com> wrote:

Hi hackers,

At Neon, we've been working on removing the file system dependency
from PostgreSQL and replacing it with a distributed storage layer. For
now, we've seen most success in this by replacing the implementation
of the smgr API, but it did require some core modifications like those
proposed early last year by Anastasia [0].

As mentioned in the previous thread, there are several reasons why you
would want to use a non-default storage manager: storage-level
compression, encryption, and disk limit quotas [0]; offloading of cold
relation data was also mentioned [1].

In the thread on Anastasia's patch, Yura Sokolov mentioned that
instead of a hook-based smgr extension, a registration-based smgr
would be preferred, with integration into namespaces. Please find
attached an as of yet incomplete patch that starts to do that.

The patch is yet incomplete (as it isn't derived from Anastasia's
patch), but I would like comments on this regardless, as this is a
fairly fundamental component of PostgreSQL that is being modified, and
it is often better to get comments early in the development cycle. One
significant issue that I've seen so far are that catcache is not
guaranteed to be available in all backends that need to do smgr
operations, and I've not yet found a good solution.

Changes compared to HEAD:
- smgrsw is now dynamically allocated and grows as new storage
managers are loaded (during shared_preload_libraries)
- CREATE TABLESPACE has new optional syntax USING smgrname (option [, ...])
- tablespace storage is (planned) fully managed by smgr through some
new smgr apis

Changes compared to Anastasia's patch:
- extensions do not get to hook and replace the api of the smgr code
directly - they are hidden behind the smgr registry.

Successes:
- 0001 passes tests (make check-world)
- 0002 builds without warnings (make)

TODO:
- fix dependency failures when catcache is unavailable
- tablespace redo is currently broken with 0002
- fix tests for 0002
- ensure that pg_dump etc. works with the new tablespace storage manager
options

Looking forward to any comments, suggestions and reviews.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech/)

[0]
/messages/by-id/CAP4vRV6JKXyFfEOf=n+v5RGsZywAQ3CTM8ESWvgq+S87Tmgx_g@mail.gmail.com
[1]
/messages/by-id/D365F19F-BC3E-4F96-A91E-8DB13049749E@yandex-team.ru

So, 0002 patch uses the `get_tablespace` function, which searches Catalog
to tablespace SMGR id. I wonder how `smgr_redo` would work with it?
Is it possible to query the system catalog during crash recovery? As far as
i understand the answer is "no", correct me if I'm wrong.
Furthermore, why do we only allow tablespace to have its own SMGR
implementation, can we have per-relation SMGR? Maybe we can do it in a way
similar to custom RMGR (meaning, write SMGR OID into WAL and use it in
crash recovery etc.)?

#6Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Kirill Reshke (#5)
Re: Extensible storage manager API - SMGR hook Redux

On Mon, 4 Dec 2023 at 17:51, Kirill Reshke <reshkekirill@gmail.com> wrote:

So, 0002 patch uses the `get_tablespace` function, which searches Catalog to tablespace SMGR id. I wonder how `smgr_redo` would work with it?

That's a very good point I hadn't considered in detail yet. Quite
clearly, the current code is wrong in assuming that the catalog is
accessible, and it should probably be stored in a way similar to
pg_filenode.map in a file managed outside the buffer pool.

Is it possible to query the system catalog during crash recovery? As far as i understand the answer is "no", correct me if I'm wrong.

Yes, you're correct, we can't access buffers like this during
recovery. That's going to need some more effort.

Furthermore, why do we only allow tablespace to have its own SMGR implementation, can we have per-relation SMGR? Maybe we can do it in a way similar to custom RMGR (meaning, write SMGR OID into WAL and use it in crash recovery etc.)?

AMs (and by extension, their RMGRs) that use Postgres' buffer pool
have control over how they want to layout their blocks and files, but
generally don't care about where those blocks and files are located,
as long as they _can_ be retrieved.

Tablespaces, however, describe 'drives' or 'storage pools' in which
the tables/relations are stored, which to me seems to be the more
logical place to configure the SMGR abstraction of how and where to
store the actual data, as SMGRs manage the low-level relation block IO
(= file accesses), and tablespaces manage where files are stored.

Note that nothing prevents you from using one tablespace (thus
different SMGR) per relation, apart from bloated catalogs and the
superuser permissions required for creating those tablespaces. It'd be
difficult to manage, but not impossible.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

#7Kirill Reshke
reshkekirill@gmail.com
In reply to: Matthias van de Meent (#6)
Re: Extensible storage manager API - SMGR hook Redux

On Mon, 4 Dec 2023 at 22:21, Matthias van de Meent <
boekewurm+postgres@gmail.com> wrote:

On Mon, 4 Dec 2023 at 17:51, Kirill Reshke <reshkekirill@gmail.com> wrote:

So, 0002 patch uses the `get_tablespace` function, which searches

Catalog to tablespace SMGR id. I wonder how `smgr_redo` would work with it?

That's a very good point I hadn't considered in detail yet. Quite
clearly, the current code is wrong in assuming that the catalog is
accessible, and it should probably be stored in a way similar to
pg_filenode.map in a file managed outside the buffer pool.

Hmm, pg_filenode.map is a nice idea. So, simply maintain TableSpaceOId ->

smgr id mapping in a separate file and update the whole file on any
changes, right?
Looks reasonable to me, but it is clear that this solution can be really
slow in some patterns, like if we create many-many tablespaces(the way you
suggested it in the per-relation SMGR feature). Maybe we can store data in
files somehow separately, and only update one chunk per operation.

Anyway, if we use a `pg_filenode.map` - like solution, we need to reuse its
code infrasture, right? For example, it seems that code that calculates
checksums can be reused.
So, we need to refactor code here, define something like FileMap API maybe.
Or is it not really worth it? We can just write similar code twice.

#8Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Kirill Reshke (#7)
Re: Extensible storage manager API - SMGR hook Redux

On Mon, 4 Dec 2023 at 22:03, Kirill Reshke <reshkekirill@gmail.com> wrote:

On Mon, 4 Dec 2023 at 22:21, Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:

On Mon, 4 Dec 2023 at 17:51, Kirill Reshke <reshkekirill@gmail.com> wrote:

So, 0002 patch uses the `get_tablespace` function, which searches Catalog to tablespace SMGR id. I wonder how `smgr_redo` would work with it?

That's a very good point I hadn't considered in detail yet. Quite
clearly, the current code is wrong in assuming that the catalog is
accessible, and it should probably be stored in a way similar to
pg_filenode.map in a file managed outside the buffer pool.

Hmm, pg_filenode.map is a nice idea. So, simply maintain TableSpaceOId -> smgr id mapping in a separate file and update the whole file on any changes, right?
Looks reasonable to me, but it is clear that this solution can be really slow in some patterns, like if we create many-many tablespaces(the way you suggested it in the per-relation SMGR feature). Maybe we can store data in files somehow separately, and only update one chunk per operation.

Yes, but that's a later issue... I'm not sure many-many tablespaces is
actually a good thing. There are already very few reasons to store
tables in more than just the default tablespace. For temporary
relations, there is indeed a guc to automatically put them into one
tablespace; and I can see a similar thing being useful for temporary
relations, too. Then there I can see high-performant local disks vs
lower-performant (but cheaper) local disks also as something
reasonable. But that only gets us to ~6 tablespaces, assuming separate
tablespaces for each combination of (normal, temp, unlogged) * (fast,
cheap). I'm not sure there are many other reasons to add tablespaces,
let alone making one for each table.

Note that you can select which tablespace a table is stored in, so I
see very little reason to actually do something about large numbers of
tablespaces being prohibitively expensive performance-wise.

Why do you want to have a whole new storage configuration for each of
your relations?

Anyway, if we use a `pg_filenode.map` - like solution, we need to reuse its code infrasture, right? For example, it seems that code that calculates checksums can be reused.
So, we need to refactor code here, define something like FileMap API maybe. Or is it not really worth it? We can just write similar code twice.

I'm not sure about that. I really doubt we'll need things that are
that similar: right now, the tablespace->smgr mapping could be
considered to be implied by the symlinks in /pg_tblspc/. Non-MD
tablespaces could add a file <oid>.tblspc that detail their
configuration, which would also fix the issue of spcoid->smgr mapping.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

#9Tristan Partin
tristan@neon.tech
In reply to: Matthias van de Meent (#1)
4 attachment(s)
Re: Extensible storage manager API - SMGR hook Redux

Thought I would show off what is possible with this patchset.

Heikki, a couple of months ago in our internal Slack, said:

[I would like] a debugging tool that checks that we're not missing any
fsyncs. I bumped into a few missing fsync bugs with unlogged tables
lately and a tool like that would've been very helpful.

My task was to create such a tool, and off I went. I started with the
storage manager extension patch that Matthias sent to the list last
year[0]/messages/by-id/CAEze2WgMySu2suO_TLvFyGY3URa4mAx22WeoEicnK=PCNWEMrA@mail.gmail.com.

Andres, in another thread[1]/messages/by-id/20220127182838.ba3434dp2pe5vcia@alap3.anarazel.de, said:

I've been thinking that we need a validation layer for fsyncs, it's too hard
to get right without testing, and crash testing is not likel enough to catch
problems quickly / resource intensive.

My thought was that we could keep a shared hash table of all files created /
dirtied at the fd layer, with the filename as key and the value the current
LSN. We'd delete files from it when they're fsynced. At checkpoints we go
through the list and see if there's any files from before the redo that aren't
yet fsynced. All of this in assert builds only, of course.

I took this idea and ran with it. I call it the fsync_checker™️. It is an
extension that prints relations that haven't been fsynced prior to
a CHECKPOINT. Note that this idea doesn't work in practice because
relations might not be fsynced, but they might be WAL-logged, like in
the case of createdb. See log_smgrcreate(). I can't think of an easy way
to solve this problem looking at the codebase as it stands.

Here is a description of the patches:

0001:

This is essentially just the patch that Matthias posted earlier, but
rebased and adjusted a little bit so storage managers can "inherit" from
other storage managers.

0002:

This is an extension of 0001, which allows for extensions to set
a global storage manager. This is pretty hacky, and if it was going to
be pulled into mainline, it would need some better protection. For
instance, only one extension should be able to set the global storage
manager. We wouldn't want extensions stepping over each other, etc.

0003:

Adds a hook for extensions to inspect a checkpoint before it actually
occurs. The purpose for the fsync_checker is so that I can iterate over
all the relations the extension tracks to find files that haven't been
synced prior to the completion of the checkpoint.

0004:

This is the actual fsync_checker extension itself. It must be preloaded.

Hopefully this is a good illustration of how the initial patch could be
used, even though it isn't perfect.

[0]: /messages/by-id/CAEze2WgMySu2suO_TLvFyGY3URa4mAx22WeoEicnK=PCNWEMrA@mail.gmail.com
[1]: /messages/by-id/20220127182838.ba3434dp2pe5vcia@alap3.anarazel.de

--
Tristan Partin
Neon (https://neon.tech)

Attachments:

v1-0001-Expose-f_smgr-to-extensions-for-manual-implementa.patchtext/x-patch; charset=utf-8; name=v1-0001-Expose-f_smgr-to-extensions-for-manual-implementa.patchDownload
From 5ffbc7c35bb3248501b2517d26f99afe02fb53d6 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Tue, 27 Jun 2023 15:59:23 +0200
Subject: [PATCH v1 1/5] Expose f_smgr to extensions for manual implementation

There are various reasons why one would want to create their own
implementation of a storage manager, among which are block-level compression,
encryption and offloading to cold storage. This patch is a first patch that
allows extensions to register their own SMgr.

Note, however, that this SMgr is not yet used - only the first SMgr to register
is used, and this is currently the md.c smgr. Future commits will include
facilities to select an SMgr for each tablespace.
---
 src/backend/postmaster/postmaster.c |   5 +
 src/backend/storage/smgr/md.c       | 172 +++++++++++++++++++---------
 src/backend/storage/smgr/smgr.c     | 129 ++++++++++-----------
 src/backend/utils/init/miscinit.c   |  13 +++
 src/include/miscadmin.h             |   1 +
 src/include/storage/md.h            |   4 +
 src/include/storage/smgr.h          |  59 ++++++++--
 7 files changed, 252 insertions(+), 131 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index feb471dd1d..a0e46fe1f2 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1010,6 +1010,11 @@ PostmasterMain(int argc, char *argv[])
 	 */
 	ApplyLauncherRegister();
 
+	/*
+	 * Register built-in managers that are not part of static arrays
+	 */
+	register_builtin_dynamic_managers();
+
 	/*
 	 * process any libraries that should be preloaded at postmaster start
 	 */
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index b1e9932a29..66a93101ab 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -87,6 +87,21 @@ typedef struct _MdfdVec
 } MdfdVec;
 
 static MemoryContext MdCxt;		/* context for all MdfdVec objects */
+SMgrId MdSMgrId;
+
+typedef struct MdSMgrRelationData
+{
+	/* parent data */
+	SMgrRelationData reln;
+	/*
+	 * for md.c; per-fork arrays of the number of open segments
+	 * (md_num_open_segs) and the segments themselves (md_seg_fds).
+	 */
+	int			md_num_open_segs[MAX_FORKNUM + 1];
+	struct _MdfdVec *md_seg_fds[MAX_FORKNUM + 1];
+} MdSMgrRelationData;
+
+typedef MdSMgrRelationData *MdSMgrRelation;
 
 
 /* Populate a file tag describing an md.c segment file. */
@@ -121,26 +136,52 @@ static MemoryContext MdCxt;		/* context for all MdfdVec objects */
 #define EXTENSION_DONT_OPEN			(1 << 5)
 
 
+void mdsmgr_register(void)
+{
+	/* magnetic disk */
+	f_smgr md_smgr = (f_smgr) {
+		.name = "md",
+		.smgr_init = mdinit,
+		.smgr_shutdown = NULL,
+		.smgr_open = mdopen,
+		.smgr_close = mdclose,
+		.smgr_create = mdcreate,
+		.smgr_exists = mdexists,
+		.smgr_unlink = mdunlink,
+		.smgr_extend = mdextend,
+		.smgr_zeroextend = mdzeroextend,
+		.smgr_prefetch = mdprefetch,
+		.smgr_readv = mdreadv,
+		.smgr_writev = mdwritev,
+		.smgr_writeback = mdwriteback,
+		.smgr_nblocks = mdnblocks,
+		.smgr_truncate = mdtruncate,
+		.smgr_immedsync = mdimmedsync,
+	};
+
+	MdSMgrId = smgr_register(&md_smgr, sizeof(MdSMgrRelationData));
+}
+
 /* local routines */
 static void mdunlinkfork(RelFileLocatorBackend rlocator, ForkNumber forknum,
 						 bool isRedo);
-static MdfdVec *mdopenfork(SMgrRelation reln, ForkNumber forknum, int behavior);
-static void register_dirty_segment(SMgrRelation reln, ForkNumber forknum,
+static MdfdVec *mdopenfork(MdSMgrRelation reln, ForkNumber forknum, int behavior);
+static void register_dirty_segment(MdSMgrRelation reln, ForkNumber forknum,
 								   MdfdVec *seg);
 static void register_unlink_segment(RelFileLocatorBackend rlocator, ForkNumber forknum,
 									BlockNumber segno);
 static void register_forget_request(RelFileLocatorBackend rlocator, ForkNumber forknum,
 									BlockNumber segno);
-static void _fdvec_resize(SMgrRelation reln,
+static void _fdvec_resize(MdSMgrRelation reln,
 						  ForkNumber forknum,
 						  int nseg);
-static char *_mdfd_segpath(SMgrRelation reln, ForkNumber forknum,
+static char *_mdfd_segpath(MdSMgrRelation reln, ForkNumber forknum,
 						   BlockNumber segno);
-static MdfdVec *_mdfd_openseg(SMgrRelation reln, ForkNumber forknum,
+static MdfdVec *_mdfd_openseg(MdSMgrRelation reln, ForkNumber forknum,
 							  BlockNumber segno, int oflags);
-static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forknum,
+static MdfdVec *_mdfd_getseg(MdSMgrRelation reln, ForkNumber forknum,
 							 BlockNumber blkno, bool skipFsync, int behavior);
-static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
+static BlockNumber _mdnblocks(MdSMgrRelation reln, ForkNumber forknum,
 							  MdfdVec *seg);
 
 static inline int
@@ -173,6 +214,8 @@ mdinit(void)
 bool
 mdexists(SMgrRelation reln, ForkNumber forknum)
 {
+	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
+
 	/*
 	 * Close it first, to ensure that we notice if the fork has been unlinked
 	 * since we opened it.  As an optimization, we can skip that in recovery,
@@ -181,7 +224,7 @@ mdexists(SMgrRelation reln, ForkNumber forknum)
 	if (!InRecovery)
 		mdclose(reln, forknum);
 
-	return (mdopenfork(reln, forknum, EXTENSION_RETURN_NULL) != NULL);
+	return (mdopenfork(mdreln, forknum, EXTENSION_RETURN_NULL) != NULL);
 }
 
 /*
@@ -195,11 +238,13 @@ mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
 	MdfdVec    *mdfd;
 	char	   *path;
 	File		fd;
+	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
+	// Assert(reln->smgr_which == MdSMgrId);
 
-	if (isRedo && reln->md_num_open_segs[forknum] > 0)
+	if (isRedo && mdreln->md_num_open_segs[forknum] > 0)
 		return;					/* created and opened already... */
 
-	Assert(reln->md_num_open_segs[forknum] == 0);
+	Assert(mdreln->md_num_open_segs[forknum] == 0);
 
 	/*
 	 * We may be using the target table space for the first time in this
@@ -236,13 +281,13 @@ mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
 
 	pfree(path);
 
-	_fdvec_resize(reln, forknum, 1);
-	mdfd = &reln->md_seg_fds[forknum][0];
+	_fdvec_resize(mdreln, forknum, 1);
+	mdfd = &mdreln->md_seg_fds[forknum][0];
 	mdfd->mdfd_vfd = fd;
 	mdfd->mdfd_segno = 0;
 
 	if (!SmgrIsTemp(reln))
-		register_dirty_segment(reln, forknum, mdfd);
+		register_dirty_segment(mdreln, forknum, mdfd);
 }
 
 /*
@@ -466,6 +511,7 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	off_t		seekpos;
 	int			nbytes;
 	MdfdVec    *v;
+	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 
 	/* If this build supports direct I/O, the buffer must be I/O aligned. */
 	if (PG_O_DIRECT != 0 && PG_IO_ALIGN_SIZE <= BLCKSZ)
@@ -489,7 +535,7 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 						relpath(reln->smgr_rlocator, forknum),
 						InvalidBlockNumber)));
 
-	v = _mdfd_getseg(reln, forknum, blocknum, skipFsync, EXTENSION_CREATE);
+	v = _mdfd_getseg(mdreln, forknum, blocknum, skipFsync, EXTENSION_CREATE);
 
 	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
 
@@ -513,9 +559,9 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	}
 
 	if (!skipFsync && !SmgrIsTemp(reln))
-		register_dirty_segment(reln, forknum, v);
+		register_dirty_segment(mdreln, forknum, v);
 
-	Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));
+	Assert(_mdnblocks(mdreln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));
 }
 
 /*
@@ -531,6 +577,7 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 	MdfdVec    *v;
 	BlockNumber curblocknum = blocknum;
 	int			remblocks = nblocks;
+	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 
 	Assert(nblocks > 0);
 
@@ -562,7 +609,7 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 		else
 			numblocks = remblocks;
 
-		v = _mdfd_getseg(reln, forknum, curblocknum, skipFsync, EXTENSION_CREATE);
+		v = _mdfd_getseg(mdreln, forknum, curblocknum, skipFsync, EXTENSION_CREATE);
 
 		Assert(segstartblock < RELSEG_SIZE);
 		Assert(segstartblock + numblocks <= RELSEG_SIZE);
@@ -617,9 +664,9 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 		}
 
 		if (!skipFsync && !SmgrIsTemp(reln))
-			register_dirty_segment(reln, forknum, v);
+			register_dirty_segment(mdreln, forknum, v);
 
-		Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));
+		Assert(_mdnblocks(mdreln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));
 
 		remblocks -= numblocks;
 		curblocknum += numblocks;
@@ -637,7 +684,7 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum,
  * invent one out of whole cloth.
  */
 static MdfdVec *
-mdopenfork(SMgrRelation reln, ForkNumber forknum, int behavior)
+mdopenfork(MdSMgrRelation reln, ForkNumber forknum, int behavior)
 {
 	MdfdVec    *mdfd;
 	char	   *path;
@@ -647,7 +694,7 @@ mdopenfork(SMgrRelation reln, ForkNumber forknum, int behavior)
 	if (reln->md_num_open_segs[forknum] > 0)
 		return &reln->md_seg_fds[forknum][0];
 
-	path = relpath(reln->smgr_rlocator, forknum);
+	path = relpath(reln->reln.smgr_rlocator, forknum);
 
 	fd = PathNameOpenFile(path, _mdfd_open_flags());
 
@@ -682,9 +729,10 @@ mdopenfork(SMgrRelation reln, ForkNumber forknum, int behavior)
 void
 mdopen(SMgrRelation reln)
 {
+	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 	/* mark it not open */
 	for (int forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-		reln->md_num_open_segs[forknum] = 0;
+		mdreln->md_num_open_segs[forknum] = 0;
 }
 
 /*
@@ -693,7 +741,8 @@ mdopen(SMgrRelation reln)
 void
 mdclose(SMgrRelation reln, ForkNumber forknum)
 {
-	int			nopensegs = reln->md_num_open_segs[forknum];
+	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
+	int			nopensegs = mdreln->md_num_open_segs[forknum];
 
 	/* No work if already closed */
 	if (nopensegs == 0)
@@ -702,10 +751,10 @@ mdclose(SMgrRelation reln, ForkNumber forknum)
 	/* close segments starting from the end */
 	while (nopensegs > 0)
 	{
-		MdfdVec    *v = &reln->md_seg_fds[forknum][nopensegs - 1];
+		MdfdVec    *v = &mdreln->md_seg_fds[forknum][nopensegs - 1];
 
 		FileClose(v->mdfd_vfd);
-		_fdvec_resize(reln, forknum, nopensegs - 1);
+		_fdvec_resize(mdreln, forknum, nopensegs - 1);
 		nopensegs--;
 	}
 }
@@ -718,6 +767,7 @@ mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		   int nblocks)
 {
 #ifdef USE_PREFETCH
+	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 
 	Assert((io_direct_flags & IO_DIRECT_DATA) == 0);
 
@@ -730,7 +780,7 @@ mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		MdfdVec    *v;
 		int			nblocks_this_segment;
 
-		v = _mdfd_getseg(reln, forknum, blocknum, false,
+		v = _mdfd_getseg(mdreln, forknum, blocknum, false,
 						 InRecovery ? EXTENSION_RETURN_NULL : EXTENSION_FAIL);
 		if (v == NULL)
 			return false;
@@ -813,6 +863,8 @@ void
 mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		void **buffers, BlockNumber nblocks)
 {
+	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
+
 	while (nblocks > 0)
 	{
 		struct iovec iov[PG_IOV_MAX];
@@ -824,7 +876,7 @@ mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		size_t		transferred_this_segment;
 		size_t		size_this_segment;
 
-		v = _mdfd_getseg(reln, forknum, blocknum, false,
+		v = _mdfd_getseg(mdreln, forknum, blocknum, false,
 						 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
 
 		seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
@@ -931,6 +983,8 @@ void
 mdwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		 const void **buffers, BlockNumber nblocks, bool skipFsync)
 {
+	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
+
 	/* This assert is too expensive to have on normally ... */
 #ifdef CHECK_WRITE_VS_EXTEND
 	Assert(blocknum < mdnblocks(reln, forknum));
@@ -947,7 +1001,7 @@ mdwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		size_t		transferred_this_segment;
 		size_t		size_this_segment;
 
-		v = _mdfd_getseg(reln, forknum, blocknum, skipFsync,
+		v = _mdfd_getseg(mdreln, forknum, blocknum, skipFsync,
 						 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
 
 		seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
@@ -1014,7 +1068,7 @@ mdwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		}
 
 		if (!skipFsync && !SmgrIsTemp(reln))
-			register_dirty_segment(reln, forknum, v);
+			register_dirty_segment(mdreln, forknum, v);
 
 		nblocks -= nblocks_this_segment;
 		buffers += nblocks_this_segment;
@@ -1033,6 +1087,7 @@ void
 mdwriteback(SMgrRelation reln, ForkNumber forknum,
 			BlockNumber blocknum, BlockNumber nblocks)
 {
+	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 	Assert((io_direct_flags & IO_DIRECT_DATA) == 0);
 
 	/*
@@ -1047,7 +1102,7 @@ mdwriteback(SMgrRelation reln, ForkNumber forknum,
 		int			segnum_start,
 					segnum_end;
 
-		v = _mdfd_getseg(reln, forknum, blocknum, true /* not used */ ,
+		v = _mdfd_getseg(mdreln, forknum, blocknum, true /* not used */ ,
 						 EXTENSION_DONT_OPEN);
 
 		/*
@@ -1094,11 +1149,12 @@ mdnblocks(SMgrRelation reln, ForkNumber forknum)
 	MdfdVec    *v;
 	BlockNumber nblocks;
 	BlockNumber segno;
+	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 
-	mdopenfork(reln, forknum, EXTENSION_FAIL);
+	mdopenfork(mdreln, forknum, EXTENSION_FAIL);
 
 	/* mdopen has opened the first segment */
-	Assert(reln->md_num_open_segs[forknum] > 0);
+	Assert(mdreln->md_num_open_segs[forknum] > 0);
 
 	/*
 	 * Start from the last open segments, to avoid redundant seeks.  We have
@@ -1113,12 +1169,12 @@ mdnblocks(SMgrRelation reln, ForkNumber forknum)
 	 * that's OK because the checkpointer never needs to compute relation
 	 * size.)
 	 */
-	segno = reln->md_num_open_segs[forknum] - 1;
-	v = &reln->md_seg_fds[forknum][segno];
+	segno = mdreln->md_num_open_segs[forknum] - 1;
+	v = &mdreln->md_seg_fds[forknum][segno];
 
 	for (;;)
 	{
-		nblocks = _mdnblocks(reln, forknum, v);
+		nblocks = _mdnblocks(mdreln, forknum, v);
 		if (nblocks > ((BlockNumber) RELSEG_SIZE))
 			elog(FATAL, "segment too big");
 		if (nblocks < ((BlockNumber) RELSEG_SIZE))
@@ -1136,7 +1192,7 @@ mdnblocks(SMgrRelation reln, ForkNumber forknum)
 		 * undermines _mdfd_getseg's attempts to notice and report an error
 		 * upon access to a missing segment.
 		 */
-		v = _mdfd_openseg(reln, forknum, segno, 0);
+		v = _mdfd_openseg(mdreln, forknum, segno, 0);
 		if (v == NULL)
 			return segno * ((BlockNumber) RELSEG_SIZE);
 	}
@@ -1151,6 +1207,7 @@ mdtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
 	BlockNumber curnblk;
 	BlockNumber priorblocks;
 	int			curopensegs;
+	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 
 	/*
 	 * NOTE: mdnblocks makes sure we have opened all active segments, so that
@@ -1174,14 +1231,14 @@ mdtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
 	 * Truncate segments, starting at the last one. Starting at the end makes
 	 * managing the memory for the fd array easier, should there be errors.
 	 */
-	curopensegs = reln->md_num_open_segs[forknum];
+	curopensegs = mdreln->md_num_open_segs[forknum];
 	while (curopensegs > 0)
 	{
 		MdfdVec    *v;
 
 		priorblocks = (curopensegs - 1) * RELSEG_SIZE;
 
-		v = &reln->md_seg_fds[forknum][curopensegs - 1];
+		v = &mdreln->md_seg_fds[forknum][curopensegs - 1];
 
 		if (priorblocks > nblocks)
 		{
@@ -1196,13 +1253,13 @@ mdtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
 								FilePathName(v->mdfd_vfd))));
 
 			if (!SmgrIsTemp(reln))
-				register_dirty_segment(reln, forknum, v);
+				register_dirty_segment(mdreln, forknum, v);
 
 			/* we never drop the 1st segment */
-			Assert(v != &reln->md_seg_fds[forknum][0]);
+			Assert(v != &mdreln->md_seg_fds[forknum][0]);
 
 			FileClose(v->mdfd_vfd);
-			_fdvec_resize(reln, forknum, curopensegs - 1);
+			_fdvec_resize(mdreln, forknum, curopensegs - 1);
 		}
 		else if (priorblocks + ((BlockNumber) RELSEG_SIZE) > nblocks)
 		{
@@ -1222,7 +1279,7 @@ mdtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
 								FilePathName(v->mdfd_vfd),
 								nblocks)));
 			if (!SmgrIsTemp(reln))
-				register_dirty_segment(reln, forknum, v);
+				register_dirty_segment(mdreln, forknum, v);
 		}
 		else
 		{
@@ -1252,6 +1309,7 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 {
 	int			segno;
 	int			min_inactive_seg;
+	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 
 	/*
 	 * NOTE: mdnblocks makes sure we have opened all active segments, so that
@@ -1259,7 +1317,7 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 	 */
 	mdnblocks(reln, forknum);
 
-	min_inactive_seg = segno = reln->md_num_open_segs[forknum];
+	min_inactive_seg = segno = mdreln->md_num_open_segs[forknum];
 
 	/*
 	 * Temporarily open inactive segments, then close them after sync.  There
@@ -1267,12 +1325,12 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 	 * is harmless.  We don't bother to clean them up and take a risk of
 	 * further trouble.  The next mdclose() will soon close them.
 	 */
-	while (_mdfd_openseg(reln, forknum, segno, 0) != NULL)
+	while (_mdfd_openseg(mdreln, forknum, segno, 0) != NULL)
 		segno++;
 
 	while (segno > 0)
 	{
-		MdfdVec    *v = &reln->md_seg_fds[forknum][segno - 1];
+		MdfdVec    *v = &mdreln->md_seg_fds[forknum][segno - 1];
 
 		/*
 		 * fsyncs done through mdimmedsync() should be tracked in a separate
@@ -1293,7 +1351,7 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 		if (segno > min_inactive_seg)
 		{
 			FileClose(v->mdfd_vfd);
-			_fdvec_resize(reln, forknum, segno - 1);
+			_fdvec_resize(mdreln, forknum, segno - 1);
 		}
 
 		segno--;
@@ -1310,14 +1368,14 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
  * enough to be a performance problem).
  */
 static void
-register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
+register_dirty_segment(MdSMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 {
 	FileTag		tag;
 
-	INIT_MD_FILETAG(tag, reln->smgr_rlocator.locator, forknum, seg->mdfd_segno);
+	INIT_MD_FILETAG(tag, reln->reln.smgr_rlocator.locator, forknum, seg->mdfd_segno);
 
 	/* Temp relations should never be fsync'd */
-	Assert(!SmgrIsTemp(reln));
+	Assert(!SmgrIsTemp(&reln->reln));
 
 	if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false /* retryOnError */ ))
 	{
@@ -1435,7 +1493,7 @@ DropRelationFiles(RelFileLocator *delrels, int ndelrels, bool isRedo)
  * _fdvec_resize() -- Resize the fork's open segments array
  */
 static void
-_fdvec_resize(SMgrRelation reln,
+_fdvec_resize(MdSMgrRelation reln,
 			  ForkNumber forknum,
 			  int nseg)
 {
@@ -1473,12 +1531,12 @@ _fdvec_resize(SMgrRelation reln,
  * returned string is palloc'd.
  */
 static char *
-_mdfd_segpath(SMgrRelation reln, ForkNumber forknum, BlockNumber segno)
+_mdfd_segpath(MdSMgrRelation reln, ForkNumber forknum, BlockNumber segno)
 {
 	char	   *path,
 			   *fullpath;
 
-	path = relpath(reln->smgr_rlocator, forknum);
+	path = relpath(reln->reln.smgr_rlocator, forknum);
 
 	if (segno > 0)
 	{
@@ -1496,7 +1554,7 @@ _mdfd_segpath(SMgrRelation reln, ForkNumber forknum, BlockNumber segno)
  * and make a MdfdVec object for it.  Returns NULL on failure.
  */
 static MdfdVec *
-_mdfd_openseg(SMgrRelation reln, ForkNumber forknum, BlockNumber segno,
+_mdfd_openseg(MdSMgrRelation reln, ForkNumber forknum, BlockNumber segno,
 			  int oflags)
 {
 	MdfdVec    *v;
@@ -1541,7 +1599,7 @@ _mdfd_openseg(SMgrRelation reln, ForkNumber forknum, BlockNumber segno,
  * EXTENSION_CREATE case.
  */
 static MdfdVec *
-_mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
+_mdfd_getseg(MdSMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
 			 bool skipFsync, int behavior)
 {
 	MdfdVec    *v;
@@ -1615,7 +1673,7 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
 				char	   *zerobuf = palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE,
 													 MCXT_ALLOC_ZERO);
 
-				mdextend(reln, forknum,
+				mdextend((SMgrRelation) reln, forknum,
 						 nextsegno * ((BlockNumber) RELSEG_SIZE) - 1,
 						 zerobuf, skipFsync);
 				pfree(zerobuf);
@@ -1672,7 +1730,7 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
  * Get number of blocks present in a single disk file
  */
 static BlockNumber
-_mdnblocks(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
+_mdnblocks(MdSMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 {
 	off_t		len;
 
@@ -1695,7 +1753,7 @@ _mdnblocks(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 int
 mdsyncfiletag(const FileTag *ftag, char *path)
 {
-	SMgrRelation reln = smgropen(ftag->rlocator, InvalidBackendId);
+	MdSMgrRelation reln = (MdSMgrRelation) smgropen(ftag->rlocator, InvalidBackendId);
 	File		file;
 	instr_time	io_start;
 	bool		need_to_close;
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 563a0be5c7..b586e6e25a 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -19,80 +19,23 @@
 
 #include "access/xlogutils.h"
 #include "lib/ilist.h"
+#include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/md.h"
 #include "storage/smgr.h"
+#include "port/atomics.h"
 #include "utils/hsearch.h"
 #include "utils/inval.h"
+#include "utils/memutils.h"
 
 
-/*
- * This struct of function pointers defines the API between smgr.c and
- * any individual storage manager module.  Note that smgr subfunctions are
- * generally expected to report problems via elog(ERROR).  An exception is
- * that smgr_unlink should use elog(WARNING), rather than erroring out,
- * because we normally unlink relations during post-commit/abort cleanup,
- * and so it's too late to raise an error.  Also, various conditions that
- * would normally be errors should be allowed during bootstrap and/or WAL
- * recovery --- see comments in md.c for details.
- */
-typedef struct f_smgr
-{
-	void		(*smgr_init) (void);	/* may be NULL */
-	void		(*smgr_shutdown) (void);	/* may be NULL */
-	void		(*smgr_open) (SMgrRelation reln);
-	void		(*smgr_close) (SMgrRelation reln, ForkNumber forknum);
-	void		(*smgr_create) (SMgrRelation reln, ForkNumber forknum,
-								bool isRedo);
-	bool		(*smgr_exists) (SMgrRelation reln, ForkNumber forknum);
-	void		(*smgr_unlink) (RelFileLocatorBackend rlocator, ForkNumber forknum,
-								bool isRedo);
-	void		(*smgr_extend) (SMgrRelation reln, ForkNumber forknum,
-								BlockNumber blocknum, const void *buffer, bool skipFsync);
-	void		(*smgr_zeroextend) (SMgrRelation reln, ForkNumber forknum,
-									BlockNumber blocknum, int nblocks, bool skipFsync);
-	bool		(*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
-								  BlockNumber blocknum, int nblocks);
-	void		(*smgr_readv) (SMgrRelation reln, ForkNumber forknum,
-							   BlockNumber blocknum,
-							   void **buffers, BlockNumber nblocks);
-	void		(*smgr_writev) (SMgrRelation reln, ForkNumber forknum,
-								BlockNumber blocknum,
-								const void **buffers, BlockNumber nblocks,
-								bool skipFsync);
-	void		(*smgr_writeback) (SMgrRelation reln, ForkNumber forknum,
-								   BlockNumber blocknum, BlockNumber nblocks);
-	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
-	void		(*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
-								  BlockNumber nblocks);
-	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
-} f_smgr;
-
-static const f_smgr smgrsw[] = {
-	/* magnetic disk */
-	{
-		.smgr_init = mdinit,
-		.smgr_shutdown = NULL,
-		.smgr_open = mdopen,
-		.smgr_close = mdclose,
-		.smgr_create = mdcreate,
-		.smgr_exists = mdexists,
-		.smgr_unlink = mdunlink,
-		.smgr_extend = mdextend,
-		.smgr_zeroextend = mdzeroextend,
-		.smgr_prefetch = mdprefetch,
-		.smgr_readv = mdreadv,
-		.smgr_writev = mdwritev,
-		.smgr_writeback = mdwriteback,
-		.smgr_nblocks = mdnblocks,
-		.smgr_truncate = mdtruncate,
-		.smgr_immedsync = mdimmedsync,
-	}
-};
+static f_smgr *smgrsw;
 
-static const int NSmgr = lengthof(smgrsw);
+static int NSmgr = 0;
+
+static Size LargestSMgrRelationSize = 0;
 
 /*
  * Each backend has a hashtable that stores all extant SMgrRelation objects.
@@ -105,6 +48,57 @@ static dlist_head unowned_relns;
 /* local function prototypes */
 static void smgrshutdown(int code, Datum arg);
 
+SMgrId
+smgr_register(const f_smgr *smgr, Size smgrrelation_size)
+{
+	SMgrId my_id;
+	MemoryContext old;
+
+	if (process_shared_preload_libraries_done)
+		elog(FATAL, "SMgrs must be registered in the shared_preload_libraries phase");
+	if (NSmgr == MaxSMgrId)
+		elog(FATAL, "Too many smgrs registered");
+	if (smgr->name == NULL || *smgr->name == 0)
+		elog(FATAL, "smgr registered with invalid name");
+
+	Assert(smgr->smgr_open != NULL);
+	Assert(smgr->smgr_close != NULL);
+	Assert(smgr->smgr_create != NULL);
+	Assert(smgr->smgr_exists != NULL);
+	Assert(smgr->smgr_unlink != NULL);
+	Assert(smgr->smgr_extend != NULL);
+	Assert(smgr->smgr_zeroextend != NULL);
+	Assert(smgr->smgr_prefetch != NULL);
+	Assert(smgr->smgr_readv != NULL);
+	Assert(smgr->smgr_writev != NULL);
+	Assert(smgr->smgr_writeback != NULL);
+	Assert(smgr->smgr_nblocks != NULL);
+	Assert(smgr->smgr_truncate != NULL);
+	Assert(smgr->smgr_immedsync != NULL);
+	old = MemoryContextSwitchTo(TopMemoryContext);
+
+	my_id = NSmgr++;
+	if (my_id == 0)
+		smgrsw = palloc(sizeof(f_smgr));
+	else
+		smgrsw = repalloc(smgrsw, sizeof(f_smgr) * NSmgr);
+
+	MemoryContextSwitchTo(old);
+
+	pg_compiler_barrier();
+
+	if (!smgrsw)
+	{
+		NSmgr--;
+		elog(FATAL, "Failed to extend smgr array");
+	}
+
+	memcpy(&smgrsw[my_id], smgr, sizeof(f_smgr));
+
+	LargestSMgrRelationSize = Max(LargestSMgrRelationSize, smgrrelation_size);
+
+	return my_id;
+}
 
 /*
  * smgrinit(), smgrshutdown() -- Initialize or shut down storage
@@ -162,9 +156,11 @@ smgropen(RelFileLocator rlocator, BackendId backend)
 	{
 		/* First time through: initialize the hash table */
 		HASHCTL		ctl;
+		LargestSMgrRelationSize = MAXALIGN(LargestSMgrRelationSize);
+		Assert(NSmgr > 0);
 
 		ctl.keysize = sizeof(RelFileLocatorBackend);
-		ctl.entrysize = sizeof(SMgrRelationData);
+		ctl.entrysize = LargestSMgrRelationSize;
 		SMgrRelationHash = hash_create("smgr relation table", 400,
 									   &ctl, HASH_ELEM | HASH_BLOBS);
 		dlist_init(&unowned_relns);
@@ -185,7 +181,8 @@ smgropen(RelFileLocator rlocator, BackendId backend)
 		reln->smgr_targblock = InvalidBlockNumber;
 		for (int i = 0; i <= MAX_FORKNUM; ++i)
 			reln->smgr_cached_nblocks[i] = InvalidBlockNumber;
-		reln->smgr_which = 0;	/* we only have md.c at present */
+
+		reln->smgr_which = MdSMgrId;	/* we only have md.c at present */
 
 		/* implementation-specific initialization */
 		smgrsw[reln->smgr_which].smgr_open(reln);
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 23f77a59e5..4ec7619302 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -42,6 +42,7 @@
 #include "postmaster/postmaster.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
+#include "storage/md.h"
 #include "storage/latch.h"
 #include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
@@ -198,6 +199,9 @@ InitStandaloneProcess(const char *argv0)
 	InitProcessLocalLatch();
 	InitializeLatchWaitSet();
 
+	/* Initialize smgrs */
+	register_builtin_dynamic_managers();
+
 	/*
 	 * For consistency with InitPostmasterChild, initialize signal mask here.
 	 * But we don't unblock SIGQUIT or provide a default handler for it.
@@ -1860,6 +1864,15 @@ process_session_preload_libraries(void)
 				   true);
 }
 
+/*
+ * Register any internal managers.
+ */
+void
+register_builtin_dynamic_managers(void)
+{
+	mdsmgr_register();
+}
+
 /*
  * process any shared memory requests from preloaded libraries
  */
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 0b01c1f093..d0d4ba38ef 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -493,6 +493,7 @@ extern void TouchSocketLockFiles(void);
 extern void AddToDataDirLockFile(int target_line, const char *str);
 extern bool RecheckDataDirLockFile(void);
 extern void ValidatePgVersion(const char *path);
+extern void register_builtin_dynamic_managers(void);
 extern void process_shared_preload_libraries(void);
 extern void process_session_preload_libraries(void);
 extern void process_shmem_requests(void);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 7c181e5a17..734bae07e1 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -19,6 +19,10 @@
 #include "storage/smgr.h"
 #include "storage/sync.h"
 
+/* registration function for md storage manager */
+extern void mdsmgr_register(void);
+extern SMgrId MdSMgrId;
+
 /* md storage manager functionality */
 extern void mdinit(void);
 extern void mdopen(SMgrRelation reln);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 527cd2a056..95927b8bdd 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -18,6 +18,10 @@
 #include "storage/block.h"
 #include "storage/relfilelocator.h"
 
+typedef uint8 SMgrId;
+
+#define MaxSMgrId UINT8_MAX
+
 /*
  * smgr.c maintains a table of SMgrRelation objects, which are essentially
  * cached file handles.  An SMgrRelation is created (if not already present)
@@ -59,14 +63,8 @@ typedef struct SMgrRelationData
 	 * Fields below here are intended to be private to smgr.c and its
 	 * submodules.  Do not touch them from elsewhere.
 	 */
-	int			smgr_which;		/* storage manager selector */
-
-	/*
-	 * for md.c; per-fork arrays of the number of open segments
-	 * (md_num_open_segs) and the segments themselves (md_seg_fds).
-	 */
-	int			md_num_open_segs[MAX_FORKNUM + 1];
-	struct _MdfdVec *md_seg_fds[MAX_FORKNUM + 1];
+	SMgrId		smgr_which;		/* storage manager selector */
+	int			smgrrelation_size;	/* size of this struct, incl. smgr-specific data */
 
 	/* if unowned, list link in list of all unowned SMgrRelations */
 	dlist_node	node;
@@ -77,6 +75,51 @@ typedef SMgrRelationData *SMgrRelation;
 #define SmgrIsTemp(smgr) \
 	RelFileLocatorBackendIsTemp((smgr)->smgr_rlocator)
 
+/*
+ * This struct of function pointers defines the API between smgr.c and
+ * any individual storage manager module.  Note that smgr subfunctions are
+ * generally expected to report problems via elog(ERROR).  An exception is
+ * that smgr_unlink should use elog(WARNING), rather than erroring out,
+ * because we normally unlink relations during post-commit/abort cleanup,
+ * and so it's too late to raise an error.  Also, various conditions that
+ * would normally be errors should be allowed during bootstrap and/or WAL
+ * recovery --- see comments in md.c for details.
+ */
+typedef struct f_smgr
+{
+	const char *name;
+	void		(*smgr_init) (void);	/* may be NULL */
+	void		(*smgr_shutdown) (void);	/* may be NULL */
+	void		(*smgr_open) (SMgrRelation reln);
+	void		(*smgr_close) (SMgrRelation reln, ForkNumber forknum);
+	void		(*smgr_create) (SMgrRelation reln, ForkNumber forknum,
+								bool isRedo);
+	bool		(*smgr_exists) (SMgrRelation reln, ForkNumber forknum);
+	void		(*smgr_unlink) (RelFileLocatorBackend rlocator, ForkNumber forknum,
+								bool isRedo);
+	void		(*smgr_extend) (SMgrRelation reln, ForkNumber forknum,
+								BlockNumber blocknum, const void *buffer, bool skipFsync);
+	void		(*smgr_zeroextend) (SMgrRelation reln, ForkNumber forknum,
+									BlockNumber blocknum, int nblocks, bool skipFsync);
+	bool		(*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
+								  BlockNumber blocknum, int nblocks);
+	void		(*smgr_readv) (SMgrRelation reln, ForkNumber forknum,
+							   BlockNumber blocknum,
+							   void **buffers, BlockNumber nblocks);
+	void		(*smgr_writev) (SMgrRelation reln, ForkNumber forknum,
+								BlockNumber blocknum,
+								const void **buffers, BlockNumber nblocks,
+								bool skipFsync);
+	void		(*smgr_writeback) (SMgrRelation reln, ForkNumber forknum,
+								   BlockNumber blocknum, BlockNumber nblocks);
+	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
+	void		(*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
+								  BlockNumber nblocks);
+	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
+} f_smgr;
+
+extern SMgrId smgr_register(const f_smgr *smgr, Size smgrrelation_size);
+
 extern void smgrinit(void);
 extern SMgrRelation smgropen(RelFileLocator rlocator, BackendId backend);
 extern bool smgrexists(SMgrRelation reln, ForkNumber forknum);
-- 
Tristan Partin
Neon (https://neon.tech)

v1-0002-Allow-extensions-to-override-the-global-storage-m.patchtext/x-patch; charset=utf-8; name=v1-0002-Allow-extensions-to-override-the-global-storage-m.patchDownload
From 59a667f079c9b040c23921e4c43fae94b88776f2 Mon Sep 17 00:00:00 2001
From: Tristan Partin <tristan@neon.tech>
Date: Fri, 13 Oct 2023 14:00:44 -0500
Subject: [PATCH v1 2/5] Allow extensions to override the global storage
 manager

---
 src/backend/storage/smgr/md.c     | 2 +-
 src/backend/storage/smgr/smgr.c   | 5 ++++-
 src/backend/utils/init/miscinit.c | 2 ++
 src/include/storage/md.h          | 2 ++
 src/include/storage/smgr.h        | 2 ++
 5 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 66a93101ab..13ec9da236 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -140,7 +140,7 @@ void mdsmgr_register(void)
 {
 	/* magnetic disk */
 	f_smgr md_smgr = (f_smgr) {
-		.name = "md",
+		.name = MdSMgrName,
 		.smgr_init = mdinit,
 		.smgr_shutdown = NULL,
 		.smgr_open = mdopen,
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index b586e6e25a..0814330b8a 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -37,6 +37,9 @@ static int NSmgr = 0;
 
 static Size LargestSMgrRelationSize = 0;
 
+char *storage_manager_string;
+SMgrId storage_manager_id;
+
 /*
  * Each backend has a hashtable that stores all extant SMgrRelation objects.
  * In addition, "unowned" SMgrRelation objects are chained together in a list.
@@ -182,7 +185,7 @@ smgropen(RelFileLocator rlocator, BackendId backend)
 		for (int i = 0; i <= MAX_FORKNUM; ++i)
 			reln->smgr_cached_nblocks[i] = InvalidBlockNumber;
 
-		reln->smgr_which = MdSMgrId;	/* we only have md.c at present */
+		reln->smgr_which = storage_manager_id;
 
 		/* implementation-specific initialization */
 		smgrsw[reln->smgr_which].smgr_open(reln);
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 4ec7619302..f44f511f69 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -1871,6 +1871,8 @@ void
 register_builtin_dynamic_managers(void)
 {
 	mdsmgr_register();
+
+	storage_manager_id = MdSMgrId;
 }
 
 /*
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 734bae07e1..fdafb2c8e3 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -19,6 +19,8 @@
 #include "storage/smgr.h"
 #include "storage/sync.h"
 
+#define MdSMgrName "md"
+
 /* registration function for md storage manager */
 extern void mdsmgr_register(void);
 extern SMgrId MdSMgrId;
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 95927b8bdd..ee4fc27265 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -22,6 +22,8 @@ typedef uint8 SMgrId;
 
 #define MaxSMgrId UINT8_MAX
 
+extern PGDLLIMPORT SMgrId storage_manager_id;
+
 /*
  * smgr.c maintains a table of SMgrRelation objects, which are essentially
  * cached file handles.  An SMgrRelation is created (if not already present)
-- 
Tristan Partin
Neon (https://neon.tech)

v1-0003-Add-checkpoint_create_hook.patchtext/x-patch; charset=utf-8; name=v1-0003-Add-checkpoint_create_hook.patchDownload
From 9ed9b8ca36cdb75b44deccdfea619c7494fcc6ef Mon Sep 17 00:00:00 2001
From: Tristan Partin <tristan@neon.tech>
Date: Fri, 13 Oct 2023 13:57:18 -0500
Subject: [PATCH v1 3/5] Add checkpoint_create_hook

Allows an extension to hook into CheckPointCreate().
---
 src/backend/access/transam/xlog.c | 5 +++++
 src/include/access/xlog.h         | 4 ++++
 2 files changed, 9 insertions(+)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 478377c4a2..61ae5b63b8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -212,6 +212,8 @@ const struct config_enum_entry archive_mode_options[] = {
  */
 CheckpointStatsData CheckpointStats;
 
+checkpoint_create_hook_type checkpoint_create_hook = NULL;
+
 /*
  * During recovery, lastFullPageWrites keeps track of full_page_writes that
  * the replayed WAL records indicate. It's initialized with full_page_writes
@@ -6875,6 +6877,9 @@ CreateCheckPoint(int flags)
 	 */
 	END_CRIT_SECTION();
 
+	if (checkpoint_create_hook != NULL)
+		checkpoint_create_hook(&checkPoint);
+
 	/*
 	 * In some cases there are groups of actions that must all occur on one
 	 * side or the other of a checkpoint record. Before flushing the
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 301c5fa11f..437f2a994b 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -13,6 +13,7 @@
 
 #include "access/xlogbackup.h"
 #include "access/xlogdefs.h"
+#include "catalog/pg_control.h"
 #include "datatype/timestamp.h"
 #include "lib/stringinfo.h"
 #include "nodes/pg_list.h"
@@ -57,6 +58,9 @@ extern PGDLLIMPORT int wal_decode_buffer_size;
 
 extern PGDLLIMPORT int CheckPointSegments;
 
+typedef void (*checkpoint_create_hook_type)(const CheckPoint *);
+extern PGDLLIMPORT checkpoint_create_hook_type checkpoint_create_hook;
+
 /* Archive modes */
 typedef enum ArchiveMode
 {
-- 
Tristan Partin
Neon (https://neon.tech)

v1-0004-Add-contrib-fsync_checker.patchtext/x-patch; charset=utf-8; name=v1-0004-Add-contrib-fsync_checker.patchDownload
From d46b41d7c89deb23a6a1afec9d7fe3544b9a3327 Mon Sep 17 00:00:00 2001
From: Tristan Partin <tristan@neon.tech>
Date: Wed, 20 Sep 2023 14:23:38 -0500
Subject: [PATCH v1 4/5] Add contrib/fsync_checker

fsync_checker is an extension which overrides the global storage manager
to check for volatile relations, those which have been written but not
synced to disk.
---
 contrib/Makefile                            |   1 +
 contrib/fsync_checker/fsync_checker.control |   5 +
 contrib/fsync_checker/fsync_checker_smgr.c  | 249 ++++++++++++++++++++
 contrib/fsync_checker/meson.build           |  22 ++
 contrib/meson.build                         |   1 +
 5 files changed, 278 insertions(+)
 create mode 100644 contrib/fsync_checker/fsync_checker.control
 create mode 100644 contrib/fsync_checker/fsync_checker_smgr.c
 create mode 100644 contrib/fsync_checker/meson.build

diff --git a/contrib/Makefile b/contrib/Makefile
index da4e2316a3..c55ced6ec0 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -20,6 +20,7 @@ SUBDIRS = \
 		dict_int	\
 		dict_xsyn	\
 		earthdistance	\
+		fsync_checker	\
 		file_fdw	\
 		fuzzystrmatch	\
 		hstore		\
diff --git a/contrib/fsync_checker/fsync_checker.control b/contrib/fsync_checker/fsync_checker.control
new file mode 100644
index 0000000000..7d0e36434b
--- /dev/null
+++ b/contrib/fsync_checker/fsync_checker.control
@@ -0,0 +1,5 @@
+# fsync_checker extension
+comment = 'SMGR extension for checking volatile writes'
+default_version = '1.0'
+module_pathname = '$libdir/fsync_checker'
+relocatable = true
diff --git a/contrib/fsync_checker/fsync_checker_smgr.c b/contrib/fsync_checker/fsync_checker_smgr.c
new file mode 100644
index 0000000000..feef2f7d3e
--- /dev/null
+++ b/contrib/fsync_checker/fsync_checker_smgr.c
@@ -0,0 +1,249 @@
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "storage/ipc.h"
+#include "storage/lwlock.h"
+#include "storage/shmem.h"
+#include "storage/smgr.h"
+#include "storage/md.h"
+#include "utils/hsearch.h"
+
+PG_MODULE_MAGIC;
+
+typedef struct volatileRelnKey
+{
+	RelFileLocator locator;
+	ForkNumber	forknum;
+}			volatileRelnKey;
+
+typedef struct volatileRelnEntry
+{
+	volatileRelnKey key;
+	XLogRecPtr	lsn;
+}			volatileRelnEntry;
+
+void		_PG_init(void);
+
+static void fsync_checker_extend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+								 const void *buffer, bool skipFsync);
+static void fsync_checker_immedsync(SMgrRelation reln, ForkNumber forknum);
+static void fsync_checker_writev(SMgrRelation reln, ForkNumber forknum,
+								 BlockNumber blocknum, const void **buffers,
+								 BlockNumber nblocks, bool skipFsync);
+static void fsync_checker_writeback(SMgrRelation reln, ForkNumber forknum,
+									BlockNumber blocknum, BlockNumber nblocks);
+static void fsync_checker_zeroextend(SMgrRelation reln, ForkNumber forknum,
+									 BlockNumber blocknum, int nblocks, bool skipFsync);
+
+static void fsync_checker_checkpoint_create(const CheckPoint *checkPoint);
+static void fsync_checker_shmem_request(void);
+static void fsync_checker_shmem_startup(void);
+
+static void add_reln(SMgrRelation reln, ForkNumber forknum);
+static void remove_reln(SMgrRelation reln, ForkNumber forknum);
+
+static SMgrId fsync_checker_smgr_id;
+static const struct f_smgr fsync_checker_smgr = {
+	.name = "fsync_checker",
+	.smgr_init = mdinit,
+	.smgr_shutdown = NULL,
+	.smgr_open = mdopen,
+	.smgr_close = mdclose,
+	.smgr_create = mdcreate,
+	.smgr_exists = mdexists,
+	.smgr_unlink = mdunlink,
+	.smgr_extend = fsync_checker_extend,
+	.smgr_zeroextend = fsync_checker_zeroextend,
+	.smgr_prefetch = mdprefetch,
+	.smgr_readv = mdreadv,
+	.smgr_writev = fsync_checker_writev,
+	.smgr_writeback = fsync_checker_writeback,
+	.smgr_nblocks = mdnblocks,
+	.smgr_truncate = mdtruncate,
+	.smgr_immedsync = fsync_checker_immedsync,
+};
+
+static HTAB *volatile_relns;
+static LWLock *volatile_relns_lock;
+static shmem_request_hook_type prev_shmem_request_hook;
+static shmem_startup_hook_type prev_shmem_startup_hook;
+static checkpoint_create_hook_type prev_checkpoint_create_hook;
+
+void
+_PG_init(void)
+{
+	prev_checkpoint_create_hook = checkpoint_create_hook;
+	checkpoint_create_hook = fsync_checker_checkpoint_create;
+
+	prev_shmem_request_hook = shmem_request_hook;
+	shmem_request_hook = fsync_checker_shmem_request;
+
+	prev_shmem_startup_hook = shmem_startup_hook;
+	shmem_startup_hook = fsync_checker_shmem_startup;
+
+	/*
+	 * Relation size of 0 means we can just defer to md, but it would be nice
+	 * to just expose this functionality, so if I needed my own relation, I
+	 * could use MdSmgrRelation as the parent.
+	 */
+	fsync_checker_smgr_id = smgr_register(&fsync_checker_smgr, 0);
+
+	storage_manager_id = fsync_checker_smgr_id;
+}
+
+static void
+fsync_checker_checkpoint_create(const CheckPoint *checkPoint)
+{
+	long		num_entries;
+	HASH_SEQ_STATUS status;
+	volatileRelnEntry *entry;
+
+	if (prev_checkpoint_create_hook)
+		prev_checkpoint_create_hook(checkPoint);
+
+	LWLockAcquire(volatile_relns_lock, LW_EXCLUSIVE);
+
+	hash_seq_init(&status, volatile_relns);
+
+	num_entries = hash_get_num_entries(volatile_relns);
+	elog(INFO, "Analyzing %ld volatile relations", num_entries);
+	while ((entry = hash_seq_search(&status)))
+	{
+		if (entry->lsn < checkPoint->redo)
+		{
+			char	   *path;
+
+			path = relpathperm(entry->key.locator, entry->key.forknum);
+
+			elog(WARNING, "Relation not previously synced: %s", path);
+
+			pfree(path);
+		}
+	}
+
+	LWLockRelease(volatile_relns_lock);
+}
+
+static void
+fsync_checker_shmem_request(void)
+{
+	if (prev_shmem_request_hook)
+		prev_shmem_request_hook();
+
+	RequestAddinShmemSpace(hash_estimate_size(1024, sizeof(volatileRelnEntry)));
+	RequestNamedLWLockTranche("fsync_checker volatile relns lock", 1);
+}
+
+static void
+fsync_checker_shmem_startup(void)
+{
+	HASHCTL		ctl;
+
+	if (prev_shmem_startup_hook)
+		prev_shmem_startup_hook();
+
+	ctl.keysize = sizeof(volatileRelnKey);
+	ctl.entrysize = sizeof(volatileRelnEntry);
+	volatile_relns = NULL;
+	volatile_relns_lock = NULL;
+
+	/*
+	 * Create or attach to the shared memory state, including hash table
+	 */
+	LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE);
+
+	volatile_relns = ShmemInitHash("fsync_checker volatile relns",
+								   1024, 1024, &ctl, HASH_BLOBS | HASH_ELEM);
+	volatile_relns_lock = &GetNamedLWLockTranche("fsync_checker volatile relns lock")->lock;
+
+	LWLockRelease(AddinShmemInitLock);
+}
+
+static void
+add_reln(SMgrRelation reln, ForkNumber forknum)
+{
+	bool		found;
+	XLogRecPtr	lsn;
+	volatileRelnKey key;
+	volatileRelnEntry *entry;
+
+	key.locator = reln->smgr_rlocator.locator;
+	key.forknum = forknum;
+
+	lsn = GetXLogWriteRecPtr();
+
+	LWLockAcquire(volatile_relns_lock, LW_EXCLUSIVE);
+
+	entry = hash_search(volatile_relns, &key, HASH_ENTER, &found);
+	if (!found)
+		entry->lsn = lsn;
+
+	LWLockRelease(volatile_relns_lock);
+}
+
+static void
+remove_reln(SMgrRelation reln, ForkNumber forknum)
+{
+	volatileRelnKey key;
+
+	key.locator = reln->smgr_rlocator.locator;
+	key.forknum = forknum;
+
+	LWLockAcquire(volatile_relns_lock, LW_EXCLUSIVE);
+
+	hash_search(volatile_relns, &key, HASH_REMOVE, NULL);
+
+	LWLockRelease(volatile_relns_lock);
+}
+
+static void
+fsync_checker_extend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+					 const void *buffer, bool skipFsync)
+{
+	if (!SmgrIsTemp(reln) && !skipFsync)
+		add_reln(reln, forknum);
+
+	mdextend(reln, forknum, blocknum, buffer, skipFsync);
+}
+
+static void
+fsync_checker_immedsync(SMgrRelation reln, ForkNumber forknum)
+{
+	if (!SmgrIsTemp(reln))
+		remove_reln(reln, forknum);
+
+	mdimmedsync(reln, forknum);
+}
+
+static void
+fsync_checker_writev(SMgrRelation reln, ForkNumber forknum,
+					 BlockNumber blocknum, const void **buffers,
+					 BlockNumber nblocks, bool skipFsync)
+{
+	if (!SmgrIsTemp(reln) && !skipFsync)
+		add_reln(reln, forknum);
+
+	mdwritev(reln, forknum, blocknum, buffers, nblocks, skipFsync);
+}
+
+static void
+fsync_checker_writeback(SMgrRelation reln, ForkNumber forknum,
+						BlockNumber blocknum, BlockNumber nblocks)
+{
+	if (!SmgrIsTemp(reln))
+		remove_reln(reln, forknum);
+
+	mdwriteback(reln, forknum, blocknum, nblocks);
+}
+
+static void
+fsync_checker_zeroextend(SMgrRelation reln, ForkNumber forknum,
+						 BlockNumber blocknum, int nblocks, bool skipFsync)
+{
+	if (!SmgrIsTemp(reln) && !skipFsync)
+		add_reln(reln, forknum);
+
+	mdzeroextend(reln, forknum, blocknum, nblocks, skipFsync);
+}
diff --git a/contrib/fsync_checker/meson.build b/contrib/fsync_checker/meson.build
new file mode 100644
index 0000000000..ce6ed7fe90
--- /dev/null
+++ b/contrib/fsync_checker/meson.build
@@ -0,0 +1,22 @@
+# Copyright (c) 2023, PostgreSQL Global Development Group
+
+fsync_checker_sources = files(
+  'fsync_checker_smgr.c',
+)
+
+if host_system == 'windows'
+  fsync_checker_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'fsync_checker',
+    '--FILEDESC', 'fsync_checker - SMGR extension for checking volatile relations',])
+endif
+
+fsync_checker = shared_module('fsync_checker',
+  fsync_checker_sources,
+  kwargs: contrib_mod_args,
+)
+contrib_targets += fsync_checker
+
+install_data(
+  'fsync_checker.control',
+  kwargs: contrib_data_args,
+)
diff --git a/contrib/meson.build b/contrib/meson.build
index c12dc906ca..e5d872494a 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -29,6 +29,7 @@ subdir('dict_int')
 subdir('dict_xsyn')
 subdir('earthdistance')
 subdir('file_fdw')
+subdir('fsync_checker')
 subdir('fuzzystrmatch')
 subdir('hstore')
 subdir('hstore_plperl')
-- 
Tristan Partin
Neon (https://neon.tech)

#10Aleksander Alekseev
aleksander@timescale.com
In reply to: Tristan Partin (#9)
Re: Extensible storage manager API - SMGR hook Redux

Hi,

Thought I would show off what is possible with this patchset.

[...]

Just wanted to let you know that cfbot doesn't seem to be entirely
happy with the patch [1]http://cfbot.cputube.org/. Please consider submitting an updated
version.

Best regards,
Aleksander Alekseev (wearing co-CFM hat)

[1]: http://cfbot.cputube.org/

#11Nitin Jadhav
nitinjadhavpostgres@gmail.com
In reply to: Tristan Partin (#9)
Re: Extensible storage manager API - SMGR hook Redux

Hi,

I reviewed the discussion and took a look at the patch sets. It seems
like many things are combined here. Based on the subject, I initially
thought it aimed to provide the infrastructure to easily extend
storage managers. This would allow anyone to create their own storage
managers using this infrastructure. While it addresses this, it also
includes additional features like fsync_checker, which I believe
should be a separate feature. Even though it might use the same
infrastructure, it appears to be a different functionality. I think we
should focus solely on providing the infrastructure here.

We need to decide on our approach—whether to use a hook-based method
or a registration-based method—and I believe this requires further
discussion.

The hook-based approach is simple and works well for anyone writing
their own storage manager. However, it has its limitations as we can
either use the default storage manager or a custom-built one for all
the work load, but we cannot choose between multiple storage managers.
On the other hand, the registration-based approach allows choosing
between multiple storage managers based on the workload, though it
requires a lot of changes.

Are we planning to support other storage managers in PostgreSQL in the
near future? If not, it is better to go with the hook-based approach.
Otherwise, the registration-based approach is preferable as it offers
more flexibility to users and enhances PostgreSQL’s functionality.

Could you please share your thoughts on this? Also, let me know if
this topic has already been discussed and if any conclusions were
reached.

Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
Microsoft

Show quoted text

On Sat, Jan 13, 2024 at 1:27 AM Tristan Partin <tristan@neon.tech> wrote:

Thought I would show off what is possible with this patchset.

Heikki, a couple of months ago in our internal Slack, said:

[I would like] a debugging tool that checks that we're not missing any
fsyncs. I bumped into a few missing fsync bugs with unlogged tables
lately and a tool like that would've been very helpful.

My task was to create such a tool, and off I went. I started with the
storage manager extension patch that Matthias sent to the list last
year[0].

Andres, in another thread[1], said:

I've been thinking that we need a validation layer for fsyncs, it's too hard
to get right without testing, and crash testing is not likel enough to catch
problems quickly / resource intensive.

My thought was that we could keep a shared hash table of all files created /
dirtied at the fd layer, with the filename as key and the value the current
LSN. We'd delete files from it when they're fsynced. At checkpoints we go
through the list and see if there's any files from before the redo that aren't
yet fsynced. All of this in assert builds only, of course.

I took this idea and ran with it. I call it the fsync_checker™️. It is an
extension that prints relations that haven't been fsynced prior to
a CHECKPOINT. Note that this idea doesn't work in practice because
relations might not be fsynced, but they might be WAL-logged, like in
the case of createdb. See log_smgrcreate(). I can't think of an easy way
to solve this problem looking at the codebase as it stands.

Here is a description of the patches:

0001:

This is essentially just the patch that Matthias posted earlier, but
rebased and adjusted a little bit so storage managers can "inherit" from
other storage managers.

0002:

This is an extension of 0001, which allows for extensions to set
a global storage manager. This is pretty hacky, and if it was going to
be pulled into mainline, it would need some better protection. For
instance, only one extension should be able to set the global storage
manager. We wouldn't want extensions stepping over each other, etc.

0003:

Adds a hook for extensions to inspect a checkpoint before it actually
occurs. The purpose for the fsync_checker is so that I can iterate over
all the relations the extension tracks to find files that haven't been
synced prior to the completion of the checkpoint.

0004:

This is the actual fsync_checker extension itself. It must be preloaded.

Hopefully this is a good illustration of how the initial patch could be
used, even though it isn't perfect.

[0]: /messages/by-id/CAEze2WgMySu2suO_TLvFyGY3URa4mAx22WeoEicnK=PCNWEMrA@mail.gmail.com
[1]: /messages/by-id/20220127182838.ba3434dp2pe5vcia@alap3.anarazel.de

--
Tristan Partin
Neon (https://neon.tech)

#12Xun Gong
gongxun0928@gmail.com
In reply to: Nitin Jadhav (#11)
Re: Extensible storage manager API - SMGR hook Redux

Thank you for your detailed review and insights. I share your view that a
registration-based approach for custom storage managers (smgr) is more
versatile. This method allows for the implementation of custom table access
methods, facilitating the use of various storage services (such as file
services or object storage), different file organization formats (files in
one directory or many sub directories), and flexible deletion logic
(direct deletion or mark-and-sweep).

While I acknowledge that the registration-based approach requires more
modifications, I believe the benefits in terms of extensibility and
functionality are significant. I have seen similar explorations in the
implementation of AO tables in Greenplum, where a dedicated smgr was
created due to its distinct file organization compared to heap tables.

I look forward to further discussions on this topic

Nitin Jadhav <nitinjadhavpostgres@gmail.com> 于2024年12月10日周二 11:25写道:

Show quoted text

Hi,

I reviewed the discussion and took a look at the patch sets. It seems
like many things are combined here. Based on the subject, I initially
thought it aimed to provide the infrastructure to easily extend
storage managers. This would allow anyone to create their own storage
managers using this infrastructure. While it addresses this, it also
includes additional features like fsync_checker, which I believe
should be a separate feature. Even though it might use the same
infrastructure, it appears to be a different functionality. I think we
should focus solely on providing the infrastructure here.

We need to decide on our approach—whether to use a hook-based method
or a registration-based method—and I believe this requires further
discussion.

The hook-based approach is simple and works well for anyone writing
their own storage manager. However, it has its limitations as we can
either use the default storage manager or a custom-built one for all
the work load, but we cannot choose between multiple storage managers.
On the other hand, the registration-based approach allows choosing
between multiple storage managers based on the workload, though it
requires a lot of changes.

Are we planning to support other storage managers in PostgreSQL in the
near future? If not, it is better to go with the hook-based approach.
Otherwise, the registration-based approach is preferable as it offers
more flexibility to users and enhances PostgreSQL’s functionality.

Could you please share your thoughts on this? Also, let me know if
this topic has already been discussed and if any conclusions were
reached.

Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
Microsoft

On Sat, Jan 13, 2024 at 1:27 AM Tristan Partin <tristan@neon.tech> wrote:

Thought I would show off what is possible with this patchset.

Heikki, a couple of months ago in our internal Slack, said:

[I would like] a debugging tool that checks that we're not missing any
fsyncs. I bumped into a few missing fsync bugs with unlogged tables
lately and a tool like that would've been very helpful.

My task was to create such a tool, and off I went. I started with the
storage manager extension patch that Matthias sent to the list last
year[0].

Andres, in another thread[1], said:

I've been thinking that we need a validation layer for fsyncs, it's

too hard

to get right without testing, and crash testing is not likel enough to

catch

problems quickly / resource intensive.

My thought was that we could keep a shared hash table of all files

created /

dirtied at the fd layer, with the filename as key and the value the

current

LSN. We'd delete files from it when they're fsynced. At checkpoints we

go

through the list and see if there's any files from before the redo

that aren't

yet fsynced. All of this in assert builds only, of course.

I took this idea and ran with it. I call it the fsync_checker™️. It is an
extension that prints relations that haven't been fsynced prior to
a CHECKPOINT. Note that this idea doesn't work in practice because
relations might not be fsynced, but they might be WAL-logged, like in
the case of createdb. See log_smgrcreate(). I can't think of an easy way
to solve this problem looking at the codebase as it stands.

Here is a description of the patches:

0001:

This is essentially just the patch that Matthias posted earlier, but
rebased and adjusted a little bit so storage managers can "inherit" from
other storage managers.

0002:

This is an extension of 0001, which allows for extensions to set
a global storage manager. This is pretty hacky, and if it was going to
be pulled into mainline, it would need some better protection. For
instance, only one extension should be able to set the global storage
manager. We wouldn't want extensions stepping over each other, etc.

0003:

Adds a hook for extensions to inspect a checkpoint before it actually
occurs. The purpose for the fsync_checker is so that I can iterate over
all the relations the extension tracks to find files that haven't been
synced prior to the completion of the checkpoint.

0004:

This is the actual fsync_checker extension itself. It must be preloaded.

Hopefully this is a good illustration of how the initial patch could be
used, even though it isn't perfect.

[0]:

/messages/by-id/CAEze2WgMySu2suO_TLvFyGY3URa4mAx22WeoEicnK=PCNWEMrA@mail.gmail.com

[1]:

/messages/by-id/20220127182838.ba3434dp2pe5vcia@alap3.anarazel.de

--
Tristan Partin
Neon (https://neon.tech)

#13Andreas Karlsson
andreas@proxel.se
In reply to: Tristan Partin (#9)
6 attachment(s)
Re: Extensible storage manager API - SMGR hook Redux

Hi!

We at Percona are very interested in this patch for our transparent data
encryption extension. So we would love to collaborate with you, and
anyone else interested, on making the SMGR extensible.

I have attached rebased and a bit cleaned up versions of Tristan's
patches plus a couple of patches we have been working on in-house
(mainly my colleague Zsolt). I also have some questions which I would
like to discuss.

0001-0004

The same patches as Tristan posted but rebased and cleaned up a bit to
better follow the code style. I also removed a couple of dead variables
which seemed like left overs.

0005

Since we support having both encrypted and unencrypted relations we use
the RelFileLocator to look up if a relation is encrypted. And to
preserve that information when smgrcreate() creates a new relfile for a
relation we pass along the old RelFileLocator.

For our use case it is possible that we could solve this in other ways.
For example if we decide to go with configuring the SMGR per schema this
will probably not be necessary at all.

0006

The patch introduces the concept of "chaining" SMGRs where we have tail
(e.g. md or a theoretical Ceph SMGR) and modifier (e.g. TDE or the
fsync_checker). Something like this would be useful for our case since
it would be nice to be able to use the same encryption code for md and
for some other potential replacement for md which uses some kind object
storage for example.

As a bonus this allowed us to make the functions implementing md static.

It is currently controlled via a GUC, smgr_chain, but this will of
course depend on how we decide to implement configuring which SMGR to use.

Questions

- What is up with the barrier when loading SMGRs? That does not seem
necessary or am I missing something? I believe Andres also spotted this.

- How should we configure which SMGR to use for each relation? People
have talked about doing it per tablespace or using hooks and we have a
patch which uses a GUC for this. I have personally not researched these
options enough to have an opinion yet.

- Is our idea about chaining SMGRs useful? In its current form or some
variant inspired by it?

- We need to benchmark this to make sure we do not introduce too much
overhead, especially for people who just want to use md. I saw for
example that Andres had some complaint about extra indirection which we
may have to address.

Andreas

Attachments:

v3-0001-Expose-f_smgr-to-extensions-for-manual-implementa.patchtext/x-patch; charset=UTF-8; name=v3-0001-Expose-f_smgr-to-extensions-for-manual-implementa.patchDownload
From 01d41e8b407fa1126358af8d5f314fa9b7dd8f4d Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Tue, 27 Jun 2023 15:59:23 +0200
Subject: [PATCH v3 1/6] Expose f_smgr to extensions for manual implementation

There are various reasons why one would want to create their own
implementation of a storage manager, among which are block-level compression,
encryption and offloading to cold storage. This patch is a first patch that
allows extensions to register their own SMgr.

Note, however, that this SMgr is not yet used - only the first SMgr to register
is used, and this is currently the md.c smgr. Future commits will include
facilities to select an SMgr for each tablespace.
---
 src/backend/postmaster/postmaster.c |   5 +
 src/backend/storage/smgr/md.c       | 187 +++++++++++++++++++---------
 src/backend/storage/smgr/smgr.c     | 137 ++++++++++----------
 src/backend/utils/init/miscinit.c   |  13 ++
 src/include/miscadmin.h             |   1 +
 src/include/storage/md.h            |   4 +
 src/include/storage/smgr.h          |  59 +++++++--
 src/tools/pgindent/typedefs.list    |   1 +
 8 files changed, 266 insertions(+), 141 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index bb22b13adef..ddf4a011411 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -916,6 +916,11 @@ PostmasterMain(int argc, char *argv[])
 	 */
 	ApplyLauncherRegister();
 
+	/*
+	 * Register built-in managers that are not part of static arrays
+	 */
+	register_builtin_dynamic_managers();
+
 	/*
 	 * process any libraries that should be preloaded at postmaster start
 	 */
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 7bf0b45e2c3..6a4dd0eb4d8 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -84,6 +84,21 @@ typedef struct _MdfdVec
 } MdfdVec;
 
 static MemoryContext MdCxt;		/* context for all MdfdVec objects */
+SMgrId		MdSMgrId;
+
+typedef struct
+{
+	SMgrRelationData reln;		/* parent data */
+
+	/*
+	 * for md.c; per-fork arrays of the number of open segments
+	 * (md_num_open_segs) and the segments themselves (md_seg_fds).
+	 */
+	int			md_num_open_segs[MAX_FORKNUM + 1];
+	MdfdVec    *md_seg_fds[MAX_FORKNUM + 1];
+} MdSMgrRelationData;
+
+typedef MdSMgrRelationData *MdSMgrRelation;
 
 
 /* Populate a file tag describing an md.c segment file. */
@@ -110,26 +125,55 @@ static MemoryContext MdCxt;		/* context for all MdfdVec objects */
 #define EXTENSION_DONT_OPEN			(1 << 5)
 
 
+void
+mdsmgr_register(void)
+{
+	/* magnetic disk */
+	f_smgr		md_smgr = (f_smgr) {
+		.name = "md",
+		.smgr_init = mdinit,
+		.smgr_shutdown = NULL,
+		.smgr_open = mdopen,
+		.smgr_close = mdclose,
+		.smgr_create = mdcreate,
+		.smgr_exists = mdexists,
+		.smgr_unlink = mdunlink,
+		.smgr_extend = mdextend,
+		.smgr_zeroextend = mdzeroextend,
+		.smgr_prefetch = mdprefetch,
+		.smgr_maxcombine = mdmaxcombine,
+		.smgr_readv = mdreadv,
+		.smgr_writev = mdwritev,
+		.smgr_writeback = mdwriteback,
+		.smgr_nblocks = mdnblocks,
+		.smgr_truncate = mdtruncate,
+		.smgr_immedsync = mdimmedsync,
+		.smgr_registersync = mdregistersync,
+	};
+
+	MdSMgrId = smgr_register(&md_smgr, sizeof(MdSMgrRelationData));
+}
+
 /* local routines */
 static void mdunlinkfork(RelFileLocatorBackend rlocator, ForkNumber forknum,
 						 bool isRedo);
-static MdfdVec *mdopenfork(SMgrRelation reln, ForkNumber forknum, int behavior);
-static void register_dirty_segment(SMgrRelation reln, ForkNumber forknum,
+static MdfdVec *mdopenfork(MdSMgrRelation reln, ForkNumber forknum, int behavior);
+static void register_dirty_segment(MdSMgrRelation reln, ForkNumber forknum,
 								   MdfdVec *seg);
 static void register_unlink_segment(RelFileLocatorBackend rlocator, ForkNumber forknum,
 									BlockNumber segno);
 static void register_forget_request(RelFileLocatorBackend rlocator, ForkNumber forknum,
 									BlockNumber segno);
-static void _fdvec_resize(SMgrRelation reln,
+static void _fdvec_resize(MdSMgrRelation reln,
 						  ForkNumber forknum,
 						  int nseg);
-static char *_mdfd_segpath(SMgrRelation reln, ForkNumber forknum,
+static char *_mdfd_segpath(MdSMgrRelation reln, ForkNumber forknum,
 						   BlockNumber segno);
-static MdfdVec *_mdfd_openseg(SMgrRelation reln, ForkNumber forknum,
+static MdfdVec *_mdfd_openseg(MdSMgrRelation reln, ForkNumber forknum,
 							  BlockNumber segno, int oflags);
-static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forknum,
+static MdfdVec *_mdfd_getseg(MdSMgrRelation reln, ForkNumber forknum,
 							 BlockNumber blkno, bool skipFsync, int behavior);
-static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
+static BlockNumber _mdnblocks(MdSMgrRelation reln, ForkNumber forknum,
 							  MdfdVec *seg);
 
 static inline int
@@ -162,6 +206,8 @@ mdinit(void)
 bool
 mdexists(SMgrRelation reln, ForkNumber forknum)
 {
+	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
+
 	/*
 	 * Close it first, to ensure that we notice if the fork has been unlinked
 	 * since we opened it.  As an optimization, we can skip that in recovery,
@@ -170,7 +216,7 @@ mdexists(SMgrRelation reln, ForkNumber forknum)
 	if (!InRecovery)
 		mdclose(reln, forknum);
 
-	return (mdopenfork(reln, forknum, EXTENSION_RETURN_NULL) != NULL);
+	return (mdopenfork(mdreln, forknum, EXTENSION_RETURN_NULL) != NULL);
 }
 
 /*
@@ -181,14 +227,15 @@ mdexists(SMgrRelation reln, ForkNumber forknum)
 void
 mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
 {
+	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 	MdfdVec    *mdfd;
 	char	   *path;
 	File		fd;
 
-	if (isRedo && reln->md_num_open_segs[forknum] > 0)
+	if (isRedo && mdreln->md_num_open_segs[forknum] > 0)
 		return;					/* created and opened already... */
 
-	Assert(reln->md_num_open_segs[forknum] == 0);
+	Assert(mdreln->md_num_open_segs[forknum] == 0);
 
 	/*
 	 * We may be using the target table space for the first time in this
@@ -225,13 +272,13 @@ mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
 
 	pfree(path);
 
-	_fdvec_resize(reln, forknum, 1);
-	mdfd = &reln->md_seg_fds[forknum][0];
+	_fdvec_resize(mdreln, forknum, 1);
+	mdfd = &mdreln->md_seg_fds[forknum][0];
 	mdfd->mdfd_vfd = fd;
 	mdfd->mdfd_segno = 0;
 
 	if (!SmgrIsTemp(reln))
-		register_dirty_segment(reln, forknum, mdfd);
+		register_dirty_segment(mdreln, forknum, mdfd);
 }
 
 /*
@@ -452,6 +499,7 @@ void
 mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		 const void *buffer, bool skipFsync)
 {
+	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 	off_t		seekpos;
 	int			nbytes;
 	MdfdVec    *v;
@@ -478,7 +526,7 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 						relpath(reln->smgr_rlocator, forknum),
 						InvalidBlockNumber)));
 
-	v = _mdfd_getseg(reln, forknum, blocknum, skipFsync, EXTENSION_CREATE);
+	v = _mdfd_getseg(mdreln, forknum, blocknum, skipFsync, EXTENSION_CREATE);
 
 	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
 
@@ -502,9 +550,9 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	}
 
 	if (!skipFsync && !SmgrIsTemp(reln))
-		register_dirty_segment(reln, forknum, v);
+		register_dirty_segment(mdreln, forknum, v);
 
-	Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));
+	Assert(_mdnblocks(mdreln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));
 }
 
 /*
@@ -517,6 +565,7 @@ void
 mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 			 BlockNumber blocknum, int nblocks, bool skipFsync)
 {
+	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 	MdfdVec    *v;
 	BlockNumber curblocknum = blocknum;
 	int			remblocks = nblocks;
@@ -551,7 +600,7 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 		else
 			numblocks = remblocks;
 
-		v = _mdfd_getseg(reln, forknum, curblocknum, skipFsync, EXTENSION_CREATE);
+		v = _mdfd_getseg(mdreln, forknum, curblocknum, skipFsync, EXTENSION_CREATE);
 
 		Assert(segstartblock < RELSEG_SIZE);
 		Assert(segstartblock + numblocks <= RELSEG_SIZE);
@@ -606,9 +655,9 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 		}
 
 		if (!skipFsync && !SmgrIsTemp(reln))
-			register_dirty_segment(reln, forknum, v);
+			register_dirty_segment(mdreln, forknum, v);
 
-		Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));
+		Assert(_mdnblocks(mdreln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));
 
 		remblocks -= numblocks;
 		curblocknum += numblocks;
@@ -626,7 +675,7 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum,
  * invent one out of whole cloth.
  */
 static MdfdVec *
-mdopenfork(SMgrRelation reln, ForkNumber forknum, int behavior)
+mdopenfork(MdSMgrRelation reln, ForkNumber forknum, int behavior)
 {
 	MdfdVec    *mdfd;
 	char	   *path;
@@ -636,7 +685,7 @@ mdopenfork(SMgrRelation reln, ForkNumber forknum, int behavior)
 	if (reln->md_num_open_segs[forknum] > 0)
 		return &reln->md_seg_fds[forknum][0];
 
-	path = relpath(reln->smgr_rlocator, forknum);
+	path = relpath(reln->reln.smgr_rlocator, forknum);
 
 	fd = PathNameOpenFile(path, _mdfd_open_flags());
 
@@ -671,9 +720,11 @@ mdopenfork(SMgrRelation reln, ForkNumber forknum, int behavior)
 void
 mdopen(SMgrRelation reln)
 {
+	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
+
 	/* mark it not open */
 	for (int forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-		reln->md_num_open_segs[forknum] = 0;
+		mdreln->md_num_open_segs[forknum] = 0;
 }
 
 /*
@@ -682,7 +733,8 @@ mdopen(SMgrRelation reln)
 void
 mdclose(SMgrRelation reln, ForkNumber forknum)
 {
-	int			nopensegs = reln->md_num_open_segs[forknum];
+	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
+	int			nopensegs = mdreln->md_num_open_segs[forknum];
 
 	/* No work if already closed */
 	if (nopensegs == 0)
@@ -691,10 +743,10 @@ mdclose(SMgrRelation reln, ForkNumber forknum)
 	/* close segments starting from the end */
 	while (nopensegs > 0)
 	{
-		MdfdVec    *v = &reln->md_seg_fds[forknum][nopensegs - 1];
+		MdfdVec    *v = &mdreln->md_seg_fds[forknum][nopensegs - 1];
 
 		FileClose(v->mdfd_vfd);
-		_fdvec_resize(reln, forknum, nopensegs - 1);
+		_fdvec_resize(mdreln, forknum, nopensegs - 1);
 		nopensegs--;
 	}
 }
@@ -707,6 +759,7 @@ mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		   int nblocks)
 {
 #ifdef USE_PREFETCH
+	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 
 	Assert((io_direct_flags & IO_DIRECT_DATA) == 0);
 
@@ -719,7 +772,7 @@ mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		MdfdVec    *v;
 		int			nblocks_this_segment;
 
-		v = _mdfd_getseg(reln, forknum, blocknum, false,
+		v = _mdfd_getseg(mdreln, forknum, blocknum, false,
 						 InRecovery ? EXTENSION_RETURN_NULL : EXTENSION_FAIL);
 		if (v == NULL)
 			return false;
@@ -817,6 +870,8 @@ void
 mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		void **buffers, BlockNumber nblocks)
 {
+	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
+
 	while (nblocks > 0)
 	{
 		struct iovec iov[PG_IOV_MAX];
@@ -828,7 +883,7 @@ mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		size_t		transferred_this_segment;
 		size_t		size_this_segment;
 
-		v = _mdfd_getseg(reln, forknum, blocknum, false,
+		v = _mdfd_getseg(mdreln, forknum, blocknum, false,
 						 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
 
 		seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
@@ -938,6 +993,8 @@ void
 mdwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		 const void **buffers, BlockNumber nblocks, bool skipFsync)
 {
+	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
+
 	/* This assert is too expensive to have on normally ... */
 #ifdef CHECK_WRITE_VS_EXTEND
 	Assert((uint64) blocknum + (uint64) nblocks <= (uint64) mdnblocks(reln, forknum));
@@ -954,7 +1011,7 @@ mdwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		size_t		transferred_this_segment;
 		size_t		size_this_segment;
 
-		v = _mdfd_getseg(reln, forknum, blocknum, skipFsync,
+		v = _mdfd_getseg(mdreln, forknum, blocknum, skipFsync,
 						 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
 
 		seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
@@ -1024,7 +1081,7 @@ mdwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		}
 
 		if (!skipFsync && !SmgrIsTemp(reln))
-			register_dirty_segment(reln, forknum, v);
+			register_dirty_segment(mdreln, forknum, v);
 
 		nblocks -= nblocks_this_segment;
 		buffers += nblocks_this_segment;
@@ -1043,6 +1100,8 @@ void
 mdwriteback(SMgrRelation reln, ForkNumber forknum,
 			BlockNumber blocknum, BlockNumber nblocks)
 {
+	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
+
 	Assert((io_direct_flags & IO_DIRECT_DATA) == 0);
 
 	/*
@@ -1057,7 +1116,7 @@ mdwriteback(SMgrRelation reln, ForkNumber forknum,
 		int			segnum_start,
 					segnum_end;
 
-		v = _mdfd_getseg(reln, forknum, blocknum, true /* not used */ ,
+		v = _mdfd_getseg(mdreln, forknum, blocknum, true /* not used */ ,
 						 EXTENSION_DONT_OPEN);
 
 		/*
@@ -1101,14 +1160,15 @@ mdwriteback(SMgrRelation reln, ForkNumber forknum,
 BlockNumber
 mdnblocks(SMgrRelation reln, ForkNumber forknum)
 {
+	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 	MdfdVec    *v;
 	BlockNumber nblocks;
 	BlockNumber segno;
 
-	mdopenfork(reln, forknum, EXTENSION_FAIL);
+	mdopenfork(mdreln, forknum, EXTENSION_FAIL);
 
 	/* mdopen has opened the first segment */
-	Assert(reln->md_num_open_segs[forknum] > 0);
+	Assert(mdreln->md_num_open_segs[forknum] > 0);
 
 	/*
 	 * Start from the last open segments, to avoid redundant seeks.  We have
@@ -1123,12 +1183,12 @@ mdnblocks(SMgrRelation reln, ForkNumber forknum)
 	 * that's OK because the checkpointer never needs to compute relation
 	 * size.)
 	 */
-	segno = reln->md_num_open_segs[forknum] - 1;
-	v = &reln->md_seg_fds[forknum][segno];
+	segno = mdreln->md_num_open_segs[forknum] - 1;
+	v = &mdreln->md_seg_fds[forknum][segno];
 
 	for (;;)
 	{
-		nblocks = _mdnblocks(reln, forknum, v);
+		nblocks = _mdnblocks(mdreln, forknum, v);
 		if (nblocks > ((BlockNumber) RELSEG_SIZE))
 			elog(FATAL, "segment too big");
 		if (nblocks < ((BlockNumber) RELSEG_SIZE))
@@ -1146,7 +1206,7 @@ mdnblocks(SMgrRelation reln, ForkNumber forknum)
 		 * undermines _mdfd_getseg's attempts to notice and report an error
 		 * upon access to a missing segment.
 		 */
-		v = _mdfd_openseg(reln, forknum, segno, 0);
+		v = _mdfd_openseg(mdreln, forknum, segno, 0);
 		if (v == NULL)
 			return segno * ((BlockNumber) RELSEG_SIZE);
 	}
@@ -1166,6 +1226,7 @@ void
 mdtruncate(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber curnblk, BlockNumber nblocks)
 {
+	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 	BlockNumber priorblocks;
 	int			curopensegs;
 
@@ -1186,14 +1247,14 @@ mdtruncate(SMgrRelation reln, ForkNumber forknum,
 	 * Truncate segments, starting at the last one. Starting at the end makes
 	 * managing the memory for the fd array easier, should there be errors.
 	 */
-	curopensegs = reln->md_num_open_segs[forknum];
+	curopensegs = mdreln->md_num_open_segs[forknum];
 	while (curopensegs > 0)
 	{
 		MdfdVec    *v;
 
 		priorblocks = (curopensegs - 1) * RELSEG_SIZE;
 
-		v = &reln->md_seg_fds[forknum][curopensegs - 1];
+		v = &mdreln->md_seg_fds[forknum][curopensegs - 1];
 
 		if (priorblocks > nblocks)
 		{
@@ -1208,13 +1269,13 @@ mdtruncate(SMgrRelation reln, ForkNumber forknum,
 								FilePathName(v->mdfd_vfd))));
 
 			if (!SmgrIsTemp(reln))
-				register_dirty_segment(reln, forknum, v);
+				register_dirty_segment(mdreln, forknum, v);
 
 			/* we never drop the 1st segment */
-			Assert(v != &reln->md_seg_fds[forknum][0]);
+			Assert(v != &mdreln->md_seg_fds[forknum][0]);
 
 			FileClose(v->mdfd_vfd);
-			_fdvec_resize(reln, forknum, curopensegs - 1);
+			_fdvec_resize(mdreln, forknum, curopensegs - 1);
 		}
 		else if (priorblocks + ((BlockNumber) RELSEG_SIZE) > nblocks)
 		{
@@ -1234,7 +1295,7 @@ mdtruncate(SMgrRelation reln, ForkNumber forknum,
 								FilePathName(v->mdfd_vfd),
 								nblocks)));
 			if (!SmgrIsTemp(reln))
-				register_dirty_segment(reln, forknum, v);
+				register_dirty_segment(mdreln, forknum, v);
 		}
 		else
 		{
@@ -1254,6 +1315,7 @@ mdtruncate(SMgrRelation reln, ForkNumber forknum,
 void
 mdregistersync(SMgrRelation reln, ForkNumber forknum)
 {
+	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 	int			segno;
 	int			min_inactive_seg;
 
@@ -1263,7 +1325,7 @@ mdregistersync(SMgrRelation reln, ForkNumber forknum)
 	 */
 	mdnblocks(reln, forknum);
 
-	min_inactive_seg = segno = reln->md_num_open_segs[forknum];
+	min_inactive_seg = segno = mdreln->md_num_open_segs[forknum];
 
 	/*
 	 * Temporarily open inactive segments, then close them after sync.  There
@@ -1271,20 +1333,20 @@ mdregistersync(SMgrRelation reln, ForkNumber forknum)
 	 * harmless.  We don't bother to clean them up and take a risk of further
 	 * trouble.  The next mdclose() will soon close them.
 	 */
-	while (_mdfd_openseg(reln, forknum, segno, 0) != NULL)
+	while (_mdfd_openseg(mdreln, forknum, segno, 0) != NULL)
 		segno++;
 
 	while (segno > 0)
 	{
-		MdfdVec    *v = &reln->md_seg_fds[forknum][segno - 1];
+		MdfdVec    *v = &mdreln->md_seg_fds[forknum][segno - 1];
 
-		register_dirty_segment(reln, forknum, v);
+		register_dirty_segment(mdreln, forknum, v);
 
 		/* Close inactive segments immediately */
 		if (segno > min_inactive_seg)
 		{
 			FileClose(v->mdfd_vfd);
-			_fdvec_resize(reln, forknum, segno - 1);
+			_fdvec_resize(mdreln, forknum, segno - 1);
 		}
 
 		segno--;
@@ -1305,6 +1367,7 @@ mdregistersync(SMgrRelation reln, ForkNumber forknum)
 void
 mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 {
+	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 	int			segno;
 	int			min_inactive_seg;
 
@@ -1314,7 +1377,7 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 	 */
 	mdnblocks(reln, forknum);
 
-	min_inactive_seg = segno = reln->md_num_open_segs[forknum];
+	min_inactive_seg = segno = mdreln->md_num_open_segs[forknum];
 
 	/*
 	 * Temporarily open inactive segments, then close them after sync.  There
@@ -1322,12 +1385,12 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 	 * is harmless.  We don't bother to clean them up and take a risk of
 	 * further trouble.  The next mdclose() will soon close them.
 	 */
-	while (_mdfd_openseg(reln, forknum, segno, 0) != NULL)
+	while (_mdfd_openseg(mdreln, forknum, segno, 0) != NULL)
 		segno++;
 
 	while (segno > 0)
 	{
-		MdfdVec    *v = &reln->md_seg_fds[forknum][segno - 1];
+		MdfdVec    *v = &mdreln->md_seg_fds[forknum][segno - 1];
 
 		/*
 		 * fsyncs done through mdimmedsync() should be tracked in a separate
@@ -1348,7 +1411,7 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 		if (segno > min_inactive_seg)
 		{
 			FileClose(v->mdfd_vfd);
-			_fdvec_resize(reln, forknum, segno - 1);
+			_fdvec_resize(mdreln, forknum, segno - 1);
 		}
 
 		segno--;
@@ -1365,14 +1428,14 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
  * enough to be a performance problem).
  */
 static void
-register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
+register_dirty_segment(MdSMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 {
 	FileTag		tag;
 
-	INIT_MD_FILETAG(tag, reln->smgr_rlocator.locator, forknum, seg->mdfd_segno);
+	INIT_MD_FILETAG(tag, reln->reln.smgr_rlocator.locator, forknum, seg->mdfd_segno);
 
 	/* Temp relations should never be fsync'd */
-	Assert(!SmgrIsTemp(reln));
+	Assert(!SmgrIsTemp(&reln->reln));
 
 	if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false /* retryOnError */ ))
 	{
@@ -1490,7 +1553,7 @@ DropRelationFiles(RelFileLocator *delrels, int ndelrels, bool isRedo)
  * _fdvec_resize() -- Resize the fork's open segments array
  */
 static void
-_fdvec_resize(SMgrRelation reln,
+_fdvec_resize(MdSMgrRelation reln,
 			  ForkNumber forknum,
 			  int nseg)
 {
@@ -1538,12 +1601,12 @@ _fdvec_resize(SMgrRelation reln,
  * returned string is palloc'd.
  */
 static char *
-_mdfd_segpath(SMgrRelation reln, ForkNumber forknum, BlockNumber segno)
+_mdfd_segpath(MdSMgrRelation reln, ForkNumber forknum, BlockNumber segno)
 {
 	char	   *path,
 			   *fullpath;
 
-	path = relpath(reln->smgr_rlocator, forknum);
+	path = relpath(reln->reln.smgr_rlocator, forknum);
 
 	if (segno > 0)
 	{
@@ -1561,7 +1624,7 @@ _mdfd_segpath(SMgrRelation reln, ForkNumber forknum, BlockNumber segno)
  * and make a MdfdVec object for it.  Returns NULL on failure.
  */
 static MdfdVec *
-_mdfd_openseg(SMgrRelation reln, ForkNumber forknum, BlockNumber segno,
+_mdfd_openseg(MdSMgrRelation reln, ForkNumber forknum, BlockNumber segno,
 			  int oflags)
 {
 	MdfdVec    *v;
@@ -1606,7 +1669,7 @@ _mdfd_openseg(SMgrRelation reln, ForkNumber forknum, BlockNumber segno,
  * EXTENSION_CREATE case.
  */
 static MdfdVec *
-_mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
+_mdfd_getseg(MdSMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
 			 bool skipFsync, int behavior)
 {
 	MdfdVec    *v;
@@ -1680,7 +1743,7 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
 				char	   *zerobuf = palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE,
 													 MCXT_ALLOC_ZERO);
 
-				mdextend(reln, forknum,
+				mdextend((SMgrRelation) reln, forknum,
 						 nextsegno * ((BlockNumber) RELSEG_SIZE) - 1,
 						 zerobuf, skipFsync);
 				pfree(zerobuf);
@@ -1735,7 +1798,7 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
  * Get number of blocks present in a single disk file
  */
 static BlockNumber
-_mdnblocks(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
+_mdnblocks(MdSMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 {
 	off_t		len;
 
@@ -1758,7 +1821,7 @@ _mdnblocks(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 int
 mdsyncfiletag(const FileTag *ftag, char *path)
 {
-	SMgrRelation reln = smgropen(ftag->rlocator, INVALID_PROC_NUMBER);
+	MdSMgrRelation reln = (MdSMgrRelation) smgropen(ftag->rlocator, INVALID_PROC_NUMBER);
 	File		file;
 	instr_time	io_start;
 	bool		need_to_close;
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index ebe35c04de5..7635c231ea0 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -53,84 +53,21 @@
 
 #include "access/xlogutils.h"
 #include "lib/ilist.h"
+#include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/md.h"
 #include "storage/smgr.h"
+#include "port/atomics.h"
 #include "utils/hsearch.h"
 #include "utils/inval.h"
+#include "utils/memutils.h"
 
+static f_smgr *smgrsw;
 
-/*
- * This struct of function pointers defines the API between smgr.c and
- * any individual storage manager module.  Note that smgr subfunctions are
- * generally expected to report problems via elog(ERROR).  An exception is
- * that smgr_unlink should use elog(WARNING), rather than erroring out,
- * because we normally unlink relations during post-commit/abort cleanup,
- * and so it's too late to raise an error.  Also, various conditions that
- * would normally be errors should be allowed during bootstrap and/or WAL
- * recovery --- see comments in md.c for details.
- */
-typedef struct f_smgr
-{
-	void		(*smgr_init) (void);	/* may be NULL */
-	void		(*smgr_shutdown) (void);	/* may be NULL */
-	void		(*smgr_open) (SMgrRelation reln);
-	void		(*smgr_close) (SMgrRelation reln, ForkNumber forknum);
-	void		(*smgr_create) (SMgrRelation reln, ForkNumber forknum,
-								bool isRedo);
-	bool		(*smgr_exists) (SMgrRelation reln, ForkNumber forknum);
-	void		(*smgr_unlink) (RelFileLocatorBackend rlocator, ForkNumber forknum,
-								bool isRedo);
-	void		(*smgr_extend) (SMgrRelation reln, ForkNumber forknum,
-								BlockNumber blocknum, const void *buffer, bool skipFsync);
-	void		(*smgr_zeroextend) (SMgrRelation reln, ForkNumber forknum,
-									BlockNumber blocknum, int nblocks, bool skipFsync);
-	bool		(*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
-								  BlockNumber blocknum, int nblocks);
-	uint32		(*smgr_maxcombine) (SMgrRelation reln, ForkNumber forknum,
-									BlockNumber blocknum);
-	void		(*smgr_readv) (SMgrRelation reln, ForkNumber forknum,
-							   BlockNumber blocknum,
-							   void **buffers, BlockNumber nblocks);
-	void		(*smgr_writev) (SMgrRelation reln, ForkNumber forknum,
-								BlockNumber blocknum,
-								const void **buffers, BlockNumber nblocks,
-								bool skipFsync);
-	void		(*smgr_writeback) (SMgrRelation reln, ForkNumber forknum,
-								   BlockNumber blocknum, BlockNumber nblocks);
-	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
-	void		(*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
-								  BlockNumber old_blocks, BlockNumber nblocks);
-	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
-	void		(*smgr_registersync) (SMgrRelation reln, ForkNumber forknum);
-} f_smgr;
-
-static const f_smgr smgrsw[] = {
-	/* magnetic disk */
-	{
-		.smgr_init = mdinit,
-		.smgr_shutdown = NULL,
-		.smgr_open = mdopen,
-		.smgr_close = mdclose,
-		.smgr_create = mdcreate,
-		.smgr_exists = mdexists,
-		.smgr_unlink = mdunlink,
-		.smgr_extend = mdextend,
-		.smgr_zeroextend = mdzeroextend,
-		.smgr_prefetch = mdprefetch,
-		.smgr_maxcombine = mdmaxcombine,
-		.smgr_readv = mdreadv,
-		.smgr_writev = mdwritev,
-		.smgr_writeback = mdwriteback,
-		.smgr_nblocks = mdnblocks,
-		.smgr_truncate = mdtruncate,
-		.smgr_immedsync = mdimmedsync,
-		.smgr_registersync = mdregistersync,
-	}
-};
+static int	NSmgr = 0;
 
-static const int NSmgr = lengthof(smgrsw);
+static Size LargestSMgrRelationSize = 0;
 
 /*
  * Each backend has a hashtable that stores all extant SMgrRelation objects.
@@ -144,6 +81,60 @@ static dlist_head unpinned_relns;
 static void smgrshutdown(int code, Datum arg);
 static void smgrdestroy(SMgrRelation reln);
 
+#define MaxSMgrId UINT8_MAX
+
+SMgrId
+smgr_register(const f_smgr *smgr, Size smgrrelation_size)
+{
+	SMgrId		my_id;
+	MemoryContext old;
+
+	if (process_shared_preload_libraries_done)
+		elog(FATAL, "SMgrs must be registered in the shared_preload_libraries phase");
+	if (NSmgr == MaxSMgrId)
+		elog(FATAL, "Too many smgrs registered");
+	if (smgr->name == NULL || *smgr->name == 0)
+		elog(FATAL, "smgr registered with invalid name");
+
+	Assert(smgr->smgr_open != NULL);
+	Assert(smgr->smgr_close != NULL);
+	Assert(smgr->smgr_create != NULL);
+	Assert(smgr->smgr_exists != NULL);
+	Assert(smgr->smgr_unlink != NULL);
+	Assert(smgr->smgr_extend != NULL);
+	Assert(smgr->smgr_zeroextend != NULL);
+	Assert(smgr->smgr_prefetch != NULL);
+	Assert(smgr->smgr_readv != NULL);
+	Assert(smgr->smgr_writev != NULL);
+	Assert(smgr->smgr_writeback != NULL);
+	Assert(smgr->smgr_nblocks != NULL);
+	Assert(smgr->smgr_truncate != NULL);
+	Assert(smgr->smgr_immedsync != NULL);
+
+	old = MemoryContextSwitchTo(TopMemoryContext);
+
+	my_id = NSmgr++;
+	if (my_id == 0)
+		smgrsw = palloc_array(f_smgr, 1);
+	else
+		smgrsw = repalloc_array(smgrsw, f_smgr, NSmgr);
+
+	MemoryContextSwitchTo(old);
+
+	pg_compiler_barrier();
+
+	if (!smgrsw)
+	{
+		NSmgr--;
+		elog(FATAL, "Failed to extend smgr array");
+	}
+
+	smgrsw[my_id] = *smgr;
+
+	LargestSMgrRelationSize = Max(LargestSMgrRelationSize, smgrrelation_size);
+
+	return my_id;
+}
 
 /*
  * smgrinit(), smgrshutdown() -- Initialize or shut down storage
@@ -211,8 +202,11 @@ smgropen(RelFileLocator rlocator, ProcNumber backend)
 		/* First time through: initialize the hash table */
 		HASHCTL		ctl;
 
+		LargestSMgrRelationSize = MAXALIGN(LargestSMgrRelationSize);
+		Assert(NSmgr > 0);
+
 		ctl.keysize = sizeof(RelFileLocatorBackend);
-		ctl.entrysize = sizeof(SMgrRelationData);
+		ctl.entrysize = LargestSMgrRelationSize;
 		SMgrRelationHash = hash_create("smgr relation table", 400,
 									   &ctl, HASH_ELEM | HASH_BLOBS);
 		dlist_init(&unpinned_relns);
@@ -232,7 +226,8 @@ smgropen(RelFileLocator rlocator, ProcNumber backend)
 		reln->smgr_targblock = InvalidBlockNumber;
 		for (int i = 0; i <= MAX_FORKNUM; ++i)
 			reln->smgr_cached_nblocks[i] = InvalidBlockNumber;
-		reln->smgr_which = 0;	/* we only have md.c at present */
+
+		reln->smgr_which = MdSMgrId;	/* we only have md.c at present */
 
 		/* implementation-specific initialization */
 		smgrsw[reln->smgr_which].smgr_open(reln);
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 0347fc11092..c325d23e132 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -43,6 +43,7 @@
 #include "replication/slotsync.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
+#include "storage/md.h"
 #include "storage/latch.h"
 #include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
@@ -192,6 +193,9 @@ InitStandaloneProcess(const char *argv0)
 	InitProcessLocalLatch();
 	InitializeLatchWaitSet();
 
+	/* Initialize smgrs */
+	register_builtin_dynamic_managers();
+
 	/*
 	 * For consistency with InitPostmasterChild, initialize signal mask here.
 	 * But we don't unblock SIGQUIT or provide a default handler for it.
@@ -1920,6 +1924,15 @@ process_session_preload_libraries(void)
 				   true);
 }
 
+/*
+ * Register any internal managers.
+ */
+void
+register_builtin_dynamic_managers(void)
+{
+	mdsmgr_register();
+}
+
 /*
  * process any shared memory requests from preloaded libraries
  */
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index a2b63495eec..ff4ef578a1f 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -513,6 +513,7 @@ extern void TouchSocketLockFiles(void);
 extern void AddToDataDirLockFile(int target_line, const char *str);
 extern bool RecheckDataDirLockFile(void);
 extern void ValidatePgVersion(const char *path);
+extern void register_builtin_dynamic_managers(void);
 extern void process_shared_preload_libraries(void);
 extern void process_session_preload_libraries(void);
 extern void process_shmem_requests(void);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 05bf537066e..da1d1d339be 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -19,6 +19,10 @@
 #include "storage/smgr.h"
 #include "storage/sync.h"
 
+/* registration function for md storage manager */
+extern void mdsmgr_register(void);
+extern SMgrId MdSMgrId;
+
 /* md storage manager functionality */
 extern void mdinit(void);
 extern void mdopen(SMgrRelation reln);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 4016b206ad6..52f74f917b2 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -18,6 +18,8 @@
 #include "storage/block.h"
 #include "storage/relfilelocator.h"
 
+typedef uint8 SMgrId;
+
 /*
  * smgr.c maintains a table of SMgrRelation objects, which are essentially
  * cached file handles.  An SMgrRelation is created (if not already present)
@@ -51,14 +53,7 @@ typedef struct SMgrRelationData
 	 * Fields below here are intended to be private to smgr.c and its
 	 * submodules.  Do not touch them from elsewhere.
 	 */
-	int			smgr_which;		/* storage manager selector */
-
-	/*
-	 * for md.c; per-fork arrays of the number of open segments
-	 * (md_num_open_segs) and the segments themselves (md_seg_fds).
-	 */
-	int			md_num_open_segs[MAX_FORKNUM + 1];
-	struct _MdfdVec *md_seg_fds[MAX_FORKNUM + 1];
+	SMgrId		smgr_which;		/* storage manager selector */
 
 	/*
 	 * Pinning support.  If unpinned (ie. pincount == 0), 'node' is a list
@@ -73,6 +68,54 @@ typedef SMgrRelationData *SMgrRelation;
 #define SmgrIsTemp(smgr) \
 	RelFileLocatorBackendIsTemp((smgr)->smgr_rlocator)
 
+/*
+ * This struct of function pointers defines the API between smgr.c and
+ * any individual storage manager module.  Note that smgr subfunctions are
+ * generally expected to report problems via elog(ERROR).  An exception is
+ * that smgr_unlink should use elog(WARNING), rather than erroring out,
+ * because we normally unlink relations during post-commit/abort cleanup,
+ * and so it's too late to raise an error.  Also, various conditions that
+ * would normally be errors should be allowed during bootstrap and/or WAL
+ * recovery --- see comments in md.c for details.
+ */
+typedef struct f_smgr
+{
+	const char *name;
+	void		(*smgr_init) (void);	/* may be NULL */
+	void		(*smgr_shutdown) (void);	/* may be NULL */
+	void		(*smgr_open) (SMgrRelation reln);
+	void		(*smgr_close) (SMgrRelation reln, ForkNumber forknum);
+	void		(*smgr_create) (SMgrRelation reln, ForkNumber forknum,
+								bool isRedo);
+	bool		(*smgr_exists) (SMgrRelation reln, ForkNumber forknum);
+	void		(*smgr_unlink) (RelFileLocatorBackend rlocator, ForkNumber forknum,
+								bool isRedo);
+	void		(*smgr_extend) (SMgrRelation reln, ForkNumber forknum,
+								BlockNumber blocknum, const void *buffer, bool skipFsync);
+	void		(*smgr_zeroextend) (SMgrRelation reln, ForkNumber forknum,
+									BlockNumber blocknum, int nblocks, bool skipFsync);
+	bool		(*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
+								  BlockNumber blocknum, int nblocks);
+	uint32		(*smgr_maxcombine) (SMgrRelation reln, ForkNumber forknum,
+									BlockNumber blocknum);
+	void		(*smgr_readv) (SMgrRelation reln, ForkNumber forknum,
+							   BlockNumber blocknum,
+							   void **buffers, BlockNumber nblocks);
+	void		(*smgr_writev) (SMgrRelation reln, ForkNumber forknum,
+								BlockNumber blocknum,
+								const void **buffers, BlockNumber nblocks,
+								bool skipFsync);
+	void		(*smgr_writeback) (SMgrRelation reln, ForkNumber forknum,
+								   BlockNumber blocknum, BlockNumber nblocks);
+	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
+	void		(*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
+								  BlockNumber old_blocks, BlockNumber nblocks);
+	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
+	void		(*smgr_registersync) (SMgrRelation reln, ForkNumber forknum);
+} f_smgr;
+
+extern SMgrId smgr_register(const f_smgr *smgr, Size smgrrelation_size);
+
 extern void smgrinit(void);
 extern SMgrRelation smgropen(RelFileLocator rlocator, ProcNumber backend);
 extern bool smgrexists(SMgrRelation reln, ForkNumber forknum);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a2644a2e653..ecf23f3a933 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1618,6 +1618,7 @@ ManyTestResourceKind
 Material
 MaterialPath
 MaterialState
+MdSMgrRelationData
 MdfdVec
 Memoize
 MemoizeEntry
-- 
2.47.2

v3-0002-Allow-extensions-to-override-the-global-storage-m.patchtext/x-patch; charset=UTF-8; name=v3-0002-Allow-extensions-to-override-the-global-storage-m.patchDownload
From 111a5ff0b20ad5810f21370be6e5ea64283b4703 Mon Sep 17 00:00:00 2001
From: Tristan Partin <tristan@neon.tech>
Date: Fri, 13 Oct 2023 14:00:44 -0500
Subject: [PATCH v3 2/6] Allow extensions to override the global storage
 manager

---
 src/backend/storage/smgr/smgr.c   | 4 +++-
 src/backend/utils/init/miscinit.c | 2 ++
 src/include/storage/smgr.h        | 2 ++
 3 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 7635c231ea0..9b3e63aff55 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -69,6 +69,8 @@ static int	NSmgr = 0;
 
 static Size LargestSMgrRelationSize = 0;
 
+SMgrId		storage_manager_id;
+
 /*
  * Each backend has a hashtable that stores all extant SMgrRelation objects.
  * In addition, "unpinned" SMgrRelation objects are chained together in a list.
@@ -227,7 +229,7 @@ smgropen(RelFileLocator rlocator, ProcNumber backend)
 		for (int i = 0; i <= MAX_FORKNUM; ++i)
 			reln->smgr_cached_nblocks[i] = InvalidBlockNumber;
 
-		reln->smgr_which = MdSMgrId;	/* we only have md.c at present */
+		reln->smgr_which = storage_manager_id;
 
 		/* implementation-specific initialization */
 		smgrsw[reln->smgr_which].smgr_open(reln);
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index c325d23e132..184ec632dbe 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -1931,6 +1931,8 @@ void
 register_builtin_dynamic_managers(void)
 {
 	mdsmgr_register();
+
+	storage_manager_id = MdSMgrId;
 }
 
 /*
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 52f74f917b2..629c78cfdde 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -20,6 +20,8 @@
 
 typedef uint8 SMgrId;
 
+extern PGDLLIMPORT SMgrId storage_manager_id;
+
 /*
  * smgr.c maintains a table of SMgrRelation objects, which are essentially
  * cached file handles.  An SMgrRelation is created (if not already present)
-- 
2.47.2

v3-0003-Add-checkpoint_create_hook.patchtext/x-patch; charset=UTF-8; name=v3-0003-Add-checkpoint_create_hook.patchDownload
From 1c8597e1172d89044ee58a678b70017f6328f069 Mon Sep 17 00:00:00 2001
From: Tristan Partin <tristan@neon.tech>
Date: Fri, 13 Oct 2023 13:57:18 -0500
Subject: [PATCH v3 3/6] Add checkpoint_create_hook

Allows an extension to hook into CheckPointCreate().
---
 src/backend/access/transam/xlog.c | 5 +++++
 src/include/access/xlog.h         | 4 ++++
 2 files changed, 9 insertions(+)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index bf3dbda901d..2fec7064636 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -208,6 +208,8 @@ const struct config_enum_entry archive_mode_options[] = {
  */
 CheckpointStatsData CheckpointStats;
 
+checkpoint_create_hook_type checkpoint_create_hook = NULL;
+
 /*
  * During recovery, lastFullPageWrites keeps track of full_page_writes that
  * the replayed WAL records indicate. It's initialized with full_page_writes
@@ -7127,6 +7129,9 @@ CreateCheckPoint(int flags)
 	 */
 	END_CRIT_SECTION();
 
+	if (checkpoint_create_hook != NULL)
+		checkpoint_create_hook(&checkPoint);
+
 	/*
 	 * In some cases there are groups of actions that must all occur on one
 	 * side or the other of a checkpoint record. Before flushing the
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 4411c1468ac..446bbb97053 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -13,6 +13,7 @@
 
 #include "access/xlogbackup.h"
 #include "access/xlogdefs.h"
+#include "catalog/pg_control.h"
 #include "datatype/timestamp.h"
 #include "lib/stringinfo.h"
 #include "nodes/pg_list.h"
@@ -59,6 +60,9 @@ extern PGDLLIMPORT int wal_decode_buffer_size;
 
 extern PGDLLIMPORT int CheckPointSegments;
 
+typedef void (*checkpoint_create_hook_type) (const CheckPoint *);
+extern PGDLLIMPORT checkpoint_create_hook_type checkpoint_create_hook;
+
 /* Archive modes */
 typedef enum ArchiveMode
 {
-- 
2.47.2

v3-0004-Add-contrib-fsync_checker.patchtext/x-patch; charset=UTF-8; name=v3-0004-Add-contrib-fsync_checker.patchDownload
From 0a5c7e739252abf3ae37ad090e4ec51f589fa3e9 Mon Sep 17 00:00:00 2001
From: Tristan Partin <tristan@neon.tech>
Date: Wed, 20 Sep 2023 14:23:38 -0500
Subject: [PATCH v3 4/6] Add contrib/fsync_checker

fsync_checker is an extension which overrides the global storage manager
to check for volatile relations, those which have been written but not
synced to disk.
---
 contrib/Makefile                            |   1 +
 contrib/fsync_checker/fsync_checker.control |   5 +
 contrib/fsync_checker/fsync_checker_smgr.c  | 251 ++++++++++++++++++++
 contrib/fsync_checker/meson.build           |  22 ++
 contrib/meson.build                         |   1 +
 src/tools/pgindent/typedefs.list            |   2 +
 6 files changed, 282 insertions(+)
 create mode 100644 contrib/fsync_checker/fsync_checker.control
 create mode 100644 contrib/fsync_checker/fsync_checker_smgr.c
 create mode 100644 contrib/fsync_checker/meson.build

diff --git a/contrib/Makefile b/contrib/Makefile
index 952855d9b61..1c9f22b1c86 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -19,6 +19,7 @@ SUBDIRS = \
 		dict_int	\
 		dict_xsyn	\
 		earthdistance	\
+		fsync_checker	\
 		file_fdw	\
 		fuzzystrmatch	\
 		hstore		\
diff --git a/contrib/fsync_checker/fsync_checker.control b/contrib/fsync_checker/fsync_checker.control
new file mode 100644
index 00000000000..7d0e36434bf
--- /dev/null
+++ b/contrib/fsync_checker/fsync_checker.control
@@ -0,0 +1,5 @@
+# fsync_checker extension
+comment = 'SMGR extension for checking volatile writes'
+default_version = '1.0'
+module_pathname = '$libdir/fsync_checker'
+relocatable = true
diff --git a/contrib/fsync_checker/fsync_checker_smgr.c b/contrib/fsync_checker/fsync_checker_smgr.c
new file mode 100644
index 00000000000..9dcee8bbd74
--- /dev/null
+++ b/contrib/fsync_checker/fsync_checker_smgr.c
@@ -0,0 +1,251 @@
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "storage/ipc.h"
+#include "storage/lwlock.h"
+#include "storage/shmem.h"
+#include "storage/smgr.h"
+#include "storage/md.h"
+#include "utils/hsearch.h"
+
+PG_MODULE_MAGIC;
+
+typedef struct
+{
+	RelFileLocator locator;
+	ForkNumber	forknum;
+} VolatileRelnKey;
+
+typedef struct
+{
+	VolatileRelnKey key;
+	XLogRecPtr	lsn;
+} VolatileRelnEntry;
+
+void		_PG_init(void);
+
+static void fsync_checker_extend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+								 const void *buffer, bool skipFsync);
+static void fsync_checker_immedsync(SMgrRelation reln, ForkNumber forknum);
+static void fsync_checker_writev(SMgrRelation reln, ForkNumber forknum,
+								 BlockNumber blocknum, const void **buffers,
+								 BlockNumber nblocks, bool skipFsync);
+static void fsync_checker_writeback(SMgrRelation reln, ForkNumber forknum,
+									BlockNumber blocknum, BlockNumber nblocks);
+static void fsync_checker_zeroextend(SMgrRelation reln, ForkNumber forknum,
+									 BlockNumber blocknum, int nblocks, bool skipFsync);
+
+static void fsync_checker_checkpoint_create(const CheckPoint *checkPoint);
+static void fsync_checker_shmem_request(void);
+static void fsync_checker_shmem_startup(void);
+
+static void add_reln(SMgrRelation reln, ForkNumber forknum);
+static void remove_reln(SMgrRelation reln, ForkNumber forknum);
+
+static SMgrId fsync_checker_smgr_id;
+static const struct f_smgr fsync_checker_smgr = {
+	.name = "fsync_checker",
+	.smgr_init = mdinit,
+	.smgr_shutdown = NULL,
+	.smgr_open = mdopen,
+	.smgr_close = mdclose,
+	.smgr_create = mdcreate,
+	.smgr_exists = mdexists,
+	.smgr_unlink = mdunlink,
+	.smgr_extend = fsync_checker_extend,
+	.smgr_zeroextend = fsync_checker_zeroextend,
+	.smgr_prefetch = mdprefetch,
+	.smgr_maxcombine = mdmaxcombine,
+	.smgr_readv = mdreadv,
+	.smgr_writev = fsync_checker_writev,
+	.smgr_writeback = fsync_checker_writeback,
+	.smgr_nblocks = mdnblocks,
+	.smgr_truncate = mdtruncate,
+	.smgr_immedsync = fsync_checker_immedsync,
+	.smgr_registersync = mdregistersync,
+};
+
+static HTAB *volatile_relns;
+static LWLock *volatile_relns_lock;
+static shmem_request_hook_type prev_shmem_request_hook;
+static shmem_startup_hook_type prev_shmem_startup_hook;
+static checkpoint_create_hook_type prev_checkpoint_create_hook;
+
+void
+_PG_init(void)
+{
+	prev_checkpoint_create_hook = checkpoint_create_hook;
+	checkpoint_create_hook = fsync_checker_checkpoint_create;
+
+	prev_shmem_request_hook = shmem_request_hook;
+	shmem_request_hook = fsync_checker_shmem_request;
+
+	prev_shmem_startup_hook = shmem_startup_hook;
+	shmem_startup_hook = fsync_checker_shmem_startup;
+
+	/*
+	 * Relation size of 0 means we can just defer to md, but it would be nice
+	 * to just expose this functionality, so if I needed my own relation, I
+	 * could use MdSmgrRelation as the parent.
+	 */
+	fsync_checker_smgr_id = smgr_register(&fsync_checker_smgr, 0);
+
+	storage_manager_id = fsync_checker_smgr_id;
+}
+
+static void
+fsync_checker_checkpoint_create(const CheckPoint *checkPoint)
+{
+	long		num_entries;
+	HASH_SEQ_STATUS status;
+	VolatileRelnEntry *entry;
+
+	if (prev_checkpoint_create_hook)
+		prev_checkpoint_create_hook(checkPoint);
+
+	LWLockAcquire(volatile_relns_lock, LW_EXCLUSIVE);
+
+	hash_seq_init(&status, volatile_relns);
+
+	num_entries = hash_get_num_entries(volatile_relns);
+	elog(INFO, "Analyzing %ld volatile relations", num_entries);
+	while ((entry = hash_seq_search(&status)))
+	{
+		if (entry->lsn < checkPoint->redo)
+		{
+			char	   *path;
+
+			path = relpathperm(entry->key.locator, entry->key.forknum);
+
+			elog(WARNING, "Relation not previously synced: %s", path);
+
+			pfree(path);
+		}
+	}
+
+	LWLockRelease(volatile_relns_lock);
+}
+
+static void
+fsync_checker_shmem_request(void)
+{
+	if (prev_shmem_request_hook)
+		prev_shmem_request_hook();
+
+	RequestAddinShmemSpace(hash_estimate_size(1024, sizeof(VolatileRelnEntry)));
+	RequestNamedLWLockTranche("fsync_checker volatile relns lock", 1);
+}
+
+static void
+fsync_checker_shmem_startup(void)
+{
+	HASHCTL		ctl;
+
+	if (prev_shmem_startup_hook)
+		prev_shmem_startup_hook();
+
+	ctl.keysize = sizeof(VolatileRelnKey);
+	ctl.entrysize = sizeof(VolatileRelnEntry);
+	volatile_relns = NULL;
+	volatile_relns_lock = NULL;
+
+	/*
+	 * Create or attach to the shared memory state, including hash table
+	 */
+	LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE);
+
+	volatile_relns = ShmemInitHash("fsync_checker volatile relns",
+								   1024, 1024, &ctl, HASH_BLOBS | HASH_ELEM);
+	volatile_relns_lock = &GetNamedLWLockTranche("fsync_checker volatile relns lock")->lock;
+
+	LWLockRelease(AddinShmemInitLock);
+}
+
+static void
+add_reln(SMgrRelation reln, ForkNumber forknum)
+{
+	bool		found;
+	XLogRecPtr	lsn;
+	VolatileRelnKey key;
+	VolatileRelnEntry *entry;
+
+	key.locator = reln->smgr_rlocator.locator;
+	key.forknum = forknum;
+
+	lsn = GetXLogWriteRecPtr();
+
+	LWLockAcquire(volatile_relns_lock, LW_EXCLUSIVE);
+
+	entry = hash_search(volatile_relns, &key, HASH_ENTER, &found);
+	if (!found)
+		entry->lsn = lsn;
+
+	LWLockRelease(volatile_relns_lock);
+}
+
+static void
+remove_reln(SMgrRelation reln, ForkNumber forknum)
+{
+	VolatileRelnKey key;
+
+	key.locator = reln->smgr_rlocator.locator;
+	key.forknum = forknum;
+
+	LWLockAcquire(volatile_relns_lock, LW_EXCLUSIVE);
+
+	hash_search(volatile_relns, &key, HASH_REMOVE, NULL);
+
+	LWLockRelease(volatile_relns_lock);
+}
+
+static void
+fsync_checker_extend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+					 const void *buffer, bool skipFsync)
+{
+	if (!SmgrIsTemp(reln) && !skipFsync)
+		add_reln(reln, forknum);
+
+	mdextend(reln, forknum, blocknum, buffer, skipFsync);
+}
+
+static void
+fsync_checker_immedsync(SMgrRelation reln, ForkNumber forknum)
+{
+	if (!SmgrIsTemp(reln))
+		remove_reln(reln, forknum);
+
+	mdimmedsync(reln, forknum);
+}
+
+static void
+fsync_checker_writev(SMgrRelation reln, ForkNumber forknum,
+					 BlockNumber blocknum, const void **buffers,
+					 BlockNumber nblocks, bool skipFsync)
+{
+	if (!SmgrIsTemp(reln) && !skipFsync)
+		add_reln(reln, forknum);
+
+	mdwritev(reln, forknum, blocknum, buffers, nblocks, skipFsync);
+}
+
+static void
+fsync_checker_writeback(SMgrRelation reln, ForkNumber forknum,
+						BlockNumber blocknum, BlockNumber nblocks)
+{
+	if (!SmgrIsTemp(reln))
+		remove_reln(reln, forknum);
+
+	mdwriteback(reln, forknum, blocknum, nblocks);
+}
+
+static void
+fsync_checker_zeroextend(SMgrRelation reln, ForkNumber forknum,
+						 BlockNumber blocknum, int nblocks, bool skipFsync)
+{
+	if (!SmgrIsTemp(reln) && !skipFsync)
+		add_reln(reln, forknum);
+
+	mdzeroextend(reln, forknum, blocknum, nblocks, skipFsync);
+}
diff --git a/contrib/fsync_checker/meson.build b/contrib/fsync_checker/meson.build
new file mode 100644
index 00000000000..ce6ed7fe90b
--- /dev/null
+++ b/contrib/fsync_checker/meson.build
@@ -0,0 +1,22 @@
+# Copyright (c) 2023, PostgreSQL Global Development Group
+
+fsync_checker_sources = files(
+  'fsync_checker_smgr.c',
+)
+
+if host_system == 'windows'
+  fsync_checker_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'fsync_checker',
+    '--FILEDESC', 'fsync_checker - SMGR extension for checking volatile relations',])
+endif
+
+fsync_checker = shared_module('fsync_checker',
+  fsync_checker_sources,
+  kwargs: contrib_mod_args,
+)
+contrib_targets += fsync_checker
+
+install_data(
+  'fsync_checker.control',
+  kwargs: contrib_data_args,
+)
diff --git a/contrib/meson.build b/contrib/meson.build
index 1ba73ebd67a..c48fb138751 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -28,6 +28,7 @@ subdir('dict_int')
 subdir('dict_xsyn')
 subdir('earthdistance')
 subdir('file_fdw')
+subdir('fsync_checker')
 subdir('fuzzystrmatch')
 subdir('hstore')
 subdir('hstore_plperl')
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index ecf23f3a933..a117fb633d3 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3119,6 +3119,8 @@ ViewStmt
 VirtualTransactionId
 VirtualTupleTableSlot
 VolatileFunctionStatus
+VolatileRelnEntry
+VolatileRelnKey
 Vsrt
 WAIT_ORDER
 WALAvailability
-- 
2.47.2

v3-0005-Refactor-smgr-API-mdcreate-needs-the-old-relfilel.patchtext/x-patch; charset=UTF-8; name=v3-0005-Refactor-smgr-API-mdcreate-needs-the-old-relfilel.patchDownload
From d13bc437857cc6d21134f2c94b44d7824a8399e7 Mon Sep 17 00:00:00 2001
From: Zsolt Parragi <zsolt.parragi@cancellar.hu>
Date: Sat, 12 Oct 2024 22:01:28 +0100
Subject: [PATCH v3 5/6] Refactor smgr API: mdcreate needs the old
 relfilelocator

With this change, mdcreate receives the old relfilelocator along
with the new for operations that create a new file for an existing
relation.

This is required for tde_heap in pg_tde.
---
 src/backend/access/heap/heapam_handler.c | 10 ++++++----
 src/backend/access/transam/xlogutils.c   |  2 +-
 src/backend/catalog/heap.c               |  2 +-
 src/backend/catalog/index.c              |  2 +-
 src/backend/catalog/storage.c            |  8 ++++----
 src/backend/commands/sequence.c          |  2 +-
 src/backend/commands/tablecmds.c         |  4 ++--
 src/backend/storage/buffer/bufmgr.c      |  7 ++++---
 src/backend/storage/smgr/md.c            |  2 +-
 src/backend/storage/smgr/smgr.c          |  4 ++--
 src/backend/utils/cache/relcache.c       |  2 +-
 src/include/catalog/storage.h            |  3 ++-
 src/include/storage/md.h                 |  2 +-
 src/include/storage/smgr.h               |  4 ++--
 14 files changed, 29 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index a4003cf59e1..cd262ee7756 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -584,6 +584,8 @@ heapam_relation_set_new_filelocator(Relation rel,
 {
 	SMgrRelation srel;
 
+	RelFileLocator oldlocator = rel->rd_locator;
+
 	/*
 	 * Initialize to the minimum XID that could put tuples in the table. We
 	 * know that no xacts older than RecentXmin are still running, so that
@@ -601,7 +603,7 @@ heapam_relation_set_new_filelocator(Relation rel,
 	 */
 	*minmulti = GetOldestMultiXactId();
 
-	srel = RelationCreateStorage(*newrlocator, persistence, true);
+	srel = RelationCreateStorage(oldlocator, *newrlocator, persistence, true);
 
 	/*
 	 * If required, set up an init fork for an unlogged table so that it can
@@ -611,7 +613,7 @@ heapam_relation_set_new_filelocator(Relation rel,
 	{
 		Assert(rel->rd_rel->relkind == RELKIND_RELATION ||
 			   rel->rd_rel->relkind == RELKIND_TOASTVALUE);
-		smgrcreate(srel, INIT_FORKNUM, false);
+		smgrcreate(oldlocator, srel, INIT_FORKNUM, false);
 		log_smgrcreate(newrlocator, INIT_FORKNUM);
 	}
 
@@ -644,7 +646,7 @@ heapam_relation_copy_data(Relation rel, const RelFileLocator *newrlocator)
 	 * NOTE: any conflict in relfilenumber value will be caught in
 	 * RelationCreateStorage().
 	 */
-	dstrel = RelationCreateStorage(*newrlocator, rel->rd_rel->relpersistence, true);
+	dstrel = RelationCreateStorage(rel->rd_locator, *newrlocator, rel->rd_rel->relpersistence, true);
 
 	/* copy main fork */
 	RelationCopyStorage(RelationGetSmgr(rel), dstrel, MAIN_FORKNUM,
@@ -656,7 +658,7 @@ heapam_relation_copy_data(Relation rel, const RelFileLocator *newrlocator)
 	{
 		if (smgrexists(RelationGetSmgr(rel), forkNum))
 		{
-			smgrcreate(dstrel, forkNum, false);
+			smgrcreate(rel->rd_locator, dstrel, forkNum, false);
 
 			/*
 			 * WAL log creation if the relation is persistent, or this is the
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 68d53815925..e1e8ff25f7c 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -498,7 +498,7 @@ XLogReadBufferExtended(RelFileLocator rlocator, ForkNumber forknum,
 	 * filesystem loses an inode during a crash.  Better to write the data
 	 * until we are actually told to delete the file.)
 	 */
-	smgrcreate(smgr, forknum, true);
+	smgrcreate(rlocator, smgr, forknum, true);
 
 	lastblock = smgrnblocks(smgr, forknum);
 
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index 57ef466acce..03ec9a3ba1f 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -385,7 +385,7 @@ heap_create(const char *relname,
 											   relpersistence,
 											   relfrozenxid, relminmxid);
 		else if (RELKIND_HAS_STORAGE(rel->rd_rel->relkind))
-			RelationCreateStorage(rel->rd_locator, relpersistence, true);
+			RelationCreateStorage(rel->rd_locator, rel->rd_locator, relpersistence, true);
 		else
 			Assert(false);
 	}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 7377912b41e..c367d7df5fe 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3060,7 +3060,7 @@ index_build(Relation heapRelation,
 	if (indexRelation->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
 		!smgrexists(RelationGetSmgr(indexRelation), INIT_FORKNUM))
 	{
-		smgrcreate(RelationGetSmgr(indexRelation), INIT_FORKNUM, false);
+		smgrcreate(indexRelation->rd_locator, RelationGetSmgr(indexRelation), INIT_FORKNUM, false);
 		log_smgrcreate(&indexRelation->rd_locator, INIT_FORKNUM);
 		indexRelation->rd_indam->ambuildempty(indexRelation);
 	}
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index eba0e716549..440366a1c86 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -118,7 +118,7 @@ AddPendingSync(const RelFileLocator *rlocator)
  * pass register_delete = false.
  */
 SMgrRelation
-RelationCreateStorage(RelFileLocator rlocator, char relpersistence,
+RelationCreateStorage(RelFileLocator oldlocator, RelFileLocator rlocator, char relpersistence,
 					  bool register_delete)
 {
 	SMgrRelation srel;
@@ -147,7 +147,7 @@ RelationCreateStorage(RelFileLocator rlocator, char relpersistence,
 	}
 
 	srel = smgropen(rlocator, procNumber);
-	smgrcreate(srel, MAIN_FORKNUM, false);
+	smgrcreate(oldlocator, srel, MAIN_FORKNUM, false);
 
 	if (needs_wal)
 		log_smgrcreate(&srel->smgr_rlocator.locator, MAIN_FORKNUM);
@@ -976,7 +976,7 @@ smgr_redo(XLogReaderState *record)
 		SMgrRelation reln;
 
 		reln = smgropen(xlrec->rlocator, INVALID_PROC_NUMBER);
-		smgrcreate(reln, xlrec->forkNum, true);
+		smgrcreate(xlrec->rlocator, reln, xlrec->forkNum, true);
 	}
 	else if (info == XLOG_SMGR_TRUNCATE)
 	{
@@ -997,7 +997,7 @@ smgr_redo(XLogReaderState *record)
 		 * XLogReadBufferForRedo, we prefer to recreate the rel and replay the
 		 * log as best we can until the drop is seen.
 		 */
-		smgrcreate(reln, MAIN_FORKNUM, true);
+		smgrcreate(xlrec->rlocator, reln, MAIN_FORKNUM, true);
 
 		/*
 		 * Before we perform the truncation, update minimum recovery point to
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index b13ee2b745d..b4caa8aa1f1 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -344,7 +344,7 @@ fill_seq_with_data(Relation rel, HeapTuple tuple)
 		SMgrRelation srel;
 
 		srel = smgropen(rel->rd_locator, INVALID_PROC_NUMBER);
-		smgrcreate(srel, INIT_FORKNUM, false);
+		smgrcreate(rel->rd_locator, srel, INIT_FORKNUM, false);
 		log_smgrcreate(&rel->rd_locator, INIT_FORKNUM);
 		fill_seq_fork_with_data(rel, tuple, INIT_FORKNUM);
 		FlushRelationBuffers(rel);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index d617c4bc63d..f17de553a0f 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -16147,7 +16147,7 @@ index_copy_data(Relation rel, RelFileLocator newrlocator)
 	 * NOTE: any conflict in relfilenumber value will be caught in
 	 * RelationCreateStorage().
 	 */
-	dstrel = RelationCreateStorage(newrlocator, rel->rd_rel->relpersistence, true);
+	dstrel = RelationCreateStorage(rel->rd_locator, newrlocator, rel->rd_rel->relpersistence, true);
 
 	/* copy main fork */
 	RelationCopyStorage(RelationGetSmgr(rel), dstrel, MAIN_FORKNUM,
@@ -16159,7 +16159,7 @@ index_copy_data(Relation rel, RelFileLocator newrlocator)
 	{
 		if (smgrexists(RelationGetSmgr(rel), forkNum))
 		{
-			smgrcreate(dstrel, forkNum, false);
+			smgrcreate(rel->rd_locator, dstrel, forkNum, false);
 
 			/*
 			 * WAL log creation if the relation is persistent, or this is the
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index ee83669992b..c27d2fac60f 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -943,7 +943,7 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 
 		/* recheck, fork might have been created concurrently */
 		if (!smgrexists(bmr.smgr, fork))
-			smgrcreate(bmr.smgr, fork, flags & EB_PERFORMING_RECOVERY);
+			smgrcreate(bmr.rel->rd_locator, bmr.smgr, fork, flags & EB_PERFORMING_RECOVERY);
 
 		UnlockRelationForExtension(bmr.rel, ExclusiveLock);
 	}
@@ -4757,7 +4757,7 @@ CreateAndCopyRelationData(RelFileLocator src_rlocator,
 	 * directory.  Therefore, each individual relation doesn't need to be
 	 * registered for cleanup.
 	 */
-	RelationCreateStorage(dst_rlocator, relpersistence, false);
+	RelationCreateStorage(src_rlocator, dst_rlocator, relpersistence, false);
 
 	/* copy main fork. */
 	RelationCopyStorageUsingBuffer(src_rlocator, dst_rlocator, MAIN_FORKNUM,
@@ -4769,7 +4769,8 @@ CreateAndCopyRelationData(RelFileLocator src_rlocator,
 	{
 		if (smgrexists(src_rel, forkNum))
 		{
-			smgrcreate(dst_rel, forkNum, false);
+			/* TODO: for sure? */
+			smgrcreate(src_rel->smgr_rlocator.locator, dst_rel, forkNum, false);
 
 			/*
 			 * WAL log creation if the relation is persistent, or this is the
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 6a4dd0eb4d8..6329cce2e06 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -225,7 +225,7 @@ mdexists(SMgrRelation reln, ForkNumber forknum)
  * If isRedo is true, it's okay for the relation to exist already.
  */
 void
-mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
+mdcreate(RelFileLocator /* reln */, SMgrRelation reln, ForkNumber forknum, bool isRedo)
 {
 	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 	MdfdVec    *mdfd;
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 9b3e63aff55..0498fd6c317 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -408,9 +408,9 @@ smgrexists(SMgrRelation reln, ForkNumber forknum)
  * to be created.
  */
 void
-smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
+smgrcreate(RelFileLocator relold, SMgrRelation reln, ForkNumber forknum, bool isRedo)
 {
-	smgrsw[reln->smgr_which].smgr_create(reln, forknum, isRedo);
+	smgrsw[reln->smgr_which].smgr_create(relold, reln, forknum, isRedo);
 }
 
 /*
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 43219a9629c..6567bc60313 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -3825,7 +3825,7 @@ RelationSetNewRelfilenumber(Relation relation, char persistence)
 		/* handle these directly, at least for now */
 		SMgrRelation srel;
 
-		srel = RelationCreateStorage(newrlocator, persistence, true);
+		srel = RelationCreateStorage(relation->rd_locator, newrlocator, persistence, true);
 		smgrclose(srel);
 	}
 	else
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index ba99225b0a3..ecc3b792f4f 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -22,7 +22,8 @@
 /* GUC variables */
 extern PGDLLIMPORT int wal_skip_threshold;
 
-extern SMgrRelation RelationCreateStorage(RelFileLocator rlocator,
+extern SMgrRelation RelationCreateStorage(RelFileLocator oldlocator,
+										  RelFileLocator rlocator,
 										  char relpersistence,
 										  bool register_delete);
 extern void RelationDropStorage(Relation rel);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index da1d1d339be..61c0e85dd74 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -27,7 +27,7 @@ extern SMgrId MdSMgrId;
 extern void mdinit(void);
 extern void mdopen(SMgrRelation reln);
 extern void mdclose(SMgrRelation reln, ForkNumber forknum);
-extern void mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
+extern void mdcreate(RelFileLocator relold, SMgrRelation reln, ForkNumber forknum, bool isRedo);
 extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
 extern void mdunlink(RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo);
 extern void mdextend(SMgrRelation reln, ForkNumber forknum,
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 629c78cfdde..5b2b6de91c4 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -87,7 +87,7 @@ typedef struct f_smgr
 	void		(*smgr_shutdown) (void);	/* may be NULL */
 	void		(*smgr_open) (SMgrRelation reln);
 	void		(*smgr_close) (SMgrRelation reln, ForkNumber forknum);
-	void		(*smgr_create) (SMgrRelation reln, ForkNumber forknum,
+	void		(*smgr_create) (RelFileLocator relold, SMgrRelation reln, ForkNumber forknum,
 								bool isRedo);
 	bool		(*smgr_exists) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_unlink) (RelFileLocatorBackend rlocator, ForkNumber forknum,
@@ -128,7 +128,7 @@ extern void smgrdestroyall(void);
 extern void smgrrelease(SMgrRelation reln);
 extern void smgrreleaseall(void);
 extern void smgrreleaserellocator(RelFileLocatorBackend rlocator);
-extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
+extern void smgrcreate(RelFileLocator relold, SMgrRelation reln, ForkNumber forknum, bool isRedo);
 extern void smgrdosyncall(SMgrRelation *rels, int nrels);
 extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
 extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
-- 
2.47.2

v3-0006-SMGR-GUC-variable-and-chaining.patchtext/x-patch; charset=UTF-8; name=v3-0006-SMGR-GUC-variable-and-chaining.patchDownload
From 05c26b4c733950a971bc687209832d51f69f686f Mon Sep 17 00:00:00 2001
From: Zsolt Parragi <zsolt.parragi@cancellar.hu>
Date: Mon, 2 Dec 2024 07:47:07 +0000
Subject: [PATCH v3 6/6] SMGR GUC variable and chaining

The overall goal of this commit is to introduce a user interface to
the previous SMGR patch.

The idea is to allow a simple configuration for multiple "modificator"
SMGRs similar to the fsync_checker in the original proposal.

* Extensions should be able to declare a named lists of SMGR implementations,
also specifying if the given SMGR is an "end" implementation for
actual storage, or if it is a modifier implementation for some other
purpose.
* Users should be able to specify a list of SMGRs: possibly multiple modifiers,
and one storage implementation at the end to configure how the
storage manager is constructed.

This commit introduces a new GUC variable, `smgr_chain`, which allows
users to configure multiple SMGR implementations: it is a comma separated list,
where the last entry most be a storage implementation, the others must be
modifiers. The default value of this variable is "md".

The internal storage manager API is also refactored to include an easy
way for SMGR implementations to support proper chaining. Modifier SMGR
implementations also only have to implement the functions they actually
change, and can leave everything else as empty (NULL). And with this
change we can make the functions of the md smgr static.

The fsync example extension is also modified to match the new API.
---
 contrib/fsync_checker/fsync_checker_smgr.c    |  57 ++--
 src/backend/postmaster/postmaster.c           |   2 +
 src/backend/storage/smgr/md.c                 | 102 +++++---
 src/backend/storage/smgr/smgr.c               | 247 ++++++++++++++----
 src/backend/tcop/postgres.c                   |   2 +
 src/backend/utils/init/miscinit.c             |  63 ++++-
 src/backend/utils/misc/guc_tables.c           |  11 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/miscadmin.h                       |   2 +
 src/include/storage/md.h                      |  28 --
 src/include/storage/smgr.h                    |  93 +++++--
 src/tools/pgindent/typedefs.list              |   1 +
 12 files changed, 448 insertions(+), 161 deletions(-)

diff --git a/contrib/fsync_checker/fsync_checker_smgr.c b/contrib/fsync_checker/fsync_checker_smgr.c
index 9dcee8bbd74..587db5b8b22 100644
--- a/contrib/fsync_checker/fsync_checker_smgr.c
+++ b/contrib/fsync_checker/fsync_checker_smgr.c
@@ -27,15 +27,15 @@ typedef struct
 void		_PG_init(void);
 
 static void fsync_checker_extend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-								 const void *buffer, bool skipFsync);
-static void fsync_checker_immedsync(SMgrRelation reln, ForkNumber forknum);
+								 const void *buffer, bool skipFsync, SmgrChainIndex chain_index);
+static void fsync_checker_immedsync(SMgrRelation reln, ForkNumber forknum, SmgrChainIndex chain_index);
 static void fsync_checker_writev(SMgrRelation reln, ForkNumber forknum,
 								 BlockNumber blocknum, const void **buffers,
-								 BlockNumber nblocks, bool skipFsync);
+								 BlockNumber nblocks, bool skipFsync, SmgrChainIndex chain_index);
 static void fsync_checker_writeback(SMgrRelation reln, ForkNumber forknum,
-									BlockNumber blocknum, BlockNumber nblocks);
+									BlockNumber blocknum, BlockNumber nblocks, SmgrChainIndex chain_index);
 static void fsync_checker_zeroextend(SMgrRelation reln, ForkNumber forknum,
-									 BlockNumber blocknum, int nblocks, bool skipFsync);
+									 BlockNumber blocknum, int nblocks, bool skipFsync, SmgrChainIndex chain_index);
 
 static void fsync_checker_checkpoint_create(const CheckPoint *checkPoint);
 static void fsync_checker_shmem_request(void);
@@ -47,24 +47,25 @@ static void remove_reln(SMgrRelation reln, ForkNumber forknum);
 static SMgrId fsync_checker_smgr_id;
 static const struct f_smgr fsync_checker_smgr = {
 	.name = "fsync_checker",
-	.smgr_init = mdinit,
+	.chain_position = SMGR_CHAIN_MODIFIER,
+	.smgr_init = NULL,
 	.smgr_shutdown = NULL,
-	.smgr_open = mdopen,
-	.smgr_close = mdclose,
-	.smgr_create = mdcreate,
-	.smgr_exists = mdexists,
-	.smgr_unlink = mdunlink,
+	.smgr_open = NULL,
+	.smgr_close = NULL,
+	.smgr_create = NULL,
+	.smgr_exists = NULL,
+	.smgr_unlink = NULL,
 	.smgr_extend = fsync_checker_extend,
 	.smgr_zeroextend = fsync_checker_zeroextend,
-	.smgr_prefetch = mdprefetch,
-	.smgr_maxcombine = mdmaxcombine,
-	.smgr_readv = mdreadv,
+	.smgr_prefetch = NULL,
+	.smgr_maxcombine = NULL,
+	.smgr_readv = NULL,
 	.smgr_writev = fsync_checker_writev,
 	.smgr_writeback = fsync_checker_writeback,
-	.smgr_nblocks = mdnblocks,
-	.smgr_truncate = mdtruncate,
+	.smgr_nblocks = NULL,
+	.smgr_truncate = NULL,
 	.smgr_immedsync = fsync_checker_immedsync,
-	.smgr_registersync = mdregistersync,
+	.smgr_registersync = NULL,
 };
 
 static HTAB *volatile_relns;
@@ -91,8 +92,6 @@ _PG_init(void)
 	 * could use MdSmgrRelation as the parent.
 	 */
 	fsync_checker_smgr_id = smgr_register(&fsync_checker_smgr, 0);
-
-	storage_manager_id = fsync_checker_smgr_id;
 }
 
 static void
@@ -202,50 +201,50 @@ remove_reln(SMgrRelation reln, ForkNumber forknum)
 
 static void
 fsync_checker_extend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-					 const void *buffer, bool skipFsync)
+					 const void *buffer, bool skipFsync, SmgrChainIndex chain_index)
 {
 	if (!SmgrIsTemp(reln) && !skipFsync)
 		add_reln(reln, forknum);
 
-	mdextend(reln, forknum, blocknum, buffer, skipFsync);
+	smgr_extend_next(reln, forknum, blocknum, buffer, skipFsync, chain_index + 1);
 }
 
 static void
-fsync_checker_immedsync(SMgrRelation reln, ForkNumber forknum)
+fsync_checker_immedsync(SMgrRelation reln, ForkNumber forknum, SmgrChainIndex chain_index)
 {
 	if (!SmgrIsTemp(reln))
 		remove_reln(reln, forknum);
 
-	mdimmedsync(reln, forknum);
+	smgr_immedsync_next(reln, forknum, chain_index + 1);
 }
 
 static void
 fsync_checker_writev(SMgrRelation reln, ForkNumber forknum,
 					 BlockNumber blocknum, const void **buffers,
-					 BlockNumber nblocks, bool skipFsync)
+					 BlockNumber nblocks, bool skipFsync, SmgrChainIndex chain_index)
 {
 	if (!SmgrIsTemp(reln) && !skipFsync)
 		add_reln(reln, forknum);
 
-	mdwritev(reln, forknum, blocknum, buffers, nblocks, skipFsync);
+	smgr_writev_next(reln, forknum, blocknum, buffers, nblocks, skipFsync, chain_index + 1);
 }
 
 static void
 fsync_checker_writeback(SMgrRelation reln, ForkNumber forknum,
-						BlockNumber blocknum, BlockNumber nblocks)
+						BlockNumber blocknum, BlockNumber nblocks, SmgrChainIndex chain_index)
 {
 	if (!SmgrIsTemp(reln))
 		remove_reln(reln, forknum);
 
-	mdwriteback(reln, forknum, blocknum, nblocks);
+	smgr_writeback_next(reln, forknum, blocknum, nblocks, chain_index + 1);
 }
 
 static void
 fsync_checker_zeroextend(SMgrRelation reln, ForkNumber forknum,
-						 BlockNumber blocknum, int nblocks, bool skipFsync)
+						 BlockNumber blocknum, int nblocks, bool skipFsync, SmgrChainIndex chain_index)
 {
 	if (!SmgrIsTemp(reln) && !skipFsync)
 		add_reln(reln, forknum);
 
-	mdzeroextend(reln, forknum, blocknum, nblocks, skipFsync);
+	smgr_zeroextend_next(reln, forknum, blocknum, nblocks, skipFsync, chain_index + 1);
 }
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index ddf4a011411..d5caca3684a 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -926,6 +926,8 @@ PostmasterMain(int argc, char *argv[])
 	 */
 	process_shared_preload_libraries();
 
+	process_smgr_chain();
+
 	/*
 	 * Initialize SSL library, if specified.
 	 */
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 6329cce2e06..4534876aac4 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -124,6 +124,33 @@ typedef MdSMgrRelationData *MdSMgrRelation;
 /* don't try to open a segment, if not already open */
 #define EXTENSION_DONT_OPEN			(1 << 5)
 
+/* md storage manager functionality */
+static void mdinit(void);
+static void mdopen(SMgrRelation reln, SmgrChainIndex chain_index);
+static void mdclose(SMgrRelation reln, ForkNumber forknum, SmgrChainIndex chain_index);
+static void mdcreate(RelFileLocator relold, SMgrRelation reln, ForkNumber forknum, bool isRedo, SmgrChainIndex chain_index);
+static bool mdexists(SMgrRelation reln, ForkNumber forknum, SmgrChainIndex chain_index);
+static void mdunlink(RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo, SmgrChainIndex chain_index);
+static void mdextend(SMgrRelation reln, ForkNumber forknum,
+					 BlockNumber blocknum, const void *buffer, bool skipFsync, SmgrChainIndex chain_index);
+static void mdzeroextend(SMgrRelation reln, ForkNumber forknum,
+						 BlockNumber blocknum, int nblocks, bool skipFsync, SmgrChainIndex chain_index);
+static bool mdprefetch(SMgrRelation reln, ForkNumber forknum,
+					   BlockNumber blocknum, int nblocks, SmgrChainIndex chain_index);
+static uint32 mdmaxcombine(SMgrRelation reln, ForkNumber forknum,
+						   BlockNumber blocknum, SmgrChainIndex chain_index);
+static void mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+					void **buffers, BlockNumber nblocks, SmgrChainIndex chain_index);
+static void mdwritev(SMgrRelation reln, ForkNumber forknum,
+					 BlockNumber blocknum,
+					 const void **buffers, BlockNumber nblocks, bool skipFsync, SmgrChainIndex chain_index);
+static void mdwriteback(SMgrRelation reln, ForkNumber forknum,
+						BlockNumber blocknum, BlockNumber nblocks, SmgrChainIndex chain_index);
+static BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum, SmgrChainIndex chain_index);
+static void mdtruncate(SMgrRelation reln, ForkNumber forknum,
+					   BlockNumber old_blocks, BlockNumber nblocks, SmgrChainIndex chain_index);
+static void mdimmedsync(SMgrRelation reln, ForkNumber forknum, SmgrChainIndex chain_index);
+static void mdregistersync(SMgrRelation reln, ForkNumber forknum, SmgrChainIndex chain_index);
 
 void
 mdsmgr_register(void)
@@ -131,6 +158,7 @@ mdsmgr_register(void)
 	/* magnetic disk */
 	f_smgr		md_smgr = (f_smgr) {
 		.name = "md",
+		.chain_position = SMGR_CHAIN_TAIL,
 		.smgr_init = mdinit,
 		.smgr_shutdown = NULL,
 		.smgr_open = mdopen,
@@ -190,7 +218,7 @@ _mdfd_open_flags(void)
 /*
  * mdinit() -- Initialize private state for magnetic disk storage manager.
  */
-void
+static void
 mdinit(void)
 {
 	MdCxt = AllocSetContextCreate(TopMemoryContext,
@@ -203,8 +231,8 @@ mdinit(void)
  *
  * Note: this will return true for lingering files, with pending deletions
  */
-bool
-mdexists(SMgrRelation reln, ForkNumber forknum)
+static bool
+mdexists(SMgrRelation reln, ForkNumber forknum, SmgrChainIndex chain_index)
 {
 	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 
@@ -214,7 +242,7 @@ mdexists(SMgrRelation reln, ForkNumber forknum)
 	 * which already closes relations when dropping them.
 	 */
 	if (!InRecovery)
-		mdclose(reln, forknum);
+		mdclose(reln, forknum, 0);
 
 	return (mdopenfork(mdreln, forknum, EXTENSION_RETURN_NULL) != NULL);
 }
@@ -224,8 +252,8 @@ mdexists(SMgrRelation reln, ForkNumber forknum)
  *
  * If isRedo is true, it's okay for the relation to exist already.
  */
-void
-mdcreate(RelFileLocator /* reln */, SMgrRelation reln, ForkNumber forknum, bool isRedo)
+static void
+mdcreate(RelFileLocator relold, SMgrRelation reln, ForkNumber forknum, bool isRedo, SmgrChainIndex chain_index)
 {
 	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 	MdfdVec    *mdfd;
@@ -342,8 +370,8 @@ mdcreate(RelFileLocator /* reln */, SMgrRelation reln, ForkNumber forknum, bool
  * Note: any failure should be reported as WARNING not ERROR, because
  * we are usually not in a transaction anymore when this is called.
  */
-void
-mdunlink(RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo)
+static void
+mdunlink(RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo, SmgrChainIndex chain_index)
 {
 	/* Now do the per-fork work */
 	if (forknum == InvalidForkNumber)
@@ -495,9 +523,9 @@ mdunlinkfork(RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo)
  * EOF).  Note that we assume writing a block beyond current EOF
  * causes intervening file space to become filled with zeroes.
  */
-void
+static void
 mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		 const void *buffer, bool skipFsync)
+		 const void *buffer, bool skipFsync, SmgrChainIndex chain_index)
 {
 	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 	off_t		seekpos;
@@ -561,9 +589,9 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
  * Similar to mdextend(), except the relation can be extended by multiple
  * blocks at once and the added blocks will be filled with zeroes.
  */
-void
+static void
 mdzeroextend(SMgrRelation reln, ForkNumber forknum,
-			 BlockNumber blocknum, int nblocks, bool skipFsync)
+			 BlockNumber blocknum, int nblocks, bool skipFsync, SmgrChainIndex chain_index)
 {
 	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 	MdfdVec    *v;
@@ -717,8 +745,8 @@ mdopenfork(MdSMgrRelation reln, ForkNumber forknum, int behavior)
 /*
  * mdopen() -- Initialize newly-opened relation.
  */
-void
-mdopen(SMgrRelation reln)
+static void
+mdopen(SMgrRelation reln, SmgrChainIndex chain_index)
 {
 	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 
@@ -730,8 +758,8 @@ mdopen(SMgrRelation reln)
 /*
  * mdclose() -- Close the specified relation, if it isn't closed already.
  */
-void
-mdclose(SMgrRelation reln, ForkNumber forknum)
+static void
+mdclose(SMgrRelation reln, ForkNumber forknum, SmgrChainIndex chain_index)
 {
 	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 	int			nopensegs = mdreln->md_num_open_segs[forknum];
@@ -754,9 +782,9 @@ mdclose(SMgrRelation reln, ForkNumber forknum)
 /*
  * mdprefetch() -- Initiate asynchronous read of the specified blocks of a relation
  */
-bool
+static bool
 mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		   int nblocks)
+		   int nblocks, SmgrChainIndex chain_index)
 {
 #ifdef USE_PREFETCH
 	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
@@ -852,9 +880,9 @@ buffers_to_iovec(struct iovec *iov, void **buffers, int nblocks)
  * mdmaxcombine() -- Return the maximum number of total blocks that can be
  *				 combined with an IO starting at blocknum.
  */
-uint32
+static uint32
 mdmaxcombine(SMgrRelation reln, ForkNumber forknum,
-			 BlockNumber blocknum)
+			 BlockNumber blocknum, SmgrChainIndex index)
 {
 	BlockNumber segoff;
 
@@ -866,9 +894,9 @@ mdmaxcombine(SMgrRelation reln, ForkNumber forknum,
 /*
  * mdreadv() -- Read the specified blocks from a relation.
  */
-void
+static void
 mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		void **buffers, BlockNumber nblocks)
+		void **buffers, BlockNumber nblocks, SmgrChainIndex chain_index)
 {
 	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 
@@ -989,9 +1017,9 @@ mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
  * relation (ie, those before the current EOF).  To extend a relation,
  * use mdextend().
  */
-void
+static void
 mdwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		 const void **buffers, BlockNumber nblocks, bool skipFsync)
+		 const void **buffers, BlockNumber nblocks, bool skipFsync, SmgrChainIndex chain_index)
 {
 	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 
@@ -1096,9 +1124,9 @@ mdwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
  * This accepts a range of blocks because flushing several pages at once is
  * considerably more efficient than doing so individually.
  */
-void
+static void
 mdwriteback(SMgrRelation reln, ForkNumber forknum,
-			BlockNumber blocknum, BlockNumber nblocks)
+			BlockNumber blocknum, BlockNumber nblocks, SmgrChainIndex chain_index)
 {
 	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 
@@ -1157,8 +1185,8 @@ mdwriteback(SMgrRelation reln, ForkNumber forknum,
  * called, then only segments up to the last one actually touched
  * are present in the array.
  */
-BlockNumber
-mdnblocks(SMgrRelation reln, ForkNumber forknum)
+static BlockNumber
+mdnblocks(SMgrRelation reln, ForkNumber forknum, SmgrChainIndex chain_index)
 {
 	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 	MdfdVec    *v;
@@ -1222,9 +1250,9 @@ mdnblocks(SMgrRelation reln, ForkNumber forknum)
  * sure we have opened all active segments, so that truncate loop will get
  * them all!
  */
-void
+static void
 mdtruncate(SMgrRelation reln, ForkNumber forknum,
-		   BlockNumber curnblk, BlockNumber nblocks)
+		   BlockNumber curnblk, BlockNumber nblocks, SmgrChainIndex chain_index)
 {
 	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 	BlockNumber priorblocks;
@@ -1312,8 +1340,8 @@ mdtruncate(SMgrRelation reln, ForkNumber forknum,
 /*
  * mdregistersync() -- Mark whole relation as needing fsync
  */
-void
-mdregistersync(SMgrRelation reln, ForkNumber forknum)
+static void
+mdregistersync(SMgrRelation reln, ForkNumber forknum, SmgrChainIndex chain_index)
 {
 	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 	int			segno;
@@ -1323,7 +1351,7 @@ mdregistersync(SMgrRelation reln, ForkNumber forknum)
 	 * NOTE: mdnblocks makes sure we have opened all active segments, so that
 	 * the loop below will get them all!
 	 */
-	mdnblocks(reln, forknum);
+	mdnblocks(reln, forknum, 0);
 
 	min_inactive_seg = segno = mdreln->md_num_open_segs[forknum];
 
@@ -1364,8 +1392,8 @@ mdregistersync(SMgrRelation reln, ForkNumber forknum)
  * crash before the next checkpoint syncs the newly-inactive segment, that
  * segment may survive recovery, reintroducing unwanted data into the table.
  */
-void
-mdimmedsync(SMgrRelation reln, ForkNumber forknum)
+static void
+mdimmedsync(SMgrRelation reln, ForkNumber forknum, SmgrChainIndex chain_index)
 {
 	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 	int			segno;
@@ -1375,7 +1403,7 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 	 * NOTE: mdnblocks makes sure we have opened all active segments, so that
 	 * the loop below will get them all!
 	 */
-	mdnblocks(reln, forknum);
+	mdnblocks(reln, forknum, 0);
 
 	min_inactive_seg = segno = mdreln->md_num_open_segs[forknum];
 
@@ -1745,7 +1773,7 @@ _mdfd_getseg(MdSMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
 
 				mdextend((SMgrRelation) reln, forknum,
 						 nextsegno * ((BlockNumber) RELSEG_SIZE) - 1,
-						 zerobuf, skipFsync);
+						 zerobuf, skipFsync, 0);
 				pfree(zerobuf);
 			}
 			flags = O_CREAT;
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 0498fd6c317..08892563768 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -63,13 +63,13 @@
 #include "utils/inval.h"
 #include "utils/memutils.h"
 
-static f_smgr *smgrsw;
+f_smgr	   *smgrsw;
 
 static int	NSmgr = 0;
 
 static Size LargestSMgrRelationSize = 0;
 
-SMgrId		storage_manager_id;
+SMgrChain	storage_manager_chain;
 
 /*
  * Each backend has a hashtable that stores all extant SMgrRelation objects.
@@ -98,20 +98,23 @@ smgr_register(const f_smgr *smgr, Size smgrrelation_size)
 	if (smgr->name == NULL || *smgr->name == 0)
 		elog(FATAL, "smgr registered with invalid name");
 
-	Assert(smgr->smgr_open != NULL);
-	Assert(smgr->smgr_close != NULL);
-	Assert(smgr->smgr_create != NULL);
-	Assert(smgr->smgr_exists != NULL);
-	Assert(smgr->smgr_unlink != NULL);
-	Assert(smgr->smgr_extend != NULL);
-	Assert(smgr->smgr_zeroextend != NULL);
-	Assert(smgr->smgr_prefetch != NULL);
-	Assert(smgr->smgr_readv != NULL);
-	Assert(smgr->smgr_writev != NULL);
-	Assert(smgr->smgr_writeback != NULL);
-	Assert(smgr->smgr_nblocks != NULL);
-	Assert(smgr->smgr_truncate != NULL);
-	Assert(smgr->smgr_immedsync != NULL);
+	if (smgr->chain_position == SMGR_CHAIN_TAIL)
+	{
+		Assert(smgr->smgr_open != NULL);
+		Assert(smgr->smgr_close != NULL);
+		Assert(smgr->smgr_create != NULL);
+		Assert(smgr->smgr_exists != NULL);
+		Assert(smgr->smgr_unlink != NULL);
+		Assert(smgr->smgr_extend != NULL);
+		Assert(smgr->smgr_zeroextend != NULL);
+		Assert(smgr->smgr_prefetch != NULL);
+		Assert(smgr->smgr_readv != NULL);
+		Assert(smgr->smgr_writev != NULL);
+		Assert(smgr->smgr_writeback != NULL);
+		Assert(smgr->smgr_nblocks != NULL);
+		Assert(smgr->smgr_truncate != NULL);
+		Assert(smgr->smgr_immedsync != NULL);
+	}
 
 	old = MemoryContextSwitchTo(TopMemoryContext);
 
@@ -138,6 +141,17 @@ smgr_register(const f_smgr *smgr, Size smgrrelation_size)
 	return my_id;
 }
 
+SMgrId
+smgr_lookup(const char *name)
+{
+	for (int i = 0; i < NSmgr; i++)
+	{
+		if (strcmp(smgrsw[i].name, name) == 0)
+			return i;
+	}
+	elog(FATAL, "Storage manager not found with name: %s", name);
+}
+
 /*
  * smgrinit(), smgrshutdown() -- Initialize or shut down storage
  *								 managers.
@@ -176,6 +190,22 @@ smgrshutdown(int code, Datum arg)
 	}
 }
 
+#define SMGR_CHAIN_LOOKUP(SMGR_METHOD) \
+	do \
+	{ \
+		while (chain_index < reln->smgr_chain.size && smgrsw[reln->smgr_chain.chain[chain_index]].SMGR_METHOD == NULL) \
+			chain_index++; \
+		Assert(chain_index < reln->smgr_chain.size); \
+	} while (0)
+
+void
+smgr_open_next(SMgrRelation reln, SmgrChainIndex chain_index)
+{
+	SMGR_CHAIN_LOOKUP(smgr_open);
+
+	smgrsw[reln->smgr_chain.chain[chain_index]].smgr_open(reln, chain_index);
+}
+
 /*
  * smgropen() -- Return an SMgrRelation object, creating it if need be.
  *
@@ -229,10 +259,10 @@ smgropen(RelFileLocator rlocator, ProcNumber backend)
 		for (int i = 0; i <= MAX_FORKNUM; ++i)
 			reln->smgr_cached_nblocks[i] = InvalidBlockNumber;
 
-		reln->smgr_which = storage_manager_id;
+		memcpy(&reln->smgr_chain, &storage_manager_chain, sizeof(SMgrChain));
 
 		/* implementation-specific initialization */
-		smgrsw[reln->smgr_which].smgr_open(reln);
+		smgr_open_next(reln, 0);
 
 		/* it is not pinned yet */
 		reln->pincount = 0;
@@ -270,6 +300,14 @@ smgrunpin(SMgrRelation reln)
 		dlist_push_tail(&unpinned_relns, &reln->node);
 }
 
+void
+smgr_close_next(SMgrRelation reln, ForkNumber forknum, SmgrChainIndex chain_index)
+{
+	SMGR_CHAIN_LOOKUP(smgr_close);
+
+	smgrsw[reln->smgr_chain.chain[chain_index]].smgr_close(reln, forknum, chain_index);
+}
+
 /*
  * smgrdestroy() -- Delete an SMgrRelation object.
  */
@@ -281,7 +319,7 @@ smgrdestroy(SMgrRelation reln)
 	Assert(reln->pincount == 0);
 
 	for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-		smgrsw[reln->smgr_which].smgr_close(reln, forknum);
+		smgr_close_next(reln, forknum, 0);
 
 	dlist_delete(&reln->node);
 
@@ -301,7 +339,7 @@ smgrrelease(SMgrRelation reln)
 {
 	for (ForkNumber forknum = 0; forknum <= MAX_FORKNUM; forknum++)
 	{
-		smgrsw[reln->smgr_which].smgr_close(reln, forknum);
+		smgr_close_next(reln, forknum, 0);
 		reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
 	}
 	reln->smgr_targblock = InvalidBlockNumber;
@@ -391,13 +429,29 @@ smgrreleaserellocator(RelFileLocatorBackend rlocator)
 		smgrrelease(reln);
 }
 
+bool
+smgr_exists_next(SMgrRelation reln, ForkNumber forknum, SmgrChainIndex chain_index)
+{
+	SMGR_CHAIN_LOOKUP(smgr_exists);
+
+	return smgrsw[reln->smgr_chain.chain[chain_index]].smgr_exists(reln, forknum, chain_index);
+}
+
 /*
  * smgrexists() -- Does the underlying file for a fork exist?
  */
 bool
 smgrexists(SMgrRelation reln, ForkNumber forknum)
 {
-	return smgrsw[reln->smgr_which].smgr_exists(reln, forknum);
+	return smgr_exists_next(reln, forknum, 0);
+}
+
+void
+smgr_create_next(RelFileLocator relold, SMgrRelation reln, ForkNumber forknum, bool isRedo, SmgrChainIndex chain_index)
+{
+	SMGR_CHAIN_LOOKUP(smgr_create);
+
+	smgrsw[reln->smgr_chain.chain[chain_index]].smgr_create(relold, reln, forknum, isRedo, chain_index);
 }
 
 /*
@@ -410,7 +464,15 @@ smgrexists(SMgrRelation reln, ForkNumber forknum)
 void
 smgrcreate(RelFileLocator relold, SMgrRelation reln, ForkNumber forknum, bool isRedo)
 {
-	smgrsw[reln->smgr_which].smgr_create(relold, reln, forknum, isRedo);
+	smgr_create_next(relold, reln, forknum, isRedo, 0);
+}
+
+void
+smgr_immedsync_next(SMgrRelation reln, ForkNumber forknum, SmgrChainIndex chain_index)
+{
+	SMGR_CHAIN_LOOKUP(smgr_immedsync);
+
+	smgrsw[reln->smgr_chain.chain[chain_index]].smgr_immedsync(reln, forknum, chain_index);
 }
 
 /*
@@ -438,16 +500,22 @@ smgrdosyncall(SMgrRelation *rels, int nrels)
 	 */
 	for (i = 0; i < nrels; i++)
 	{
-		int			which = rels[i]->smgr_which;
-
 		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
 		{
-			if (smgrsw[which].smgr_exists(rels[i], forknum))
-				smgrsw[which].smgr_immedsync(rels[i], forknum);
+			if (smgr_exists_next(rels[i], forknum, 0))
+				smgr_immedsync_next(rels[i], forknum, 0);
 		}
 	}
 }
 
+void
+smgr_unlink_next(SMgrRelation reln, RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo, SmgrChainIndex chain_index)
+{
+	SMGR_CHAIN_LOOKUP(smgr_unlink);
+
+	smgrsw[reln->smgr_chain.chain[chain_index]].smgr_unlink(rlocator, forknum, isRedo, chain_index);
+}
+
 /*
  * smgrdounlinkall() -- Immediately unlink all forks of all given relations
  *
@@ -482,13 +550,12 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
 	for (i = 0; i < nrels; i++)
 	{
 		RelFileLocatorBackend rlocator = rels[i]->smgr_rlocator;
-		int			which = rels[i]->smgr_which;
 
 		rlocators[i] = rlocator;
 
 		/* Close the forks at smgr level */
 		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-			smgrsw[which].smgr_close(rels[i], forknum);
+			smgr_close_next(rels[i], forknum, 0);
 	}
 
 	/*
@@ -512,15 +579,22 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
 
 	for (i = 0; i < nrels; i++)
 	{
-		int			which = rels[i]->smgr_which;
-
 		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-			smgrsw[which].smgr_unlink(rlocators[i], forknum, isRedo);
+			smgr_unlink_next(rels[i], rlocators[i], forknum, isRedo, 0);
 	}
 
 	pfree(rlocators);
 }
 
+void
+smgr_extend_next(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+				 const void *buffer, bool skipFsync, SmgrChainIndex chain_index)
+{
+	SMGR_CHAIN_LOOKUP(smgr_extend);
+
+	smgrsw[reln->smgr_chain.chain[chain_index]].smgr_extend(reln, forknum, blocknum,
+															buffer, skipFsync, chain_index);
+}
 
 /*
  * smgrextend() -- Add a new block to a file.
@@ -535,8 +609,7 @@ void
 smgrextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		   const void *buffer, bool skipFsync)
 {
-	smgrsw[reln->smgr_which].smgr_extend(reln, forknum, blocknum,
-										 buffer, skipFsync);
+	smgr_extend_next(reln, forknum, blocknum, buffer, skipFsync, 0);
 
 	/*
 	 * Normally we expect this to increase nblocks by one, but if the cached
@@ -549,6 +622,16 @@ smgrextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
 }
 
+void
+smgr_zeroextend_next(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+					 int nblocks, bool skipFsync, SmgrChainIndex chain_index)
+{
+	SMGR_CHAIN_LOOKUP(smgr_zeroextend);
+
+	smgrsw[reln->smgr_chain.chain[chain_index]].smgr_zeroextend(reln, forknum, blocknum,
+																nblocks, skipFsync, chain_index);
+}
+
 /*
  * smgrzeroextend() -- Add new zeroed out blocks to a file.
  *
@@ -560,8 +643,7 @@ void
 smgrzeroextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 			   int nblocks, bool skipFsync)
 {
-	smgrsw[reln->smgr_which].smgr_zeroextend(reln, forknum, blocknum,
-											 nblocks, skipFsync);
+	smgr_zeroextend_next(reln, forknum, blocknum, nblocks, skipFsync, 0);
 
 	/*
 	 * Normally we expect this to increase the fork size by nblocks, but if
@@ -574,6 +656,16 @@ smgrzeroextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
 }
 
+bool
+smgr_prefetch_next(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+				   int nblocks, SmgrChainIndex chain_index)
+{
+	SMGR_CHAIN_LOOKUP(smgr_prefetch);
+
+	return smgrsw[reln->smgr_chain.chain[chain_index]].smgr_prefetch(reln, forknum, blocknum,
+																	 nblocks, chain_index);
+}
+
 /*
  * smgrprefetch() -- Initiate asynchronous read of the specified block of a relation.
  *
@@ -585,7 +677,16 @@ bool
 smgrprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 			 int nblocks)
 {
-	return smgrsw[reln->smgr_which].smgr_prefetch(reln, forknum, blocknum, nblocks);
+	return smgr_prefetch_next(reln, forknum, blocknum, nblocks, 0);
+}
+
+uint32
+smgr_maxcombine_next(SMgrRelation reln, ForkNumber forknum,
+					 BlockNumber blocknum, SmgrChainIndex chain_index)
+{
+	SMGR_CHAIN_LOOKUP(smgr_maxcombine);
+
+	return smgrsw[reln->smgr_chain.chain[chain_index]].smgr_maxcombine(reln, forknum, blocknum, chain_index);
 }
 
 /*
@@ -598,7 +699,17 @@ uint32
 smgrmaxcombine(SMgrRelation reln, ForkNumber forknum,
 			   BlockNumber blocknum)
 {
-	return smgrsw[reln->smgr_which].smgr_maxcombine(reln, forknum, blocknum);
+	return smgr_maxcombine_next(reln, forknum, blocknum, 0);
+}
+
+void
+smgr_readv_next(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+				void **buffers, BlockNumber nblocks, SmgrChainIndex chain_index)
+{
+	SMGR_CHAIN_LOOKUP(smgr_readv);
+
+	smgrsw[reln->smgr_chain.chain[chain_index]].smgr_readv(reln, forknum, blocknum,
+														   buffers, nblocks, chain_index);
 }
 
 /*
@@ -616,8 +727,17 @@ void
 smgrreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		  void **buffers, BlockNumber nblocks)
 {
-	smgrsw[reln->smgr_which].smgr_readv(reln, forknum, blocknum, buffers,
-										nblocks);
+	smgr_readv_next(reln, forknum, blocknum, buffers, nblocks, 0);
+}
+
+void
+smgr_writev_next(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+				 const void **buffers, BlockNumber nblocks, bool skipFsync, SmgrChainIndex chain_index)
+{
+	SMGR_CHAIN_LOOKUP(smgr_writev);
+
+	smgrsw[reln->smgr_chain.chain[chain_index]].smgr_writev(reln, forknum, blocknum,
+															buffers, nblocks, skipFsync, chain_index);
 }
 
 /*
@@ -650,8 +770,17 @@ void
 smgrwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		   const void **buffers, BlockNumber nblocks, bool skipFsync)
 {
-	smgrsw[reln->smgr_which].smgr_writev(reln, forknum, blocknum,
-										 buffers, nblocks, skipFsync);
+	smgr_writev_next(reln, forknum, blocknum,
+					 buffers, nblocks, skipFsync, 0);
+}
+
+void
+smgr_writeback_next(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+					BlockNumber nblocks, SmgrChainIndex chain_index)
+{
+	SMGR_CHAIN_LOOKUP(smgr_writeback);
+
+	smgrsw[reln->smgr_chain.chain[chain_index]].smgr_writeback(reln, forknum, blocknum, nblocks, chain_index);
 }
 
 /*
@@ -662,8 +791,15 @@ void
 smgrwriteback(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 			  BlockNumber nblocks)
 {
-	smgrsw[reln->smgr_which].smgr_writeback(reln, forknum, blocknum,
-											nblocks);
+	smgr_writeback_next(reln, forknum, blocknum, nblocks, 0);
+}
+
+extern BlockNumber
+smgr_nblocks_next(SMgrRelation reln, ForkNumber forknum, SmgrChainIndex chain_index)
+{
+	SMGR_CHAIN_LOOKUP(smgr_nblocks);
+
+	return smgrsw[reln->smgr_chain.chain[chain_index]].smgr_nblocks(reln, forknum, chain_index);
 }
 
 /*
@@ -680,7 +816,7 @@ smgrnblocks(SMgrRelation reln, ForkNumber forknum)
 	if (result != InvalidBlockNumber)
 		return result;
 
-	result = smgrsw[reln->smgr_which].smgr_nblocks(reln, forknum);
+	result = smgr_nblocks_next(reln, forknum, 0);
 
 	reln->smgr_cached_nblocks[forknum] = result;
 
@@ -708,6 +844,14 @@ smgrnblocks_cached(SMgrRelation reln, ForkNumber forknum)
 	return InvalidBlockNumber;
 }
 
+void
+smgr_truncate_next(SMgrRelation reln, ForkNumber forknum, BlockNumber curnblk, BlockNumber nblocks, SmgrChainIndex chain_index)
+{
+	SMGR_CHAIN_LOOKUP(smgr_truncate);
+
+	smgrsw[reln->smgr_chain.chain[chain_index]].smgr_truncate(reln, forknum, curnblk, nblocks, chain_index);
+}
+
 /*
  * smgrtruncate() -- Truncate the given forks of supplied relation to
  *					 each specified numbers of blocks
@@ -752,8 +896,7 @@ smgrtruncate(SMgrRelation reln, ForkNumber *forknum, int nforks,
 		/* Make the cached size is invalid if we encounter an error. */
 		reln->smgr_cached_nblocks[forknum[i]] = InvalidBlockNumber;
 
-		smgrsw[reln->smgr_which].smgr_truncate(reln, forknum[i],
-											   old_nblocks[i], nblocks[i]);
+		smgr_truncate_next(reln, forknum[i], old_nblocks[i], nblocks[i], 0);
 
 		/*
 		 * We might as well update the local smgr_cached_nblocks values. The
@@ -766,6 +909,14 @@ smgrtruncate(SMgrRelation reln, ForkNumber *forknum, int nforks,
 	}
 }
 
+void
+smgr_registersync_next(SMgrRelation reln, ForkNumber forknum, SmgrChainIndex chain_index)
+{
+	SMGR_CHAIN_LOOKUP(smgr_registersync);
+
+	smgrsw[reln->smgr_chain.chain[chain_index]].smgr_registersync(reln, forknum, chain_index);
+}
+
 /*
  * smgrregistersync() -- Request a relation to be sync'd at next checkpoint
  *
@@ -781,7 +932,7 @@ smgrtruncate(SMgrRelation reln, ForkNumber *forknum, int nforks,
 void
 smgrregistersync(SMgrRelation reln, ForkNumber forknum)
 {
-	smgrsw[reln->smgr_which].smgr_registersync(reln, forknum);
+	smgr_registersync_next(reln, forknum, 0);
 }
 
 /*
@@ -813,7 +964,7 @@ smgrregistersync(SMgrRelation reln, ForkNumber forknum)
 void
 smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
 {
-	smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
+	smgr_immedsync_next(reln, forknum, 0);
 }
 
 /*
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 5655348a2e2..06b8aa2adbe 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -4069,6 +4069,8 @@ PostgresSingleUserMain(int argc, char *argv[],
 	 */
 	process_shared_preload_libraries();
 
+	process_smgr_chain();
+
 	/* Initialize MaxBackends */
 	InitializeMaxBackends();
 
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 184ec632dbe..39936a731f6 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -56,6 +56,7 @@
 #include "utils/pidfile.h"
 #include "utils/syscache.h"
 #include "utils/varlena.h"
+#include "storage/smgr.h"
 
 
 #define DIRECTORY_LOCK_FILE		"postmaster.pid"
@@ -1834,6 +1835,8 @@ char	   *session_preload_libraries_string = NULL;
 char	   *shared_preload_libraries_string = NULL;
 char	   *local_preload_libraries_string = NULL;
 
+char	   *smgr_chain_string = NULL;
+
 /* Flag telling that we are loading shared_preload_libraries */
 bool		process_shared_preload_libraries_in_progress = false;
 bool		process_shared_preload_libraries_done = false;
@@ -1910,6 +1913,62 @@ process_shared_preload_libraries(void)
 	process_shared_preload_libraries_done = true;
 }
 
+void
+process_smgr_chain(void)
+{
+	char	   *rawstring;
+	List	   *elemlist;
+	ListCell   *l;
+	uint8		idx = 0;
+
+	if (smgr_chain_string == NULL || smgr_chain_string[0] == '\0')
+		return;					/* nothing to do */
+
+	/* Need a modifiable copy of string */
+	rawstring = pstrdup(smgr_chain_string);
+
+	/* Parse string into list of filename paths */
+	if (!SplitIdentifierString(rawstring, ',', &elemlist))
+	{
+		/* syntax error in list */
+		pfree(rawstring);
+		ereport(LOG,
+				(errcode(ERRCODE_SYNTAX_ERROR),
+				 errmsg("invalid list syntax in parameter \"%s\"",
+						"smgr_chain")));
+		return;
+	}
+
+	foreach(l, elemlist)
+	{
+		char	   *smgrname = (char *) lfirst(l);
+		SMgrId		id = smgr_lookup(smgrname);
+
+		storage_manager_chain.chain[idx++] = id;
+
+		ereport(DEBUG1,
+				(errmsg_internal("using storage manager in chain \"%s\"", smgrname)));
+	}
+
+	for (int i = 0; i < idx; ++i)
+	{
+		int			chain_position = smgrsw[storage_manager_chain.chain[i]].chain_position;
+
+		if (i == idx - 1 && chain_position != SMGR_CHAIN_TAIL)
+			ereport(FATAL,
+					(errmsg_internal("smgr_chain: the last element should be a `tail` implementation, not a modifier.")));
+
+		if (i != idx - 1 && chain_position != SMGR_CHAIN_MODIFIER)
+			ereport(FATAL,
+					(errmsg_internal("smgr_chain: element %i/%i %s is not a modifier.", i, idx, smgrsw[storage_manager_chain.chain[i]].name)));
+	}
+
+	storage_manager_chain.size = idx;
+
+	list_free(elemlist);
+	pfree(rawstring);
+}
+
 /*
  * process any libraries that should be preloaded at backend start
  */
@@ -1932,7 +1991,9 @@ register_builtin_dynamic_managers(void)
 {
 	mdsmgr_register();
 
-	storage_manager_id = MdSMgrId;
+	/* setup a dummy chain with md, for tools */
+	storage_manager_chain.chain[0] = MdSMgrId;
+	storage_manager_chain.size = 1;
 }
 
 /*
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 38cb9e970d5..735e36838b0 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -4356,6 +4356,17 @@ struct config_string ConfigureNamesString[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"smgr_chain", PGC_POSTMASTER, CLIENT_CONN_PRELOAD,
+			gettext_noop("Lists storage managers used by the server, in order."),
+			NULL,
+			GUC_LIST_INPUT | GUC_LIST_QUOTE | GUC_SUPERUSER_ONLY
+		},
+		&smgr_chain_string,
+		"md",
+		NULL, NULL, NULL
+	},
+
 	{
 		{"search_path", PGC_USERSET, CLIENT_CONN_STATEMENT,
 			gettext_noop("Sets the schema search order for names that are not schema-qualified."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 079efa1baa7..1fa746708b6 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -775,6 +775,7 @@ autovacuum_worker_slots = 16	# autovacuum worker slots to allocate
 #session_preload_libraries = ''
 #shared_preload_libraries = ''		# (change requires restart)
 #jit_provider = 'llvmjit'		# JIT library to use
+#smgr_chain = 'md'			# SMGR implementations to use
 
 # - Other Defaults -
 
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index ff4ef578a1f..4e218941a4b 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -505,6 +505,7 @@ extern PGDLLIMPORT bool process_shmem_requests_in_progress;
 extern PGDLLIMPORT char *session_preload_libraries_string;
 extern PGDLLIMPORT char *shared_preload_libraries_string;
 extern PGDLLIMPORT char *local_preload_libraries_string;
+extern PGDLLIMPORT char *smgr_chain_string;
 
 extern void CreateDataDirLockFile(bool amPostmaster);
 extern void CreateSocketLockFile(const char *socketfile, bool amPostmaster,
@@ -515,6 +516,7 @@ extern bool RecheckDataDirLockFile(void);
 extern void ValidatePgVersion(const char *path);
 extern void register_builtin_dynamic_managers(void);
 extern void process_shared_preload_libraries(void);
+extern void process_smgr_chain(void);
 extern void process_session_preload_libraries(void);
 extern void process_shmem_requests(void);
 extern void pg_bindtextdomain(const char *domain);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 61c0e85dd74..5b4992c0855 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -23,34 +23,6 @@
 extern void mdsmgr_register(void);
 extern SMgrId MdSMgrId;
 
-/* md storage manager functionality */
-extern void mdinit(void);
-extern void mdopen(SMgrRelation reln);
-extern void mdclose(SMgrRelation reln, ForkNumber forknum);
-extern void mdcreate(RelFileLocator relold, SMgrRelation reln, ForkNumber forknum, bool isRedo);
-extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
-extern void mdunlink(RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo);
-extern void mdextend(SMgrRelation reln, ForkNumber forknum,
-					 BlockNumber blocknum, const void *buffer, bool skipFsync);
-extern void mdzeroextend(SMgrRelation reln, ForkNumber forknum,
-						 BlockNumber blocknum, int nblocks, bool skipFsync);
-extern bool mdprefetch(SMgrRelation reln, ForkNumber forknum,
-					   BlockNumber blocknum, int nblocks);
-extern uint32 mdmaxcombine(SMgrRelation reln, ForkNumber forknum,
-						   BlockNumber blocknum);
-extern void mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-					void **buffers, BlockNumber nblocks);
-extern void mdwritev(SMgrRelation reln, ForkNumber forknum,
-					 BlockNumber blocknum,
-					 const void **buffers, BlockNumber nblocks, bool skipFsync);
-extern void mdwriteback(SMgrRelation reln, ForkNumber forknum,
-						BlockNumber blocknum, BlockNumber nblocks);
-extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
-extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
-					   BlockNumber old_blocks, BlockNumber nblocks);
-extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
-extern void mdregistersync(SMgrRelation reln, ForkNumber forknum);
-
 extern void ForgetDatabaseSyncRequests(Oid dbid);
 extern void DropRelationFiles(RelFileLocator *delrels, int ndelrels, bool isRedo);
 
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 5b2b6de91c4..8f789cb7f80 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -20,7 +20,17 @@
 
 typedef uint8 SMgrId;
 
-extern PGDLLIMPORT SMgrId storage_manager_id;
+typedef uint8 SmgrChainIndex;
+
+#define MAX_SMGR_CHAIN 15
+
+typedef struct
+{
+	SMgrId		chain[MAX_SMGR_CHAIN];	/* storage manager selector */
+	uint8		size;
+} SMgrChain;
+
+extern PGDLLIMPORT SMgrChain storage_manager_chain;
 
 /*
  * smgr.c maintains a table of SMgrRelation objects, which are essentially
@@ -55,7 +65,7 @@ typedef struct SMgrRelationData
 	 * Fields below here are intended to be private to smgr.c and its
 	 * submodules.  Do not touch them from elsewhere.
 	 */
-	SMgrId		smgr_which;		/* storage manager selector */
+	SMgrChain	smgr_chain;		/* selected storage manager chain */
 
 	/*
 	 * Pinning support.  If unpinned (ie. pincount == 0), 'node' is a list
@@ -70,6 +80,9 @@ typedef SMgrRelationData *SMgrRelation;
 #define SmgrIsTemp(smgr) \
 	RelFileLocatorBackendIsTemp((smgr)->smgr_rlocator)
 
+#define SMGR_CHAIN_TAIL 1
+#define SMGR_CHAIN_MODIFIER 2
+
 /*
  * This struct of function pointers defines the API between smgr.c and
  * any individual storage manager module.  Note that smgr subfunctions are
@@ -83,40 +96,44 @@ typedef SMgrRelationData *SMgrRelation;
 typedef struct f_smgr
 {
 	const char *name;
+	int			chain_position;
 	void		(*smgr_init) (void);	/* may be NULL */
 	void		(*smgr_shutdown) (void);	/* may be NULL */
-	void		(*smgr_open) (SMgrRelation reln);
-	void		(*smgr_close) (SMgrRelation reln, ForkNumber forknum);
+	void		(*smgr_open) (SMgrRelation reln, SmgrChainIndex chain_index);
+	void		(*smgr_close) (SMgrRelation reln, ForkNumber forknum, SmgrChainIndex chain_index);
 	void		(*smgr_create) (RelFileLocator relold, SMgrRelation reln, ForkNumber forknum,
-								bool isRedo);
-	bool		(*smgr_exists) (SMgrRelation reln, ForkNumber forknum);
+								bool isRedo, SmgrChainIndex chain_index);
+	bool		(*smgr_exists) (SMgrRelation reln, ForkNumber forknum, SmgrChainIndex chain_index);
 	void		(*smgr_unlink) (RelFileLocatorBackend rlocator, ForkNumber forknum,
-								bool isRedo);
+								bool isRedo, SmgrChainIndex chain_index);
 	void		(*smgr_extend) (SMgrRelation reln, ForkNumber forknum,
-								BlockNumber blocknum, const void *buffer, bool skipFsync);
+								BlockNumber blocknum, const void *buffer, bool skipFsync, SmgrChainIndex chain_index);
 	void		(*smgr_zeroextend) (SMgrRelation reln, ForkNumber forknum,
-									BlockNumber blocknum, int nblocks, bool skipFsync);
+									BlockNumber blocknum, int nblocks, bool skipFsync, SmgrChainIndex chain_index);
 	bool		(*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
-								  BlockNumber blocknum, int nblocks);
+								  BlockNumber blocknum, int nblocks, SmgrChainIndex chain_index);
 	uint32		(*smgr_maxcombine) (SMgrRelation reln, ForkNumber forknum,
-									BlockNumber blocknum);
+									BlockNumber blocknum, SmgrChainIndex chain_index);
 	void		(*smgr_readv) (SMgrRelation reln, ForkNumber forknum,
 							   BlockNumber blocknum,
-							   void **buffers, BlockNumber nblocks);
+							   void **buffers, BlockNumber nblocks, SmgrChainIndex chain_index);
 	void		(*smgr_writev) (SMgrRelation reln, ForkNumber forknum,
 								BlockNumber blocknum,
 								const void **buffers, BlockNumber nblocks,
-								bool skipFsync);
+								bool skipFsync, SmgrChainIndex chain_index);
 	void		(*smgr_writeback) (SMgrRelation reln, ForkNumber forknum,
-								   BlockNumber blocknum, BlockNumber nblocks);
-	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
+								   BlockNumber blocknum, BlockNumber nblocks, SmgrChainIndex chain_index);
+	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum, SmgrChainIndex chain_index);
 	void		(*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
-								  BlockNumber old_blocks, BlockNumber nblocks);
-	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
-	void		(*smgr_registersync) (SMgrRelation reln, ForkNumber forknum);
+								  BlockNumber old_blocks, BlockNumber nblocks, SmgrChainIndex chain_index);
+	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum, SmgrChainIndex chain_index);
+	void		(*smgr_registersync) (SMgrRelation reln, ForkNumber forknum, SmgrChainIndex chain_index);
 } f_smgr;
 
 extern SMgrId smgr_register(const f_smgr *smgr, Size smgrrelation_size);
+extern SMgrId smgr_lookup(const char *name);
+
+extern f_smgr *smgrsw;
 
 extern void smgrinit(void);
 extern SMgrRelation smgropen(RelFileLocator rlocator, ProcNumber backend);
@@ -158,6 +175,46 @@ extern void smgrregistersync(SMgrRelation reln, ForkNumber forknum);
 extern void AtEOXact_SMgr(void);
 extern bool ProcessBarrierSmgrRelease(void);
 
+extern void
+			smgr_open_next(SMgrRelation reln, SmgrChainIndex chain_index);
+extern void
+			smgr_close_next(SMgrRelation reln, ForkNumber forknum, SmgrChainIndex chain_index);
+extern bool
+			smgr_exists_next(SMgrRelation reln, ForkNumber forknum, SmgrChainIndex chain_index);
+extern void
+			smgr_create_next(RelFileLocator relold, SMgrRelation reln, ForkNumber forknum, bool isRedo, SmgrChainIndex chain_index);
+extern void
+			smgr_immedsync_next(SMgrRelation reln, ForkNumber forknum, SmgrChainIndex chain_index);
+extern void
+			smgr_unlink_next(SMgrRelation reln, RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo, SmgrChainIndex chain_index);
+extern void
+			smgr_extend_next(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+							 const void *buffer, bool skipFsync, SmgrChainIndex chain_index);
+extern void
+			smgr_zeroextend_next(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+								 int nblocks, bool skipFsync, SmgrChainIndex chain_index);
+extern bool
+			smgr_prefetch_next(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+							   int nblocks, SmgrChainIndex chain_index);
+extern uint32
+			smgr_maxcombine_next(SMgrRelation reln, ForkNumber forknum,
+								 BlockNumber blocknum, SmgrChainIndex chain_index);
+extern void
+			smgr_readv_next(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+							void **buffers, BlockNumber nblocks, SmgrChainIndex chain_index);
+extern void
+			smgr_writev_next(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+							 const void **buffers, BlockNumber nblocks, bool skipFsync, SmgrChainIndex chain_index);
+extern void
+			smgr_writeback_next(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+								BlockNumber nblocks, SmgrChainIndex chain_index);
+extern BlockNumber
+			smgr_nblocks_next(SMgrRelation reln, ForkNumber forknum, SmgrChainIndex chain_index);
+extern void
+			smgr_truncate_next(SMgrRelation reln, ForkNumber forknum, BlockNumber curnblk, BlockNumber nblocks, SmgrChainIndex chain_index);
+extern void
+			smgr_registersync_next(SMgrRelation reln, ForkNumber forknum, SmgrChainIndex chain_index);
+
 static inline void
 smgrread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		 void *buffer)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a117fb633d3..37c82c1c2d7 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2532,6 +2532,7 @@ SID_IDENTIFIER_AUTHORITY
 SID_NAME_USE
 SISeg
 SIZE_T
+SMgrChain
 SMgrRelation
 SMgrRelationData
 SMgrSortArray
-- 
2.47.2

#14Andreas Karlsson
andreas@proxel.se
In reply to: Nitin Jadhav (#11)
Re: Extensible storage manager API - SMGR hook Redux

On 9/21/24 8:24 PM, Nitin Jadhav wrote:

I reviewed the discussion and took a look at the patch sets. It seems
like many things are combined here. Based on the subject, I initially
thought it aimed to provide the infrastructure to easily extend
storage managers. This would allow anyone to create their own storage
managers using this infrastructure. While it addresses this, it also
includes additional features like fsync_checker, which I believe
should be a separate feature. Even though it might use the same
infrastructure, it appears to be a different functionality. I think we
should focus solely on providing the infrastructure here.

I personally think that it is fine that there are patches which provide
a PoC implementation of the new API. It is hard to verify if an API is
correct if there are zero alternative implementations. And there is also
a case to be made for having one of them in contrib just to make it hard
for us to break the API for external users.

That said I have not made up my mind yet if this is a good extension for
contrib.

We need to decide on our approach—whether to use a hook-based method
or a registration-based method—and I believe this requires further
discussion.

100% agreed.

The hook-based approach is simple and works well for anyone writing
their own storage manager. However, it has its limitations as we can
either use the default storage manager or a custom-built one for all
the work load, but we cannot choose between multiple storage managers.
On the other hand, the registration-based approach allows choosing
between multiple storage managers based on the workload, though it
requires a lot of changes.

Are we planning to support other storage managers in PostgreSQL in the
near future? If not, it is better to go with the hook-based approach.
Otherwise, the registration-based approach is preferable as it offers
more flexibility to users and enhances PostgreSQL’s functionality.

Could you please share your thoughts on this? Also, let me know if
this topic has already been discussed and if any conclusions were
reached.

I do not think there is any plan for core to support multiple storage
managers, but there are open source thrid party extensions which plan to
implement this API once it has been merged.

Andreas

#15Andreas Karlsson
andreas@proxel.se
In reply to: Andreas Karlsson (#14)
6 attachment(s)
Re: Extensible storage manager API - SMGR hook Redux

Hi,

Here is a rebased version of it to make the CI happy. I plan to work
more on this next week but am happy with any feedback on what is already
there.

Andreas

Attachments:

v4-0006-SMGR-GUC-variable-and-chaining.patchtext/x-patch; charset=UTF-8; name=v4-0006-SMGR-GUC-variable-and-chaining.patchDownload
From 1dd58c606c5e026c96ed93a71d85d3605078a429 Mon Sep 17 00:00:00 2001
From: Zsolt Parragi <zsolt.parragi@cancellar.hu>
Date: Mon, 2 Dec 2024 07:47:07 +0000
Subject: [PATCH v4 6/6] SMGR GUC variable and chaining

The overall goal of this commit is to introduce a user interface to
the previous SMGR patch.

The idea is to allow a simple configuration for multiple "modificator"
SMGRs similar to the fsync_checker in the original proposal.

* Extensions should be able to declare a named lists of SMGR implementations,
also specifying if the given SMGR is an "end" implementation for
actual storage, or if it is a modifier implementation for some other
purpose.
* Users should be able to specify a list of SMGRs: possibly multiple modifiers,
and one storage implementation at the end to configure how the
storage manager is constructed.

This commit introduces a new GUC variable, `smgr_chain`, which allows
users to configure multiple SMGR implementations: it is a comma separated list,
where the last entry most be a storage implementation, the others must be
modifiers. The default value of this variable is "md".

The internal storage manager API is also refactored to include an easy
way for SMGR implementations to support proper chaining. Modifier SMGR
implementations also only have to implement the functions they actually
change, and can leave everything else as empty (NULL). And with this
change we can make the functions of the md smgr static.

The fsync example extension is also modified to match the new API.
---
 contrib/fsync_checker/fsync_checker_smgr.c    |  57 ++--
 src/backend/postmaster/postmaster.c           |   2 +
 src/backend/storage/smgr/md.c                 | 102 +++++---
 src/backend/storage/smgr/smgr.c               | 247 ++++++++++++++----
 src/backend/tcop/postgres.c                   |   2 +
 src/backend/utils/init/miscinit.c             |  63 ++++-
 src/backend/utils/misc/guc_tables.c           |  11 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/miscadmin.h                       |   2 +
 src/include/storage/md.h                      |  28 --
 src/include/storage/smgr.h                    |  93 +++++--
 src/tools/pgindent/typedefs.list              |   1 +
 12 files changed, 448 insertions(+), 161 deletions(-)

diff --git a/contrib/fsync_checker/fsync_checker_smgr.c b/contrib/fsync_checker/fsync_checker_smgr.c
index 97ad0f78da8..626ff058764 100644
--- a/contrib/fsync_checker/fsync_checker_smgr.c
+++ b/contrib/fsync_checker/fsync_checker_smgr.c
@@ -27,15 +27,15 @@ typedef struct
 void		_PG_init(void);
 
 static void fsync_checker_extend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-								 const void *buffer, bool skipFsync);
-static void fsync_checker_immedsync(SMgrRelation reln, ForkNumber forknum);
+								 const void *buffer, bool skipFsync, SmgrChainIndex chain_index);
+static void fsync_checker_immedsync(SMgrRelation reln, ForkNumber forknum, SmgrChainIndex chain_index);
 static void fsync_checker_writev(SMgrRelation reln, ForkNumber forknum,
 								 BlockNumber blocknum, const void **buffers,
-								 BlockNumber nblocks, bool skipFsync);
+								 BlockNumber nblocks, bool skipFsync, SmgrChainIndex chain_index);
 static void fsync_checker_writeback(SMgrRelation reln, ForkNumber forknum,
-									BlockNumber blocknum, BlockNumber nblocks);
+									BlockNumber blocknum, BlockNumber nblocks, SmgrChainIndex chain_index);
 static void fsync_checker_zeroextend(SMgrRelation reln, ForkNumber forknum,
-									 BlockNumber blocknum, int nblocks, bool skipFsync);
+									 BlockNumber blocknum, int nblocks, bool skipFsync, SmgrChainIndex chain_index);
 
 static void fsync_checker_checkpoint_create(const CheckPoint *checkPoint);
 static void fsync_checker_shmem_request(void);
@@ -47,24 +47,25 @@ static void remove_reln(SMgrRelation reln, ForkNumber forknum);
 static SMgrId fsync_checker_smgr_id;
 static const struct f_smgr fsync_checker_smgr = {
 	.name = "fsync_checker",
-	.smgr_init = mdinit,
+	.chain_position = SMGR_CHAIN_MODIFIER,
+	.smgr_init = NULL,
 	.smgr_shutdown = NULL,
-	.smgr_open = mdopen,
-	.smgr_close = mdclose,
-	.smgr_create = mdcreate,
-	.smgr_exists = mdexists,
-	.smgr_unlink = mdunlink,
+	.smgr_open = NULL,
+	.smgr_close = NULL,
+	.smgr_create = NULL,
+	.smgr_exists = NULL,
+	.smgr_unlink = NULL,
 	.smgr_extend = fsync_checker_extend,
 	.smgr_zeroextend = fsync_checker_zeroextend,
-	.smgr_prefetch = mdprefetch,
-	.smgr_maxcombine = mdmaxcombine,
-	.smgr_readv = mdreadv,
+	.smgr_prefetch = NULL,
+	.smgr_maxcombine = NULL,
+	.smgr_readv = NULL,
 	.smgr_writev = fsync_checker_writev,
 	.smgr_writeback = fsync_checker_writeback,
-	.smgr_nblocks = mdnblocks,
-	.smgr_truncate = mdtruncate,
+	.smgr_nblocks = NULL,
+	.smgr_truncate = NULL,
 	.smgr_immedsync = fsync_checker_immedsync,
-	.smgr_registersync = mdregistersync,
+	.smgr_registersync = NULL,
 };
 
 static HTAB *volatile_relns;
@@ -91,8 +92,6 @@ _PG_init(void)
 	 * could use MdSmgrRelation as the parent.
 	 */
 	fsync_checker_smgr_id = smgr_register(&fsync_checker_smgr, 0);
-
-	storage_manager_id = fsync_checker_smgr_id;
 }
 
 static void
@@ -200,50 +199,50 @@ remove_reln(SMgrRelation reln, ForkNumber forknum)
 
 static void
 fsync_checker_extend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-					 const void *buffer, bool skipFsync)
+					 const void *buffer, bool skipFsync, SmgrChainIndex chain_index)
 {
 	if (!SmgrIsTemp(reln) && !skipFsync)
 		add_reln(reln, forknum);
 
-	mdextend(reln, forknum, blocknum, buffer, skipFsync);
+	smgr_extend_next(reln, forknum, blocknum, buffer, skipFsync, chain_index + 1);
 }
 
 static void
-fsync_checker_immedsync(SMgrRelation reln, ForkNumber forknum)
+fsync_checker_immedsync(SMgrRelation reln, ForkNumber forknum, SmgrChainIndex chain_index)
 {
 	if (!SmgrIsTemp(reln))
 		remove_reln(reln, forknum);
 
-	mdimmedsync(reln, forknum);
+	smgr_immedsync_next(reln, forknum, chain_index + 1);
 }
 
 static void
 fsync_checker_writev(SMgrRelation reln, ForkNumber forknum,
 					 BlockNumber blocknum, const void **buffers,
-					 BlockNumber nblocks, bool skipFsync)
+					 BlockNumber nblocks, bool skipFsync, SmgrChainIndex chain_index)
 {
 	if (!SmgrIsTemp(reln) && !skipFsync)
 		add_reln(reln, forknum);
 
-	mdwritev(reln, forknum, blocknum, buffers, nblocks, skipFsync);
+	smgr_writev_next(reln, forknum, blocknum, buffers, nblocks, skipFsync, chain_index + 1);
 }
 
 static void
 fsync_checker_writeback(SMgrRelation reln, ForkNumber forknum,
-						BlockNumber blocknum, BlockNumber nblocks)
+						BlockNumber blocknum, BlockNumber nblocks, SmgrChainIndex chain_index)
 {
 	if (!SmgrIsTemp(reln))
 		remove_reln(reln, forknum);
 
-	mdwriteback(reln, forknum, blocknum, nblocks);
+	smgr_writeback_next(reln, forknum, blocknum, nblocks, chain_index + 1);
 }
 
 static void
 fsync_checker_zeroextend(SMgrRelation reln, ForkNumber forknum,
-						 BlockNumber blocknum, int nblocks, bool skipFsync)
+						 BlockNumber blocknum, int nblocks, bool skipFsync, SmgrChainIndex chain_index)
 {
 	if (!SmgrIsTemp(reln) && !skipFsync)
 		add_reln(reln, forknum);
 
-	mdzeroextend(reln, forknum, blocknum, nblocks, skipFsync);
+	smgr_zeroextend_next(reln, forknum, blocknum, nblocks, skipFsync, chain_index + 1);
 }
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 88ea821573d..3a9fba2fbc5 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -926,6 +926,8 @@ PostmasterMain(int argc, char *argv[])
 	 */
 	process_shared_preload_libraries();
 
+	process_smgr_chain();
+
 	/*
 	 * Initialize SSL library, if specified.
 	 */
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 1766bbe1e57..963bd0e9cde 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -124,6 +124,33 @@ typedef MdSMgrRelationData *MdSMgrRelation;
 /* don't try to open a segment, if not already open */
 #define EXTENSION_DONT_OPEN			(1 << 5)
 
+/* md storage manager functionality */
+static void mdinit(void);
+static void mdopen(SMgrRelation reln, SmgrChainIndex chain_index);
+static void mdclose(SMgrRelation reln, ForkNumber forknum, SmgrChainIndex chain_index);
+static void mdcreate(RelFileLocator relold, SMgrRelation reln, ForkNumber forknum, bool isRedo, SmgrChainIndex chain_index);
+static bool mdexists(SMgrRelation reln, ForkNumber forknum, SmgrChainIndex chain_index);
+static void mdunlink(RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo, SmgrChainIndex chain_index);
+static void mdextend(SMgrRelation reln, ForkNumber forknum,
+					 BlockNumber blocknum, const void *buffer, bool skipFsync, SmgrChainIndex chain_index);
+static void mdzeroextend(SMgrRelation reln, ForkNumber forknum,
+						 BlockNumber blocknum, int nblocks, bool skipFsync, SmgrChainIndex chain_index);
+static bool mdprefetch(SMgrRelation reln, ForkNumber forknum,
+					   BlockNumber blocknum, int nblocks, SmgrChainIndex chain_index);
+static uint32 mdmaxcombine(SMgrRelation reln, ForkNumber forknum,
+						   BlockNumber blocknum, SmgrChainIndex chain_index);
+static void mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+					void **buffers, BlockNumber nblocks, SmgrChainIndex chain_index);
+static void mdwritev(SMgrRelation reln, ForkNumber forknum,
+					 BlockNumber blocknum,
+					 const void **buffers, BlockNumber nblocks, bool skipFsync, SmgrChainIndex chain_index);
+static void mdwriteback(SMgrRelation reln, ForkNumber forknum,
+						BlockNumber blocknum, BlockNumber nblocks, SmgrChainIndex chain_index);
+static BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum, SmgrChainIndex chain_index);
+static void mdtruncate(SMgrRelation reln, ForkNumber forknum,
+					   BlockNumber old_blocks, BlockNumber nblocks, SmgrChainIndex chain_index);
+static void mdimmedsync(SMgrRelation reln, ForkNumber forknum, SmgrChainIndex chain_index);
+static void mdregistersync(SMgrRelation reln, ForkNumber forknum, SmgrChainIndex chain_index);
 
 /*
  * Fixed-length string to represent paths to files that need to be built by
@@ -151,6 +178,7 @@ mdsmgr_register(void)
 	/* magnetic disk */
 	f_smgr		md_smgr = (f_smgr) {
 		.name = "md",
+		.chain_position = SMGR_CHAIN_TAIL,
 		.smgr_init = mdinit,
 		.smgr_shutdown = NULL,
 		.smgr_open = mdopen,
@@ -210,7 +238,7 @@ _mdfd_open_flags(void)
 /*
  * mdinit() -- Initialize private state for magnetic disk storage manager.
  */
-void
+static void
 mdinit(void)
 {
 	MdCxt = AllocSetContextCreate(TopMemoryContext,
@@ -223,8 +251,8 @@ mdinit(void)
  *
  * Note: this will return true for lingering files, with pending deletions
  */
-bool
-mdexists(SMgrRelation reln, ForkNumber forknum)
+static bool
+mdexists(SMgrRelation reln, ForkNumber forknum, SmgrChainIndex chain_index)
 {
 	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 
@@ -234,7 +262,7 @@ mdexists(SMgrRelation reln, ForkNumber forknum)
 	 * which already closes relations when dropping them.
 	 */
 	if (!InRecovery)
-		mdclose(reln, forknum);
+		mdclose(reln, forknum, 0);
 
 	return (mdopenfork(mdreln, forknum, EXTENSION_RETURN_NULL) != NULL);
 }
@@ -244,8 +272,8 @@ mdexists(SMgrRelation reln, ForkNumber forknum)
  *
  * If isRedo is true, it's okay for the relation to exist already.
  */
-void
-mdcreate(RelFileLocator /* reln */, SMgrRelation reln, ForkNumber forknum, bool isRedo)
+static void
+mdcreate(RelFileLocator relold, SMgrRelation reln, ForkNumber forknum, bool isRedo, SmgrChainIndex chain_index)
 {
 	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 	MdfdVec    *mdfd;
@@ -360,8 +388,8 @@ mdcreate(RelFileLocator /* reln */, SMgrRelation reln, ForkNumber forknum, bool
  * Note: any failure should be reported as WARNING not ERROR, because
  * we are usually not in a transaction anymore when this is called.
  */
-void
-mdunlink(RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo)
+static void
+mdunlink(RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo, SmgrChainIndex chain_index)
 {
 	/* Now do the per-fork work */
 	if (forknum == InvalidForkNumber)
@@ -510,9 +538,9 @@ mdunlinkfork(RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo)
  * EOF).  Note that we assume writing a block beyond current EOF
  * causes intervening file space to become filled with zeroes.
  */
-void
+static void
 mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		 const void *buffer, bool skipFsync)
+		 const void *buffer, bool skipFsync, SmgrChainIndex chain_index)
 {
 	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 	off_t		seekpos;
@@ -576,9 +604,9 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
  * Similar to mdextend(), except the relation can be extended by multiple
  * blocks at once and the added blocks will be filled with zeroes.
  */
-void
+static void
 mdzeroextend(SMgrRelation reln, ForkNumber forknum,
-			 BlockNumber blocknum, int nblocks, bool skipFsync)
+			 BlockNumber blocknum, int nblocks, bool skipFsync, SmgrChainIndex chain_index)
 {
 	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 	MdfdVec    *v;
@@ -727,8 +755,8 @@ mdopenfork(MdSMgrRelation reln, ForkNumber forknum, int behavior)
 /*
  * mdopen() -- Initialize newly-opened relation.
  */
-void
-mdopen(SMgrRelation reln)
+static void
+mdopen(SMgrRelation reln, SmgrChainIndex chain_index)
 {
 	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 
@@ -740,8 +768,8 @@ mdopen(SMgrRelation reln)
 /*
  * mdclose() -- Close the specified relation, if it isn't closed already.
  */
-void
-mdclose(SMgrRelation reln, ForkNumber forknum)
+static void
+mdclose(SMgrRelation reln, ForkNumber forknum, SmgrChainIndex chain_index)
 {
 	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 	int			nopensegs = mdreln->md_num_open_segs[forknum];
@@ -764,9 +792,9 @@ mdclose(SMgrRelation reln, ForkNumber forknum)
 /*
  * mdprefetch() -- Initiate asynchronous read of the specified blocks of a relation
  */
-bool
+static bool
 mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		   int nblocks)
+		   int nblocks, SmgrChainIndex chain_index)
 {
 #ifdef USE_PREFETCH
 	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
@@ -862,9 +890,9 @@ buffers_to_iovec(struct iovec *iov, void **buffers, int nblocks)
  * mdmaxcombine() -- Return the maximum number of total blocks that can be
  *				 combined with an IO starting at blocknum.
  */
-uint32
+static uint32
 mdmaxcombine(SMgrRelation reln, ForkNumber forknum,
-			 BlockNumber blocknum)
+			 BlockNumber blocknum, SmgrChainIndex index)
 {
 	BlockNumber segoff;
 
@@ -876,9 +904,9 @@ mdmaxcombine(SMgrRelation reln, ForkNumber forknum,
 /*
  * mdreadv() -- Read the specified blocks from a relation.
  */
-void
+static void
 mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		void **buffers, BlockNumber nblocks)
+		void **buffers, BlockNumber nblocks, SmgrChainIndex chain_index)
 {
 	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 
@@ -999,9 +1027,9 @@ mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
  * relation (ie, those before the current EOF).  To extend a relation,
  * use mdextend().
  */
-void
+static void
 mdwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		 const void **buffers, BlockNumber nblocks, bool skipFsync)
+		 const void **buffers, BlockNumber nblocks, bool skipFsync, SmgrChainIndex chain_index)
 {
 	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 
@@ -1106,9 +1134,9 @@ mdwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
  * This accepts a range of blocks because flushing several pages at once is
  * considerably more efficient than doing so individually.
  */
-void
+static void
 mdwriteback(SMgrRelation reln, ForkNumber forknum,
-			BlockNumber blocknum, BlockNumber nblocks)
+			BlockNumber blocknum, BlockNumber nblocks, SmgrChainIndex chain_index)
 {
 	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 
@@ -1167,8 +1195,8 @@ mdwriteback(SMgrRelation reln, ForkNumber forknum,
  * called, then only segments up to the last one actually touched
  * are present in the array.
  */
-BlockNumber
-mdnblocks(SMgrRelation reln, ForkNumber forknum)
+static BlockNumber
+mdnblocks(SMgrRelation reln, ForkNumber forknum, SmgrChainIndex chain_index)
 {
 	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 	MdfdVec    *v;
@@ -1232,9 +1260,9 @@ mdnblocks(SMgrRelation reln, ForkNumber forknum)
  * sure we have opened all active segments, so that truncate loop will get
  * them all!
  */
-void
+static void
 mdtruncate(SMgrRelation reln, ForkNumber forknum,
-		   BlockNumber curnblk, BlockNumber nblocks)
+		   BlockNumber curnblk, BlockNumber nblocks, SmgrChainIndex chain_index)
 {
 	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 	BlockNumber priorblocks;
@@ -1322,8 +1350,8 @@ mdtruncate(SMgrRelation reln, ForkNumber forknum,
 /*
  * mdregistersync() -- Mark whole relation as needing fsync
  */
-void
-mdregistersync(SMgrRelation reln, ForkNumber forknum)
+static void
+mdregistersync(SMgrRelation reln, ForkNumber forknum, SmgrChainIndex chain_index)
 {
 	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 	int			segno;
@@ -1333,7 +1361,7 @@ mdregistersync(SMgrRelation reln, ForkNumber forknum)
 	 * NOTE: mdnblocks makes sure we have opened all active segments, so that
 	 * the loop below will get them all!
 	 */
-	mdnblocks(reln, forknum);
+	mdnblocks(reln, forknum, 0);
 
 	min_inactive_seg = segno = mdreln->md_num_open_segs[forknum];
 
@@ -1374,8 +1402,8 @@ mdregistersync(SMgrRelation reln, ForkNumber forknum)
  * crash before the next checkpoint syncs the newly-inactive segment, that
  * segment may survive recovery, reintroducing unwanted data into the table.
  */
-void
-mdimmedsync(SMgrRelation reln, ForkNumber forknum)
+static void
+mdimmedsync(SMgrRelation reln, ForkNumber forknum, SmgrChainIndex chain_index)
 {
 	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 	int			segno;
@@ -1385,7 +1413,7 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 	 * NOTE: mdnblocks makes sure we have opened all active segments, so that
 	 * the loop below will get them all!
 	 */
-	mdnblocks(reln, forknum);
+	mdnblocks(reln, forknum, 0);
 
 	min_inactive_seg = segno = mdreln->md_num_open_segs[forknum];
 
@@ -1750,7 +1778,7 @@ _mdfd_getseg(MdSMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
 
 				mdextend((SMgrRelation) reln, forknum,
 						 nextsegno * ((BlockNumber) RELSEG_SIZE) - 1,
-						 zerobuf, skipFsync);
+						 zerobuf, skipFsync, 0);
 				pfree(zerobuf);
 			}
 			flags = O_CREAT;
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 0498fd6c317..08892563768 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -63,13 +63,13 @@
 #include "utils/inval.h"
 #include "utils/memutils.h"
 
-static f_smgr *smgrsw;
+f_smgr	   *smgrsw;
 
 static int	NSmgr = 0;
 
 static Size LargestSMgrRelationSize = 0;
 
-SMgrId		storage_manager_id;
+SMgrChain	storage_manager_chain;
 
 /*
  * Each backend has a hashtable that stores all extant SMgrRelation objects.
@@ -98,20 +98,23 @@ smgr_register(const f_smgr *smgr, Size smgrrelation_size)
 	if (smgr->name == NULL || *smgr->name == 0)
 		elog(FATAL, "smgr registered with invalid name");
 
-	Assert(smgr->smgr_open != NULL);
-	Assert(smgr->smgr_close != NULL);
-	Assert(smgr->smgr_create != NULL);
-	Assert(smgr->smgr_exists != NULL);
-	Assert(smgr->smgr_unlink != NULL);
-	Assert(smgr->smgr_extend != NULL);
-	Assert(smgr->smgr_zeroextend != NULL);
-	Assert(smgr->smgr_prefetch != NULL);
-	Assert(smgr->smgr_readv != NULL);
-	Assert(smgr->smgr_writev != NULL);
-	Assert(smgr->smgr_writeback != NULL);
-	Assert(smgr->smgr_nblocks != NULL);
-	Assert(smgr->smgr_truncate != NULL);
-	Assert(smgr->smgr_immedsync != NULL);
+	if (smgr->chain_position == SMGR_CHAIN_TAIL)
+	{
+		Assert(smgr->smgr_open != NULL);
+		Assert(smgr->smgr_close != NULL);
+		Assert(smgr->smgr_create != NULL);
+		Assert(smgr->smgr_exists != NULL);
+		Assert(smgr->smgr_unlink != NULL);
+		Assert(smgr->smgr_extend != NULL);
+		Assert(smgr->smgr_zeroextend != NULL);
+		Assert(smgr->smgr_prefetch != NULL);
+		Assert(smgr->smgr_readv != NULL);
+		Assert(smgr->smgr_writev != NULL);
+		Assert(smgr->smgr_writeback != NULL);
+		Assert(smgr->smgr_nblocks != NULL);
+		Assert(smgr->smgr_truncate != NULL);
+		Assert(smgr->smgr_immedsync != NULL);
+	}
 
 	old = MemoryContextSwitchTo(TopMemoryContext);
 
@@ -138,6 +141,17 @@ smgr_register(const f_smgr *smgr, Size smgrrelation_size)
 	return my_id;
 }
 
+SMgrId
+smgr_lookup(const char *name)
+{
+	for (int i = 0; i < NSmgr; i++)
+	{
+		if (strcmp(smgrsw[i].name, name) == 0)
+			return i;
+	}
+	elog(FATAL, "Storage manager not found with name: %s", name);
+}
+
 /*
  * smgrinit(), smgrshutdown() -- Initialize or shut down storage
  *								 managers.
@@ -176,6 +190,22 @@ smgrshutdown(int code, Datum arg)
 	}
 }
 
+#define SMGR_CHAIN_LOOKUP(SMGR_METHOD) \
+	do \
+	{ \
+		while (chain_index < reln->smgr_chain.size && smgrsw[reln->smgr_chain.chain[chain_index]].SMGR_METHOD == NULL) \
+			chain_index++; \
+		Assert(chain_index < reln->smgr_chain.size); \
+	} while (0)
+
+void
+smgr_open_next(SMgrRelation reln, SmgrChainIndex chain_index)
+{
+	SMGR_CHAIN_LOOKUP(smgr_open);
+
+	smgrsw[reln->smgr_chain.chain[chain_index]].smgr_open(reln, chain_index);
+}
+
 /*
  * smgropen() -- Return an SMgrRelation object, creating it if need be.
  *
@@ -229,10 +259,10 @@ smgropen(RelFileLocator rlocator, ProcNumber backend)
 		for (int i = 0; i <= MAX_FORKNUM; ++i)
 			reln->smgr_cached_nblocks[i] = InvalidBlockNumber;
 
-		reln->smgr_which = storage_manager_id;
+		memcpy(&reln->smgr_chain, &storage_manager_chain, sizeof(SMgrChain));
 
 		/* implementation-specific initialization */
-		smgrsw[reln->smgr_which].smgr_open(reln);
+		smgr_open_next(reln, 0);
 
 		/* it is not pinned yet */
 		reln->pincount = 0;
@@ -270,6 +300,14 @@ smgrunpin(SMgrRelation reln)
 		dlist_push_tail(&unpinned_relns, &reln->node);
 }
 
+void
+smgr_close_next(SMgrRelation reln, ForkNumber forknum, SmgrChainIndex chain_index)
+{
+	SMGR_CHAIN_LOOKUP(smgr_close);
+
+	smgrsw[reln->smgr_chain.chain[chain_index]].smgr_close(reln, forknum, chain_index);
+}
+
 /*
  * smgrdestroy() -- Delete an SMgrRelation object.
  */
@@ -281,7 +319,7 @@ smgrdestroy(SMgrRelation reln)
 	Assert(reln->pincount == 0);
 
 	for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-		smgrsw[reln->smgr_which].smgr_close(reln, forknum);
+		smgr_close_next(reln, forknum, 0);
 
 	dlist_delete(&reln->node);
 
@@ -301,7 +339,7 @@ smgrrelease(SMgrRelation reln)
 {
 	for (ForkNumber forknum = 0; forknum <= MAX_FORKNUM; forknum++)
 	{
-		smgrsw[reln->smgr_which].smgr_close(reln, forknum);
+		smgr_close_next(reln, forknum, 0);
 		reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
 	}
 	reln->smgr_targblock = InvalidBlockNumber;
@@ -391,13 +429,29 @@ smgrreleaserellocator(RelFileLocatorBackend rlocator)
 		smgrrelease(reln);
 }
 
+bool
+smgr_exists_next(SMgrRelation reln, ForkNumber forknum, SmgrChainIndex chain_index)
+{
+	SMGR_CHAIN_LOOKUP(smgr_exists);
+
+	return smgrsw[reln->smgr_chain.chain[chain_index]].smgr_exists(reln, forknum, chain_index);
+}
+
 /*
  * smgrexists() -- Does the underlying file for a fork exist?
  */
 bool
 smgrexists(SMgrRelation reln, ForkNumber forknum)
 {
-	return smgrsw[reln->smgr_which].smgr_exists(reln, forknum);
+	return smgr_exists_next(reln, forknum, 0);
+}
+
+void
+smgr_create_next(RelFileLocator relold, SMgrRelation reln, ForkNumber forknum, bool isRedo, SmgrChainIndex chain_index)
+{
+	SMGR_CHAIN_LOOKUP(smgr_create);
+
+	smgrsw[reln->smgr_chain.chain[chain_index]].smgr_create(relold, reln, forknum, isRedo, chain_index);
 }
 
 /*
@@ -410,7 +464,15 @@ smgrexists(SMgrRelation reln, ForkNumber forknum)
 void
 smgrcreate(RelFileLocator relold, SMgrRelation reln, ForkNumber forknum, bool isRedo)
 {
-	smgrsw[reln->smgr_which].smgr_create(relold, reln, forknum, isRedo);
+	smgr_create_next(relold, reln, forknum, isRedo, 0);
+}
+
+void
+smgr_immedsync_next(SMgrRelation reln, ForkNumber forknum, SmgrChainIndex chain_index)
+{
+	SMGR_CHAIN_LOOKUP(smgr_immedsync);
+
+	smgrsw[reln->smgr_chain.chain[chain_index]].smgr_immedsync(reln, forknum, chain_index);
 }
 
 /*
@@ -438,16 +500,22 @@ smgrdosyncall(SMgrRelation *rels, int nrels)
 	 */
 	for (i = 0; i < nrels; i++)
 	{
-		int			which = rels[i]->smgr_which;
-
 		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
 		{
-			if (smgrsw[which].smgr_exists(rels[i], forknum))
-				smgrsw[which].smgr_immedsync(rels[i], forknum);
+			if (smgr_exists_next(rels[i], forknum, 0))
+				smgr_immedsync_next(rels[i], forknum, 0);
 		}
 	}
 }
 
+void
+smgr_unlink_next(SMgrRelation reln, RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo, SmgrChainIndex chain_index)
+{
+	SMGR_CHAIN_LOOKUP(smgr_unlink);
+
+	smgrsw[reln->smgr_chain.chain[chain_index]].smgr_unlink(rlocator, forknum, isRedo, chain_index);
+}
+
 /*
  * smgrdounlinkall() -- Immediately unlink all forks of all given relations
  *
@@ -482,13 +550,12 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
 	for (i = 0; i < nrels; i++)
 	{
 		RelFileLocatorBackend rlocator = rels[i]->smgr_rlocator;
-		int			which = rels[i]->smgr_which;
 
 		rlocators[i] = rlocator;
 
 		/* Close the forks at smgr level */
 		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-			smgrsw[which].smgr_close(rels[i], forknum);
+			smgr_close_next(rels[i], forknum, 0);
 	}
 
 	/*
@@ -512,15 +579,22 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
 
 	for (i = 0; i < nrels; i++)
 	{
-		int			which = rels[i]->smgr_which;
-
 		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-			smgrsw[which].smgr_unlink(rlocators[i], forknum, isRedo);
+			smgr_unlink_next(rels[i], rlocators[i], forknum, isRedo, 0);
 	}
 
 	pfree(rlocators);
 }
 
+void
+smgr_extend_next(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+				 const void *buffer, bool skipFsync, SmgrChainIndex chain_index)
+{
+	SMGR_CHAIN_LOOKUP(smgr_extend);
+
+	smgrsw[reln->smgr_chain.chain[chain_index]].smgr_extend(reln, forknum, blocknum,
+															buffer, skipFsync, chain_index);
+}
 
 /*
  * smgrextend() -- Add a new block to a file.
@@ -535,8 +609,7 @@ void
 smgrextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		   const void *buffer, bool skipFsync)
 {
-	smgrsw[reln->smgr_which].smgr_extend(reln, forknum, blocknum,
-										 buffer, skipFsync);
+	smgr_extend_next(reln, forknum, blocknum, buffer, skipFsync, 0);
 
 	/*
 	 * Normally we expect this to increase nblocks by one, but if the cached
@@ -549,6 +622,16 @@ smgrextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
 }
 
+void
+smgr_zeroextend_next(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+					 int nblocks, bool skipFsync, SmgrChainIndex chain_index)
+{
+	SMGR_CHAIN_LOOKUP(smgr_zeroextend);
+
+	smgrsw[reln->smgr_chain.chain[chain_index]].smgr_zeroextend(reln, forknum, blocknum,
+																nblocks, skipFsync, chain_index);
+}
+
 /*
  * smgrzeroextend() -- Add new zeroed out blocks to a file.
  *
@@ -560,8 +643,7 @@ void
 smgrzeroextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 			   int nblocks, bool skipFsync)
 {
-	smgrsw[reln->smgr_which].smgr_zeroextend(reln, forknum, blocknum,
-											 nblocks, skipFsync);
+	smgr_zeroextend_next(reln, forknum, blocknum, nblocks, skipFsync, 0);
 
 	/*
 	 * Normally we expect this to increase the fork size by nblocks, but if
@@ -574,6 +656,16 @@ smgrzeroextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
 }
 
+bool
+smgr_prefetch_next(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+				   int nblocks, SmgrChainIndex chain_index)
+{
+	SMGR_CHAIN_LOOKUP(smgr_prefetch);
+
+	return smgrsw[reln->smgr_chain.chain[chain_index]].smgr_prefetch(reln, forknum, blocknum,
+																	 nblocks, chain_index);
+}
+
 /*
  * smgrprefetch() -- Initiate asynchronous read of the specified block of a relation.
  *
@@ -585,7 +677,16 @@ bool
 smgrprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 			 int nblocks)
 {
-	return smgrsw[reln->smgr_which].smgr_prefetch(reln, forknum, blocknum, nblocks);
+	return smgr_prefetch_next(reln, forknum, blocknum, nblocks, 0);
+}
+
+uint32
+smgr_maxcombine_next(SMgrRelation reln, ForkNumber forknum,
+					 BlockNumber blocknum, SmgrChainIndex chain_index)
+{
+	SMGR_CHAIN_LOOKUP(smgr_maxcombine);
+
+	return smgrsw[reln->smgr_chain.chain[chain_index]].smgr_maxcombine(reln, forknum, blocknum, chain_index);
 }
 
 /*
@@ -598,7 +699,17 @@ uint32
 smgrmaxcombine(SMgrRelation reln, ForkNumber forknum,
 			   BlockNumber blocknum)
 {
-	return smgrsw[reln->smgr_which].smgr_maxcombine(reln, forknum, blocknum);
+	return smgr_maxcombine_next(reln, forknum, blocknum, 0);
+}
+
+void
+smgr_readv_next(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+				void **buffers, BlockNumber nblocks, SmgrChainIndex chain_index)
+{
+	SMGR_CHAIN_LOOKUP(smgr_readv);
+
+	smgrsw[reln->smgr_chain.chain[chain_index]].smgr_readv(reln, forknum, blocknum,
+														   buffers, nblocks, chain_index);
 }
 
 /*
@@ -616,8 +727,17 @@ void
 smgrreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		  void **buffers, BlockNumber nblocks)
 {
-	smgrsw[reln->smgr_which].smgr_readv(reln, forknum, blocknum, buffers,
-										nblocks);
+	smgr_readv_next(reln, forknum, blocknum, buffers, nblocks, 0);
+}
+
+void
+smgr_writev_next(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+				 const void **buffers, BlockNumber nblocks, bool skipFsync, SmgrChainIndex chain_index)
+{
+	SMGR_CHAIN_LOOKUP(smgr_writev);
+
+	smgrsw[reln->smgr_chain.chain[chain_index]].smgr_writev(reln, forknum, blocknum,
+															buffers, nblocks, skipFsync, chain_index);
 }
 
 /*
@@ -650,8 +770,17 @@ void
 smgrwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		   const void **buffers, BlockNumber nblocks, bool skipFsync)
 {
-	smgrsw[reln->smgr_which].smgr_writev(reln, forknum, blocknum,
-										 buffers, nblocks, skipFsync);
+	smgr_writev_next(reln, forknum, blocknum,
+					 buffers, nblocks, skipFsync, 0);
+}
+
+void
+smgr_writeback_next(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+					BlockNumber nblocks, SmgrChainIndex chain_index)
+{
+	SMGR_CHAIN_LOOKUP(smgr_writeback);
+
+	smgrsw[reln->smgr_chain.chain[chain_index]].smgr_writeback(reln, forknum, blocknum, nblocks, chain_index);
 }
 
 /*
@@ -662,8 +791,15 @@ void
 smgrwriteback(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 			  BlockNumber nblocks)
 {
-	smgrsw[reln->smgr_which].smgr_writeback(reln, forknum, blocknum,
-											nblocks);
+	smgr_writeback_next(reln, forknum, blocknum, nblocks, 0);
+}
+
+extern BlockNumber
+smgr_nblocks_next(SMgrRelation reln, ForkNumber forknum, SmgrChainIndex chain_index)
+{
+	SMGR_CHAIN_LOOKUP(smgr_nblocks);
+
+	return smgrsw[reln->smgr_chain.chain[chain_index]].smgr_nblocks(reln, forknum, chain_index);
 }
 
 /*
@@ -680,7 +816,7 @@ smgrnblocks(SMgrRelation reln, ForkNumber forknum)
 	if (result != InvalidBlockNumber)
 		return result;
 
-	result = smgrsw[reln->smgr_which].smgr_nblocks(reln, forknum);
+	result = smgr_nblocks_next(reln, forknum, 0);
 
 	reln->smgr_cached_nblocks[forknum] = result;
 
@@ -708,6 +844,14 @@ smgrnblocks_cached(SMgrRelation reln, ForkNumber forknum)
 	return InvalidBlockNumber;
 }
 
+void
+smgr_truncate_next(SMgrRelation reln, ForkNumber forknum, BlockNumber curnblk, BlockNumber nblocks, SmgrChainIndex chain_index)
+{
+	SMGR_CHAIN_LOOKUP(smgr_truncate);
+
+	smgrsw[reln->smgr_chain.chain[chain_index]].smgr_truncate(reln, forknum, curnblk, nblocks, chain_index);
+}
+
 /*
  * smgrtruncate() -- Truncate the given forks of supplied relation to
  *					 each specified numbers of blocks
@@ -752,8 +896,7 @@ smgrtruncate(SMgrRelation reln, ForkNumber *forknum, int nforks,
 		/* Make the cached size is invalid if we encounter an error. */
 		reln->smgr_cached_nblocks[forknum[i]] = InvalidBlockNumber;
 
-		smgrsw[reln->smgr_which].smgr_truncate(reln, forknum[i],
-											   old_nblocks[i], nblocks[i]);
+		smgr_truncate_next(reln, forknum[i], old_nblocks[i], nblocks[i], 0);
 
 		/*
 		 * We might as well update the local smgr_cached_nblocks values. The
@@ -766,6 +909,14 @@ smgrtruncate(SMgrRelation reln, ForkNumber *forknum, int nforks,
 	}
 }
 
+void
+smgr_registersync_next(SMgrRelation reln, ForkNumber forknum, SmgrChainIndex chain_index)
+{
+	SMGR_CHAIN_LOOKUP(smgr_registersync);
+
+	smgrsw[reln->smgr_chain.chain[chain_index]].smgr_registersync(reln, forknum, chain_index);
+}
+
 /*
  * smgrregistersync() -- Request a relation to be sync'd at next checkpoint
  *
@@ -781,7 +932,7 @@ smgrtruncate(SMgrRelation reln, ForkNumber *forknum, int nforks,
 void
 smgrregistersync(SMgrRelation reln, ForkNumber forknum)
 {
-	smgrsw[reln->smgr_which].smgr_registersync(reln, forknum);
+	smgr_registersync_next(reln, forknum, 0);
 }
 
 /*
@@ -813,7 +964,7 @@ smgrregistersync(SMgrRelation reln, ForkNumber forknum)
 void
 smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
 {
-	smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
+	smgr_immedsync_next(reln, forknum, 0);
 }
 
 /*
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 947ffb40421..d523f306ab8 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -4071,6 +4071,8 @@ PostgresSingleUserMain(int argc, char *argv[],
 	 */
 	process_shared_preload_libraries();
 
+	process_smgr_chain();
+
 	/* Initialize MaxBackends */
 	InitializeMaxBackends();
 
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 1b3ce51cfce..32d99e1244a 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -56,6 +56,7 @@
 #include "utils/pidfile.h"
 #include "utils/syscache.h"
 #include "utils/varlena.h"
+#include "storage/smgr.h"
 
 
 #define DIRECTORY_LOCK_FILE		"postmaster.pid"
@@ -1834,6 +1835,8 @@ char	   *session_preload_libraries_string = NULL;
 char	   *shared_preload_libraries_string = NULL;
 char	   *local_preload_libraries_string = NULL;
 
+char	   *smgr_chain_string = NULL;
+
 /* Flag telling that we are loading shared_preload_libraries */
 bool		process_shared_preload_libraries_in_progress = false;
 bool		process_shared_preload_libraries_done = false;
@@ -1910,6 +1913,62 @@ process_shared_preload_libraries(void)
 	process_shared_preload_libraries_done = true;
 }
 
+void
+process_smgr_chain(void)
+{
+	char	   *rawstring;
+	List	   *elemlist;
+	ListCell   *l;
+	uint8		idx = 0;
+
+	if (smgr_chain_string == NULL || smgr_chain_string[0] == '\0')
+		return;					/* nothing to do */
+
+	/* Need a modifiable copy of string */
+	rawstring = pstrdup(smgr_chain_string);
+
+	/* Parse string into list of filename paths */
+	if (!SplitIdentifierString(rawstring, ',', &elemlist))
+	{
+		/* syntax error in list */
+		pfree(rawstring);
+		ereport(LOG,
+				(errcode(ERRCODE_SYNTAX_ERROR),
+				 errmsg("invalid list syntax in parameter \"%s\"",
+						"smgr_chain")));
+		return;
+	}
+
+	foreach(l, elemlist)
+	{
+		char	   *smgrname = (char *) lfirst(l);
+		SMgrId		id = smgr_lookup(smgrname);
+
+		storage_manager_chain.chain[idx++] = id;
+
+		ereport(DEBUG1,
+				(errmsg_internal("using storage manager in chain \"%s\"", smgrname)));
+	}
+
+	for (int i = 0; i < idx; ++i)
+	{
+		int			chain_position = smgrsw[storage_manager_chain.chain[i]].chain_position;
+
+		if (i == idx - 1 && chain_position != SMGR_CHAIN_TAIL)
+			ereport(FATAL,
+					(errmsg_internal("smgr_chain: the last element should be a `tail` implementation, not a modifier.")));
+
+		if (i != idx - 1 && chain_position != SMGR_CHAIN_MODIFIER)
+			ereport(FATAL,
+					(errmsg_internal("smgr_chain: element %i/%i %s is not a modifier.", i, idx, smgrsw[storage_manager_chain.chain[i]].name)));
+	}
+
+	storage_manager_chain.size = idx;
+
+	list_free(elemlist);
+	pfree(rawstring);
+}
+
 /*
  * process any libraries that should be preloaded at backend start
  */
@@ -1932,7 +1991,9 @@ register_builtin_dynamic_managers(void)
 {
 	mdsmgr_register();
 
-	storage_manager_id = MdSMgrId;
+	/* setup a dummy chain with md, for tools */
+	storage_manager_chain.chain[0] = MdSMgrId;
+	storage_manager_chain.size = 1;
 }
 
 /*
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index ad25cbb39c5..ea43aadc96a 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -4408,6 +4408,17 @@ struct config_string ConfigureNamesString[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"smgr_chain", PGC_POSTMASTER, CLIENT_CONN_PRELOAD,
+			gettext_noop("Lists storage managers used by the server, in order."),
+			NULL,
+			GUC_LIST_INPUT | GUC_LIST_QUOTE | GUC_SUPERUSER_ONLY
+		},
+		&smgr_chain_string,
+		"md",
+		NULL, NULL, NULL
+	},
+
 	{
 		{"search_path", PGC_USERSET, CLIENT_CONN_STATEMENT,
 			gettext_noop("Sets the schema search order for names that are not schema-qualified."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 2d1de9c37bd..84d1159c4d7 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -787,6 +787,7 @@ autovacuum_worker_slots = 16	# autovacuum worker slots to allocate
 #session_preload_libraries = ''
 #shared_preload_libraries = ''		# (change requires restart)
 #jit_provider = 'llvmjit'		# JIT library to use
+#smgr_chain = 'md'			# SMGR implementations to use
 
 # - Other Defaults -
 
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index ff4ef578a1f..4e218941a4b 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -505,6 +505,7 @@ extern PGDLLIMPORT bool process_shmem_requests_in_progress;
 extern PGDLLIMPORT char *session_preload_libraries_string;
 extern PGDLLIMPORT char *shared_preload_libraries_string;
 extern PGDLLIMPORT char *local_preload_libraries_string;
+extern PGDLLIMPORT char *smgr_chain_string;
 
 extern void CreateDataDirLockFile(bool amPostmaster);
 extern void CreateSocketLockFile(const char *socketfile, bool amPostmaster,
@@ -515,6 +516,7 @@ extern bool RecheckDataDirLockFile(void);
 extern void ValidatePgVersion(const char *path);
 extern void register_builtin_dynamic_managers(void);
 extern void process_shared_preload_libraries(void);
+extern void process_smgr_chain(void);
 extern void process_session_preload_libraries(void);
 extern void process_shmem_requests(void);
 extern void pg_bindtextdomain(const char *domain);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 61c0e85dd74..5b4992c0855 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -23,34 +23,6 @@
 extern void mdsmgr_register(void);
 extern SMgrId MdSMgrId;
 
-/* md storage manager functionality */
-extern void mdinit(void);
-extern void mdopen(SMgrRelation reln);
-extern void mdclose(SMgrRelation reln, ForkNumber forknum);
-extern void mdcreate(RelFileLocator relold, SMgrRelation reln, ForkNumber forknum, bool isRedo);
-extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
-extern void mdunlink(RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo);
-extern void mdextend(SMgrRelation reln, ForkNumber forknum,
-					 BlockNumber blocknum, const void *buffer, bool skipFsync);
-extern void mdzeroextend(SMgrRelation reln, ForkNumber forknum,
-						 BlockNumber blocknum, int nblocks, bool skipFsync);
-extern bool mdprefetch(SMgrRelation reln, ForkNumber forknum,
-					   BlockNumber blocknum, int nblocks);
-extern uint32 mdmaxcombine(SMgrRelation reln, ForkNumber forknum,
-						   BlockNumber blocknum);
-extern void mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-					void **buffers, BlockNumber nblocks);
-extern void mdwritev(SMgrRelation reln, ForkNumber forknum,
-					 BlockNumber blocknum,
-					 const void **buffers, BlockNumber nblocks, bool skipFsync);
-extern void mdwriteback(SMgrRelation reln, ForkNumber forknum,
-						BlockNumber blocknum, BlockNumber nblocks);
-extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
-extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
-					   BlockNumber old_blocks, BlockNumber nblocks);
-extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
-extern void mdregistersync(SMgrRelation reln, ForkNumber forknum);
-
 extern void ForgetDatabaseSyncRequests(Oid dbid);
 extern void DropRelationFiles(RelFileLocator *delrels, int ndelrels, bool isRedo);
 
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 5b2b6de91c4..8f789cb7f80 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -20,7 +20,17 @@
 
 typedef uint8 SMgrId;
 
-extern PGDLLIMPORT SMgrId storage_manager_id;
+typedef uint8 SmgrChainIndex;
+
+#define MAX_SMGR_CHAIN 15
+
+typedef struct
+{
+	SMgrId		chain[MAX_SMGR_CHAIN];	/* storage manager selector */
+	uint8		size;
+} SMgrChain;
+
+extern PGDLLIMPORT SMgrChain storage_manager_chain;
 
 /*
  * smgr.c maintains a table of SMgrRelation objects, which are essentially
@@ -55,7 +65,7 @@ typedef struct SMgrRelationData
 	 * Fields below here are intended to be private to smgr.c and its
 	 * submodules.  Do not touch them from elsewhere.
 	 */
-	SMgrId		smgr_which;		/* storage manager selector */
+	SMgrChain	smgr_chain;		/* selected storage manager chain */
 
 	/*
 	 * Pinning support.  If unpinned (ie. pincount == 0), 'node' is a list
@@ -70,6 +80,9 @@ typedef SMgrRelationData *SMgrRelation;
 #define SmgrIsTemp(smgr) \
 	RelFileLocatorBackendIsTemp((smgr)->smgr_rlocator)
 
+#define SMGR_CHAIN_TAIL 1
+#define SMGR_CHAIN_MODIFIER 2
+
 /*
  * This struct of function pointers defines the API between smgr.c and
  * any individual storage manager module.  Note that smgr subfunctions are
@@ -83,40 +96,44 @@ typedef SMgrRelationData *SMgrRelation;
 typedef struct f_smgr
 {
 	const char *name;
+	int			chain_position;
 	void		(*smgr_init) (void);	/* may be NULL */
 	void		(*smgr_shutdown) (void);	/* may be NULL */
-	void		(*smgr_open) (SMgrRelation reln);
-	void		(*smgr_close) (SMgrRelation reln, ForkNumber forknum);
+	void		(*smgr_open) (SMgrRelation reln, SmgrChainIndex chain_index);
+	void		(*smgr_close) (SMgrRelation reln, ForkNumber forknum, SmgrChainIndex chain_index);
 	void		(*smgr_create) (RelFileLocator relold, SMgrRelation reln, ForkNumber forknum,
-								bool isRedo);
-	bool		(*smgr_exists) (SMgrRelation reln, ForkNumber forknum);
+								bool isRedo, SmgrChainIndex chain_index);
+	bool		(*smgr_exists) (SMgrRelation reln, ForkNumber forknum, SmgrChainIndex chain_index);
 	void		(*smgr_unlink) (RelFileLocatorBackend rlocator, ForkNumber forknum,
-								bool isRedo);
+								bool isRedo, SmgrChainIndex chain_index);
 	void		(*smgr_extend) (SMgrRelation reln, ForkNumber forknum,
-								BlockNumber blocknum, const void *buffer, bool skipFsync);
+								BlockNumber blocknum, const void *buffer, bool skipFsync, SmgrChainIndex chain_index);
 	void		(*smgr_zeroextend) (SMgrRelation reln, ForkNumber forknum,
-									BlockNumber blocknum, int nblocks, bool skipFsync);
+									BlockNumber blocknum, int nblocks, bool skipFsync, SmgrChainIndex chain_index);
 	bool		(*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
-								  BlockNumber blocknum, int nblocks);
+								  BlockNumber blocknum, int nblocks, SmgrChainIndex chain_index);
 	uint32		(*smgr_maxcombine) (SMgrRelation reln, ForkNumber forknum,
-									BlockNumber blocknum);
+									BlockNumber blocknum, SmgrChainIndex chain_index);
 	void		(*smgr_readv) (SMgrRelation reln, ForkNumber forknum,
 							   BlockNumber blocknum,
-							   void **buffers, BlockNumber nblocks);
+							   void **buffers, BlockNumber nblocks, SmgrChainIndex chain_index);
 	void		(*smgr_writev) (SMgrRelation reln, ForkNumber forknum,
 								BlockNumber blocknum,
 								const void **buffers, BlockNumber nblocks,
-								bool skipFsync);
+								bool skipFsync, SmgrChainIndex chain_index);
 	void		(*smgr_writeback) (SMgrRelation reln, ForkNumber forknum,
-								   BlockNumber blocknum, BlockNumber nblocks);
-	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
+								   BlockNumber blocknum, BlockNumber nblocks, SmgrChainIndex chain_index);
+	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum, SmgrChainIndex chain_index);
 	void		(*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
-								  BlockNumber old_blocks, BlockNumber nblocks);
-	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
-	void		(*smgr_registersync) (SMgrRelation reln, ForkNumber forknum);
+								  BlockNumber old_blocks, BlockNumber nblocks, SmgrChainIndex chain_index);
+	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum, SmgrChainIndex chain_index);
+	void		(*smgr_registersync) (SMgrRelation reln, ForkNumber forknum, SmgrChainIndex chain_index);
 } f_smgr;
 
 extern SMgrId smgr_register(const f_smgr *smgr, Size smgrrelation_size);
+extern SMgrId smgr_lookup(const char *name);
+
+extern f_smgr *smgrsw;
 
 extern void smgrinit(void);
 extern SMgrRelation smgropen(RelFileLocator rlocator, ProcNumber backend);
@@ -158,6 +175,46 @@ extern void smgrregistersync(SMgrRelation reln, ForkNumber forknum);
 extern void AtEOXact_SMgr(void);
 extern bool ProcessBarrierSmgrRelease(void);
 
+extern void
+			smgr_open_next(SMgrRelation reln, SmgrChainIndex chain_index);
+extern void
+			smgr_close_next(SMgrRelation reln, ForkNumber forknum, SmgrChainIndex chain_index);
+extern bool
+			smgr_exists_next(SMgrRelation reln, ForkNumber forknum, SmgrChainIndex chain_index);
+extern void
+			smgr_create_next(RelFileLocator relold, SMgrRelation reln, ForkNumber forknum, bool isRedo, SmgrChainIndex chain_index);
+extern void
+			smgr_immedsync_next(SMgrRelation reln, ForkNumber forknum, SmgrChainIndex chain_index);
+extern void
+			smgr_unlink_next(SMgrRelation reln, RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo, SmgrChainIndex chain_index);
+extern void
+			smgr_extend_next(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+							 const void *buffer, bool skipFsync, SmgrChainIndex chain_index);
+extern void
+			smgr_zeroextend_next(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+								 int nblocks, bool skipFsync, SmgrChainIndex chain_index);
+extern bool
+			smgr_prefetch_next(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+							   int nblocks, SmgrChainIndex chain_index);
+extern uint32
+			smgr_maxcombine_next(SMgrRelation reln, ForkNumber forknum,
+								 BlockNumber blocknum, SmgrChainIndex chain_index);
+extern void
+			smgr_readv_next(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+							void **buffers, BlockNumber nblocks, SmgrChainIndex chain_index);
+extern void
+			smgr_writev_next(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+							 const void **buffers, BlockNumber nblocks, bool skipFsync, SmgrChainIndex chain_index);
+extern void
+			smgr_writeback_next(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+								BlockNumber nblocks, SmgrChainIndex chain_index);
+extern BlockNumber
+			smgr_nblocks_next(SMgrRelation reln, ForkNumber forknum, SmgrChainIndex chain_index);
+extern void
+			smgr_truncate_next(SMgrRelation reln, ForkNumber forknum, BlockNumber curnblk, BlockNumber nblocks, SmgrChainIndex chain_index);
+extern void
+			smgr_registersync_next(SMgrRelation reln, ForkNumber forknum, SmgrChainIndex chain_index);
+
 static inline void
 smgrread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		 void *buffer)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index bc260e713ae..b1b485d5445 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2553,6 +2553,7 @@ SID_IDENTIFIER_AUTHORITY
 SID_NAME_USE
 SISeg
 SIZE_T
+SMgrChain
 SMgrRelation
 SMgrRelationData
 SMgrSortArray
-- 
2.47.2

v4-0005-Refactor-smgr-API-mdcreate-needs-the-old-relfilel.patchtext/x-patch; charset=UTF-8; name=v4-0005-Refactor-smgr-API-mdcreate-needs-the-old-relfilel.patchDownload
From 96523c547e90de64b116acf470c07307541795b1 Mon Sep 17 00:00:00 2001
From: Zsolt Parragi <zsolt.parragi@cancellar.hu>
Date: Sat, 12 Oct 2024 22:01:28 +0100
Subject: [PATCH v4 5/6] Refactor smgr API: mdcreate needs the old
 relfilelocator

With this change, mdcreate receives the old relfilelocator along
with the new for operations that create a new file for an existing
relation.

This is required for tde_heap in pg_tde.
---
 src/backend/access/heap/heapam_handler.c | 10 ++++++----
 src/backend/access/transam/xlogutils.c   |  2 +-
 src/backend/catalog/heap.c               |  2 +-
 src/backend/catalog/index.c              |  2 +-
 src/backend/catalog/storage.c            |  8 ++++----
 src/backend/commands/sequence.c          |  2 +-
 src/backend/commands/tablecmds.c         |  4 ++--
 src/backend/storage/buffer/bufmgr.c      |  7 ++++---
 src/backend/storage/smgr/md.c            |  2 +-
 src/backend/storage/smgr/smgr.c          |  4 ++--
 src/backend/utils/cache/relcache.c       |  2 +-
 src/include/catalog/storage.h            |  3 ++-
 src/include/storage/md.h                 |  2 +-
 src/include/storage/smgr.h               |  4 ++--
 14 files changed, 29 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index e78682c3cef..96463d1bb14 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -584,6 +584,8 @@ heapam_relation_set_new_filelocator(Relation rel,
 {
 	SMgrRelation srel;
 
+	RelFileLocator oldlocator = rel->rd_locator;
+
 	/*
 	 * Initialize to the minimum XID that could put tuples in the table. We
 	 * know that no xacts older than RecentXmin are still running, so that
@@ -601,7 +603,7 @@ heapam_relation_set_new_filelocator(Relation rel,
 	 */
 	*minmulti = GetOldestMultiXactId();
 
-	srel = RelationCreateStorage(*newrlocator, persistence, true);
+	srel = RelationCreateStorage(oldlocator, *newrlocator, persistence, true);
 
 	/*
 	 * If required, set up an init fork for an unlogged table so that it can
@@ -611,7 +613,7 @@ heapam_relation_set_new_filelocator(Relation rel,
 	{
 		Assert(rel->rd_rel->relkind == RELKIND_RELATION ||
 			   rel->rd_rel->relkind == RELKIND_TOASTVALUE);
-		smgrcreate(srel, INIT_FORKNUM, false);
+		smgrcreate(oldlocator, srel, INIT_FORKNUM, false);
 		log_smgrcreate(newrlocator, INIT_FORKNUM);
 	}
 
@@ -644,7 +646,7 @@ heapam_relation_copy_data(Relation rel, const RelFileLocator *newrlocator)
 	 * NOTE: any conflict in relfilenumber value will be caught in
 	 * RelationCreateStorage().
 	 */
-	dstrel = RelationCreateStorage(*newrlocator, rel->rd_rel->relpersistence, true);
+	dstrel = RelationCreateStorage(rel->rd_locator, *newrlocator, rel->rd_rel->relpersistence, true);
 
 	/* copy main fork */
 	RelationCopyStorage(RelationGetSmgr(rel), dstrel, MAIN_FORKNUM,
@@ -656,7 +658,7 @@ heapam_relation_copy_data(Relation rel, const RelFileLocator *newrlocator)
 	{
 		if (smgrexists(RelationGetSmgr(rel), forkNum))
 		{
-			smgrcreate(dstrel, forkNum, false);
+			smgrcreate(rel->rd_locator, dstrel, forkNum, false);
 
 			/*
 			 * WAL log creation if the relation is persistent, or this is the
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index c389b27f77d..2179d2f73fa 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -487,7 +487,7 @@ XLogReadBufferExtended(RelFileLocator rlocator, ForkNumber forknum,
 	 * filesystem loses an inode during a crash.  Better to write the data
 	 * until we are actually told to delete the file.)
 	 */
-	smgrcreate(smgr, forknum, true);
+	smgrcreate(rlocator, smgr, forknum, true);
 
 	lastblock = smgrnblocks(smgr, forknum);
 
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index bd3554c0bfd..251d22f50b2 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -386,7 +386,7 @@ heap_create(const char *relname,
 											   relpersistence,
 											   relfrozenxid, relminmxid);
 		else if (RELKIND_HAS_STORAGE(rel->rd_rel->relkind))
-			RelationCreateStorage(rel->rd_locator, relpersistence, true);
+			RelationCreateStorage(rel->rd_locator, rel->rd_locator, relpersistence, true);
 		else
 			Assert(false);
 	}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 022b9b99b13..cc99d45f2ff 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3089,7 +3089,7 @@ index_build(Relation heapRelation,
 	if (indexRelation->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
 		!smgrexists(RelationGetSmgr(indexRelation), INIT_FORKNUM))
 	{
-		smgrcreate(RelationGetSmgr(indexRelation), INIT_FORKNUM, false);
+		smgrcreate(indexRelation->rd_locator, RelationGetSmgr(indexRelation), INIT_FORKNUM, false);
 		log_smgrcreate(&indexRelation->rd_locator, INIT_FORKNUM);
 		indexRelation->rd_indam->ambuildempty(indexRelation);
 	}
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 624ed41bbf3..59fa01decc5 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -118,7 +118,7 @@ AddPendingSync(const RelFileLocator *rlocator)
  * pass register_delete = false.
  */
 SMgrRelation
-RelationCreateStorage(RelFileLocator rlocator, char relpersistence,
+RelationCreateStorage(RelFileLocator oldlocator, RelFileLocator rlocator, char relpersistence,
 					  bool register_delete)
 {
 	SMgrRelation srel;
@@ -147,7 +147,7 @@ RelationCreateStorage(RelFileLocator rlocator, char relpersistence,
 	}
 
 	srel = smgropen(rlocator, procNumber);
-	smgrcreate(srel, MAIN_FORKNUM, false);
+	smgrcreate(oldlocator, srel, MAIN_FORKNUM, false);
 
 	if (needs_wal)
 		log_smgrcreate(&srel->smgr_rlocator.locator, MAIN_FORKNUM);
@@ -976,7 +976,7 @@ smgr_redo(XLogReaderState *record)
 		SMgrRelation reln;
 
 		reln = smgropen(xlrec->rlocator, INVALID_PROC_NUMBER);
-		smgrcreate(reln, xlrec->forkNum, true);
+		smgrcreate(xlrec->rlocator, reln, xlrec->forkNum, true);
 	}
 	else if (info == XLOG_SMGR_TRUNCATE)
 	{
@@ -997,7 +997,7 @@ smgr_redo(XLogReaderState *record)
 		 * XLogReadBufferForRedo, we prefer to recreate the rel and replay the
 		 * log as best we can until the drop is seen.
 		 */
-		smgrcreate(reln, MAIN_FORKNUM, true);
+		smgrcreate(xlrec->rlocator, reln, MAIN_FORKNUM, true);
 
 		/*
 		 * Before we perform the truncation, update minimum recovery point to
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 4b7c5113aab..d8c560a11b2 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -344,7 +344,7 @@ fill_seq_with_data(Relation rel, HeapTuple tuple)
 		SMgrRelation srel;
 
 		srel = smgropen(rel->rd_locator, INVALID_PROC_NUMBER);
-		smgrcreate(srel, INIT_FORKNUM, false);
+		smgrcreate(rel->rd_locator, srel, INIT_FORKNUM, false);
 		log_smgrcreate(&rel->rd_locator, INIT_FORKNUM);
 		fill_seq_fork_with_data(rel, tuple, INIT_FORKNUM);
 		FlushRelationBuffers(rel);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 59156a1c1f6..7d0d9d3efa9 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -16313,7 +16313,7 @@ index_copy_data(Relation rel, RelFileLocator newrlocator)
 	 * NOTE: any conflict in relfilenumber value will be caught in
 	 * RelationCreateStorage().
 	 */
-	dstrel = RelationCreateStorage(newrlocator, rel->rd_rel->relpersistence, true);
+	dstrel = RelationCreateStorage(rel->rd_locator, newrlocator, rel->rd_rel->relpersistence, true);
 
 	/* copy main fork */
 	RelationCopyStorage(RelationGetSmgr(rel), dstrel, MAIN_FORKNUM,
@@ -16325,7 +16325,7 @@ index_copy_data(Relation rel, RelFileLocator newrlocator)
 	{
 		if (smgrexists(RelationGetSmgr(rel), forkNum))
 		{
-			smgrcreate(dstrel, forkNum, false);
+			smgrcreate(rel->rd_locator, dstrel, forkNum, false);
 
 			/*
 			 * WAL log creation if the relation is persistent, or this is the
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7915ed624c1..ecacb5fb50a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -943,7 +943,7 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 
 		/* recheck, fork might have been created concurrently */
 		if (!smgrexists(bmr.smgr, fork))
-			smgrcreate(bmr.smgr, fork, flags & EB_PERFORMING_RECOVERY);
+			smgrcreate(bmr.rel->rd_locator, bmr.smgr, fork, flags & EB_PERFORMING_RECOVERY);
 
 		UnlockRelationForExtension(bmr.rel, ExclusiveLock);
 	}
@@ -4754,7 +4754,7 @@ CreateAndCopyRelationData(RelFileLocator src_rlocator,
 	 * directory.  Therefore, each individual relation doesn't need to be
 	 * registered for cleanup.
 	 */
-	RelationCreateStorage(dst_rlocator, relpersistence, false);
+	RelationCreateStorage(src_rlocator, dst_rlocator, relpersistence, false);
 
 	/* copy main fork. */
 	RelationCopyStorageUsingBuffer(src_rlocator, dst_rlocator, MAIN_FORKNUM,
@@ -4766,7 +4766,8 @@ CreateAndCopyRelationData(RelFileLocator src_rlocator,
 	{
 		if (smgrexists(src_rel, forkNum))
 		{
-			smgrcreate(dst_rel, forkNum, false);
+			/* TODO: for sure? */
+			smgrcreate(src_rel->smgr_rlocator.locator, dst_rel, forkNum, false);
 
 			/*
 			 * WAL log creation if the relation is persistent, or this is the
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 5a2072e0816..1766bbe1e57 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -245,7 +245,7 @@ mdexists(SMgrRelation reln, ForkNumber forknum)
  * If isRedo is true, it's okay for the relation to exist already.
  */
 void
-mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
+mdcreate(RelFileLocator /* reln */, SMgrRelation reln, ForkNumber forknum, bool isRedo)
 {
 	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 	MdfdVec    *mdfd;
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 9b3e63aff55..0498fd6c317 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -408,9 +408,9 @@ smgrexists(SMgrRelation reln, ForkNumber forknum)
  * to be created.
  */
 void
-smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
+smgrcreate(RelFileLocator relold, SMgrRelation reln, ForkNumber forknum, bool isRedo)
 {
-	smgrsw[reln->smgr_which].smgr_create(reln, forknum, isRedo);
+	smgrsw[reln->smgr_which].smgr_create(relold, reln, forknum, isRedo);
 }
 
 /*
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index d1ae761b3f6..db3e241404d 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -3829,7 +3829,7 @@ RelationSetNewRelfilenumber(Relation relation, char persistence)
 		/* handle these directly, at least for now */
 		SMgrRelation srel;
 
-		srel = RelationCreateStorage(newrlocator, persistence, true);
+		srel = RelationCreateStorage(relation->rd_locator, newrlocator, persistence, true);
 		smgrclose(srel);
 	}
 	else
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index ba99225b0a3..ecc3b792f4f 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -22,7 +22,8 @@
 /* GUC variables */
 extern PGDLLIMPORT int wal_skip_threshold;
 
-extern SMgrRelation RelationCreateStorage(RelFileLocator rlocator,
+extern SMgrRelation RelationCreateStorage(RelFileLocator oldlocator,
+										  RelFileLocator rlocator,
 										  char relpersistence,
 										  bool register_delete);
 extern void RelationDropStorage(Relation rel);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index da1d1d339be..61c0e85dd74 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -27,7 +27,7 @@ extern SMgrId MdSMgrId;
 extern void mdinit(void);
 extern void mdopen(SMgrRelation reln);
 extern void mdclose(SMgrRelation reln, ForkNumber forknum);
-extern void mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
+extern void mdcreate(RelFileLocator relold, SMgrRelation reln, ForkNumber forknum, bool isRedo);
 extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
 extern void mdunlink(RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo);
 extern void mdextend(SMgrRelation reln, ForkNumber forknum,
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 629c78cfdde..5b2b6de91c4 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -87,7 +87,7 @@ typedef struct f_smgr
 	void		(*smgr_shutdown) (void);	/* may be NULL */
 	void		(*smgr_open) (SMgrRelation reln);
 	void		(*smgr_close) (SMgrRelation reln, ForkNumber forknum);
-	void		(*smgr_create) (SMgrRelation reln, ForkNumber forknum,
+	void		(*smgr_create) (RelFileLocator relold, SMgrRelation reln, ForkNumber forknum,
 								bool isRedo);
 	bool		(*smgr_exists) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_unlink) (RelFileLocatorBackend rlocator, ForkNumber forknum,
@@ -128,7 +128,7 @@ extern void smgrdestroyall(void);
 extern void smgrrelease(SMgrRelation reln);
 extern void smgrreleaseall(void);
 extern void smgrreleaserellocator(RelFileLocatorBackend rlocator);
-extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
+extern void smgrcreate(RelFileLocator relold, SMgrRelation reln, ForkNumber forknum, bool isRedo);
 extern void smgrdosyncall(SMgrRelation *rels, int nrels);
 extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
 extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
-- 
2.47.2

v4-0004-Add-contrib-fsync_checker.patchtext/x-patch; charset=UTF-8; name=v4-0004-Add-contrib-fsync_checker.patchDownload
From 608fc0377f7823d4f21ed8af60586f0d7c301cd0 Mon Sep 17 00:00:00 2001
From: Tristan Partin <tristan@neon.tech>
Date: Wed, 20 Sep 2023 14:23:38 -0500
Subject: [PATCH v4 4/6] Add contrib/fsync_checker

fsync_checker is an extension which overrides the global storage manager
to check for volatile relations, those which have been written but not
synced to disk.
---
 contrib/Makefile                            |   1 +
 contrib/fsync_checker/fsync_checker.control |   5 +
 contrib/fsync_checker/fsync_checker_smgr.c  | 249 ++++++++++++++++++++
 contrib/fsync_checker/meson.build           |  22 ++
 contrib/meson.build                         |   1 +
 src/tools/pgindent/typedefs.list            |   2 +
 6 files changed, 280 insertions(+)
 create mode 100644 contrib/fsync_checker/fsync_checker.control
 create mode 100644 contrib/fsync_checker/fsync_checker_smgr.c
 create mode 100644 contrib/fsync_checker/meson.build

diff --git a/contrib/Makefile b/contrib/Makefile
index 952855d9b61..1c9f22b1c86 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -19,6 +19,7 @@ SUBDIRS = \
 		dict_int	\
 		dict_xsyn	\
 		earthdistance	\
+		fsync_checker	\
 		file_fdw	\
 		fuzzystrmatch	\
 		hstore		\
diff --git a/contrib/fsync_checker/fsync_checker.control b/contrib/fsync_checker/fsync_checker.control
new file mode 100644
index 00000000000..7d0e36434bf
--- /dev/null
+++ b/contrib/fsync_checker/fsync_checker.control
@@ -0,0 +1,5 @@
+# fsync_checker extension
+comment = 'SMGR extension for checking volatile writes'
+default_version = '1.0'
+module_pathname = '$libdir/fsync_checker'
+relocatable = true
diff --git a/contrib/fsync_checker/fsync_checker_smgr.c b/contrib/fsync_checker/fsync_checker_smgr.c
new file mode 100644
index 00000000000..97ad0f78da8
--- /dev/null
+++ b/contrib/fsync_checker/fsync_checker_smgr.c
@@ -0,0 +1,249 @@
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "storage/ipc.h"
+#include "storage/lwlock.h"
+#include "storage/shmem.h"
+#include "storage/smgr.h"
+#include "storage/md.h"
+#include "utils/hsearch.h"
+
+PG_MODULE_MAGIC;
+
+typedef struct
+{
+	RelFileLocator locator;
+	ForkNumber	forknum;
+} VolatileRelnKey;
+
+typedef struct
+{
+	VolatileRelnKey key;
+	XLogRecPtr	lsn;
+} VolatileRelnEntry;
+
+void		_PG_init(void);
+
+static void fsync_checker_extend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+								 const void *buffer, bool skipFsync);
+static void fsync_checker_immedsync(SMgrRelation reln, ForkNumber forknum);
+static void fsync_checker_writev(SMgrRelation reln, ForkNumber forknum,
+								 BlockNumber blocknum, const void **buffers,
+								 BlockNumber nblocks, bool skipFsync);
+static void fsync_checker_writeback(SMgrRelation reln, ForkNumber forknum,
+									BlockNumber blocknum, BlockNumber nblocks);
+static void fsync_checker_zeroextend(SMgrRelation reln, ForkNumber forknum,
+									 BlockNumber blocknum, int nblocks, bool skipFsync);
+
+static void fsync_checker_checkpoint_create(const CheckPoint *checkPoint);
+static void fsync_checker_shmem_request(void);
+static void fsync_checker_shmem_startup(void);
+
+static void add_reln(SMgrRelation reln, ForkNumber forknum);
+static void remove_reln(SMgrRelation reln, ForkNumber forknum);
+
+static SMgrId fsync_checker_smgr_id;
+static const struct f_smgr fsync_checker_smgr = {
+	.name = "fsync_checker",
+	.smgr_init = mdinit,
+	.smgr_shutdown = NULL,
+	.smgr_open = mdopen,
+	.smgr_close = mdclose,
+	.smgr_create = mdcreate,
+	.smgr_exists = mdexists,
+	.smgr_unlink = mdunlink,
+	.smgr_extend = fsync_checker_extend,
+	.smgr_zeroextend = fsync_checker_zeroextend,
+	.smgr_prefetch = mdprefetch,
+	.smgr_maxcombine = mdmaxcombine,
+	.smgr_readv = mdreadv,
+	.smgr_writev = fsync_checker_writev,
+	.smgr_writeback = fsync_checker_writeback,
+	.smgr_nblocks = mdnblocks,
+	.smgr_truncate = mdtruncate,
+	.smgr_immedsync = fsync_checker_immedsync,
+	.smgr_registersync = mdregistersync,
+};
+
+static HTAB *volatile_relns;
+static LWLock *volatile_relns_lock;
+static shmem_request_hook_type prev_shmem_request_hook;
+static shmem_startup_hook_type prev_shmem_startup_hook;
+static checkpoint_create_hook_type prev_checkpoint_create_hook;
+
+void
+_PG_init(void)
+{
+	prev_checkpoint_create_hook = checkpoint_create_hook;
+	checkpoint_create_hook = fsync_checker_checkpoint_create;
+
+	prev_shmem_request_hook = shmem_request_hook;
+	shmem_request_hook = fsync_checker_shmem_request;
+
+	prev_shmem_startup_hook = shmem_startup_hook;
+	shmem_startup_hook = fsync_checker_shmem_startup;
+
+	/*
+	 * Relation size of 0 means we can just defer to md, but it would be nice
+	 * to just expose this functionality, so if I needed my own relation, I
+	 * could use MdSmgrRelation as the parent.
+	 */
+	fsync_checker_smgr_id = smgr_register(&fsync_checker_smgr, 0);
+
+	storage_manager_id = fsync_checker_smgr_id;
+}
+
+static void
+fsync_checker_checkpoint_create(const CheckPoint *checkPoint)
+{
+	long		num_entries;
+	HASH_SEQ_STATUS status;
+	VolatileRelnEntry *entry;
+
+	if (prev_checkpoint_create_hook)
+		prev_checkpoint_create_hook(checkPoint);
+
+	LWLockAcquire(volatile_relns_lock, LW_EXCLUSIVE);
+
+	hash_seq_init(&status, volatile_relns);
+
+	num_entries = hash_get_num_entries(volatile_relns);
+	elog(INFO, "Analyzing %ld volatile relations", num_entries);
+	while ((entry = hash_seq_search(&status)))
+	{
+		if (entry->lsn < checkPoint->redo)
+		{
+			RelPathStr	path;
+
+			path = relpathperm(entry->key.locator, entry->key.forknum);
+
+			elog(WARNING, "Relation not previously synced: %s", path.str);
+		}
+	}
+
+	LWLockRelease(volatile_relns_lock);
+}
+
+static void
+fsync_checker_shmem_request(void)
+{
+	if (prev_shmem_request_hook)
+		prev_shmem_request_hook();
+
+	RequestAddinShmemSpace(hash_estimate_size(1024, sizeof(VolatileRelnEntry)));
+	RequestNamedLWLockTranche("fsync_checker volatile relns lock", 1);
+}
+
+static void
+fsync_checker_shmem_startup(void)
+{
+	HASHCTL		ctl;
+
+	if (prev_shmem_startup_hook)
+		prev_shmem_startup_hook();
+
+	ctl.keysize = sizeof(VolatileRelnKey);
+	ctl.entrysize = sizeof(VolatileRelnEntry);
+	volatile_relns = NULL;
+	volatile_relns_lock = NULL;
+
+	/*
+	 * Create or attach to the shared memory state, including hash table
+	 */
+	LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE);
+
+	volatile_relns = ShmemInitHash("fsync_checker volatile relns",
+								   1024, 1024, &ctl, HASH_BLOBS | HASH_ELEM);
+	volatile_relns_lock = &GetNamedLWLockTranche("fsync_checker volatile relns lock")->lock;
+
+	LWLockRelease(AddinShmemInitLock);
+}
+
+static void
+add_reln(SMgrRelation reln, ForkNumber forknum)
+{
+	bool		found;
+	XLogRecPtr	lsn;
+	VolatileRelnKey key;
+	VolatileRelnEntry *entry;
+
+	key.locator = reln->smgr_rlocator.locator;
+	key.forknum = forknum;
+
+	lsn = GetXLogWriteRecPtr();
+
+	LWLockAcquire(volatile_relns_lock, LW_EXCLUSIVE);
+
+	entry = hash_search(volatile_relns, &key, HASH_ENTER, &found);
+	if (!found)
+		entry->lsn = lsn;
+
+	LWLockRelease(volatile_relns_lock);
+}
+
+static void
+remove_reln(SMgrRelation reln, ForkNumber forknum)
+{
+	VolatileRelnKey key;
+
+	key.locator = reln->smgr_rlocator.locator;
+	key.forknum = forknum;
+
+	LWLockAcquire(volatile_relns_lock, LW_EXCLUSIVE);
+
+	hash_search(volatile_relns, &key, HASH_REMOVE, NULL);
+
+	LWLockRelease(volatile_relns_lock);
+}
+
+static void
+fsync_checker_extend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+					 const void *buffer, bool skipFsync)
+{
+	if (!SmgrIsTemp(reln) && !skipFsync)
+		add_reln(reln, forknum);
+
+	mdextend(reln, forknum, blocknum, buffer, skipFsync);
+}
+
+static void
+fsync_checker_immedsync(SMgrRelation reln, ForkNumber forknum)
+{
+	if (!SmgrIsTemp(reln))
+		remove_reln(reln, forknum);
+
+	mdimmedsync(reln, forknum);
+}
+
+static void
+fsync_checker_writev(SMgrRelation reln, ForkNumber forknum,
+					 BlockNumber blocknum, const void **buffers,
+					 BlockNumber nblocks, bool skipFsync)
+{
+	if (!SmgrIsTemp(reln) && !skipFsync)
+		add_reln(reln, forknum);
+
+	mdwritev(reln, forknum, blocknum, buffers, nblocks, skipFsync);
+}
+
+static void
+fsync_checker_writeback(SMgrRelation reln, ForkNumber forknum,
+						BlockNumber blocknum, BlockNumber nblocks)
+{
+	if (!SmgrIsTemp(reln))
+		remove_reln(reln, forknum);
+
+	mdwriteback(reln, forknum, blocknum, nblocks);
+}
+
+static void
+fsync_checker_zeroextend(SMgrRelation reln, ForkNumber forknum,
+						 BlockNumber blocknum, int nblocks, bool skipFsync)
+{
+	if (!SmgrIsTemp(reln) && !skipFsync)
+		add_reln(reln, forknum);
+
+	mdzeroextend(reln, forknum, blocknum, nblocks, skipFsync);
+}
diff --git a/contrib/fsync_checker/meson.build b/contrib/fsync_checker/meson.build
new file mode 100644
index 00000000000..ce6ed7fe90b
--- /dev/null
+++ b/contrib/fsync_checker/meson.build
@@ -0,0 +1,22 @@
+# Copyright (c) 2023, PostgreSQL Global Development Group
+
+fsync_checker_sources = files(
+  'fsync_checker_smgr.c',
+)
+
+if host_system == 'windows'
+  fsync_checker_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'fsync_checker',
+    '--FILEDESC', 'fsync_checker - SMGR extension for checking volatile relations',])
+endif
+
+fsync_checker = shared_module('fsync_checker',
+  fsync_checker_sources,
+  kwargs: contrib_mod_args,
+)
+contrib_targets += fsync_checker
+
+install_data(
+  'fsync_checker.control',
+  kwargs: contrib_data_args,
+)
diff --git a/contrib/meson.build b/contrib/meson.build
index 1ba73ebd67a..c48fb138751 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -28,6 +28,7 @@ subdir('dict_int')
 subdir('dict_xsyn')
 subdir('earthdistance')
 subdir('file_fdw')
+subdir('fsync_checker')
 subdir('fuzzystrmatch')
 subdir('hstore')
 subdir('hstore_plperl')
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 4b971b81ae5..bc260e713ae 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3145,6 +3145,8 @@ ViewStmt
 VirtualTransactionId
 VirtualTupleTableSlot
 VolatileFunctionStatus
+VolatileRelnEntry
+VolatileRelnKey
 Vsrt
 WAIT_ORDER
 WALAvailability
-- 
2.47.2

v4-0003-Add-checkpoint_create_hook.patchtext/x-patch; charset=UTF-8; name=v4-0003-Add-checkpoint_create_hook.patchDownload
From 01d76e69d1e0b2018f72160ef8873c4997f42030 Mon Sep 17 00:00:00 2001
From: Tristan Partin <tristan@neon.tech>
Date: Fri, 13 Oct 2023 13:57:18 -0500
Subject: [PATCH v4 3/6] Add checkpoint_create_hook

Allows an extension to hook into CheckPointCreate().
---
 src/backend/access/transam/xlog.c | 5 +++++
 src/include/access/xlog.h         | 4 ++++
 2 files changed, 9 insertions(+)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 799fc739e18..cdd6eff2fd5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -208,6 +208,8 @@ const struct config_enum_entry archive_mode_options[] = {
  */
 CheckpointStatsData CheckpointStats;
 
+checkpoint_create_hook_type checkpoint_create_hook = NULL;
+
 /*
  * During recovery, lastFullPageWrites keeps track of full_page_writes that
  * the replayed WAL records indicate. It's initialized with full_page_writes
@@ -7173,6 +7175,9 @@ CreateCheckPoint(int flags)
 	 */
 	END_CRIT_SECTION();
 
+	if (checkpoint_create_hook != NULL)
+		checkpoint_create_hook(&checkPoint);
+
 	/*
 	 * In some cases there are groups of actions that must all occur on one
 	 * side or the other of a checkpoint record. Before flushing the
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index d313099c027..8aab37ef52d 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -13,6 +13,7 @@
 
 #include "access/xlogbackup.h"
 #include "access/xlogdefs.h"
+#include "catalog/pg_control.h"
 #include "datatype/timestamp.h"
 #include "lib/stringinfo.h"
 #include "nodes/pg_list.h"
@@ -59,6 +60,9 @@ extern PGDLLIMPORT int wal_decode_buffer_size;
 
 extern PGDLLIMPORT int CheckPointSegments;
 
+typedef void (*checkpoint_create_hook_type) (const CheckPoint *);
+extern PGDLLIMPORT checkpoint_create_hook_type checkpoint_create_hook;
+
 /* Archive modes */
 typedef enum ArchiveMode
 {
-- 
2.47.2

v4-0002-Allow-extensions-to-override-the-global-storage-m.patchtext/x-patch; charset=UTF-8; name=v4-0002-Allow-extensions-to-override-the-global-storage-m.patchDownload
From f1a1dc0f8146c724589ae2578aa470faf68fdb9d Mon Sep 17 00:00:00 2001
From: Tristan Partin <tristan@neon.tech>
Date: Fri, 13 Oct 2023 14:00:44 -0500
Subject: [PATCH v4 2/6] Allow extensions to override the global storage
 manager

---
 src/backend/storage/smgr/smgr.c   | 4 +++-
 src/backend/utils/init/miscinit.c | 2 ++
 src/include/storage/smgr.h        | 2 ++
 3 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 7635c231ea0..9b3e63aff55 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -69,6 +69,8 @@ static int	NSmgr = 0;
 
 static Size LargestSMgrRelationSize = 0;
 
+SMgrId		storage_manager_id;
+
 /*
  * Each backend has a hashtable that stores all extant SMgrRelation objects.
  * In addition, "unpinned" SMgrRelation objects are chained together in a list.
@@ -227,7 +229,7 @@ smgropen(RelFileLocator rlocator, ProcNumber backend)
 		for (int i = 0; i <= MAX_FORKNUM; ++i)
 			reln->smgr_cached_nblocks[i] = InvalidBlockNumber;
 
-		reln->smgr_which = MdSMgrId;	/* we only have md.c at present */
+		reln->smgr_which = storage_manager_id;
 
 		/* implementation-specific initialization */
 		smgrsw[reln->smgr_which].smgr_open(reln);
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 3176cdce6d7..1b3ce51cfce 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -1931,6 +1931,8 @@ void
 register_builtin_dynamic_managers(void)
 {
 	mdsmgr_register();
+
+	storage_manager_id = MdSMgrId;
 }
 
 /*
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 52f74f917b2..629c78cfdde 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -20,6 +20,8 @@
 
 typedef uint8 SMgrId;
 
+extern PGDLLIMPORT SMgrId storage_manager_id;
+
 /*
  * smgr.c maintains a table of SMgrRelation objects, which are essentially
  * cached file handles.  An SMgrRelation is created (if not already present)
-- 
2.47.2

v4-0001-Expose-f_smgr-to-extensions-for-manual-implementa.patchtext/x-patch; charset=UTF-8; name=v4-0001-Expose-f_smgr-to-extensions-for-manual-implementa.patchDownload
From 772790b5a5b4ab215d1243722f1b31303dc976f5 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Tue, 27 Jun 2023 15:59:23 +0200
Subject: [PATCH v4 1/6] Expose f_smgr to extensions for manual implementation

There are various reasons why one would want to create their own
implementation of a storage manager, among which are block-level compression,
encryption and offloading to cold storage. This patch is a first patch that
allows extensions to register their own SMgr.

Note, however, that this SMgr is not yet used - only the first SMgr to register
is used, and this is currently the md.c smgr. Future commits will include
facilities to select an SMgr for each tablespace.
---
 src/backend/postmaster/postmaster.c |   5 +
 src/backend/storage/smgr/md.c       | 187 +++++++++++++++++++---------
 src/backend/storage/smgr/smgr.c     | 137 ++++++++++----------
 src/backend/utils/init/miscinit.c   |  13 ++
 src/include/miscadmin.h             |   1 +
 src/include/storage/md.h            |   4 +
 src/include/storage/smgr.h          |  59 +++++++--
 src/tools/pgindent/typedefs.list    |   1 +
 8 files changed, 266 insertions(+), 141 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index d2a7a7add6f..88ea821573d 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -916,6 +916,11 @@ PostmasterMain(int argc, char *argv[])
 	 */
 	ApplyLauncherRegister();
 
+	/*
+	 * Register built-in managers that are not part of static arrays
+	 */
+	register_builtin_dynamic_managers();
+
 	/*
 	 * process any libraries that should be preloaded at postmaster start
 	 */
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index f3220f98dc4..5a2072e0816 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -84,6 +84,21 @@ typedef struct _MdfdVec
 } MdfdVec;
 
 static MemoryContext MdCxt;		/* context for all MdfdVec objects */
+SMgrId		MdSMgrId;
+
+typedef struct
+{
+	SMgrRelationData reln;		/* parent data */
+
+	/*
+	 * for md.c; per-fork arrays of the number of open segments
+	 * (md_num_open_segs) and the segments themselves (md_seg_fds).
+	 */
+	int			md_num_open_segs[MAX_FORKNUM + 1];
+	MdfdVec    *md_seg_fds[MAX_FORKNUM + 1];
+} MdSMgrRelationData;
+
+typedef MdSMgrRelationData *MdSMgrRelation;
 
 
 /* Populate a file tag describing an md.c segment file. */
@@ -130,26 +145,55 @@ typedef struct MdPathStr
 } MdPathStr;
 
 
+void
+mdsmgr_register(void)
+{
+	/* magnetic disk */
+	f_smgr		md_smgr = (f_smgr) {
+		.name = "md",
+		.smgr_init = mdinit,
+		.smgr_shutdown = NULL,
+		.smgr_open = mdopen,
+		.smgr_close = mdclose,
+		.smgr_create = mdcreate,
+		.smgr_exists = mdexists,
+		.smgr_unlink = mdunlink,
+		.smgr_extend = mdextend,
+		.smgr_zeroextend = mdzeroextend,
+		.smgr_prefetch = mdprefetch,
+		.smgr_maxcombine = mdmaxcombine,
+		.smgr_readv = mdreadv,
+		.smgr_writev = mdwritev,
+		.smgr_writeback = mdwriteback,
+		.smgr_nblocks = mdnblocks,
+		.smgr_truncate = mdtruncate,
+		.smgr_immedsync = mdimmedsync,
+		.smgr_registersync = mdregistersync,
+	};
+
+	MdSMgrId = smgr_register(&md_smgr, sizeof(MdSMgrRelationData));
+}
+
 /* local routines */
 static void mdunlinkfork(RelFileLocatorBackend rlocator, ForkNumber forknum,
 						 bool isRedo);
-static MdfdVec *mdopenfork(SMgrRelation reln, ForkNumber forknum, int behavior);
-static void register_dirty_segment(SMgrRelation reln, ForkNumber forknum,
+static MdfdVec *mdopenfork(MdSMgrRelation reln, ForkNumber forknum, int behavior);
+static void register_dirty_segment(MdSMgrRelation reln, ForkNumber forknum,
 								   MdfdVec *seg);
 static void register_unlink_segment(RelFileLocatorBackend rlocator, ForkNumber forknum,
 									BlockNumber segno);
 static void register_forget_request(RelFileLocatorBackend rlocator, ForkNumber forknum,
 									BlockNumber segno);
-static void _fdvec_resize(SMgrRelation reln,
+static void _fdvec_resize(MdSMgrRelation reln,
 						  ForkNumber forknum,
 						  int nseg);
-static MdPathStr _mdfd_segpath(SMgrRelation reln, ForkNumber forknum,
+static MdPathStr _mdfd_segpath(MdSMgrRelation reln, ForkNumber forknum,
 							   BlockNumber segno);
-static MdfdVec *_mdfd_openseg(SMgrRelation reln, ForkNumber forknum,
+static MdfdVec *_mdfd_openseg(MdSMgrRelation reln, ForkNumber forknum,
 							  BlockNumber segno, int oflags);
-static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forknum,
+static MdfdVec *_mdfd_getseg(MdSMgrRelation reln, ForkNumber forknum,
 							 BlockNumber blkno, bool skipFsync, int behavior);
-static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
+static BlockNumber _mdnblocks(MdSMgrRelation reln, ForkNumber forknum,
 							  MdfdVec *seg);
 
 static inline int
@@ -182,6 +226,8 @@ mdinit(void)
 bool
 mdexists(SMgrRelation reln, ForkNumber forknum)
 {
+	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
+
 	/*
 	 * Close it first, to ensure that we notice if the fork has been unlinked
 	 * since we opened it.  As an optimization, we can skip that in recovery,
@@ -190,7 +236,7 @@ mdexists(SMgrRelation reln, ForkNumber forknum)
 	if (!InRecovery)
 		mdclose(reln, forknum);
 
-	return (mdopenfork(reln, forknum, EXTENSION_RETURN_NULL) != NULL);
+	return (mdopenfork(mdreln, forknum, EXTENSION_RETURN_NULL) != NULL);
 }
 
 /*
@@ -201,14 +247,15 @@ mdexists(SMgrRelation reln, ForkNumber forknum)
 void
 mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
 {
+	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 	MdfdVec    *mdfd;
 	RelPathStr	path;
 	File		fd;
 
-	if (isRedo && reln->md_num_open_segs[forknum] > 0)
+	if (isRedo && mdreln->md_num_open_segs[forknum] > 0)
 		return;					/* created and opened already... */
 
-	Assert(reln->md_num_open_segs[forknum] == 0);
+	Assert(mdreln->md_num_open_segs[forknum] == 0);
 
 	/*
 	 * We may be using the target table space for the first time in this
@@ -243,13 +290,13 @@ mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
 		}
 	}
 
-	_fdvec_resize(reln, forknum, 1);
-	mdfd = &reln->md_seg_fds[forknum][0];
+	_fdvec_resize(mdreln, forknum, 1);
+	mdfd = &mdreln->md_seg_fds[forknum][0];
 	mdfd->mdfd_vfd = fd;
 	mdfd->mdfd_segno = 0;
 
 	if (!SmgrIsTemp(reln))
-		register_dirty_segment(reln, forknum, mdfd);
+		register_dirty_segment(mdreln, forknum, mdfd);
 }
 
 /*
@@ -467,6 +514,7 @@ void
 mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		 const void *buffer, bool skipFsync)
 {
+	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 	off_t		seekpos;
 	int			nbytes;
 	MdfdVec    *v;
@@ -493,7 +541,7 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 						relpath(reln->smgr_rlocator, forknum).str,
 						InvalidBlockNumber)));
 
-	v = _mdfd_getseg(reln, forknum, blocknum, skipFsync, EXTENSION_CREATE);
+	v = _mdfd_getseg(mdreln, forknum, blocknum, skipFsync, EXTENSION_CREATE);
 
 	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
 
@@ -517,9 +565,9 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	}
 
 	if (!skipFsync && !SmgrIsTemp(reln))
-		register_dirty_segment(reln, forknum, v);
+		register_dirty_segment(mdreln, forknum, v);
 
-	Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));
+	Assert(_mdnblocks(mdreln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));
 }
 
 /*
@@ -532,6 +580,7 @@ void
 mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 			 BlockNumber blocknum, int nblocks, bool skipFsync)
 {
+	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 	MdfdVec    *v;
 	BlockNumber curblocknum = blocknum;
 	int			remblocks = nblocks;
@@ -566,7 +615,7 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 		else
 			numblocks = remblocks;
 
-		v = _mdfd_getseg(reln, forknum, curblocknum, skipFsync, EXTENSION_CREATE);
+		v = _mdfd_getseg(mdreln, forknum, curblocknum, skipFsync, EXTENSION_CREATE);
 
 		Assert(segstartblock < RELSEG_SIZE);
 		Assert(segstartblock + numblocks <= RELSEG_SIZE);
@@ -621,9 +670,9 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 		}
 
 		if (!skipFsync && !SmgrIsTemp(reln))
-			register_dirty_segment(reln, forknum, v);
+			register_dirty_segment(mdreln, forknum, v);
 
-		Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));
+		Assert(_mdnblocks(mdreln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));
 
 		remblocks -= numblocks;
 		curblocknum += numblocks;
@@ -641,7 +690,7 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum,
  * invent one out of whole cloth.
  */
 static MdfdVec *
-mdopenfork(SMgrRelation reln, ForkNumber forknum, int behavior)
+mdopenfork(MdSMgrRelation reln, ForkNumber forknum, int behavior)
 {
 	MdfdVec    *mdfd;
 	RelPathStr	path;
@@ -651,7 +700,7 @@ mdopenfork(SMgrRelation reln, ForkNumber forknum, int behavior)
 	if (reln->md_num_open_segs[forknum] > 0)
 		return &reln->md_seg_fds[forknum][0];
 
-	path = relpath(reln->smgr_rlocator, forknum);
+	path = relpath(reln->reln.smgr_rlocator, forknum);
 
 	fd = PathNameOpenFile(path.str, _mdfd_open_flags());
 
@@ -681,9 +730,11 @@ mdopenfork(SMgrRelation reln, ForkNumber forknum, int behavior)
 void
 mdopen(SMgrRelation reln)
 {
+	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
+
 	/* mark it not open */
 	for (int forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-		reln->md_num_open_segs[forknum] = 0;
+		mdreln->md_num_open_segs[forknum] = 0;
 }
 
 /*
@@ -692,7 +743,8 @@ mdopen(SMgrRelation reln)
 void
 mdclose(SMgrRelation reln, ForkNumber forknum)
 {
-	int			nopensegs = reln->md_num_open_segs[forknum];
+	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
+	int			nopensegs = mdreln->md_num_open_segs[forknum];
 
 	/* No work if already closed */
 	if (nopensegs == 0)
@@ -701,10 +753,10 @@ mdclose(SMgrRelation reln, ForkNumber forknum)
 	/* close segments starting from the end */
 	while (nopensegs > 0)
 	{
-		MdfdVec    *v = &reln->md_seg_fds[forknum][nopensegs - 1];
+		MdfdVec    *v = &mdreln->md_seg_fds[forknum][nopensegs - 1];
 
 		FileClose(v->mdfd_vfd);
-		_fdvec_resize(reln, forknum, nopensegs - 1);
+		_fdvec_resize(mdreln, forknum, nopensegs - 1);
 		nopensegs--;
 	}
 }
@@ -717,6 +769,7 @@ mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		   int nblocks)
 {
 #ifdef USE_PREFETCH
+	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 
 	Assert((io_direct_flags & IO_DIRECT_DATA) == 0);
 
@@ -729,7 +782,7 @@ mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		MdfdVec    *v;
 		int			nblocks_this_segment;
 
-		v = _mdfd_getseg(reln, forknum, blocknum, false,
+		v = _mdfd_getseg(mdreln, forknum, blocknum, false,
 						 InRecovery ? EXTENSION_RETURN_NULL : EXTENSION_FAIL);
 		if (v == NULL)
 			return false;
@@ -827,6 +880,8 @@ void
 mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		void **buffers, BlockNumber nblocks)
 {
+	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
+
 	while (nblocks > 0)
 	{
 		struct iovec iov[PG_IOV_MAX];
@@ -838,7 +893,7 @@ mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		size_t		transferred_this_segment;
 		size_t		size_this_segment;
 
-		v = _mdfd_getseg(reln, forknum, blocknum, false,
+		v = _mdfd_getseg(mdreln, forknum, blocknum, false,
 						 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
 
 		seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
@@ -948,6 +1003,8 @@ void
 mdwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		 const void **buffers, BlockNumber nblocks, bool skipFsync)
 {
+	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
+
 	/* This assert is too expensive to have on normally ... */
 #ifdef CHECK_WRITE_VS_EXTEND
 	Assert((uint64) blocknum + (uint64) nblocks <= (uint64) mdnblocks(reln, forknum));
@@ -964,7 +1021,7 @@ mdwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		size_t		transferred_this_segment;
 		size_t		size_this_segment;
 
-		v = _mdfd_getseg(reln, forknum, blocknum, skipFsync,
+		v = _mdfd_getseg(mdreln, forknum, blocknum, skipFsync,
 						 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
 
 		seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
@@ -1034,7 +1091,7 @@ mdwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		}
 
 		if (!skipFsync && !SmgrIsTemp(reln))
-			register_dirty_segment(reln, forknum, v);
+			register_dirty_segment(mdreln, forknum, v);
 
 		nblocks -= nblocks_this_segment;
 		buffers += nblocks_this_segment;
@@ -1053,6 +1110,8 @@ void
 mdwriteback(SMgrRelation reln, ForkNumber forknum,
 			BlockNumber blocknum, BlockNumber nblocks)
 {
+	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
+
 	Assert((io_direct_flags & IO_DIRECT_DATA) == 0);
 
 	/*
@@ -1067,7 +1126,7 @@ mdwriteback(SMgrRelation reln, ForkNumber forknum,
 		int			segnum_start,
 					segnum_end;
 
-		v = _mdfd_getseg(reln, forknum, blocknum, true /* not used */ ,
+		v = _mdfd_getseg(mdreln, forknum, blocknum, true /* not used */ ,
 						 EXTENSION_DONT_OPEN);
 
 		/*
@@ -1111,14 +1170,15 @@ mdwriteback(SMgrRelation reln, ForkNumber forknum,
 BlockNumber
 mdnblocks(SMgrRelation reln, ForkNumber forknum)
 {
+	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 	MdfdVec    *v;
 	BlockNumber nblocks;
 	BlockNumber segno;
 
-	mdopenfork(reln, forknum, EXTENSION_FAIL);
+	mdopenfork(mdreln, forknum, EXTENSION_FAIL);
 
 	/* mdopen has opened the first segment */
-	Assert(reln->md_num_open_segs[forknum] > 0);
+	Assert(mdreln->md_num_open_segs[forknum] > 0);
 
 	/*
 	 * Start from the last open segments, to avoid redundant seeks.  We have
@@ -1133,12 +1193,12 @@ mdnblocks(SMgrRelation reln, ForkNumber forknum)
 	 * that's OK because the checkpointer never needs to compute relation
 	 * size.)
 	 */
-	segno = reln->md_num_open_segs[forknum] - 1;
-	v = &reln->md_seg_fds[forknum][segno];
+	segno = mdreln->md_num_open_segs[forknum] - 1;
+	v = &mdreln->md_seg_fds[forknum][segno];
 
 	for (;;)
 	{
-		nblocks = _mdnblocks(reln, forknum, v);
+		nblocks = _mdnblocks(mdreln, forknum, v);
 		if (nblocks > ((BlockNumber) RELSEG_SIZE))
 			elog(FATAL, "segment too big");
 		if (nblocks < ((BlockNumber) RELSEG_SIZE))
@@ -1156,7 +1216,7 @@ mdnblocks(SMgrRelation reln, ForkNumber forknum)
 		 * undermines _mdfd_getseg's attempts to notice and report an error
 		 * upon access to a missing segment.
 		 */
-		v = _mdfd_openseg(reln, forknum, segno, 0);
+		v = _mdfd_openseg(mdreln, forknum, segno, 0);
 		if (v == NULL)
 			return segno * ((BlockNumber) RELSEG_SIZE);
 	}
@@ -1176,6 +1236,7 @@ void
 mdtruncate(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber curnblk, BlockNumber nblocks)
 {
+	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 	BlockNumber priorblocks;
 	int			curopensegs;
 
@@ -1196,14 +1257,14 @@ mdtruncate(SMgrRelation reln, ForkNumber forknum,
 	 * Truncate segments, starting at the last one. Starting at the end makes
 	 * managing the memory for the fd array easier, should there be errors.
 	 */
-	curopensegs = reln->md_num_open_segs[forknum];
+	curopensegs = mdreln->md_num_open_segs[forknum];
 	while (curopensegs > 0)
 	{
 		MdfdVec    *v;
 
 		priorblocks = (curopensegs - 1) * RELSEG_SIZE;
 
-		v = &reln->md_seg_fds[forknum][curopensegs - 1];
+		v = &mdreln->md_seg_fds[forknum][curopensegs - 1];
 
 		if (priorblocks > nblocks)
 		{
@@ -1218,13 +1279,13 @@ mdtruncate(SMgrRelation reln, ForkNumber forknum,
 								FilePathName(v->mdfd_vfd))));
 
 			if (!SmgrIsTemp(reln))
-				register_dirty_segment(reln, forknum, v);
+				register_dirty_segment(mdreln, forknum, v);
 
 			/* we never drop the 1st segment */
-			Assert(v != &reln->md_seg_fds[forknum][0]);
+			Assert(v != &mdreln->md_seg_fds[forknum][0]);
 
 			FileClose(v->mdfd_vfd);
-			_fdvec_resize(reln, forknum, curopensegs - 1);
+			_fdvec_resize(mdreln, forknum, curopensegs - 1);
 		}
 		else if (priorblocks + ((BlockNumber) RELSEG_SIZE) > nblocks)
 		{
@@ -1244,7 +1305,7 @@ mdtruncate(SMgrRelation reln, ForkNumber forknum,
 								FilePathName(v->mdfd_vfd),
 								nblocks)));
 			if (!SmgrIsTemp(reln))
-				register_dirty_segment(reln, forknum, v);
+				register_dirty_segment(mdreln, forknum, v);
 		}
 		else
 		{
@@ -1264,6 +1325,7 @@ mdtruncate(SMgrRelation reln, ForkNumber forknum,
 void
 mdregistersync(SMgrRelation reln, ForkNumber forknum)
 {
+	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 	int			segno;
 	int			min_inactive_seg;
 
@@ -1273,7 +1335,7 @@ mdregistersync(SMgrRelation reln, ForkNumber forknum)
 	 */
 	mdnblocks(reln, forknum);
 
-	min_inactive_seg = segno = reln->md_num_open_segs[forknum];
+	min_inactive_seg = segno = mdreln->md_num_open_segs[forknum];
 
 	/*
 	 * Temporarily open inactive segments, then close them after sync.  There
@@ -1281,20 +1343,20 @@ mdregistersync(SMgrRelation reln, ForkNumber forknum)
 	 * harmless.  We don't bother to clean them up and take a risk of further
 	 * trouble.  The next mdclose() will soon close them.
 	 */
-	while (_mdfd_openseg(reln, forknum, segno, 0) != NULL)
+	while (_mdfd_openseg(mdreln, forknum, segno, 0) != NULL)
 		segno++;
 
 	while (segno > 0)
 	{
-		MdfdVec    *v = &reln->md_seg_fds[forknum][segno - 1];
+		MdfdVec    *v = &mdreln->md_seg_fds[forknum][segno - 1];
 
-		register_dirty_segment(reln, forknum, v);
+		register_dirty_segment(mdreln, forknum, v);
 
 		/* Close inactive segments immediately */
 		if (segno > min_inactive_seg)
 		{
 			FileClose(v->mdfd_vfd);
-			_fdvec_resize(reln, forknum, segno - 1);
+			_fdvec_resize(mdreln, forknum, segno - 1);
 		}
 
 		segno--;
@@ -1315,6 +1377,7 @@ mdregistersync(SMgrRelation reln, ForkNumber forknum)
 void
 mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 {
+	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
 	int			segno;
 	int			min_inactive_seg;
 
@@ -1324,7 +1387,7 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 	 */
 	mdnblocks(reln, forknum);
 
-	min_inactive_seg = segno = reln->md_num_open_segs[forknum];
+	min_inactive_seg = segno = mdreln->md_num_open_segs[forknum];
 
 	/*
 	 * Temporarily open inactive segments, then close them after sync.  There
@@ -1332,12 +1395,12 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 	 * is harmless.  We don't bother to clean them up and take a risk of
 	 * further trouble.  The next mdclose() will soon close them.
 	 */
-	while (_mdfd_openseg(reln, forknum, segno, 0) != NULL)
+	while (_mdfd_openseg(mdreln, forknum, segno, 0) != NULL)
 		segno++;
 
 	while (segno > 0)
 	{
-		MdfdVec    *v = &reln->md_seg_fds[forknum][segno - 1];
+		MdfdVec    *v = &mdreln->md_seg_fds[forknum][segno - 1];
 
 		/*
 		 * fsyncs done through mdimmedsync() should be tracked in a separate
@@ -1358,7 +1421,7 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 		if (segno > min_inactive_seg)
 		{
 			FileClose(v->mdfd_vfd);
-			_fdvec_resize(reln, forknum, segno - 1);
+			_fdvec_resize(mdreln, forknum, segno - 1);
 		}
 
 		segno--;
@@ -1375,14 +1438,14 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
  * enough to be a performance problem).
  */
 static void
-register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
+register_dirty_segment(MdSMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 {
 	FileTag		tag;
 
-	INIT_MD_FILETAG(tag, reln->smgr_rlocator.locator, forknum, seg->mdfd_segno);
+	INIT_MD_FILETAG(tag, reln->reln.smgr_rlocator.locator, forknum, seg->mdfd_segno);
 
 	/* Temp relations should never be fsync'd */
-	Assert(!SmgrIsTemp(reln));
+	Assert(!SmgrIsTemp(&reln->reln));
 
 	if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false /* retryOnError */ ))
 	{
@@ -1500,7 +1563,7 @@ DropRelationFiles(RelFileLocator *delrels, int ndelrels, bool isRedo)
  * _fdvec_resize() -- Resize the fork's open segments array
  */
 static void
-_fdvec_resize(SMgrRelation reln,
+_fdvec_resize(MdSMgrRelation reln,
 			  ForkNumber forknum,
 			  int nseg)
 {
@@ -1548,12 +1611,12 @@ _fdvec_resize(SMgrRelation reln,
  * returned string is palloc'd.
  */
 static MdPathStr
-_mdfd_segpath(SMgrRelation reln, ForkNumber forknum, BlockNumber segno)
+_mdfd_segpath(MdSMgrRelation reln, ForkNumber forknum, BlockNumber segno)
 {
 	RelPathStr	path;
 	MdPathStr	fullpath;
 
-	path = relpath(reln->smgr_rlocator, forknum);
+	path = relpath(reln->reln.smgr_rlocator, forknum);
 
 	if (segno > 0)
 		sprintf(fullpath.str, "%s.%u", path.str, segno);
@@ -1568,7 +1631,7 @@ _mdfd_segpath(SMgrRelation reln, ForkNumber forknum, BlockNumber segno)
  * and make a MdfdVec object for it.  Returns NULL on failure.
  */
 static MdfdVec *
-_mdfd_openseg(SMgrRelation reln, ForkNumber forknum, BlockNumber segno,
+_mdfd_openseg(MdSMgrRelation reln, ForkNumber forknum, BlockNumber segno,
 			  int oflags)
 {
 	MdfdVec    *v;
@@ -1611,7 +1674,7 @@ _mdfd_openseg(SMgrRelation reln, ForkNumber forknum, BlockNumber segno,
  * EXTENSION_CREATE case.
  */
 static MdfdVec *
-_mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
+_mdfd_getseg(MdSMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
 			 bool skipFsync, int behavior)
 {
 	MdfdVec    *v;
@@ -1685,7 +1748,7 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
 				char	   *zerobuf = palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE,
 													 MCXT_ALLOC_ZERO);
 
-				mdextend(reln, forknum,
+				mdextend((SMgrRelation) reln, forknum,
 						 nextsegno * ((BlockNumber) RELSEG_SIZE) - 1,
 						 zerobuf, skipFsync);
 				pfree(zerobuf);
@@ -1740,7 +1803,7 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
  * Get number of blocks present in a single disk file
  */
 static BlockNumber
-_mdnblocks(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
+_mdnblocks(MdSMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 {
 	off_t		len;
 
@@ -1763,7 +1826,7 @@ _mdnblocks(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 int
 mdsyncfiletag(const FileTag *ftag, char *path)
 {
-	SMgrRelation reln = smgropen(ftag->rlocator, INVALID_PROC_NUMBER);
+	MdSMgrRelation reln = (MdSMgrRelation) smgropen(ftag->rlocator, INVALID_PROC_NUMBER);
 	File		file;
 	instr_time	io_start;
 	bool		need_to_close;
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index ebe35c04de5..7635c231ea0 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -53,84 +53,21 @@
 
 #include "access/xlogutils.h"
 #include "lib/ilist.h"
+#include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/md.h"
 #include "storage/smgr.h"
+#include "port/atomics.h"
 #include "utils/hsearch.h"
 #include "utils/inval.h"
+#include "utils/memutils.h"
 
+static f_smgr *smgrsw;
 
-/*
- * This struct of function pointers defines the API between smgr.c and
- * any individual storage manager module.  Note that smgr subfunctions are
- * generally expected to report problems via elog(ERROR).  An exception is
- * that smgr_unlink should use elog(WARNING), rather than erroring out,
- * because we normally unlink relations during post-commit/abort cleanup,
- * and so it's too late to raise an error.  Also, various conditions that
- * would normally be errors should be allowed during bootstrap and/or WAL
- * recovery --- see comments in md.c for details.
- */
-typedef struct f_smgr
-{
-	void		(*smgr_init) (void);	/* may be NULL */
-	void		(*smgr_shutdown) (void);	/* may be NULL */
-	void		(*smgr_open) (SMgrRelation reln);
-	void		(*smgr_close) (SMgrRelation reln, ForkNumber forknum);
-	void		(*smgr_create) (SMgrRelation reln, ForkNumber forknum,
-								bool isRedo);
-	bool		(*smgr_exists) (SMgrRelation reln, ForkNumber forknum);
-	void		(*smgr_unlink) (RelFileLocatorBackend rlocator, ForkNumber forknum,
-								bool isRedo);
-	void		(*smgr_extend) (SMgrRelation reln, ForkNumber forknum,
-								BlockNumber blocknum, const void *buffer, bool skipFsync);
-	void		(*smgr_zeroextend) (SMgrRelation reln, ForkNumber forknum,
-									BlockNumber blocknum, int nblocks, bool skipFsync);
-	bool		(*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
-								  BlockNumber blocknum, int nblocks);
-	uint32		(*smgr_maxcombine) (SMgrRelation reln, ForkNumber forknum,
-									BlockNumber blocknum);
-	void		(*smgr_readv) (SMgrRelation reln, ForkNumber forknum,
-							   BlockNumber blocknum,
-							   void **buffers, BlockNumber nblocks);
-	void		(*smgr_writev) (SMgrRelation reln, ForkNumber forknum,
-								BlockNumber blocknum,
-								const void **buffers, BlockNumber nblocks,
-								bool skipFsync);
-	void		(*smgr_writeback) (SMgrRelation reln, ForkNumber forknum,
-								   BlockNumber blocknum, BlockNumber nblocks);
-	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
-	void		(*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
-								  BlockNumber old_blocks, BlockNumber nblocks);
-	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
-	void		(*smgr_registersync) (SMgrRelation reln, ForkNumber forknum);
-} f_smgr;
-
-static const f_smgr smgrsw[] = {
-	/* magnetic disk */
-	{
-		.smgr_init = mdinit,
-		.smgr_shutdown = NULL,
-		.smgr_open = mdopen,
-		.smgr_close = mdclose,
-		.smgr_create = mdcreate,
-		.smgr_exists = mdexists,
-		.smgr_unlink = mdunlink,
-		.smgr_extend = mdextend,
-		.smgr_zeroextend = mdzeroextend,
-		.smgr_prefetch = mdprefetch,
-		.smgr_maxcombine = mdmaxcombine,
-		.smgr_readv = mdreadv,
-		.smgr_writev = mdwritev,
-		.smgr_writeback = mdwriteback,
-		.smgr_nblocks = mdnblocks,
-		.smgr_truncate = mdtruncate,
-		.smgr_immedsync = mdimmedsync,
-		.smgr_registersync = mdregistersync,
-	}
-};
+static int	NSmgr = 0;
 
-static const int NSmgr = lengthof(smgrsw);
+static Size LargestSMgrRelationSize = 0;
 
 /*
  * Each backend has a hashtable that stores all extant SMgrRelation objects.
@@ -144,6 +81,60 @@ static dlist_head unpinned_relns;
 static void smgrshutdown(int code, Datum arg);
 static void smgrdestroy(SMgrRelation reln);
 
+#define MaxSMgrId UINT8_MAX
+
+SMgrId
+smgr_register(const f_smgr *smgr, Size smgrrelation_size)
+{
+	SMgrId		my_id;
+	MemoryContext old;
+
+	if (process_shared_preload_libraries_done)
+		elog(FATAL, "SMgrs must be registered in the shared_preload_libraries phase");
+	if (NSmgr == MaxSMgrId)
+		elog(FATAL, "Too many smgrs registered");
+	if (smgr->name == NULL || *smgr->name == 0)
+		elog(FATAL, "smgr registered with invalid name");
+
+	Assert(smgr->smgr_open != NULL);
+	Assert(smgr->smgr_close != NULL);
+	Assert(smgr->smgr_create != NULL);
+	Assert(smgr->smgr_exists != NULL);
+	Assert(smgr->smgr_unlink != NULL);
+	Assert(smgr->smgr_extend != NULL);
+	Assert(smgr->smgr_zeroextend != NULL);
+	Assert(smgr->smgr_prefetch != NULL);
+	Assert(smgr->smgr_readv != NULL);
+	Assert(smgr->smgr_writev != NULL);
+	Assert(smgr->smgr_writeback != NULL);
+	Assert(smgr->smgr_nblocks != NULL);
+	Assert(smgr->smgr_truncate != NULL);
+	Assert(smgr->smgr_immedsync != NULL);
+
+	old = MemoryContextSwitchTo(TopMemoryContext);
+
+	my_id = NSmgr++;
+	if (my_id == 0)
+		smgrsw = palloc_array(f_smgr, 1);
+	else
+		smgrsw = repalloc_array(smgrsw, f_smgr, NSmgr);
+
+	MemoryContextSwitchTo(old);
+
+	pg_compiler_barrier();
+
+	if (!smgrsw)
+	{
+		NSmgr--;
+		elog(FATAL, "Failed to extend smgr array");
+	}
+
+	smgrsw[my_id] = *smgr;
+
+	LargestSMgrRelationSize = Max(LargestSMgrRelationSize, smgrrelation_size);
+
+	return my_id;
+}
 
 /*
  * smgrinit(), smgrshutdown() -- Initialize or shut down storage
@@ -211,8 +202,11 @@ smgropen(RelFileLocator rlocator, ProcNumber backend)
 		/* First time through: initialize the hash table */
 		HASHCTL		ctl;
 
+		LargestSMgrRelationSize = MAXALIGN(LargestSMgrRelationSize);
+		Assert(NSmgr > 0);
+
 		ctl.keysize = sizeof(RelFileLocatorBackend);
-		ctl.entrysize = sizeof(SMgrRelationData);
+		ctl.entrysize = LargestSMgrRelationSize;
 		SMgrRelationHash = hash_create("smgr relation table", 400,
 									   &ctl, HASH_ELEM | HASH_BLOBS);
 		dlist_init(&unpinned_relns);
@@ -232,7 +226,8 @@ smgropen(RelFileLocator rlocator, ProcNumber backend)
 		reln->smgr_targblock = InvalidBlockNumber;
 		for (int i = 0; i <= MAX_FORKNUM; ++i)
 			reln->smgr_cached_nblocks[i] = InvalidBlockNumber;
-		reln->smgr_which = 0;	/* we only have md.c at present */
+
+		reln->smgr_which = MdSMgrId;	/* we only have md.c at present */
 
 		/* implementation-specific initialization */
 		smgrsw[reln->smgr_which].smgr_open(reln);
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index dc3521457c7..3176cdce6d7 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -43,6 +43,7 @@
 #include "replication/slotsync.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
+#include "storage/md.h"
 #include "storage/latch.h"
 #include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
@@ -192,6 +193,9 @@ InitStandaloneProcess(const char *argv0)
 	InitProcessLocalLatch();
 	InitializeLatchWaitSet();
 
+	/* Initialize smgrs */
+	register_builtin_dynamic_managers();
+
 	/*
 	 * For consistency with InitPostmasterChild, initialize signal mask here.
 	 * But we don't unblock SIGQUIT or provide a default handler for it.
@@ -1920,6 +1924,15 @@ process_session_preload_libraries(void)
 				   true);
 }
 
+/*
+ * Register any internal managers.
+ */
+void
+register_builtin_dynamic_managers(void)
+{
+	mdsmgr_register();
+}
+
 /*
  * process any shared memory requests from preloaded libraries
  */
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index a2b63495eec..ff4ef578a1f 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -513,6 +513,7 @@ extern void TouchSocketLockFiles(void);
 extern void AddToDataDirLockFile(int target_line, const char *str);
 extern bool RecheckDataDirLockFile(void);
 extern void ValidatePgVersion(const char *path);
+extern void register_builtin_dynamic_managers(void);
 extern void process_shared_preload_libraries(void);
 extern void process_session_preload_libraries(void);
 extern void process_shmem_requests(void);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 05bf537066e..da1d1d339be 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -19,6 +19,10 @@
 #include "storage/smgr.h"
 #include "storage/sync.h"
 
+/* registration function for md storage manager */
+extern void mdsmgr_register(void);
+extern SMgrId MdSMgrId;
+
 /* md storage manager functionality */
 extern void mdinit(void);
 extern void mdopen(SMgrRelation reln);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 4016b206ad6..52f74f917b2 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -18,6 +18,8 @@
 #include "storage/block.h"
 #include "storage/relfilelocator.h"
 
+typedef uint8 SMgrId;
+
 /*
  * smgr.c maintains a table of SMgrRelation objects, which are essentially
  * cached file handles.  An SMgrRelation is created (if not already present)
@@ -51,14 +53,7 @@ typedef struct SMgrRelationData
 	 * Fields below here are intended to be private to smgr.c and its
 	 * submodules.  Do not touch them from elsewhere.
 	 */
-	int			smgr_which;		/* storage manager selector */
-
-	/*
-	 * for md.c; per-fork arrays of the number of open segments
-	 * (md_num_open_segs) and the segments themselves (md_seg_fds).
-	 */
-	int			md_num_open_segs[MAX_FORKNUM + 1];
-	struct _MdfdVec *md_seg_fds[MAX_FORKNUM + 1];
+	SMgrId		smgr_which;		/* storage manager selector */
 
 	/*
 	 * Pinning support.  If unpinned (ie. pincount == 0), 'node' is a list
@@ -73,6 +68,54 @@ typedef SMgrRelationData *SMgrRelation;
 #define SmgrIsTemp(smgr) \
 	RelFileLocatorBackendIsTemp((smgr)->smgr_rlocator)
 
+/*
+ * This struct of function pointers defines the API between smgr.c and
+ * any individual storage manager module.  Note that smgr subfunctions are
+ * generally expected to report problems via elog(ERROR).  An exception is
+ * that smgr_unlink should use elog(WARNING), rather than erroring out,
+ * because we normally unlink relations during post-commit/abort cleanup,
+ * and so it's too late to raise an error.  Also, various conditions that
+ * would normally be errors should be allowed during bootstrap and/or WAL
+ * recovery --- see comments in md.c for details.
+ */
+typedef struct f_smgr
+{
+	const char *name;
+	void		(*smgr_init) (void);	/* may be NULL */
+	void		(*smgr_shutdown) (void);	/* may be NULL */
+	void		(*smgr_open) (SMgrRelation reln);
+	void		(*smgr_close) (SMgrRelation reln, ForkNumber forknum);
+	void		(*smgr_create) (SMgrRelation reln, ForkNumber forknum,
+								bool isRedo);
+	bool		(*smgr_exists) (SMgrRelation reln, ForkNumber forknum);
+	void		(*smgr_unlink) (RelFileLocatorBackend rlocator, ForkNumber forknum,
+								bool isRedo);
+	void		(*smgr_extend) (SMgrRelation reln, ForkNumber forknum,
+								BlockNumber blocknum, const void *buffer, bool skipFsync);
+	void		(*smgr_zeroextend) (SMgrRelation reln, ForkNumber forknum,
+									BlockNumber blocknum, int nblocks, bool skipFsync);
+	bool		(*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
+								  BlockNumber blocknum, int nblocks);
+	uint32		(*smgr_maxcombine) (SMgrRelation reln, ForkNumber forknum,
+									BlockNumber blocknum);
+	void		(*smgr_readv) (SMgrRelation reln, ForkNumber forknum,
+							   BlockNumber blocknum,
+							   void **buffers, BlockNumber nblocks);
+	void		(*smgr_writev) (SMgrRelation reln, ForkNumber forknum,
+								BlockNumber blocknum,
+								const void **buffers, BlockNumber nblocks,
+								bool skipFsync);
+	void		(*smgr_writeback) (SMgrRelation reln, ForkNumber forknum,
+								   BlockNumber blocknum, BlockNumber nblocks);
+	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
+	void		(*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
+								  BlockNumber old_blocks, BlockNumber nblocks);
+	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
+	void		(*smgr_registersync) (SMgrRelation reln, ForkNumber forknum);
+} f_smgr;
+
+extern SMgrId smgr_register(const f_smgr *smgr, Size smgrrelation_size);
+
 extern void smgrinit(void);
 extern SMgrRelation smgropen(RelFileLocator rlocator, ProcNumber backend);
 extern bool smgrexists(SMgrRelation reln, ForkNumber forknum);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9840060997f..4b971b81ae5 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1629,6 +1629,7 @@ ManyTestResourceKind
 Material
 MaterialPath
 MaterialState
+MdSMgrRelationData
 MdfdVec
 MdPathStr
 Memoize
-- 
2.47.2

#16Kirill Reshke
reshkekirill@gmail.com
In reply to: Andreas Karlsson (#15)
Re: Extensible storage manager API - SMGR hook Redux

On Fri, 7 Mar 2025 at 16:52, Andreas Karlsson <andreas@proxel.se> wrote:

Hi,

Hi!

Here is a rebased version of it to make the CI happy.

Looks like CI is still unhappy with this change[0]https://cirrus-ci.com/task/6466113875214336 -- Best regards, Kirill Reshke

0001:

+
+SMgrId
+smgr_register(const f_smgr *smgr, Size smgrrelation_size)

...

+ Assert(smgr->smgr_open != NULL);
+ Assert(smgr->smgr_close != NULL);
+ Assert(smgr->smgr_create != NULL);
+ Assert(smgr->smgr_exists != NULL);
+ Assert(smgr->smgr_unlink != NULL);
+ Assert(smgr->smgr_extend != NULL);
+ Assert(smgr->smgr_zeroextend != NULL);
+ Assert(smgr->smgr_prefetch != NULL);
+ Assert(smgr->smgr_readv != NULL);
+ Assert(smgr->smgr_writev != NULL);
+ Assert(smgr->smgr_writeback != NULL);
+ Assert(smgr->smgr_nblocks != NULL);
+ Assert(smgr->smgr_truncate != NULL);
+ Assert(smgr->smgr_immedsync != NULL);

Are we sure we need to force extension authors to implement prefetch?
Also, do we intentionally skip Assert on smgr_registersync and
smgr_init here? I am not questioning smgr_shutdown here, as I can see
it is NULL for md implementation.

0002:
should we merge this with 0001?

0003: Looks mature, no comments.

0004:
It's a bit strange to place fsync_checker under contrib, huh? Like,
you will never use it in production. Maybe src/test/modules is a
better place?

0005:
We are missing rationale for this change in the commit message.

I didn't look at the 0006 modifications. Later, I'll try to take another look.

[0]: https://cirrus-ci.com/task/6466113875214336 -- Best regards, Kirill Reshke
--
Best regards,
Kirill Reshke

#17vignesh C
vignesh21@gmail.com
In reply to: Andreas Karlsson (#15)
Re: Extensible storage manager API - SMGR hook Redux

On Fri, 7 Mar 2025 at 17:22, Andreas Karlsson <andreas@proxel.se> wrote:

Hi,

Here is a rebased version of it to make the CI happy. I plan to work
more on this next week but am happy with any feedback on what is already
there.

I noticed that Kirill's comments from [1]/messages/by-id/CALdSSPimrJWeex1RbvVXoGCROLiC6VgKUdEE0pUcib=GNYo58g@mail.gmail.com are not yet addressed, I
have changed the commitfest entry status to "Waiting on Author. Please
address them and change the status to "Needs Review".
[1]: /messages/by-id/CALdSSPimrJWeex1RbvVXoGCROLiC6VgKUdEE0pUcib=GNYo58g@mail.gmail.com

Regards,
Vignesh