[Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

Started by Dilip Kumarover 4 years ago280 messages

dilipbalaut@gmail.com

over 4 years ago

1 attachment(s)

Currently, CREATE DATABASE forces a checkpoint, then copies all the
files, then forces another checkpoint. The comments in the createdb()
function explain the reasons for this. The attached patch fixes this
problem by making CREATE DATABASE completely WAL-logged so that now we
can avoid checkpoints. The patch modifies both CREATE DATABASE and
ALTER DATABASE..SET TABLESPACE to be fully WAL-logged.

One main advantage of this change is that it will be cheaper. Forcing
checkpoints on an idle system is no big deal, but when the system is
under heavy write load, it's very expensive. Another advantage is that
it makes things better for features like TDE, which might want the
pages in the source database to be encrypted using a different key or
nonce than the pages in the target database.

Design Idea:
-----------------
First, create the target database directory along with the version
file and WAL-log this operation. Create the "relation map file" in
the target database and copy the content from the source database. For
this, we can use some modified versions of the write_relmap_file() and
WAL-log the relmap create operation along with the file content. Now,
read the relmap file to find the relfilenode for pg_class and then we
read pg_class block by block and decode the tuples. For reading the
pg_class blocks, we can use ReadBufferWithoutRelCache() so that we
don't need the relcache. Nothing prevents us from checking visibility
for tuples in another database because CLOG is global to the cluster.
And nothing prevents us from deforming those tuples because the column
definitions for pg_class have to be the same in every database. Then
we can get the relfilenode of every file we need to copy, and prepare
a list of all such relfilenode. Next, for each relfilenode in the
source database, create a respective relfilenode in the target
database (for all forks) using smgrcreate, which is already a
WAL-logged operation. Now read the source relfilenode block by block
using ReadBufferWithoutRelCache() and copy the block to the target
relfilenode using smgrextend() and WAL-log them using log_newpage().
For the source database, we can not directly use the smgrread(),
because there could be some dirty buffers so we will have to read them
through the buffer manager interface, otherwise, we will have to flush
all the dirty buffers.

WAL sequence using pg_waldump
----------------------------------------------------
1. (new wal to create db dir and write PG_VERSION file)
rmgr: Database desc: CREATE create dir 1663/16394

2. (new wal to create and write relmap file)
rmgr: RelMap desc: CREATE database 16394 tablespace 1663 size 512

2. (create relfilenode)
rmgr: Storage desc: CREATE base/16394/16384
rmgr: Storage desc: CREATE base/16394/2619

3. (write page data)
rmgr: XLOG desc: FPI , blkref #0: rel 1663/16394/2619 blk 0 FPW
rmgr: XLOG desc: FPI , blkref #0: rel 1663/16394/2619 blk 1 FPW
............
4. (create other forks)
rmgr: Storage desc: CREATE base/16394/2619_fsm
rmgr: Storage CREATE base/16394/2619_vm
.............

I have attached a POC patch, which shows this idea, with this patch
all basic sanity testing and the "check-world" is passing.

Open points:
-------------------
- This is a POC patch so needs more refactoring/cleanup and testing.
- Might need to relook into the SMGR level API usage.

Credits:
-----------
Thanks to Robert Haas, for suggesting this idea and the high-level design.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

POC-0001-WAL-logged-CREATE-DATABASE.patchtext/x-patch; charset=US-ASCII; name=POC-0001-WAL-logged-CREATE-DATABASE.patchDownload

From e472d3cb744dc45641d36e919098f9570f80a8fd Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Sat, 5 Jun 2021 17:08:13 +0530
Subject: [PATCH v1] WAL logged CREATE DATABASE

Currently, CREATE DATABASE forces a checkpoint, then copies all the files,
then forces another checkpoint. The comments in the createdb() function
explain the reasons for this. The attached patch fixes this problem by making
create database completely WAL logged and so that we can avoid the checkpoints.
---
 src/backend/access/rmgrdesc/dbasedesc.c  |   3 +-
 src/backend/access/rmgrdesc/relmapdesc.c |  10 +
 src/backend/access/transam/xlogutils.c   |  12 +-
 src/backend/commands/dbcommands.c        | 653 ++++++++++++++++++++-----------
 src/backend/storage/buffer/bufmgr.c      |  13 +-
 src/backend/utils/cache/relmapper.c      | 222 +++++++----
 src/bin/pg_rewind/parsexlog.c            |   5 +
 src/include/commands/dbcommands_xlog.h   |   7 +-
 src/include/storage/bufmgr.h             |   3 +-
 src/include/utils/relmapper.h            |   6 +-
 10 files changed, 613 insertions(+), 321 deletions(-)

diff --git a/src/backend/access/rmgrdesc/dbasedesc.c b/src/backend/access/rmgrdesc/dbasedesc.c
index 2660984..5010f72 100644
--- a/src/backend/access/rmgrdesc/dbasedesc.c
+++ b/src/backend/access/rmgrdesc/dbasedesc.c
@@ -28,8 +28,7 @@ dbase_desc(StringInfo buf, XLogReaderState *record)
 	{
 		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) rec;
 
-		appendStringInfo(buf, "copy dir %u/%u to %u/%u",
-						 xlrec->src_tablespace_id, xlrec->src_db_id,
+		appendStringInfo(buf, "create dir %u/%u",
 						 xlrec->tablespace_id, xlrec->db_id);
 	}
 	else if (info == XLOG_DBASE_DROP)
diff --git a/src/backend/access/rmgrdesc/relmapdesc.c b/src/backend/access/rmgrdesc/relmapdesc.c
index 2f9d4f5..9ff1aae 100644
--- a/src/backend/access/rmgrdesc/relmapdesc.c
+++ b/src/backend/access/rmgrdesc/relmapdesc.c
@@ -29,6 +29,13 @@ relmap_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "database %u tablespace %u size %u",
 						 xlrec->dbid, xlrec->tsid, xlrec->nbytes);
 	}
+	if (info == XLOG_RELMAP_CREATE)
+	{
+		xl_relmap_update *xlrec = (xl_relmap_update *) rec;
+
+		appendStringInfo(buf, "database %u tablespace %u size %u",
+						 xlrec->dbid, xlrec->tsid, xlrec->nbytes);
+	}	
 }
 
 const char *
@@ -41,6 +48,9 @@ relmap_identify(uint8 info)
 		case XLOG_RELMAP_UPDATE:
 			id = "UPDATE";
 			break;
+		case XLOG_RELMAP_CREATE:
+			id = "CREATE";
+			break;	
 	}
 
 	return id;
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index d17d660..45bbba7 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -463,8 +463,8 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 	if (blkno < lastblock)
 	{
 		/* page exists in file */
-		buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-										   mode, NULL);
+		buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno, mode, NULL,
+										   RELPERSISTENCE_PERMANENT);
 	}
 	else
 	{
@@ -488,8 +488,8 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 					LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 				ReleaseBuffer(buffer);
 			}
-			buffer = ReadBufferWithoutRelcache(rnode, forknum,
-											   P_NEW, mode, NULL);
+			buffer = ReadBufferWithoutRelcache(rnode, forknum, P_NEW, mode,
+											   NULL, RELPERSISTENCE_PERMANENT);
 		}
 		while (BufferGetBlockNumber(buffer) < blkno);
 		/* Handle the corner case that P_NEW returns non-consecutive pages */
@@ -498,8 +498,8 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 			if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
 				LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 			ReleaseBuffer(buffer);
-			buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-											   mode, NULL);
+			buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno, mode,
+											   NULL, RELPERSISTENCE_PERMANENT);
 		}
 	}
 
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 2b159b6..53f3b6e 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -36,10 +36,14 @@
 #include "catalog/indexing.h"
 #include "catalog/objectaccess.h"
 #include "catalog/pg_authid.h"
+#include "catalog/pg_auth_members.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_db_role_setting.h"
+#include "catalog/pg_proc.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_tablespace.h"
+#include "catalog/storage.h"
+#include "catalog/storage_xlog.h"
 #include "commands/comment.h"
 #include "commands/dbcommands.h"
 #include "commands/dbcommands_xlog.h"
@@ -62,6 +66,7 @@
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
 #include "utils/pg_locale.h"
+#include "utils/relmapper.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 
@@ -77,6 +82,13 @@ typedef struct
 	Oid			dest_tsoid;		/* tablespace we are trying to move to */
 } movedb_failure_params;
 
+typedef struct RelationInfo
+{
+	RelFileNode		rnode;
+	char			relpersistence;
+} RelationInfo;
+
+
 /* non-export function prototypes */
 static void createdb_failure_callback(int code, Datum arg);
 static void movedb(const char *dbname, const char *tblspcname);
@@ -91,6 +103,387 @@ static bool have_createdb_privilege(void);
 static void remove_dbtablespaces(Oid db_id);
 static bool check_db_file_conflict(Oid db_id);
 static int	errdetail_busy_db(int notherbackends, int npreparedxacts);
+static void CreateDatabaseDirectory(char *dbpath, Oid dbid, Oid tsid,
+									bool isRedo);
+static List *GetDatabaseValidRelList(Oid srctbid, Oid srcdbid,
+									 Oid relfilenode);
+void RelationCopyStorageUsingBuffer(SMgrRelation src, SMgrRelation dst,
+									ForkNumber forkNum, char relpersistence);
+static void CopyDatabase(Oid src_dboid, Oid dboid, Oid src_tsid, Oid dst_tsid);
+
+/*
+ * CreateDatabaseDirectory - Create empty database directory and write out the
+ *							 PG_VERSION file in the database path.
+ * If isRedo is true, it's okay for the database directory to exist already.
+ */
+static void
+CreateDatabaseDirectory(char *dbpath, Oid dbid, Oid tsid, bool isRedo)
+{
+	int		fd;
+	int		nbytes;
+	char	versionfile[MAXPGPATH];
+
+	/* Create an empty db directory */
+	if (MakePGDirectory(dbpath) < 0)
+	{
+		/* Failure other than not exists or not in WAL replay? */
+		if (errno != EEXIST || !isRedo)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not create directory \"%s\": %m", dbpath)));
+	}
+
+	/* Create PG_VERSION file in the database path */
+	snprintf(versionfile, sizeof(versionfile), "%s/%s",
+			 dbpath, "PG_VERSION");
+
+	fd = OpenTransientFile(versionfile, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
+
+	/*
+	 * If file already exist and we are in WAL replay then just retry to open
+	 * in write mode.
+	 */
+	if (fd < 0 && errno == EEXIST && isRedo)
+		fd = OpenTransientFile(versionfile, O_RDWR | PG_BINARY);
+	if (fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m", versionfile)));
+
+	nbytes = strlen(PG_MAJORVERSION);
+
+	/* If we are not in WAL replay then write the WAL */
+	if (!isRedo)
+	{
+		xl_dbase_create_rec xlrec;
+		XLogRecPtr	lsn;
+
+		/* now errors are fatal ... */
+		START_CRIT_SECTION();
+
+		xlrec.db_id = dbid;
+		xlrec.tablespace_id = tsid;
+		xlrec.nbytes = nbytes;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec), MinSizeOfDbaseCreateRec);
+		XLogRegisterData((char *) PG_MAJORVERSION, nbytes);
+
+		lsn = XLogInsert(RM_DBASE_ID, XLOG_DBASE_CREATE);
+
+		/* As always, WAL must hit the disk before the data update does */
+		XLogFlush(lsn);
+	}
+
+	/* Write version in the PG_VERSION file */
+	pgstat_report_wait_start(WAIT_EVENT_COPY_FILE_WRITE);
+	errno = 0;
+	if ((int) write(fd, (char *) PG_MAJORVERSION, nbytes) != nbytes)
+	{
+		/* if write didn't set errno, assume problem is no disk space */
+		if (errno == 0)
+			errno = ENOSPC;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m", versionfile)));
+	}
+	pgstat_report_wait_end();
+
+	/* Close the version file */
+	CloseTransientFile(fd);
+
+	/* Critical section done */
+	if (!isRedo)
+		END_CRIT_SECTION();
+}
+
+/*
+ * GetDatabaseValidRelList - Get list of all valid relnode of the source db
+ *
+ * Process the input pg_class relfilenode and process block by block
+ * and prepare a list of all the valid relnode.
+ */
+static List *
+GetDatabaseValidRelList(Oid srctbid, Oid srcdbid, Oid relfilenode)
+{
+	SMgrRelation	rd_smgr;
+	RelFileNode		rnode;
+	BlockNumber		nblocks;
+	BlockNumber		blkno;
+	OffsetNumber	offnum;
+	OffsetNumber	maxoff;
+	Buffer			buf;
+	Page			page;
+	List		   *rnodelist = NIL;
+	HeapTupleData	tuple;
+	Form_pg_class	classForm;
+	BufferAccessStrategy bstrategy;
+
+	rnode.spcNode = srctbid;
+	rnode.dbNode = srcdbid;
+	rnode.relNode = relfilenode;
+
+	rd_smgr = smgropen(rnode, InvalidBackendId);
+	nblocks = smgrnblocks(rd_smgr, MAIN_FORKNUM);
+	bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	/*
+	 * Process each block for the pg_class relfilenode and check for the
+	 * visible tuple.  Store the relnode of the visible tuple in the list.
+	 * Later in the caller, these relnode files will be processed and copied
+	 * to the destination block by block.
+	 */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		buf = ReadBufferWithoutRelcache(rnode, MAIN_FORKNUM, blkno,
+										RBM_NORMAL, bstrategy,
+										RELPERSISTENCE_PERMANENT);
+
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buf);
+		if (PageIsNew(page) || PageIsEmpty(page))
+			continue;
+
+		/* Scan the page and prepare*/
+		maxoff = PageGetMaxOffsetNumber(page);
+		for (offnum = FirstOffsetNumber;
+			 offnum <= maxoff;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemid;
+
+			itemid = PageGetItemId(page, offnum);
+
+			/* Nothing to do if slot is empty or already dead */
+			if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid) ||
+				ItemIdIsRedirected(itemid))
+				continue;
+
+			Assert(ItemIdIsNormal(itemid));
+			ItemPointerSet(&(tuple.t_self), blkno, offnum);
+			tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
+			tuple.t_len = ItemIdGetLength(itemid);
+			tuple.t_tableOid = RelationRelationId;
+
+			/* Check whether the tuple is visible */
+			if (HeapTupleSatisfiesVisibility(&tuple, GetActiveSnapshot(), buf))
+			{
+				Oid				relfilenode = InvalidOid;
+				RelationInfo   *relinfo;
+
+				classForm = (Form_pg_class) GETSTRUCT(&tuple);
+
+				/* We only want to scan the object which has storage. */
+				if (!RELKIND_HAS_STORAGE(classForm->relkind))
+					continue;
+
+				/* Ignore the global objects. */
+				if (classForm->reltablespace == GLOBALTABLESPACE_OID)
+					continue;
+
+				/* Built-in oids are mapped directly */
+				if (classForm->oid < FirstGenbkiObjectId)
+					relfilenode = classForm->oid;
+				else if (OidIsValid(classForm->relfilenode))
+					relfilenode = classForm->relfilenode;
+				else
+					continue;
+
+				Assert(OidIsValid(relfilenode));
+
+				/* Prepare a rel info element and add to the list */
+				relinfo = (RelationInfo *) palloc(sizeof(RelationInfo));
+				if (OidIsValid(classForm->reltablespace))
+					relinfo->rnode.spcNode = classForm->reltablespace;
+				else
+					relinfo->rnode.spcNode = srctbid;
+
+				relinfo->rnode.dbNode = srcdbid;
+				relinfo->rnode.relNode = relfilenode;
+				relinfo->relpersistence = classForm->relpersistence;
+
+				if (rnodelist == NULL)
+					rnodelist = list_make1(relinfo);
+				else
+					rnodelist = lappend(rnodelist, relinfo);
+			}
+		}
+		UnlockReleaseBuffer(buf);
+	}
+
+	return rnodelist;
+}
+
+/*
+ * Copy a fork's data, block by block using buffers.
+ */
+void
+RelationCopyStorageUsingBuffer(SMgrRelation src, SMgrRelation dst,
+							   ForkNumber forkNum, char relpersistence)
+{
+	Buffer		buf;
+	Page		page;
+	bool		use_wal;
+	bool		copying_initfork;
+	BlockNumber nblocks;
+	BlockNumber blkno;
+	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	/*
+	 * The init fork for an unlogged relation in many respects has to be
+	 * treated the same as normal relation, changes need to be WAL logged and
+	 * it needs to be synced to disk.
+	 */
+	copying_initfork = relpersistence == RELPERSISTENCE_UNLOGGED &&
+		forkNum == INIT_FORKNUM;
+
+	/*
+	 * We need to log the copied data in WAL iff WAL archiving/streaming is
+	 * enabled AND it's a permanent relation.  This gives the same answer as
+	 * "RelationNeedsWAL(rel) || copying_initfork", because we know the
+	 * current operation created a new relfilenode.
+	 */
+	use_wal = relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork;
+
+	nblocks = smgrnblocks(src, forkNum);
+
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		/* If we got a cancel signal during the copy of the data, quit */
+		CHECK_FOR_INTERRUPTS();
+
+		buf = ReadBufferWithoutRelcache(src->smgr_rnode.node, forkNum,
+										blkno, RBM_NORMAL, bstrategy,
+										relpersistence);
+		page = BufferGetPage(buf);
+		if (PageIsNew(page) || PageIsEmpty(page))
+		{
+			ReleaseBuffer(buf);
+			continue;
+		}
+
+		/*
+		 * WAL-log the copied page. Unfortunately we don't know what kind of a
+		 * page this is, so we have to log the full page including any unused
+		 * space.
+		 */
+		if (use_wal)
+			log_newpage(&dst->smgr_rnode.node, forkNum, blkno, page, false);
+
+		PageSetChecksumInplace(page, blkno);
+
+		/*
+		 * Now write the page.  We say skipFsync = true because there's no
+		 * need for smgr to schedule an fsync for this write; we'll do it
+		 * ourselves below.
+		 */
+		smgrextend(dst, forkNum, blkno, (char *) page, true);
+		ReleaseBuffer(buf);
+	}
+
+	/*
+	 * When we WAL-logged rel pages, we must nonetheless fsync them.  The
+	 * reason is that since we're copying outside shared buffers, a CHECKPOINT
+	 * occurring during the copy has no way to flush the previously written
+	 * data to disk (indeed it won't know the new rel even exists).  A crash
+	 * later on would replay WAL from the checkpoint, therefore it wouldn't
+	 * replay our earlier WAL entries. If we do not fsync those pages here,
+	 * they might still not be on disk when the crash occurs.
+	 */
+	if (use_wal)
+		smgrimmedsync(dst, forkNum);
+}
+
+/*
+ * Copy data logically from src database to the destination database
+ */
+static void
+CopyDatabase(Oid src_dboid, Oid dboid, Oid src_tsid, Oid dst_tsid)
+{
+	char	   *srcpath;
+	char	   *dstpath;
+	Oid			relfilenode;
+	List	   *rnodelist = NULL;
+	ListCell   *cell;
+	RelationInfo   *relinfo;
+	RelFileNode	    srcrnode;
+	RelFileNode		dstrnode;
+
+	/* Create the default tablespace destination database directory */
+	dstpath = GetDatabasePath(dboid, dst_tsid);
+
+	/* Create database directory and write PG_VERSION file */
+	CreateDatabaseDirectory(dstpath, dboid, dst_tsid, false);
+
+	/* Copy the relfilenode mapping file */
+	srcpath = GetDatabasePath(src_dboid, src_tsid);
+	CreateAndCopyRelMap(dboid, dst_tsid, srcpath, dstpath);
+
+	/* Get pg_class relfilenode */
+	relfilenode = DatabaseRelationOidToFilenode(srcpath,
+												RelationRelationId);
+
+	/* get list of all valid relnode from the source database */
+	rnodelist = GetDatabaseValidRelList(src_tsid, src_dboid,
+										relfilenode);
+	Assert(rnodelist != NIL);
+
+	/*
+	* Process relfilenode for each file and copy block by block from source
+	* database to the destination database.
+	*/
+	foreach(cell, rnodelist)
+	{
+		SMgrRelation	src_smgr;
+		SMgrRelation	dst_smgr;
+
+		relinfo = lfirst(cell);
+		srcrnode = relinfo->rnode;
+
+		/* Use source relnode tablespace if it's not a default table space */
+		if (srcrnode.spcNode != src_tsid)
+			dstrnode.spcNode = srcrnode.spcNode;
+		else
+			dstrnode.spcNode = dst_tsid;
+
+		dstrnode.dbNode = dboid;
+		dstrnode.relNode = srcrnode.relNode;
+
+		/* Open the source and the destination relation at smgr level */
+		src_smgr = smgropen(srcrnode, InvalidBackendId);
+		dst_smgr = smgropen(dstrnode, InvalidBackendId);
+
+		RelationCreateStorage(dstrnode, relinfo->relpersistence);
+
+		/* copy main fork */
+		RelationCopyStorageUsingBuffer(src_smgr, dst_smgr, MAIN_FORKNUM,
+									   relinfo->relpersistence);
+
+		/* copy those extra forks that exist */
+		for (ForkNumber forkNum = MAIN_FORKNUM + 1;
+			forkNum <= MAX_FORKNUM; forkNum++)
+		{
+			if (smgrexists(src_smgr, forkNum))
+			{
+				smgrcreate(dst_smgr, forkNum, false);
+
+				/*
+				* WAL log creation if the relation is persistent, or this is the
+				* init fork of an unlogged relation.
+				*/
+				if (relinfo->relpersistence == RELPERSISTENCE_PERMANENT ||
+					(relinfo->relpersistence == RELPERSISTENCE_UNLOGGED &&
+					forkNum == INIT_FORKNUM))
+					log_smgrcreate(&dstrnode, forkNum);
+				RelationCopyStorageUsingBuffer(src_smgr, dst_smgr,
+											   forkNum,
+											   relinfo->relpersistence);
+			}
+		}
+	}
+
+	list_free_deep(rnodelist);
+}
 
 
 /*
@@ -99,8 +492,6 @@ static int	errdetail_busy_db(int notherbackends, int npreparedxacts);
 Oid
 createdb(ParseState *pstate, const CreatedbStmt *stmt)
 {
-	TableScanDesc scan;
-	Relation	rel;
 	Oid			src_dboid;
 	Oid			src_owner;
 	int			src_encoding = -1;
@@ -592,140 +983,19 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	/* Post creation hook for new database */
 	InvokeObjectPostCreateHook(DatabaseRelationId, dboid, 0);
 
-	/*
-	 * Force a checkpoint before starting the copy. This will force all dirty
-	 * buffers, including those of unlogged tables, out to disk, to ensure
-	 * source database is up-to-date on disk for the copy.
-	 * FlushDatabaseBuffers() would suffice for that, but we also want to
-	 * process any pending unlink requests. Otherwise, if a checkpoint
-	 * happened while we're copying files, a file might be deleted just when
-	 * we're about to copy it, causing the lstat() call in copydir() to fail
-	 * with ENOENT.
-	 */
-	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT
-					  | CHECKPOINT_FLUSH_ALL);
-
-	/*
-	 * Once we start copying subdirectories, we need to be able to clean 'em
-	 * up if we fail.  Use an ENSURE block to make sure this happens.  (This
-	 * is not a 100% solution, because of the possibility of failure during
-	 * transaction commit after we leave this routine, but it should handle
-	 * most scenarios.)
-	 */
 	fparms.src_dboid = src_dboid;
 	fparms.dest_dboid = dboid;
 	PG_ENSURE_ERROR_CLEANUP(createdb_failure_callback,
 							PointerGetDatum(&fparms));
-	{
-		/*
-		 * Iterate through all tablespaces of the template database, and copy
-		 * each one to the new database.
-		 */
-		rel = table_open(TableSpaceRelationId, AccessShareLock);
-		scan = table_beginscan_catalog(rel, 0, NULL);
-		while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
-		{
-			Form_pg_tablespace spaceform = (Form_pg_tablespace) GETSTRUCT(tuple);
-			Oid			srctablespace = spaceform->oid;
-			Oid			dsttablespace;
-			char	   *srcpath;
-			char	   *dstpath;
-			struct stat st;
-
-			/* No need to copy global tablespace */
-			if (srctablespace == GLOBALTABLESPACE_OID)
-				continue;
-
-			srcpath = GetDatabasePath(src_dboid, srctablespace);
-
-			if (stat(srcpath, &st) < 0 || !S_ISDIR(st.st_mode) ||
-				directory_is_empty(srcpath))
-			{
-				/* Assume we can ignore it */
-				pfree(srcpath);
-				continue;
-			}
-
-			if (srctablespace == src_deftablespace)
-				dsttablespace = dst_deftablespace;
-			else
-				dsttablespace = srctablespace;
-
-			dstpath = GetDatabasePath(dboid, dsttablespace);
-
-			/*
-			 * Copy this subdirectory to the new location
-			 *
-			 * We don't need to copy subdirectories
-			 */
-			copydir(srcpath, dstpath, false);
-
-			/* Record the filesystem change in XLOG */
-			{
-				xl_dbase_create_rec xlrec;
-
-				xlrec.db_id = dboid;
-				xlrec.tablespace_id = dsttablespace;
-				xlrec.src_db_id = src_dboid;
-				xlrec.src_tablespace_id = srctablespace;
-
-				XLogBeginInsert();
-				XLogRegisterData((char *) &xlrec, sizeof(xl_dbase_create_rec));
-
-				(void) XLogInsert(RM_DBASE_ID,
-								  XLOG_DBASE_CREATE | XLR_SPECIAL_REL_UPDATE);
-			}
-		}
-		table_endscan(scan);
-		table_close(rel, AccessShareLock);
-
-		/*
-		 * We force a checkpoint before committing.  This effectively means
-		 * that committed XLOG_DBASE_CREATE operations will never need to be
-		 * replayed (at least not in ordinary crash recovery; we still have to
-		 * make the XLOG entry for the benefit of PITR operations). This
-		 * avoids two nasty scenarios:
-		 *
-		 * #1: When PITR is off, we don't XLOG the contents of newly created
-		 * indexes; therefore the drop-and-recreate-whole-directory behavior
-		 * of DBASE_CREATE replay would lose such indexes.
-		 *
-		 * #2: Since we have to recopy the source database during DBASE_CREATE
-		 * replay, we run the risk of copying changes in it that were
-		 * committed after the original CREATE DATABASE command but before the
-		 * system crash that led to the replay.  This is at least unexpected
-		 * and at worst could lead to inconsistencies, eg duplicate table
-		 * names.
-		 *
-		 * (Both of these were real bugs in releases 8.0 through 8.0.3.)
-		 *
-		 * In PITR replay, the first of these isn't an issue, and the second
-		 * is only a risk if the CREATE DATABASE and subsequent template
-		 * database change both occur while a base backup is being taken.
-		 * There doesn't seem to be much we can do about that except document
-		 * it as a limitation.
-		 *
-		 * Perhaps if we ever implement CREATE DATABASE in a less cheesy way,
-		 * we can avoid this.
-		 */
-		RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
-
-		/*
-		 * Close pg_database, but keep lock till commit.
-		 */
-		table_close(pg_database_rel, NoLock);
-
-		/*
-		 * Force synchronous commit, thus minimizing the window between
-		 * creation of the database files and committal of the transaction. If
-		 * we crash before committing, we'll have a DB that's taking up disk
-		 * space but is not in pg_database, which is not good.
-		 */
-		ForceSyncCommit();
-	}
+	CopyDatabase(src_dboid, dboid, src_deftablespace, dst_deftablespace);
 	PG_END_ENSURE_ERROR_CLEANUP(createdb_failure_callback,
 								PointerGetDatum(&fparms));
 
+	/*
+	 * Close pg_database, but keep lock till commit.
+	 */
+	table_close(pg_database_rel, NoLock);
+
 	return dboid;
 }
 
@@ -1220,43 +1490,12 @@ movedb(const char *dbname, const char *tblspcname)
 				 errdetail_busy_db(notherbackends, npreparedxacts)));
 
 	/*
-	 * Get old and new database paths
+	 * Get new database path
 	 */
 	src_dbpath = GetDatabasePath(db_id, src_tblspcoid);
 	dst_dbpath = GetDatabasePath(db_id, dst_tblspcoid);
 
 	/*
-	 * Force a checkpoint before proceeding. This will force all dirty
-	 * buffers, including those of unlogged tables, out to disk, to ensure
-	 * source database is up-to-date on disk for the copy.
-	 * FlushDatabaseBuffers() would suffice for that, but we also want to
-	 * process any pending unlink requests. Otherwise, the check for existing
-	 * files in the target directory might fail unnecessarily, not to mention
-	 * that the copy might fail due to source files getting deleted under it.
-	 * On Windows, this also ensures that background procs don't hold any open
-	 * files, which would cause rmdir() to fail.
-	 */
-	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT
-					  | CHECKPOINT_FLUSH_ALL);
-
-	/*
-	 * Now drop all buffers holding data of the target database; they should
-	 * no longer be dirty so DropDatabaseBuffers is safe.
-	 *
-	 * It might seem that we could just let these buffers age out of shared
-	 * buffers naturally, since they should not get referenced anymore.  The
-	 * problem with that is that if the user later moves the database back to
-	 * its original tablespace, any still-surviving buffers would appear to
-	 * contain valid data again --- but they'd be missing any changes made in
-	 * the database while it was in the new tablespace.  In any case, freeing
-	 * buffers that should never be used again seems worth the cycles.
-	 *
-	 * Note: it'd be sufficient to get rid of buffers matching db_id and
-	 * src_tblspcoid, but bufmgr.c presently provides no API for that.
-	 */
-	DropDatabaseBuffers(db_id);
-
-	/*
 	 * Check for existence of files in the target directory, i.e., objects of
 	 * this database that are already in the target tablespace.  We can't
 	 * allow the move in such a case, because we would need to change those
@@ -1301,28 +1540,7 @@ movedb(const char *dbname, const char *tblspcname)
 	PG_ENSURE_ERROR_CLEANUP(movedb_failure_callback,
 							PointerGetDatum(&fparms));
 	{
-		/*
-		 * Copy files from the old tablespace to the new one
-		 */
-		copydir(src_dbpath, dst_dbpath, false);
-
-		/*
-		 * Record the filesystem change in XLOG
-		 */
-		{
-			xl_dbase_create_rec xlrec;
-
-			xlrec.db_id = db_id;
-			xlrec.tablespace_id = dst_tblspcoid;
-			xlrec.src_db_id = db_id;
-			xlrec.src_tablespace_id = src_tblspcoid;
-
-			XLogBeginInsert();
-			XLogRegisterData((char *) &xlrec, sizeof(xl_dbase_create_rec));
-
-			(void) XLogInsert(RM_DBASE_ID,
-							  XLOG_DBASE_CREATE | XLR_SPECIAL_REL_UPDATE);
-		}
+		CopyDatabase(db_id, db_id, src_tblspcoid, dst_tblspcoid);
 
 		/*
 		 * Update the database's pg_database tuple
@@ -1356,22 +1574,6 @@ movedb(const char *dbname, const char *tblspcname)
 		systable_endscan(sysscan);
 
 		/*
-		 * Force another checkpoint here.  As in CREATE DATABASE, this is to
-		 * ensure that we don't have to replay a committed XLOG_DBASE_CREATE
-		 * operation, which would cause us to lose any unlogged operations
-		 * done in the new DB tablespace before the next checkpoint.
-		 */
-		RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
-
-		/*
-		 * Force synchronous commit, thus minimizing the window between
-		 * copying the database files and committal of the transaction. If we
-		 * crash before committing, we'll leave an orphaned set of files on
-		 * disk, which is not fatal but not good either.
-		 */
-		ForceSyncCommit();
-
-		/*
 		 * Close pg_database, but keep lock till commit.
 		 */
 		table_close(pgdbrel, NoLock);
@@ -1380,6 +1582,23 @@ movedb(const char *dbname, const char *tblspcname)
 								PointerGetDatum(&fparms));
 
 	/*
+	 * Now drop all buffers holding data of the target database; they should
+	 * no longer be dirty so DropDatabaseBuffers is safe.
+	 *
+	 * It might seem that we could just let these buffers age out of shared
+	 * buffers naturally, since they should not get referenced anymore.  The
+	 * problem with that is that if the user later moves the database back to
+	 * its original tablespace, any still-surviving buffers would appear to
+	 * contain valid data again --- but they'd be missing any changes made in
+	 * the database while it was in the new tablespace.  In any case, freeing
+	 * buffers that should never be used again seems worth the cycles.
+	 *
+	 * Note: it'd be sufficient to get rid of buffers matching db_id and
+	 * src_tblspcoid, but bufmgr.c presently provides no API for that.
+	 */
+	DropDatabaseBuffers(db_id);
+
+	/*
 	 * Commit the transaction so that the pg_database update is committed. If
 	 * we crash while removing files, the database won't be corrupt, we'll
 	 * just leave some orphaned files in the old directory.
@@ -2183,39 +2402,11 @@ dbase_redo(XLogReaderState *record)
 	if (info == XLOG_DBASE_CREATE)
 	{
 		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) XLogRecGetData(record);
-		char	   *src_path;
-		char	   *dst_path;
-		struct stat st;
-
-		src_path = GetDatabasePath(xlrec->src_db_id, xlrec->src_tablespace_id);
-		dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
+		char	   *dbpath;
 
-		/*
-		 * Our theory for replaying a CREATE is to forcibly drop the target
-		 * subdirectory if present, then re-copy the source data. This may be
-		 * more work than needed, but it is simple to implement.
-		 */
-		if (stat(dst_path, &st) == 0 && S_ISDIR(st.st_mode))
-		{
-			if (!rmtree(dst_path, true))
-				/* If this failed, copydir() below is going to error. */
-				ereport(WARNING,
-						(errmsg("some useless files may be left behind in old database directory \"%s\"",
-								dst_path)));
-		}
-
-		/*
-		 * Force dirty buffers out to disk, to ensure source database is
-		 * up-to-date for the copy.
-		 */
-		FlushDatabaseBuffers(xlrec->src_db_id);
-
-		/*
-		 * Copy this subdirectory to the new location
-		 *
-		 * We don't need to copy subdirectories
-		 */
-		copydir(src_path, dst_path, false);
+		dbpath = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
+		CreateDatabaseDirectory(dbpath, xlrec->db_id, xlrec->tablespace_id,
+								true);
 	}
 	else if (info == XLOG_DBASE_DROP)
 	{
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 4b296a2..e198946 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -776,24 +776,17 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 /*
  * ReadBufferWithoutRelcache -- like ReadBufferExtended, but doesn't require
  *		a relcache entry for the relation.
- *
- * NB: At present, this function may only be used on permanent relations, which
- * is OK, because we only use it during XLOG replay.  If in the future we
- * want to use it on temporary or unlogged relations, we could pass additional
- * parameters.
  */
 Buffer
 ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
 						  BlockNumber blockNum, ReadBufferMode mode,
-						  BufferAccessStrategy strategy)
+						  BufferAccessStrategy strategy, char relpersistence)
 {
 	bool		hit;
 
 	SMgrRelation smgr = smgropen(rnode, InvalidBackendId);
 
-	Assert(InRecovery);
-
-	return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
+	return ReadBuffer_common(smgr, relpersistence, forkNum, blockNum,
 							 mode, strategy, &hit);
 }
 
@@ -803,7 +796,7 @@ ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
  *
  * *hit is set to true if the request was satisfied from shared buffer cache.
  */
-static Buffer
+Buffer
 ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 				  BlockNumber blockNum, ReadBufferMode mode,
 				  BufferAccessStrategy strategy, bool *hit)
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 424624c..58ef902 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -136,7 +136,13 @@ static void apply_map_update(RelMapFile *map, Oid relationId, Oid fileNode,
 							 bool add_okay);
 static void merge_map_updates(RelMapFile *map, const RelMapFile *updates,
 							  bool add_okay);
+static void read_relmap_file(char *mapfilename, RelMapFile *map);
 static void load_relmap_file(bool shared);
+static void write_relmap_file_internal(char *mapfilename, RelMapFile *newmap,
+									   RelMapFile *realmap, bool write_wal,
+									   bool send_sinval, bool preserve_files,
+									   Oid dbid, Oid tsid, const char *dbpath,
+									   uint8 info);
 static void write_relmap_file(bool shared, RelMapFile *newmap,
 							  bool write_wal, bool send_sinval, bool preserve_files,
 							  Oid dbid, Oid tsid, const char *dbpath);
@@ -250,6 +256,32 @@ RelationMapFilenodeToOid(Oid filenode, bool shared)
 }
 
 /*
+ * DatabaseRelationOidToFilenode
+ *
+ * Find relfilenode for the given relation id in the dbpath
+ */
+Oid
+DatabaseRelationOidToFilenode(char *dbpath, Oid relationId)
+{
+	RelMapFile	map;
+	int			i;
+	char		mapfilename[MAXPGPATH];
+
+	/* read the relmapfile from the source database */
+	snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+			 dbpath, RELMAPPER_FILENAME);
+	read_relmap_file(mapfilename, &map);
+
+	for (i = 0; i < map.num_mappings; i++)
+	{
+		if (relationId == map.mappings[i].mapoid)
+			return map.mappings[i].mapfilenode;
+	}
+
+	return InvalidOid;
+}
+
+/*
  * RelationMapUpdateMap
  *
  * Install a new relfilenode mapping for the specified relation.
@@ -687,36 +719,37 @@ RestoreRelationMap(char *startAddress)
 }
 
 /*
- * load_relmap_file -- load data from the shared or local map file
- *
- * Because the map file is essential for access to core system catalogs,
- * failure to read it is a fatal error.
- *
- * Note that the local case requires DatabasePath to be set up.
+ * copy relmapfile from source db path to the destination db path.
  */
-static void
-load_relmap_file(bool shared)
+void
+CreateAndCopyRelMap(Oid dbid, Oid tsid, char *srcdbpath, char *dstdbpath)
 {
-	RelMapFile *map;
+	RelMapFile	map;
 	char		mapfilename[MAXPGPATH];
+
+	LWLockAcquire(RelationMappingLock, LW_EXCLUSIVE);
+
+	/* read the relmapfile from the source database */
+	snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+			 srcdbpath, RELMAPPER_FILENAME);
+	read_relmap_file(mapfilename, &map);
+
+	LWLockRelease(RelationMappingLock);
+
+	/* write the relmapfile of the destination database */
+	snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+			 dstdbpath, RELMAPPER_FILENAME);
+	write_relmap_file_internal(mapfilename, &map, &map, true, false, true,
+							   dbid, tsid, dstdbpath, XLOG_RELMAP_CREATE);
+}
+
+static void
+read_relmap_file(char *mapfilename, RelMapFile *map)
+{
 	pg_crc32c	crc;
 	int			fd;
 	int			r;
 
-	if (shared)
-	{
-		snprintf(mapfilename, sizeof(mapfilename), "global/%s",
-				 RELMAPPER_FILENAME);
-		map = &shared_map;
-	}
-	else
-	{
-		snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
-				 DatabasePath, RELMAPPER_FILENAME);
-		map = &local_map;
-	}
-
-	/* Read data ... */
 	fd = OpenTransientFile(mapfilename, O_RDONLY | PG_BINARY);
 	if (fd < 0)
 		ereport(FATAL,
@@ -773,62 +806,44 @@ load_relmap_file(bool shared)
 }
 
 /*
- * Write out a new shared or local map file with the given contents.
- *
- * The magic number and CRC are automatically updated in *newmap.  On
- * success, we copy the data to the appropriate permanent static variable.
- *
- * If write_wal is true then an appropriate WAL message is emitted.
- * (It will be false for bootstrap and WAL replay cases.)
- *
- * If send_sinval is true then a SI invalidation message is sent.
- * (This should be true except in bootstrap case.)
+ * load_relmap_file -- load data from the shared or local map file
  *
- * If preserve_files is true then the storage manager is warned not to
- * delete the files listed in the map.
+ * Because the map file is essential for access to core system catalogs,
+ * failure to read it is a fatal error.
  *
- * Because this may be called during WAL replay when MyDatabaseId,
- * DatabasePath, etc aren't valid, we require the caller to pass in suitable
- * values.  The caller is also responsible for being sure no concurrent
- * map update could be happening.
+ * Note that the local case requires DatabasePath to be set up.
  */
 static void
-write_relmap_file(bool shared, RelMapFile *newmap,
-				  bool write_wal, bool send_sinval, bool preserve_files,
-				  Oid dbid, Oid tsid, const char *dbpath)
+load_relmap_file(bool shared)
 {
-	int			fd;
-	RelMapFile *realmap;
+	RelMapFile *map;
 	char		mapfilename[MAXPGPATH];
 
-	/*
-	 * Fill in the overhead fields and update CRC.
-	 */
-	newmap->magic = RELMAPPER_FILEMAGIC;
-	if (newmap->num_mappings < 0 || newmap->num_mappings > MAX_MAPPINGS)
-		elog(ERROR, "attempt to write bogus relation mapping");
-
-	INIT_CRC32C(newmap->crc);
-	COMP_CRC32C(newmap->crc, (char *) newmap, offsetof(RelMapFile, crc));
-	FIN_CRC32C(newmap->crc);
-
-	/*
-	 * Open the target file.  We prefer to do this before entering the
-	 * critical section, so that an open() failure need not force PANIC.
-	 */
 	if (shared)
 	{
 		snprintf(mapfilename, sizeof(mapfilename), "global/%s",
 				 RELMAPPER_FILENAME);
-		realmap = &shared_map;
+		map = &shared_map;
 	}
 	else
 	{
 		snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
-				 dbpath, RELMAPPER_FILENAME);
-		realmap = &local_map;
+				 DatabasePath, RELMAPPER_FILENAME);
+		map = &local_map;
 	}
 
+	/* Read data ... */
+	read_relmap_file(mapfilename, map);
+}
+
+static void
+write_relmap_file_internal(char *mapfilename, RelMapFile *newmap,
+						   RelMapFile *realmap, bool write_wal,
+						   bool send_sinval, bool preserve_files, Oid dbid,
+						   Oid tsid, const char *dbpath, uint8 info)
+{
+	int			fd;
+
 	fd = OpenTransientFile(mapfilename, O_WRONLY | O_CREAT | PG_BINARY);
 	if (fd < 0)
 		ereport(ERROR,
@@ -852,7 +867,7 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		XLogRegisterData((char *) (&xlrec), MinSizeOfRelmapUpdate);
 		XLogRegisterData((char *) newmap, sizeof(RelMapFile));
 
-		lsn = XLogInsert(RM_RELMAP_ID, XLOG_RELMAP_UPDATE);
+		lsn = XLogInsert(RM_RELMAP_ID, info);
 
 		/* As always, WAL must hit the disk before the data update does */
 		XLogFlush(lsn);
@@ -944,6 +959,67 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 }
 
 /*
+ * Write out a new shared or local map file with the given contents.
+ *
+ * The magic number and CRC are automatically updated in *newmap.  On
+ * success, we copy the data to the appropriate permanent static variable.
+ *
+ * If write_wal is true then an appropriate WAL message is emitted.
+ * (It will be false for bootstrap and WAL replay cases.)
+ *
+ * If send_sinval is true then a SI invalidation message is sent.
+ * (This should be true except in bootstrap case.)
+ *
+ * If preserve_files is true then the storage manager is warned not to
+ * delete the files listed in the map.
+ *
+ * Because this may be called during WAL replay when MyDatabaseId,
+ * DatabasePath, etc aren't valid, we require the caller to pass in suitable
+ * values.  The caller is also responsible for being sure no concurrent
+ * map update could be happening.
+ */
+static void
+write_relmap_file(bool shared, RelMapFile *newmap,
+				  bool write_wal, bool send_sinval, bool preserve_files,
+				  Oid dbid, Oid tsid, const char *dbpath)
+{
+	RelMapFile *realmap;
+	char		mapfilename[MAXPGPATH];
+
+	/*
+	 * Fill in the overhead fields and update CRC.
+	 */
+	newmap->magic = RELMAPPER_FILEMAGIC;
+	if (newmap->num_mappings < 0 || newmap->num_mappings > MAX_MAPPINGS)
+		elog(ERROR, "attempt to write bogus relation mapping");
+
+	INIT_CRC32C(newmap->crc);
+	COMP_CRC32C(newmap->crc, (char *) newmap, offsetof(RelMapFile, crc));
+	FIN_CRC32C(newmap->crc);
+
+	/*
+	 * Open the target file.  We prefer to do this before entering the
+	 * critical section, so that an open() failure need not force PANIC.
+	 */
+	if (shared)
+	{
+		snprintf(mapfilename, sizeof(mapfilename), "global/%s",
+				 RELMAPPER_FILENAME);
+		realmap = &shared_map;
+	}
+	else
+	{
+		snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+				 dbpath, RELMAPPER_FILENAME);
+		realmap = &local_map;
+	}
+
+	write_relmap_file_internal(mapfilename, newmap, realmap, write_wal,
+							   send_sinval, preserve_files, dbid, tsid,
+							   dbpath, XLOG_RELMAP_UPDATE);
+}
+
+/*
  * Merge the specified updates into the appropriate "real" map,
  * and write out the changes.  This function must be used for committing
  * updates during normal multiuser operation.
@@ -1004,7 +1080,7 @@ relmap_redo(XLogReaderState *record)
 	/* Backup blocks are not used in relmap records */
 	Assert(!XLogRecHasAnyBlockRefs(record));
 
-	if (info == XLOG_RELMAP_UPDATE)
+	if ((info == XLOG_RELMAP_UPDATE) || (info == XLOG_RELMAP_CREATE))
 	{
 		xl_relmap_update *xlrec = (xl_relmap_update *) XLogRecGetData(record);
 		RelMapFile	newmap;
@@ -1027,10 +1103,22 @@ relmap_redo(XLogReaderState *record)
 		 * so we don't bother to take the RelationMappingLock.  We would need
 		 * to do so if load_relmap_file needed to interlock against writers.
 		 */
-		write_relmap_file((xlrec->dbid == InvalidOid), &newmap,
-						  false, true, false,
-						  xlrec->dbid, xlrec->tsid, dbpath);
+		if (info == XLOG_RELMAP_UPDATE)
+			write_relmap_file((xlrec->dbid == InvalidOid), &newmap,
+							false, true, false,
+							xlrec->dbid, xlrec->tsid, dbpath);
+		else
+		{
+			char		mapfilename[MAXPGPATH];
 
+			/* We need to construct the pathname for this database */
+			snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+					 dbpath, RELMAPPER_FILENAME);
+
+			write_relmap_file_internal(mapfilename, &newmap, &newmap, false,
+									  false, false, xlrec->dbid, xlrec->tsid,
+									  dbpath, 0);
+		}
 		pfree(dbpath);
 	}
 	else
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 59ebac7..189123b 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -23,6 +23,7 @@
 #include "fe_utils/archive.h"
 #include "filemap.h"
 #include "pg_rewind.h"
+#include "utils/relmapper.h"
 
 /*
  * RmgrNames is an array of resource manager names, to make error messages
@@ -390,6 +391,10 @@ extractPageInfo(XLogReaderState *record)
 		 * system. No need to do anything special here.
 		 */
 	}
+	else if (rmid == RM_RELMAP_ID && info == XLOG_RELMAP_CREATE)
+	{
+		/* ignore */
+	}
 	else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_CREATE)
 	{
 		/*
diff --git a/src/include/commands/dbcommands_xlog.h b/src/include/commands/dbcommands_xlog.h
index f5ed762..9e4e382 100644
--- a/src/include/commands/dbcommands_xlog.h
+++ b/src/include/commands/dbcommands_xlog.h
@@ -23,13 +23,14 @@
 
 typedef struct xl_dbase_create_rec
 {
-	/* Records copying of a single subdirectory incl. contents */
 	Oid			db_id;
 	Oid			tablespace_id;
-	Oid			src_db_id;
-	Oid			src_tablespace_id;
+	int32       nbytes;         /* size of version data */
+	char		version[FLEXIBLE_ARRAY_MEMBER];
 } xl_dbase_create_rec;
 
+#define MinSizeOfDbaseCreateRec offsetof(xl_dbase_create_rec, version)
+
 typedef struct xl_dbase_drop_rec
 {
 	Oid			db_id;
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index aa64fb4..bef6d6a 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -184,7 +184,8 @@ extern Buffer ReadBufferExtended(Relation reln, ForkNumber forkNum,
 								 BufferAccessStrategy strategy);
 extern Buffer ReadBufferWithoutRelcache(RelFileNode rnode,
 										ForkNumber forkNum, BlockNumber blockNum,
-										ReadBufferMode mode, BufferAccessStrategy strategy);
+										ReadBufferMode mode, BufferAccessStrategy strategy,
+										char relpersistence);
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern void MarkBufferDirty(Buffer buffer);
diff --git a/src/include/utils/relmapper.h b/src/include/utils/relmapper.h
index c0d14da..6f42ace 100644
--- a/src/include/utils/relmapper.h
+++ b/src/include/utils/relmapper.h
@@ -23,6 +23,7 @@
  */
 
 #define XLOG_RELMAP_UPDATE		0x00
+#define XLOG_RELMAP_CREATE		0x10
 
 typedef struct xl_relmap_update
 {
@@ -39,6 +40,8 @@ extern Oid	RelationMapOidToFilenode(Oid relationId, bool shared);
 
 extern Oid	RelationMapFilenodeToOid(Oid relationId, bool shared);
 
+extern Oid DatabaseRelationOidToFilenode(char *dbpath, Oid relationId);
+
 extern void RelationMapUpdateMap(Oid relationId, Oid fileNode, bool shared,
 								 bool immediate);
 
@@ -62,7 +65,8 @@ extern void RelationMapInitializePhase3(void);
 extern Size EstimateRelationMapSpace(void);
 extern void SerializeRelationMap(Size maxSize, char *startAddress);
 extern void RestoreRelationMap(char *startAddress);
-
+extern void CreateAndCopyRelMap(Oid dbid, Oid tsid, char *srcdbpath,
+								char *dstdbpath);
 extern void relmap_redo(XLogReaderState *record);
 extern void relmap_desc(StringInfo buf, XLogReaderState *record);
 extern const char *relmap_identify(uint8 info);
-- 
1.8.3.1

Julien Rouhaud

rjuju123@gmail.com

over 4 years ago

In reply to: Dilip Kumar (#1)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Tue, Jun 15, 2021 at 04:50:24PM +0530, Dilip Kumar wrote:

Currently, CREATE DATABASE forces a checkpoint, then copies all the
files, then forces another checkpoint. The comments in the createdb()
function explain the reasons for this. The attached patch fixes this
problem by making CREATE DATABASE completely WAL-logged so that now we
can avoid checkpoints. The patch modifies both CREATE DATABASE and
ALTER DATABASE..SET TABLESPACE to be fully WAL-logged.

One main advantage of this change is that it will be cheaper. Forcing
checkpoints on an idle system is no big deal, but when the system is
under heavy write load, it's very expensive. Another advantage is that
it makes things better for features like TDE, which might want the
pages in the source database to be encrypted using a different key or
nonce than the pages in the target database.

I only had a quick look at the patch but AFAICS your patch makes the new
behavior mandatory. Wouldn't it make sense to have a way to use the previous
approach? People creating wanting to copy somewhat big database and with a
slow replication may prefer to pay 2 checkpoints rather than stream everything.
Same for people who have an otherwise idle system (I often use that to make
temporary backups and/or prepare multiple dataset and most of the time the
checkpoint is basically free).

Heikki Linnakangas

hlinnaka@iki.fi

over 4 years ago

In reply to: Dilip Kumar (#1)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On 15/06/2021 14:20, Dilip Kumar wrote:

Design Idea:
-----------------
First, create the target database directory along with the version
file and WAL-log this operation. Create the "relation map file" in
the target database and copy the content from the source database. For
this, we can use some modified versions of the write_relmap_file() and
WAL-log the relmap create operation along with the file content. Now,
read the relmap file to find the relfilenode for pg_class and then we
read pg_class block by block and decode the tuples. For reading the
pg_class blocks, we can use ReadBufferWithoutRelCache() so that we
don't need the relcache. Nothing prevents us from checking visibility
for tuples in another database because CLOG is global to the cluster.
And nothing prevents us from deforming those tuples because the column
definitions for pg_class have to be the same in every database. Then
we can get the relfilenode of every file we need to copy, and prepare
a list of all such relfilenode.

I guess that would work, but you could also walk the database directory
like copydir() does. How you find the relations to copy is orthogonal to
whether you WAL-log them or use checkpoints. And whether you use the
buffer cache is also orthogonal to the rest of the proposal; you could
issue FlushDatabaseBuffers() instead of a checkpoint.

Next, for each relfilenode in the
source database, create a respective relfilenode in the target
database (for all forks) using smgrcreate, which is already a
WAL-logged operation. Now read the source relfilenode block by block
using ReadBufferWithoutRelCache() and copy the block to the target
relfilenode using smgrextend() and WAL-log them using log_newpage().
For the source database, we can not directly use the smgrread(),
because there could be some dirty buffers so we will have to read them
through the buffer manager interface, otherwise, we will have to flush
all the dirty buffers.

Yeah, WAL-logging the contents of the source database would certainly be
less weird than the current system. As Julien also pointed out, the
question is, are there people using on "CREATE DATABASE foo TEMPLATE
bar" to copy a large source database, on the premise that it's fast
because it skips WAL-logging?

In principle, we could have both mechanisms, and use the new WAL-logged
system if the database is small, and the old system with checkpoints if
it's large. But I don't like idea of having to maintain both.

- Heikki

Dilip Kumar

dilipbalaut@gmail.com

over 4 years ago

In reply to: Heikki Linnakangas (#3)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Tue, Jun 15, 2021 at 5:34 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 15/06/2021 14:20, Dilip Kumar wrote:

Design Idea:

. Then

we can get the relfilenode of every file we need to copy, and prepare
a list of all such relfilenode.

I guess that would work, but you could also walk the database directory
like copydir() does. How you find the relations to copy is orthogonal to
whether you WAL-log them or use checkpoints. And whether you use the
buffer cache is also orthogonal to the rest of the proposal; you could
issue FlushDatabaseBuffers() instead of a checkpoint.

Yeah, that would also work, but I thought since we are already
avoiding the checkpoint so let's avoid FlushDatabaseBuffers() also and
directly use the lower level buffer manager API which doesn't need
recache. And I am using pg_class to identify the useful relfilenode
so that we can avoid processing some unwanted relfilenode but yeah I
agree that this is orthogonal to whether we use checkpoint or not.

Yeah, WAL-logging the contents of the source database would certainly be
less weird than the current system. As Julien also pointed out, the
question is, are there people using on "CREATE DATABASE foo TEMPLATE
bar" to copy a large source database, on the premise that it's fast
because it skips WAL-logging?

In principle, we could have both mechanisms, and use the new WAL-logged
system if the database is small, and the old system with checkpoints if
it's large. But I don't like idea of having to maintain both.

Yeah, I agree in some cases, where we don't have many dirty buffers,
checkpointing can be faster. I think code wise maintaining two
approaches will not be a very difficult job because the old approach
just calls copydir(), but I am thinking about how can we decide which
approach is better in which scenario. I don't think we can take calls
just based on the database size? It would also depend upon many other
factors e.g. how busy your system is, how many total dirty buffers are
there in the cluster right? because checkpoint will affect the
performance of the operation going on in other databases in the
cluster.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Adam Brusselback

adambrusselback@gmail.com

over 4 years ago

In reply to: Dilip Kumar (#4)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

Am I mistaken in thinking that this would allow CREATE DATABASE to run
inside a transaction block now, further reducing the DDL commands that are
non-transactional?

Andrew Dunstan

andrew@dunslane.net

over 4 years ago

In reply to: Heikki Linnakangas (#3)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On 6/15/21 8:04 AM, Heikki Linnakangas wrote:

Yeah, WAL-logging the contents of the source database would certainly
be less weird than the current system. As Julien also pointed out, the
question is, are there people using on "CREATE DATABASE foo TEMPLATE
bar" to copy a large source database, on the premise that it's fast
because it skips WAL-logging?

I'm 100% certain there are. It's not even a niche case.

In principle, we could have both mechanisms, and use the new
WAL-logged system if the database is small, and the old system with
checkpoints if it's large. But I don't like idea of having to maintain
both.

Rather than use size, I'd be inclined to say use this if the source
database is marked as a template, and use the copydir approach for
anything that isn't.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Julien Rouhaud

rjuju123@gmail.com

over 4 years ago

In reply to: Andrew Dunstan (#6)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Tue, Jun 15, 2021 at 9:31 PM Andrew Dunstan <andrew@dunslane.net> wrote:

Rather than use size, I'd be inclined to say use this if the source
database is marked as a template, and use the copydir approach for
anything that isn't.

Looks like a good approach.

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 4 years ago

In reply to: Julien Rouhaud (#7)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

At Tue, 15 Jun 2021 22:07:32 +0800, Julien Rouhaud <rjuju123@gmail.com> wrote in

On Tue, Jun 15, 2021 at 9:31 PM Andrew Dunstan <andrew@dunslane.net> wrote:

Rather than use size, I'd be inclined to say use this if the source
database is marked as a template, and use the copydir approach for
anything that isn't.

Looks like a good approach.

If we are willing to maintain the two methods.

Couldn't we just skip the checkpoints if the database is known to
"clean", which means no page has been loaded for the database since
startup? We can use the "template" mark to reject connections to the
database. (I'm afraid that we also should prevent vacuum to visit the
template databases, but...)

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Julien Rouhaud

rjuju123@gmail.com

over 4 years ago

In reply to: Kyotaro Horiguchi (#8)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Wed, Jun 16, 2021 at 03:27:21PM +0900, Kyotaro Horiguchi wrote:

If we are willing to maintain the two methods.
Couldn't we just skip the checkpoints if the database is known to
"clean", which means no page has been loaded for the database since
startup? We can use the "template" mark to reject connections to the
database. (I'm afraid that we also should prevent vacuum to visit the
template databases, but...)

There's already a datallowconn for that purpose. Modifying template databases
is a common practice and we shouldn't prevent that.

But having the database currently doesn't accepting connection doesn't mean that
there is no dirty buffer and/or pending unlink, so it doesn't look like
something that could be optimized, at least for the majority of use cases.

#10

Dilip Kumar

dilipbalaut@gmail.com

over 4 years ago

In reply to: Andrew Dunstan (#6)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Tue, Jun 15, 2021 at 7:01 PM Andrew Dunstan <andrew@dunslane.net> wrote:

Rather than use size, I'd be inclined to say use this if the source
database is marked as a template, and use the copydir approach for
anything that isn't.

Yeah, that is possible, on the other thought wouldn't it be good to
provide control to the user by providing two different commands, e.g.
COPY DATABASE for the existing method (copydir) and CREATE DATABASE
for the new method (fully wal logged)?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#11

Andres Freund

andres@anarazel.de

over 4 years ago

In reply to: Dilip Kumar (#1)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

Hi,

On 2021-06-15 16:50:24 +0530, Dilip Kumar wrote:

The patch modifies both CREATE DATABASE and ALTER DATABASE..SET
TABLESPACE to be fully WAL-logged.

Generally quite a bit in favor of this - the current approach is very
heavyweight, slow and I think we have a few open corner bugs related to
it.

Design Idea:
-----------------
First, create the target database directory along with the version
file and WAL-log this operation.

What happens if you crash / promote at this point?

Create the "relation map file" in the target database and copy the
content from the source database. For this, we can use some modified
versions of the write_relmap_file() and WAL-log the relmap create
operation along with the file content. Now, read the relmap file to
find the relfilenode for pg_class and then we read pg_class block by
block and decode the tuples.

This doesn't seem like a great approach - you're not going to be able to
use much of the normal infrastructure around processing tuples. So it
seems like it'd end up with quite a bit of special case code that needs
to maintained in parallel.

Now read the source relfilenode block by block using
ReadBufferWithoutRelCache() and copy the block to the target
relfilenode using smgrextend() and WAL-log them using log_newpage().
For the source database, we can not directly use the smgrread(),
because there could be some dirty buffers so we will have to read them
through the buffer manager interface, otherwise, we will have to flush
all the dirty buffers.

I think we might need a bit more batching for the WAL logging. There are
cases of template database considerably bigger than the default and the
overhead of logging each write separately seems likely to be noticable.

Greetings,

Andres Freund

#12

Andres Freund

andres@anarazel.de

over 4 years ago

In reply to: Dilip Kumar (#4)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

Hi,

On 2021-06-15 18:11:23 +0530, Dilip Kumar wrote:

On Tue, Jun 15, 2021 at 5:34 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 15/06/2021 14:20, Dilip Kumar wrote:

Design Idea:

. Then

we can get the relfilenode of every file we need to copy, and prepare
a list of all such relfilenode.

I guess that would work, but you could also walk the database directory
like copydir() does. How you find the relations to copy is orthogonal to
whether you WAL-log them or use checkpoints. And whether you use the
buffer cache is also orthogonal to the rest of the proposal; you could
issue FlushDatabaseBuffers() instead of a checkpoint.

Yeah, that would also work, but I thought since we are already
avoiding the checkpoint so let's avoid FlushDatabaseBuffers() also and
directly use the lower level buffer manager API which doesn't need
recache. And I am using pg_class to identify the useful relfilenode
so that we can avoid processing some unwanted relfilenode but yeah I
agree that this is orthogonal to whether we use checkpoint or not.

It's not entirely obvious to me that it's important to avoid
FlushDatabaseBuffers() on its own. Forcing a checkpoint is problematic because
it unnecessarily writes out dirty buffers in other databases, triggers FPWs
etc. Normally a database used as a template won't have a meaningful amount of
dirty buffers itself, so the FlushDatabaseBuffers() shouldn't trigger a lot of
writes. Of course, there is the matter of FlushDatabaseBuffers() not being
cheap with a large shared_buffers - but I suspect that's not a huge factor
compared to the rest of the database creation cost.

I think the better argument for going through shared buffers is that it might
be worth doing so for the *target* database. A common use of frequently
creating databases, in particular with a non-default template database, is to
run regression tests with pre-created schema / data - writing out all that data
just to have it then dropped a few seconds later after the regression test
completed is wasteful.

In principle, we could have both mechanisms, and use the new WAL-logged
system if the database is small, and the old system with checkpoints if
it's large. But I don't like idea of having to maintain both.

Yeah, I agree in some cases, where we don't have many dirty buffers,
checkpointing can be faster.

I don't think the main issue is the speed of checkpointing itself? The reaoson
to maintain the old paths is that the "new approach" is bloating WAL volume,
no? Right now cloning a 1TB database costs a few hundred bytes of WAL and about
1TB of write IO. With the proposed approach, the write volume approximately
doubles, because there'll also be about 1TB in WAL.

Greetings,

Andres Freund

#13

Tomas Vondra

tomas.vondra@enterprisedb.com

over 4 years ago

In reply to: Andrew Dunstan (#6)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On 6/15/21 3:31 PM, Andrew Dunstan wrote:

On 6/15/21 8:04 AM, Heikki Linnakangas wrote:

Yeah, WAL-logging the contents of the source database would certainly
be less weird than the current system. As Julien also pointed out, the
question is, are there people using on "CREATE DATABASE foo TEMPLATE
bar" to copy a large source database, on the premise that it's fast
because it skips WAL-logging?

I'm 100% certain there are. It's not even a niche case.

In principle, we could have both mechanisms, and use the new
WAL-logged system if the database is small, and the old system with
checkpoints if it's large. But I don't like idea of having to maintain
both.

Rather than use size, I'd be inclined to say use this if the source
database is marked as a template, and use the copydir approach for
anything that isn't.

I think we should be asking what is the benefit of that use case, and
perhaps try addressing that without having to maintain two entirely
different ways to do CREATE DATABASE. It's not like we're sure the
current code is 100% reliable in various corner cases, I doubt having
two separate approaches will improve the situation :-/

I can see three reasons why people want to skip the WAL logging:

1) it's faster, because there's no CPU and I/O for building the WAL

I wonder if some optimization / batching could help with (1), as
suggested by Andres elsewhere in this thread.

2) it saves the amount of WAL (could matter with large template
databases and WAL archiving, etc.)

We can't really do much about this - we need to log all the data. But
the batching from (1) might help a bit too, I guess.

3) saves the amount of WAL that needs to be copied to standby, so that
there's no increase of replication lag, etc. particularly when the
network link has limited bandwidth

I think this is a more general issue - some operations that may
generate a lot of WAL, and we generally assume it's better to do
that rather than hold exclusive locks for long time. But maybe we
could have some throttling, to limit the amount of WAL per second,
similarly to what we have to plain vacuum.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#14

Dilip Kumar

dilipbalaut@gmail.com

over 4 years ago

In reply to: Andres Freund (#11)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

()log_newpage()On Thu, Jun 17, 2021 at 3:28 AM Andres Freund
<andres@anarazel.de> wrote:

Hi,

On 2021-06-15 16:50:24 +0530, Dilip Kumar wrote:

The patch modifies both CREATE DATABASE and ALTER DATABASE..SET
TABLESPACE to be fully WAL-logged.

Generally quite a bit in favor of this - the current approach is very
heavyweight, slow and I think we have a few open corner bugs related to
it.

Great!

Design Idea:
-----------------
First, create the target database directory along with the version
file and WAL-log this operation.

What happens if you crash / promote at this point?

I will check this.

Create the "relation map file" in the target database and copy the
content from the source database. For this, we can use some modified
versions of the write_relmap_file() and WAL-log the relmap create
operation along with the file content. Now, read the relmap file to
find the relfilenode for pg_class and then we read pg_class block by
block and decode the tuples.

This doesn't seem like a great approach - you're not going to be able to
use much of the normal infrastructure around processing tuples. So it
seems like it'd end up with quite a bit of special case code that needs
to maintained in parallel.

Yeah, this needs some special-purpose code but it is not too much
code. I agree that instead of scanning the pg_class we can scan all
the tablespaces and under that identify the source database directory
as we do now. And from there we can copy each relfilenode block by
block with wal log. Honestly, these both seem like a special-purpose
code. Another problem with directly scanning the directory is, how we
are supposed to get the "relpersistence" which is stored in pg_class
tuple right?

Now read the source relfilenode block by block using
ReadBufferWithoutRelCache() and copy the block to the target
relfilenode using smgrextend() and WAL-log them using log_newpage().
For the source database, we can not directly use the smgrread(),
because there could be some dirty buffers so we will have to read them
through the buffer manager interface, otherwise, we will have to flush
all the dirty buffers.

I think we might need a bit more batching for the WAL logging. There are
cases of template database considerably bigger than the default and the
overhead of logging each write separately seems likely to be noticable.

Yeah, we can do that, and instead of using log_newpage() we can use
log_newpages(), to log multiple pages at once.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#15

Dilip Kumar

dilipbalaut@gmail.com

over 4 years ago

In reply to: Andres Freund (#12)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Thu, Jun 17, 2021 at 3:43 AM Andres Freund <andres@anarazel.de> wrote:

Yeah, that would also work, but I thought since we are already
avoiding the checkpoint so let's avoid FlushDatabaseBuffers() also and
directly use the lower level buffer manager API which doesn't need
recache. And I am using pg_class to identify the useful relfilenode
so that we can avoid processing some unwanted relfilenode but yeah I
agree that this is orthogonal to whether we use checkpoint or not.

It's not entirely obvious to me that it's important to avoid
FlushDatabaseBuffers() on its own. Forcing a checkpoint is problematic because
it unnecessarily writes out dirty buffers in other databases, triggers FPWs
etc. Normally a database used as a template won't have a meaningful amount of
dirty buffers itself, so the FlushDatabaseBuffers() shouldn't trigger a lot of
writes. Of course, there is the matter of FlushDatabaseBuffers() not being
cheap with a large shared_buffers - but I suspect that's not a huge factor
compared to the rest of the database creation cost.

Okay so if I think from that POW, then maybe we can just
FlushDatabaseBuffers() and then directly use smgrread() calls.

I think the better argument for going through shared buffers is that it might
be worth doing so for the *target* database. A common use of frequently
creating databases, in particular with a non-default template database, is to
run regression tests with pre-created schema / data - writing out all that data
just to have it then dropped a few seconds later after the regression test
completed is wasteful.

Okay, I am not sure how common this use case is but for this use case
it makes sense to use bufmgr for the target database.

In principle, we could have both mechanisms, and use the new WAL-logged
system if the database is small, and the old system with checkpoints if
it's large. But I don't like idea of having to maintain both.

Yeah, I agree in some cases, where we don't have many dirty buffers,
checkpointing can be faster.

I don't think the main issue is the speed of checkpointing itself? The reaoson
to maintain the old paths is that the "new approach" is bloating WAL volume,
no? Right now cloning a 1TB database costs a few hundred bytes of WAL and about
1TB of write IO. With the proposed approach, the write volume approximately
doubles, because there'll also be about 1TB in WAL.

Make sense.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#16

Heikki Linnakangas

hlinnaka@iki.fi

over 4 years ago

In reply to: Dilip Kumar (#14)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On 17/06/2021 08:45, Dilip Kumar wrote:

Another problem with directly scanning the directory is, how we
are supposed to get the "relpersistence" which is stored in pg_class
tuple right?

You only need relpersistence if you want to use the buffer cache, right?
I think that's a good argument for not using it.

- Heikki

#17

Dilip Kumar

dilipbalaut@gmail.com

over 4 years ago

In reply to: Heikki Linnakangas (#16)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Thu, Jun 17, 2021 at 2:50 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 17/06/2021 08:45, Dilip Kumar wrote:

Another problem with directly scanning the directory is, how we
are supposed to get the "relpersistence" which is stored in pg_class
tuple right?

You only need relpersistence if you want to use the buffer cache, right?
I think that's a good argument for not using it.

Yeah, that is the one place, another place I am using it to decide
whether to WAL log the new page while writing into the target
relfilenode, if it is unlogged relation then I am not WAL logging. But
now, I think that is not the right idea, during creating the database
we should WAL log all the pages irrespective of whether the table is
logged or unlogged.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#18

Robert Haas

robertmhaas@gmail.com

over 4 years ago

In reply to: Andres Freund (#12)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Wed, Jun 16, 2021 at 6:13 PM Andres Freund <andres@anarazel.de> wrote:

I don't think the main issue is the speed of checkpointing itself? The reaoson
to maintain the old paths is that the "new approach" is bloating WAL volume,
no? Right now cloning a 1TB database costs a few hundred bytes of WAL and about
1TB of write IO. With the proposed approach, the write volume approximately
doubles, because there'll also be about 1TB in WAL.

This is a good point, but on the other hand, I think this smells a lot
like the wal_level=minimal optimization where we don't need to log
data being bulk-loaded into a table created in the same transaction if
wal_level=minimal. In theory, that optimization has a lot of value,
but in practice it gets a lot of bad press on this list, because (1)
sometimes doing the fsync is more expensive than writing the extra WAL
would have been and (2) most people want to run with
wal_level=replica/logical so it ends up being a code path that isn't
used much and is therefore more likely than average to have bugs
nobody's terribly interested in fixing (except Noah ... thanks Noah!).
If we add features in the future, lke TDE or perhaps incremental
backup, that rely on new pages getting new LSNs instead of recycled
ones, this may turn into the same kind of wart. And as with that
optimization, you're probably not even better off unless the database
is pretty big, and you might be worse off if you have to do fsyncs or
flush buffers synchronously. I'm not severely opposed to keeping both
methods around, so if that's really what people want to do, OK, but I
guess I wonder whether we're really going to be happy with that
decision down the road.

--
Robert Haas
EDB: http://www.enterprisedb.com

#19

Robert Haas

robertmhaas@gmail.com

over 4 years ago

In reply to: Heikki Linnakangas (#16)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Thu, Jun 17, 2021 at 5:20 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

You only need relpersistence if you want to use the buffer cache, right?
I think that's a good argument for not using it.

I think the root of the problem with this feature is that it doesn't
go through shared_buffers, so in my opinion, it would be better if we
can make it all go through shared_buffers. It seems like you're
advocating a middle ground where half of the operation goes through
shared_buffers and the other half doesn't, but that sounds like
getting rid of half of the hack when we could have gotten rid of all
of it. I think things that don't go through shared_buffers are bad,
and we should be making an effort to get rid of them where we can
reasonably do so. I believe I've both introduced and fixed my share of
bugs that were caused by such cases, and I think the behavior of the
whole system would be a lot easier to reason about if we had fewer of
those, or none.

I can also think of at least one significant advantage of driving this
off the remote database's pg_class rather than the filesystem
contents. It's a known defect of PostgreSQL that if you create a table
and then crash, you leave behind a dead file that never gets removed.
If you now copy the database that contains that orphaned file, you
would ideally prefer not to copy that file, but if you do a copy based
on the filesystem contents, then you will. If you drive the copy off
of pg_class, you won't.

--
Robert Haas
EDB: http://www.enterprisedb.com

#20

Andres Freund

andres@anarazel.de

over 4 years ago

In reply to: Robert Haas (#19)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

Hi,

On 2021-06-17 13:53:38 -0400, Robert Haas wrote:

On Thu, Jun 17, 2021 at 5:20 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

You only need relpersistence if you want to use the buffer cache, right?
I think that's a good argument for not using it.

Do we really need pg_class to figure this out? Can't we just check if
the relation has an init fork?

I can also think of at least one significant advantage of driving this
off the remote database's pg_class rather than the filesystem
contents. It's a known defect of PostgreSQL that if you create a table
and then crash, you leave behind a dead file that never gets removed.
If you now copy the database that contains that orphaned file, you
would ideally prefer not to copy that file, but if you do a copy based
on the filesystem contents, then you will. If you drive the copy off
of pg_class, you won't.

I'm very unconvinced this is the place to tackle the issue of orphan
relfilenodes. It'd be one thing if it were doable by existing code,
e.g. because we supported cross-database relation accesses fully, but we
don't.

Adding a hacky special case implementation for cross-database relation
accesses that violates all kinds of assumptions (like holding a lock on
a relation when accessing it / pinning pages, processing relcache
invals, ...) doesn't seem like a good plan.

I don't think this is an academic concern: You need to read from shared
buffers to read the "remote" pg_class, otherwise you'll potentially miss
changes. But it's not correct to read in pages or to pin pages without
holding a lock, and there's code that relies on that (see
e.g. InvalidateBuffer()).

Greetings,

Andres Freund

#21

Robert Haas

robertmhaas@gmail.com

over 4 years ago

In reply to: Andres Freund (#20)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Thu, Jun 17, 2021 at 2:17 PM Andres Freund <andres@anarazel.de> wrote:

Adding a hacky special case implementation for cross-database relation
accesses that violates all kinds of assumptions (like holding a lock on
a relation when accessing it / pinning pages, processing relcache
invals, ...) doesn't seem like a good plan.

I agree that we don't want hacky code that violates assumptions, but
bypassing shared_buffers is a bit hacky, too. Can't we lock the
relations as we're copying them? We know pg_class's OID a fortiori,
and we can find out all the other OIDs as we go.

I'm just thinking that the hackiness of going around shared_buffers
feels irreducible, but maybe the hackiness in the patch is something
that can be solved with more engineering.

--
Robert Haas
EDB: http://www.enterprisedb.com

#22

Andres Freund

andres@anarazel.de

over 4 years ago

In reply to: Robert Haas (#21)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

Hi,

On 2021-06-17 14:22:52 -0400, Robert Haas wrote:

On Thu, Jun 17, 2021 at 2:17 PM Andres Freund <andres@anarazel.de> wrote:

Adding a hacky special case implementation for cross-database relation
accesses that violates all kinds of assumptions (like holding a lock on
a relation when accessing it / pinning pages, processing relcache
invals, ...) doesn't seem like a good plan.

I agree that we don't want hacky code that violates assumptions, but
bypassing shared_buffers is a bit hacky, too. Can't we lock the
relations as we're copying them? We know pg_class's OID a fortiori,
and we can find out all the other OIDs as we go.

We possibly can - but I'm not sure that won't end up violating some
other assumptions.

I'm just thinking that the hackiness of going around shared_buffers
feels irreducible, but maybe the hackiness in the patch is something
that can be solved with more engineering.

Which bypassing of shared buffers are you talking about here? We'd still
have to solve a subset of the issues around locking (at least on the
source side), but I don't think we need to read pg_class contents to be
able to go through shared_buffers? As I suggested, we can use the init
fork presence to infer relpersistence?

Or do you mean that looking at the filesystem at all is bypassing shared
buffers?

Greetings,

Andres Freund

#23

Robert Haas

robertmhaas@gmail.com

over 4 years ago

In reply to: Andres Freund (#22)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Thu, Jun 17, 2021 at 2:48 PM Andres Freund <andres@anarazel.de> wrote:

Or do you mean that looking at the filesystem at all is bypassing shared
buffers?

This is what I mean. I think we will end up in a better spot if we can
avoid doing that without creating too much ugliness elsewhere.

--
Robert Haas
EDB: http://www.enterprisedb.com

#24

Dilip Kumar

dilipbalaut@gmail.com

over 4 years ago

In reply to: Robert Haas (#23)

1 attachment(s)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Fri, Jun 18, 2021 at 12:50 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jun 17, 2021 at 2:48 PM Andres Freund <andres@anarazel.de> wrote:

Or do you mean that looking at the filesystem at all is bypassing shared
buffers?

This is what I mean. I think we will end up in a better spot if we can
avoid doing that without creating too much ugliness elsewhere.

The patch was not getting applied on head so I have rebased it, along
with that now I have used bufmgr layer for writing writing/logging
destination pages as well instead of directly using sgmr layer.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v1-0001-WAL-logged-CREATE-DATABASE.patchtext/x-patch; charset=US-ASCII; name=v1-0001-WAL-logged-CREATE-DATABASE.patchDownload

From 38521231f23be91d1cb92ca1d78b212320208643 Mon Sep 17 00:00:00 2001
From: dilipkumar <dilipbalaut@gmail.com>
Date: Tue, 29 Jun 2021 15:43:13 +0530
Subject: [PATCH v1] WAL logged CREATE DATABASE

Currently, CREATE DATABASE forces a checkpoint, then copies all the files,
then forces another checkpoint. The comments in the createdb() function
explain the reasons for this. The attached patch fixes this problem by making
create database completely WAL logged and so that we can avoid the checkpoints.
---
 src/backend/access/rmgrdesc/dbasedesc.c  |   3 +-
 src/backend/access/rmgrdesc/relmapdesc.c |  10 +
 src/backend/access/transam/xlogutils.c   |  12 +-
 src/backend/commands/dbcommands.c        | 655 ++++++++++++++++++++-----------
 src/backend/storage/buffer/bufmgr.c      |  13 +-
 src/backend/utils/cache/relmapper.c      | 223 +++++++----
 src/bin/pg_rewind/parsexlog.c            |   5 +
 src/include/commands/dbcommands_xlog.h   |   7 +-
 src/include/storage/bufmgr.h             |   3 +-
 src/include/utils/relmapper.h            |   6 +-
 10 files changed, 617 insertions(+), 320 deletions(-)

diff --git a/src/backend/access/rmgrdesc/dbasedesc.c b/src/backend/access/rmgrdesc/dbasedesc.c
index 2660984..5010f72 100644
--- a/src/backend/access/rmgrdesc/dbasedesc.c
+++ b/src/backend/access/rmgrdesc/dbasedesc.c
@@ -28,8 +28,7 @@ dbase_desc(StringInfo buf, XLogReaderState *record)
 	{
 		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) rec;
 
-		appendStringInfo(buf, "copy dir %u/%u to %u/%u",
-						 xlrec->src_tablespace_id, xlrec->src_db_id,
+		appendStringInfo(buf, "create dir %u/%u",
 						 xlrec->tablespace_id, xlrec->db_id);
 	}
 	else if (info == XLOG_DBASE_DROP)
diff --git a/src/backend/access/rmgrdesc/relmapdesc.c b/src/backend/access/rmgrdesc/relmapdesc.c
index 2f9d4f5..470de59 100644
--- a/src/backend/access/rmgrdesc/relmapdesc.c
+++ b/src/backend/access/rmgrdesc/relmapdesc.c
@@ -29,6 +29,13 @@ relmap_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "database %u tablespace %u size %u",
 						 xlrec->dbid, xlrec->tsid, xlrec->nbytes);
 	}
+	if (info == XLOG_RELMAP_CREATE)
+	{
+		xl_relmap_update *xlrec = (xl_relmap_update *) rec;
+
+		appendStringInfo(buf, "database %u tablespace %u size %u",
+						 xlrec->dbid, xlrec->tsid, xlrec->nbytes);
+	}
 }
 
 const char *
@@ -41,6 +48,9 @@ relmap_identify(uint8 info)
 		case XLOG_RELMAP_UPDATE:
 			id = "UPDATE";
 			break;
+		case XLOG_RELMAP_CREATE:
+			id = "CREATE";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index d17d660..45bbba7 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -463,8 +463,8 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 	if (blkno < lastblock)
 	{
 		/* page exists in file */
-		buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-										   mode, NULL);
+		buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno, mode, NULL,
+										   RELPERSISTENCE_PERMANENT);
 	}
 	else
 	{
@@ -488,8 +488,8 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 					LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 				ReleaseBuffer(buffer);
 			}
-			buffer = ReadBufferWithoutRelcache(rnode, forknum,
-											   P_NEW, mode, NULL);
+			buffer = ReadBufferWithoutRelcache(rnode, forknum, P_NEW, mode,
+											   NULL, RELPERSISTENCE_PERMANENT);
 		}
 		while (BufferGetBlockNumber(buffer) < blkno);
 		/* Handle the corner case that P_NEW returns non-consecutive pages */
@@ -498,8 +498,8 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 			if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
 				LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 			ReleaseBuffer(buffer);
-			buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-											   mode, NULL);
+			buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno, mode,
+											   NULL, RELPERSISTENCE_PERMANENT);
 		}
 	}
 
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 2b159b6..ced0727 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -36,10 +36,14 @@
 #include "catalog/indexing.h"
 #include "catalog/objectaccess.h"
 #include "catalog/pg_authid.h"
+#include "catalog/pg_auth_members.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_db_role_setting.h"
+#include "catalog/pg_proc.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_tablespace.h"
+#include "catalog/storage.h"
+#include "catalog/storage_xlog.h"
 #include "commands/comment.h"
 #include "commands/dbcommands.h"
 #include "commands/dbcommands_xlog.h"
@@ -62,6 +66,7 @@
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
 #include "utils/pg_locale.h"
+#include "utils/relmapper.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 
@@ -77,6 +82,13 @@ typedef struct
 	Oid			dest_tsoid;		/* tablespace we are trying to move to */
 } movedb_failure_params;
 
+typedef struct RelationInfo
+{
+	RelFileNode		rnode;
+	char			relpersistence;
+} RelationInfo;
+
+
 /* non-export function prototypes */
 static void createdb_failure_callback(int code, Datum arg);
 static void movedb(const char *dbname, const char *tblspcname);
@@ -91,6 +103,389 @@ static bool have_createdb_privilege(void);
 static void remove_dbtablespaces(Oid db_id);
 static bool check_db_file_conflict(Oid db_id);
 static int	errdetail_busy_db(int notherbackends, int npreparedxacts);
+static void CreateDatabaseDirectory(char *dbpath, Oid dbid, Oid tsid,
+									bool isRedo);
+static List *GetDatabaseValidRelList(Oid srctbid, Oid srcdbid,
+									 Oid relfilenode);
+void RelationCopyStorageUsingBuffer(SMgrRelation src, SMgrRelation dst,
+									ForkNumber forkNum, char relpersistence);
+static void CopyDatabase(Oid src_dboid, Oid dboid, Oid src_tsid, Oid dst_tsid);
+
+/*
+ * CreateDatabaseDirectory - Create empty database directory and write out the
+ *							 PG_VERSION file in the database path.
+ * If isRedo is true, it's okay for the database directory to exist already.
+ */
+static void
+CreateDatabaseDirectory(char *dbpath, Oid dbid, Oid tsid, bool isRedo)
+{
+	int		fd;
+	int		nbytes;
+	char	versionfile[MAXPGPATH];
+
+	/* Create an empty db directory */
+	if (MakePGDirectory(dbpath) < 0)
+	{
+		/* Failure other than not exists or not in WAL replay? */
+		if (errno != EEXIST || !isRedo)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not create directory \"%s\": %m", dbpath)));
+	}
+
+	/* Create PG_VERSION file in the database path */
+	snprintf(versionfile, sizeof(versionfile), "%s/%s",
+			 dbpath, "PG_VERSION");
+
+	fd = OpenTransientFile(versionfile, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
+
+	/*
+	 * If file already exist and we are in WAL replay then just retry to open
+	 * in write mode.
+	 */
+	if (fd < 0 && errno == EEXIST && isRedo)
+		fd = OpenTransientFile(versionfile, O_RDWR | PG_BINARY);
+	if (fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m", versionfile)));
+
+	nbytes = strlen(PG_MAJORVERSION);
+
+	/* If we are not in WAL replay then write the WAL */
+	if (!isRedo)
+	{
+		xl_dbase_create_rec xlrec;
+		XLogRecPtr	lsn;
+
+		/* now errors are fatal ... */
+		START_CRIT_SECTION();
+
+		xlrec.db_id = dbid;
+		xlrec.tablespace_id = tsid;
+		xlrec.nbytes = nbytes;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec), MinSizeOfDbaseCreateRec);
+		XLogRegisterData((char *) PG_MAJORVERSION, nbytes);
+
+		lsn = XLogInsert(RM_DBASE_ID, XLOG_DBASE_CREATE);
+
+		/* As always, WAL must hit the disk before the data update does */
+		XLogFlush(lsn);
+	}
+
+	/* Write version in the PG_VERSION file */
+	pgstat_report_wait_start(WAIT_EVENT_COPY_FILE_WRITE);
+	errno = 0;
+	if ((int) write(fd, (char *) PG_MAJORVERSION, nbytes) != nbytes)
+	{
+		/* if write didn't set errno, assume problem is no disk space */
+		if (errno == 0)
+			errno = ENOSPC;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m", versionfile)));
+	}
+	pgstat_report_wait_end();
+
+	/* Close the version file */
+	CloseTransientFile(fd);
+
+	/* Critical section done */
+	if (!isRedo)
+		END_CRIT_SECTION();
+}
+
+/*
+ * GetDatabaseValidRelList - Get list of all valid relnode of the source db
+ *
+ * Process the input pg_class relfilenode and process block by block
+ * and prepare a list of all the valid relnode.
+ */
+static List *
+GetDatabaseValidRelList(Oid srctbid, Oid srcdbid, Oid relfilenode)
+{
+	SMgrRelation	rd_smgr;
+	RelFileNode		rnode;
+	BlockNumber		nblocks;
+	BlockNumber		blkno;
+	OffsetNumber	offnum;
+	OffsetNumber	maxoff;
+	Buffer			buf;
+	Page			page;
+	List		   *rnodelist = NIL;
+	HeapTupleData	tuple;
+	Form_pg_class	classForm;
+	BufferAccessStrategy bstrategy;
+
+	rnode.spcNode = srctbid;
+	rnode.dbNode = srcdbid;
+	rnode.relNode = relfilenode;
+
+	rd_smgr = smgropen(rnode, InvalidBackendId);
+	nblocks = smgrnblocks(rd_smgr, MAIN_FORKNUM);
+	bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	/*
+	 * Process each block for the pg_class relfilenode and check for the
+	 * visible tuple.  Store the relnode of the visible tuple in the list.
+	 * Later in the caller, these relnode files will be processed and copied
+	 * to the destination block by block.
+	 */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		buf = ReadBufferWithoutRelcache(rnode, MAIN_FORKNUM, blkno,
+										RBM_NORMAL, bstrategy,
+										RELPERSISTENCE_PERMANENT);
+
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buf);
+		if (PageIsNew(page) || PageIsEmpty(page))
+			continue;
+
+		maxoff = PageGetMaxOffsetNumber(page);
+		for (offnum = FirstOffsetNumber;
+			 offnum <= maxoff;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemid;
+
+			itemid = PageGetItemId(page, offnum);
+
+			/* Nothing to do if slot is empty or already dead */
+			if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid) ||
+				ItemIdIsRedirected(itemid))
+				continue;
+
+			Assert(ItemIdIsNormal(itemid));
+			ItemPointerSet(&(tuple.t_self), blkno, offnum);
+			tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
+			tuple.t_len = ItemIdGetLength(itemid);
+			tuple.t_tableOid = RelationRelationId;
+
+			/*
+			 * If the tuple is visible then add its relfilenode info to the
+			 * list.
+			 */
+			if (HeapTupleSatisfiesVisibility(&tuple, GetActiveSnapshot(), buf))
+			{
+				Oid				relfilenode = InvalidOid;
+				RelationInfo   *relinfo;
+
+				classForm = (Form_pg_class) GETSTRUCT(&tuple);
+
+				/* Ignore global objects. */
+				if (classForm->reltablespace == GLOBALTABLESPACE_OID)
+					continue;
+
+				/* We only want to scan objects which has storage. */
+				if (!RELKIND_HAS_STORAGE(classForm->relkind))
+					continue;
+
+				/* Built-in oids are mapped directly */
+				if (classForm->oid < FirstGenbkiObjectId)
+					relfilenode = classForm->oid;
+				else if (OidIsValid(classForm->relfilenode))
+					relfilenode = classForm->relfilenode;
+				else
+					continue;
+
+				Assert(OidIsValid(relfilenode));
+
+				/* Prepare a rel info element and add to the list */
+				relinfo = (RelationInfo *) palloc(sizeof(RelationInfo));
+				if (OidIsValid(classForm->reltablespace))
+					relinfo->rnode.spcNode = classForm->reltablespace;
+				else
+					relinfo->rnode.spcNode = srctbid;
+
+				relinfo->rnode.dbNode = srcdbid;
+				relinfo->rnode.relNode = relfilenode;
+				relinfo->relpersistence = classForm->relpersistence;
+
+				if (rnodelist == NULL)
+					rnodelist = list_make1(relinfo);
+				else
+					rnodelist = lappend(rnodelist, relinfo);
+			}
+		}
+
+		/* Release buffer lock */
+		UnlockReleaseBuffer(buf);
+	}
+
+	return rnodelist;
+}
+
+/*
+ * Copy a fork's data, block by block using buffers.
+ */
+void
+RelationCopyStorageUsingBuffer(SMgrRelation src, SMgrRelation dst,
+							   ForkNumber forkNum, char relpersistence)
+{
+	Buffer		srcBuf;
+	Buffer		dstBuf;
+	Page		srcPage;
+	Page		dstPage;
+	bool		use_wal;
+	bool		copying_initfork;
+	BlockNumber nblocks;
+	BlockNumber blkno;
+	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	/*
+	 * The init fork for an unlogged relation in many respects has to be
+	 * treated the same as normal relation, changes need to be WAL logged and
+	 * it needs to be synced to disk.
+	 */
+	copying_initfork = relpersistence == RELPERSISTENCE_UNLOGGED &&
+		forkNum == INIT_FORKNUM;
+
+	/*
+	 * We need to log the copied data in WAL iff WAL archiving/streaming is
+	 * enabled AND it's a permanent relation.  This gives the same answer as
+	 * "RelationNeedsWAL(rel) || copying_initfork", because we know the
+	 * current operation created a new relfilenode.
+	 */
+	use_wal = relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork;
+
+	nblocks = smgrnblocks(src, forkNum);
+
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		/* If we got a cancel signal during the copy of the data, quit */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Read block from source relation */
+		srcBuf = ReadBufferWithoutRelcache(src->smgr_rnode.node, forkNum,
+										blkno, RBM_NORMAL, bstrategy,
+										relpersistence);
+		srcPage = BufferGetPage(srcBuf);
+		if (PageIsNew(srcPage) || PageIsEmpty(srcPage))
+		{
+			ReleaseBuffer(srcBuf);
+			continue;
+		}
+
+		/* Use P_NEW to extend the relation */
+		dstBuf = ReadBufferWithoutRelcache(dst->smgr_rnode.node, MAIN_FORKNUM,
+										   P_NEW, RBM_NORMAL, NULL,
+										   relpersistence);
+		LockBuffer(dstBuf, BUFFER_LOCK_EXCLUSIVE);
+
+		START_CRIT_SECTION();
+
+		dstPage = BufferGetPage(dstBuf);
+		PageInit(dstPage, BufferGetPageSize(dstBuf), 0);
+		memcpy(dstPage, srcPage, BLCKSZ);
+		MarkBufferDirty(dstBuf);
+
+		/* WAL-log the copied page */
+		if (use_wal)
+			log_newpage_buffer(dstBuf, true);
+
+		END_CRIT_SECTION();
+
+		PageSetChecksumInplace(dstPage, blkno);
+
+		UnlockReleaseBuffer(dstBuf);
+		ReleaseBuffer(srcBuf);
+	}
+}
+
+/*
+ * Copy data logically from src database to the destination database
+ */
+static void
+CopyDatabase(Oid src_dboid, Oid dboid, Oid src_tsid, Oid dst_tsid)
+{
+	char	   *srcpath;
+	char	   *dstpath;
+	Oid			relfilenode;
+	List	   *rnodelist = NULL;
+	ListCell   *cell;
+	RelationInfo   *relinfo;
+	RelFileNode	    srcrnode;
+	RelFileNode		dstrnode;
+
+	srcpath = GetDatabasePath(src_dboid, src_tsid);
+
+	/* Create the default tablespace destination database directory */
+	dstpath = GetDatabasePath(dboid, dst_tsid);
+
+	/* Create database directory and write PG_VERSION file */
+	CreateDatabaseDirectory(dstpath, dboid, dst_tsid, false);
+
+	/* Copy the relfilenode mapping file */
+	CreateAndCopyRelMap(dboid, dst_tsid, srcpath, dstpath);
+
+	/* Get pg_class relfilenode */
+	relfilenode = DatabaseRelationOidToFilenode(srcpath,
+												RelationRelationId);
+
+	/* get list of all valid relnode from the source database */
+	rnodelist = GetDatabaseValidRelList(src_tsid, src_dboid,
+										relfilenode);
+	Assert(rnodelist != NIL);
+
+	/*
+	* Process relfilenode for each file and copy block by block from source
+	* database to the destination database.
+	*/
+	foreach(cell, rnodelist)
+	{
+		SMgrRelation	src_smgr;
+		SMgrRelation	dst_smgr;
+
+		relinfo = lfirst(cell);
+		srcrnode = relinfo->rnode;
+
+		/* Use source relnode tablespace if it's not a default table space */
+		if (srcrnode.spcNode != src_tsid)
+			dstrnode.spcNode = srcrnode.spcNode;
+		else
+			dstrnode.spcNode = dst_tsid;
+
+		dstrnode.dbNode = dboid;
+		dstrnode.relNode = srcrnode.relNode;
+
+		/* Open the source and the destination relation at smgr level */
+		src_smgr = smgropen(srcrnode, InvalidBackendId);
+		dst_smgr = smgropen(dstrnode, InvalidBackendId);
+
+		RelationCreateStorage(dstrnode, relinfo->relpersistence);
+
+		/* copy main fork */
+		RelationCopyStorageUsingBuffer(src_smgr, dst_smgr, MAIN_FORKNUM,
+									   relinfo->relpersistence);
+
+		/* copy those extra forks that exist */
+		for (ForkNumber forkNum = MAIN_FORKNUM + 1;
+			forkNum <= MAX_FORKNUM; forkNum++)
+		{
+			if (smgrexists(src_smgr, forkNum))
+			{
+				smgrcreate(dst_smgr, forkNum, false);
+
+				/*
+				* WAL log creation if the relation is persistent, or this is the
+				* init fork of an unlogged relation.
+				*/
+				if (relinfo->relpersistence == RELPERSISTENCE_PERMANENT ||
+					(relinfo->relpersistence == RELPERSISTENCE_UNLOGGED &&
+					forkNum == INIT_FORKNUM))
+					log_smgrcreate(&dstrnode, forkNum);
+				RelationCopyStorageUsingBuffer(src_smgr, dst_smgr,
+											   forkNum,
+											   relinfo->relpersistence);
+			}
+		}
+	}
+
+	list_free_deep(rnodelist);
+}
 
 
 /*
@@ -99,8 +494,6 @@ static int	errdetail_busy_db(int notherbackends, int npreparedxacts);
 Oid
 createdb(ParseState *pstate, const CreatedbStmt *stmt)
 {
-	TableScanDesc scan;
-	Relation	rel;
 	Oid			src_dboid;
 	Oid			src_owner;
 	int			src_encoding = -1;
@@ -592,140 +985,19 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	/* Post creation hook for new database */
 	InvokeObjectPostCreateHook(DatabaseRelationId, dboid, 0);
 
-	/*
-	 * Force a checkpoint before starting the copy. This will force all dirty
-	 * buffers, including those of unlogged tables, out to disk, to ensure
-	 * source database is up-to-date on disk for the copy.
-	 * FlushDatabaseBuffers() would suffice for that, but we also want to
-	 * process any pending unlink requests. Otherwise, if a checkpoint
-	 * happened while we're copying files, a file might be deleted just when
-	 * we're about to copy it, causing the lstat() call in copydir() to fail
-	 * with ENOENT.
-	 */
-	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT
-					  | CHECKPOINT_FLUSH_ALL);
-
-	/*
-	 * Once we start copying subdirectories, we need to be able to clean 'em
-	 * up if we fail.  Use an ENSURE block to make sure this happens.  (This
-	 * is not a 100% solution, because of the possibility of failure during
-	 * transaction commit after we leave this routine, but it should handle
-	 * most scenarios.)
-	 */
 	fparms.src_dboid = src_dboid;
 	fparms.dest_dboid = dboid;
 	PG_ENSURE_ERROR_CLEANUP(createdb_failure_callback,
 							PointerGetDatum(&fparms));
-	{
-		/*
-		 * Iterate through all tablespaces of the template database, and copy
-		 * each one to the new database.
-		 */
-		rel = table_open(TableSpaceRelationId, AccessShareLock);
-		scan = table_beginscan_catalog(rel, 0, NULL);
-		while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
-		{
-			Form_pg_tablespace spaceform = (Form_pg_tablespace) GETSTRUCT(tuple);
-			Oid			srctablespace = spaceform->oid;
-			Oid			dsttablespace;
-			char	   *srcpath;
-			char	   *dstpath;
-			struct stat st;
-
-			/* No need to copy global tablespace */
-			if (srctablespace == GLOBALTABLESPACE_OID)
-				continue;
-
-			srcpath = GetDatabasePath(src_dboid, srctablespace);
-
-			if (stat(srcpath, &st) < 0 || !S_ISDIR(st.st_mode) ||
-				directory_is_empty(srcpath))
-			{
-				/* Assume we can ignore it */
-				pfree(srcpath);
-				continue;
-			}
-
-			if (srctablespace == src_deftablespace)
-				dsttablespace = dst_deftablespace;
-			else
-				dsttablespace = srctablespace;
-
-			dstpath = GetDatabasePath(dboid, dsttablespace);
-
-			/*
-			 * Copy this subdirectory to the new location
-			 *
-			 * We don't need to copy subdirectories
-			 */
-			copydir(srcpath, dstpath, false);
-
-			/* Record the filesystem change in XLOG */
-			{
-				xl_dbase_create_rec xlrec;
-
-				xlrec.db_id = dboid;
-				xlrec.tablespace_id = dsttablespace;
-				xlrec.src_db_id = src_dboid;
-				xlrec.src_tablespace_id = srctablespace;
-
-				XLogBeginInsert();
-				XLogRegisterData((char *) &xlrec, sizeof(xl_dbase_create_rec));
-
-				(void) XLogInsert(RM_DBASE_ID,
-								  XLOG_DBASE_CREATE | XLR_SPECIAL_REL_UPDATE);
-			}
-		}
-		table_endscan(scan);
-		table_close(rel, AccessShareLock);
-
-		/*
-		 * We force a checkpoint before committing.  This effectively means
-		 * that committed XLOG_DBASE_CREATE operations will never need to be
-		 * replayed (at least not in ordinary crash recovery; we still have to
-		 * make the XLOG entry for the benefit of PITR operations). This
-		 * avoids two nasty scenarios:
-		 *
-		 * #1: When PITR is off, we don't XLOG the contents of newly created
-		 * indexes; therefore the drop-and-recreate-whole-directory behavior
-		 * of DBASE_CREATE replay would lose such indexes.
-		 *
-		 * #2: Since we have to recopy the source database during DBASE_CREATE
-		 * replay, we run the risk of copying changes in it that were
-		 * committed after the original CREATE DATABASE command but before the
-		 * system crash that led to the replay.  This is at least unexpected
-		 * and at worst could lead to inconsistencies, eg duplicate table
-		 * names.
-		 *
-		 * (Both of these were real bugs in releases 8.0 through 8.0.3.)
-		 *
-		 * In PITR replay, the first of these isn't an issue, and the second
-		 * is only a risk if the CREATE DATABASE and subsequent template
-		 * database change both occur while a base backup is being taken.
-		 * There doesn't seem to be much we can do about that except document
-		 * it as a limitation.
-		 *
-		 * Perhaps if we ever implement CREATE DATABASE in a less cheesy way,
-		 * we can avoid this.
-		 */
-		RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
-
-		/*
-		 * Close pg_database, but keep lock till commit.
-		 */
-		table_close(pg_database_rel, NoLock);
-
-		/*
-		 * Force synchronous commit, thus minimizing the window between
-		 * creation of the database files and committal of the transaction. If
-		 * we crash before committing, we'll have a DB that's taking up disk
-		 * space but is not in pg_database, which is not good.
-		 */
-		ForceSyncCommit();
-	}
+	CopyDatabase(src_dboid, dboid, src_deftablespace, dst_deftablespace);
 	PG_END_ENSURE_ERROR_CLEANUP(createdb_failure_callback,
 								PointerGetDatum(&fparms));
 
+	/*
+	 * Close pg_database, but keep lock till commit.
+	 */
+	table_close(pg_database_rel, NoLock);
+
 	return dboid;
 }
 
@@ -1220,43 +1492,12 @@ movedb(const char *dbname, const char *tblspcname)
 				 errdetail_busy_db(notherbackends, npreparedxacts)));
 
 	/*
-	 * Get old and new database paths
+	 * Get new database path
 	 */
 	src_dbpath = GetDatabasePath(db_id, src_tblspcoid);
 	dst_dbpath = GetDatabasePath(db_id, dst_tblspcoid);
 
 	/*
-	 * Force a checkpoint before proceeding. This will force all dirty
-	 * buffers, including those of unlogged tables, out to disk, to ensure
-	 * source database is up-to-date on disk for the copy.
-	 * FlushDatabaseBuffers() would suffice for that, but we also want to
-	 * process any pending unlink requests. Otherwise, the check for existing
-	 * files in the target directory might fail unnecessarily, not to mention
-	 * that the copy might fail due to source files getting deleted under it.
-	 * On Windows, this also ensures that background procs don't hold any open
-	 * files, which would cause rmdir() to fail.
-	 */
-	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT
-					  | CHECKPOINT_FLUSH_ALL);
-
-	/*
-	 * Now drop all buffers holding data of the target database; they should
-	 * no longer be dirty so DropDatabaseBuffers is safe.
-	 *
-	 * It might seem that we could just let these buffers age out of shared
-	 * buffers naturally, since they should not get referenced anymore.  The
-	 * problem with that is that if the user later moves the database back to
-	 * its original tablespace, any still-surviving buffers would appear to
-	 * contain valid data again --- but they'd be missing any changes made in
-	 * the database while it was in the new tablespace.  In any case, freeing
-	 * buffers that should never be used again seems worth the cycles.
-	 *
-	 * Note: it'd be sufficient to get rid of buffers matching db_id and
-	 * src_tblspcoid, but bufmgr.c presently provides no API for that.
-	 */
-	DropDatabaseBuffers(db_id);
-
-	/*
 	 * Check for existence of files in the target directory, i.e., objects of
 	 * this database that are already in the target tablespace.  We can't
 	 * allow the move in such a case, because we would need to change those
@@ -1301,28 +1542,7 @@ movedb(const char *dbname, const char *tblspcname)
 	PG_ENSURE_ERROR_CLEANUP(movedb_failure_callback,
 							PointerGetDatum(&fparms));
 	{
-		/*
-		 * Copy files from the old tablespace to the new one
-		 */
-		copydir(src_dbpath, dst_dbpath, false);
-
-		/*
-		 * Record the filesystem change in XLOG
-		 */
-		{
-			xl_dbase_create_rec xlrec;
-
-			xlrec.db_id = db_id;
-			xlrec.tablespace_id = dst_tblspcoid;
-			xlrec.src_db_id = db_id;
-			xlrec.src_tablespace_id = src_tblspcoid;
-
-			XLogBeginInsert();
-			XLogRegisterData((char *) &xlrec, sizeof(xl_dbase_create_rec));
-
-			(void) XLogInsert(RM_DBASE_ID,
-							  XLOG_DBASE_CREATE | XLR_SPECIAL_REL_UPDATE);
-		}
+		CopyDatabase(db_id, db_id, src_tblspcoid, dst_tblspcoid);
 
 		/*
 		 * Update the database's pg_database tuple
@@ -1356,22 +1576,6 @@ movedb(const char *dbname, const char *tblspcname)
 		systable_endscan(sysscan);
 
 		/*
-		 * Force another checkpoint here.  As in CREATE DATABASE, this is to
-		 * ensure that we don't have to replay a committed XLOG_DBASE_CREATE
-		 * operation, which would cause us to lose any unlogged operations
-		 * done in the new DB tablespace before the next checkpoint.
-		 */
-		RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
-
-		/*
-		 * Force synchronous commit, thus minimizing the window between
-		 * copying the database files and committal of the transaction. If we
-		 * crash before committing, we'll leave an orphaned set of files on
-		 * disk, which is not fatal but not good either.
-		 */
-		ForceSyncCommit();
-
-		/*
 		 * Close pg_database, but keep lock till commit.
 		 */
 		table_close(pgdbrel, NoLock);
@@ -1380,6 +1584,23 @@ movedb(const char *dbname, const char *tblspcname)
 								PointerGetDatum(&fparms));
 
 	/*
+	 * Now drop all buffers holding data of the target database; they should
+	 * no longer be dirty so DropDatabaseBuffers is safe.
+	 *
+	 * It might seem that we could just let these buffers age out of shared
+	 * buffers naturally, since they should not get referenced anymore.  The
+	 * problem with that is that if the user later moves the database back to
+	 * its original tablespace, any still-surviving buffers would appear to
+	 * contain valid data again --- but they'd be missing any changes made in
+	 * the database while it was in the new tablespace.  In any case, freeing
+	 * buffers that should never be used again seems worth the cycles.
+	 *
+	 * Note: it'd be sufficient to get rid of buffers matching db_id and
+	 * src_tblspcoid, but bufmgr.c presently provides no API for that.
+	 */
+	DropDatabaseBuffers(db_id);
+
+	/*
 	 * Commit the transaction so that the pg_database update is committed. If
 	 * we crash while removing files, the database won't be corrupt, we'll
 	 * just leave some orphaned files in the old directory.
@@ -2183,39 +2404,11 @@ dbase_redo(XLogReaderState *record)
 	if (info == XLOG_DBASE_CREATE)
 	{
 		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) XLogRecGetData(record);
-		char	   *src_path;
-		char	   *dst_path;
-		struct stat st;
-
-		src_path = GetDatabasePath(xlrec->src_db_id, xlrec->src_tablespace_id);
-		dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
+		char	   *dbpath;
 
-		/*
-		 * Our theory for replaying a CREATE is to forcibly drop the target
-		 * subdirectory if present, then re-copy the source data. This may be
-		 * more work than needed, but it is simple to implement.
-		 */
-		if (stat(dst_path, &st) == 0 && S_ISDIR(st.st_mode))
-		{
-			if (!rmtree(dst_path, true))
-				/* If this failed, copydir() below is going to error. */
-				ereport(WARNING,
-						(errmsg("some useless files may be left behind in old database directory \"%s\"",
-								dst_path)));
-		}
-
-		/*
-		 * Force dirty buffers out to disk, to ensure source database is
-		 * up-to-date for the copy.
-		 */
-		FlushDatabaseBuffers(xlrec->src_db_id);
-
-		/*
-		 * Copy this subdirectory to the new location
-		 *
-		 * We don't need to copy subdirectories
-		 */
-		copydir(src_path, dst_path, false);
+		dbpath = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
+		CreateDatabaseDirectory(dbpath, xlrec->db_id, xlrec->tablespace_id,
+								true);
 	}
 	else if (info == XLOG_DBASE_DROP)
 	{
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 4b296a2..e198946 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -776,24 +776,17 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 /*
  * ReadBufferWithoutRelcache -- like ReadBufferExtended, but doesn't require
  *		a relcache entry for the relation.
- *
- * NB: At present, this function may only be used on permanent relations, which
- * is OK, because we only use it during XLOG replay.  If in the future we
- * want to use it on temporary or unlogged relations, we could pass additional
- * parameters.
  */
 Buffer
 ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
 						  BlockNumber blockNum, ReadBufferMode mode,
-						  BufferAccessStrategy strategy)
+						  BufferAccessStrategy strategy, char relpersistence)
 {
 	bool		hit;
 
 	SMgrRelation smgr = smgropen(rnode, InvalidBackendId);
 
-	Assert(InRecovery);
-
-	return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
+	return ReadBuffer_common(smgr, relpersistence, forkNum, blockNum,
 							 mode, strategy, &hit);
 }
 
@@ -803,7 +796,7 @@ ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
  *
  * *hit is set to true if the request was satisfied from shared buffer cache.
  */
-static Buffer
+Buffer
 ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 				  BlockNumber blockNum, ReadBufferMode mode,
 				  BufferAccessStrategy strategy, bool *hit)
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index a6e38ad..c56c087 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -136,6 +136,13 @@ static void apply_map_update(RelMapFile *map, Oid relationId, Oid fileNode,
 							 bool add_okay);
 static void merge_map_updates(RelMapFile *map, const RelMapFile *updates,
 							  bool add_okay);
+static void read_relmap_file(char *mapfilename, RelMapFile *map,
+							 bool lock_held);
+static void write_relmap_file_internal(char *mapfilename, RelMapFile *newmap,
+									   RelMapFile *realmap, bool write_wal,
+									   bool send_sinval, bool preserve_files,
+									   Oid dbid, Oid tsid, const char *dbpath,
+									   uint8 info);
 static void load_relmap_file(bool shared, bool lock_held);
 static void write_relmap_file(bool shared, RelMapFile *newmap,
 							  bool write_wal, bool send_sinval, bool preserve_files,
@@ -250,6 +257,32 @@ RelationMapFilenodeToOid(Oid filenode, bool shared)
 }
 
 /*
+ * DatabaseRelationOidToFilenode
+ *
+ * Find relfilenode for the given relation id in the dbpath
+ */
+Oid
+DatabaseRelationOidToFilenode(char *dbpath, Oid relationId)
+{
+	RelMapFile	map;
+	int			i;
+	char		mapfilename[MAXPGPATH];
+
+	/* read the relmapfile from the source database */
+	snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+			 dbpath, RELMAPPER_FILENAME);
+	read_relmap_file(mapfilename, &map, false);
+
+	for (i = 0; i < map.num_mappings; i++)
+	{
+		if (relationId == map.mappings[i].mapoid)
+			return map.mappings[i].mapfilenode;
+	}
+
+	return InvalidOid;
+}
+
+/*
  * RelationMapUpdateMap
  *
  * Install a new relfilenode mapping for the specified relation.
@@ -687,36 +720,38 @@ RestoreRelationMap(char *startAddress)
 }
 
 /*
- * load_relmap_file -- load data from the shared or local map file
+ * CreateAndCopyRelMap
  *
- * Because the map file is essential for access to core system catalogs,
- * failure to read it is a fatal error.
- *
- * Note that the local case requires DatabasePath to be set up.
+ * Create and copy relmapfile from source db path to the destination db path
+ * and WAL log the operation.
  */
+void
+CreateAndCopyRelMap(Oid dbid, Oid tsid, char *srcdbpath, char *dstdbpath)
+{
+	RelMapFile map;
+	char mapfilename[MAXPGPATH];
+
+	/* read the relmapfile from the source database */
+	snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+			 srcdbpath, RELMAPPER_FILENAME);
+	read_relmap_file(mapfilename, &map, false);
+
+	/* write the relmapfile of the destination database */
+	snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+			 dstdbpath, RELMAPPER_FILENAME);
+	write_relmap_file_internal(mapfilename, &map, &map, true, false, true,
+							   dbid, tsid, dstdbpath, XLOG_RELMAP_CREATE);
+}
+
+/*
+* read_relmap_file - read the relmap file data into given map */
 static void
-load_relmap_file(bool shared, bool lock_held)
+read_relmap_file(char *mapfilename, RelMapFile *map, bool lock_held)
 {
-	RelMapFile *map;
-	char		mapfilename[MAXPGPATH];
 	pg_crc32c	crc;
 	int			fd;
 	int			r;
 
-	if (shared)
-	{
-		snprintf(mapfilename, sizeof(mapfilename), "global/%s",
-				 RELMAPPER_FILENAME);
-		map = &shared_map;
-	}
-	else
-	{
-		snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
-				 DatabasePath, RELMAPPER_FILENAME);
-		map = &local_map;
-	}
-
-	/* Read data ... */
 	fd = OpenTransientFile(mapfilename, O_RDONLY | PG_BINARY);
 	if (fd < 0)
 		ereport(FATAL,
@@ -779,62 +814,44 @@ load_relmap_file(bool shared, bool lock_held)
 }
 
 /*
- * Write out a new shared or local map file with the given contents.
- *
- * The magic number and CRC are automatically updated in *newmap.  On
- * success, we copy the data to the appropriate permanent static variable.
- *
- * If write_wal is true then an appropriate WAL message is emitted.
- * (It will be false for bootstrap and WAL replay cases.)
- *
- * If send_sinval is true then a SI invalidation message is sent.
- * (This should be true except in bootstrap case.)
+ * load_relmap_file -- load data from the shared or local map file
  *
- * If preserve_files is true then the storage manager is warned not to
- * delete the files listed in the map.
+ * Because the map file is essential for access to core system catalogs,
+ * failure to read it is a fatal error.
  *
- * Because this may be called during WAL replay when MyDatabaseId,
- * DatabasePath, etc aren't valid, we require the caller to pass in suitable
- * values.  The caller is also responsible for being sure no concurrent
- * map update could be happening.
+ * Note that the local case requires DatabasePath to be set up.
  */
 static void
-write_relmap_file(bool shared, RelMapFile *newmap,
-				  bool write_wal, bool send_sinval, bool preserve_files,
-				  Oid dbid, Oid tsid, const char *dbpath)
+load_relmap_file(bool shared, bool lock_held)
 {
-	int			fd;
-	RelMapFile *realmap;
+	RelMapFile *map;
 	char		mapfilename[MAXPGPATH];
 
-	/*
-	 * Fill in the overhead fields and update CRC.
-	 */
-	newmap->magic = RELMAPPER_FILEMAGIC;
-	if (newmap->num_mappings < 0 || newmap->num_mappings > MAX_MAPPINGS)
-		elog(ERROR, "attempt to write bogus relation mapping");
-
-	INIT_CRC32C(newmap->crc);
-	COMP_CRC32C(newmap->crc, (char *) newmap, offsetof(RelMapFile, crc));
-	FIN_CRC32C(newmap->crc);
-
-	/*
-	 * Open the target file.  We prefer to do this before entering the
-	 * critical section, so that an open() failure need not force PANIC.
-	 */
 	if (shared)
 	{
 		snprintf(mapfilename, sizeof(mapfilename), "global/%s",
 				 RELMAPPER_FILENAME);
-		realmap = &shared_map;
+		map = &shared_map;
 	}
 	else
 	{
 		snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
-				 dbpath, RELMAPPER_FILENAME);
-		realmap = &local_map;
+				 DatabasePath, RELMAPPER_FILENAME);
+		map = &local_map;
 	}
 
+	/* Read data ... */
+	read_relmap_file(mapfilename, map, lock_held);
+}
+
+static void
+write_relmap_file_internal(char *mapfilename, RelMapFile *newmap,
+						   RelMapFile *realmap, bool write_wal,
+						   bool send_sinval, bool preserve_files, Oid dbid,
+						   Oid tsid, const char *dbpath, uint8 info)
+{
+	int			fd;
+
 	fd = OpenTransientFile(mapfilename, O_WRONLY | O_CREAT | PG_BINARY);
 	if (fd < 0)
 		ereport(ERROR,
@@ -858,7 +875,7 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		XLogRegisterData((char *) (&xlrec), MinSizeOfRelmapUpdate);
 		XLogRegisterData((char *) newmap, sizeof(RelMapFile));
 
-		lsn = XLogInsert(RM_RELMAP_ID, XLOG_RELMAP_UPDATE);
+		lsn = XLogInsert(RM_RELMAP_ID, info);
 
 		/* As always, WAL must hit the disk before the data update does */
 		XLogFlush(lsn);
@@ -950,6 +967,67 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 }
 
 /*
+ * Write out a new shared or local map file with the given contents.
+ *
+ * The magic number and CRC are automatically updated in *newmap.  On
+ * success, we copy the data to the appropriate permanent static variable.
+ *
+ * If write_wal is true then an appropriate WAL message is emitted.
+ * (It will be false for bootstrap and WAL replay cases.)
+ *
+ * If send_sinval is true then a SI invalidation message is sent.
+ * (This should be true except in bootstrap case.)
+ *
+ * If preserve_files is true then the storage manager is warned not to
+ * delete the files listed in the map.
+ *
+ * Because this may be called during WAL replay when MyDatabaseId,
+ * DatabasePath, etc aren't valid, we require the caller to pass in suitable
+ * values.  The caller is also responsible for being sure no concurrent
+ * map update could be happening.
+ */
+static void
+write_relmap_file(bool shared, RelMapFile *newmap,
+				  bool write_wal, bool send_sinval, bool preserve_files,
+				  Oid dbid, Oid tsid, const char *dbpath)
+{
+	RelMapFile *realmap;
+	char		mapfilename[MAXPGPATH];
+
+	/*
+	 * Fill in the overhead fields and update CRC.
+	 */
+	newmap->magic = RELMAPPER_FILEMAGIC;
+	if (newmap->num_mappings < 0 || newmap->num_mappings > MAX_MAPPINGS)
+		elog(ERROR, "attempt to write bogus relation mapping");
+
+	INIT_CRC32C(newmap->crc);
+	COMP_CRC32C(newmap->crc, (char *) newmap, offsetof(RelMapFile, crc));
+	FIN_CRC32C(newmap->crc);
+
+	/*
+	 * Open the target file.  We prefer to do this before entering the
+	 * critical section, so that an open() failure need not force PANIC.
+	 */
+	if (shared)
+	{
+		snprintf(mapfilename, sizeof(mapfilename), "global/%s",
+				 RELMAPPER_FILENAME);
+		realmap = &shared_map;
+	}
+	else
+	{
+		snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+				 dbpath, RELMAPPER_FILENAME);
+		realmap = &local_map;
+	}
+
+	write_relmap_file_internal(mapfilename, newmap, realmap, write_wal,
+							   send_sinval, preserve_files, dbid, tsid,
+							   dbpath, XLOG_RELMAP_UPDATE);
+}
+
+/*
  * Merge the specified updates into the appropriate "real" map,
  * and write out the changes.  This function must be used for committing
  * updates during normal multiuser operation.
@@ -1010,7 +1088,7 @@ relmap_redo(XLogReaderState *record)
 	/* Backup blocks are not used in relmap records */
 	Assert(!XLogRecHasAnyBlockRefs(record));
 
-	if (info == XLOG_RELMAP_UPDATE)
+	if ((info == XLOG_RELMAP_UPDATE) || (info == XLOG_RELMAP_CREATE))
 	{
 		xl_relmap_update *xlrec = (xl_relmap_update *) XLogRecGetData(record);
 		RelMapFile	newmap;
@@ -1033,9 +1111,22 @@ relmap_redo(XLogReaderState *record)
 		 * but grab the lock to interlock against load_relmap_file().
 		 */
 		LWLockAcquire(RelationMappingLock, LW_EXCLUSIVE);
-		write_relmap_file((xlrec->dbid == InvalidOid), &newmap,
-						  false, true, false,
-						  xlrec->dbid, xlrec->tsid, dbpath);
+		if (info == XLOG_RELMAP_UPDATE)
+			write_relmap_file((xlrec->dbid == InvalidOid), &newmap,
+							false, true, false,
+							xlrec->dbid, xlrec->tsid, dbpath);
+		else if (info == XLOG_RELMAP_CREATE)
+		{
+			char		mapfilename[MAXPGPATH];
+
+			/* We need to construct the pathname for this database */
+			snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+					 dbpath, RELMAPPER_FILENAME);
+
+			write_relmap_file_internal(mapfilename, &newmap, &newmap, false,
+									  false, false, xlrec->dbid, xlrec->tsid,
+									  dbpath, 0);
+		}
 		LWLockRelease(RelationMappingLock);
 
 		pfree(dbpath);
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 59ebac7..189123b 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -23,6 +23,7 @@
 #include "fe_utils/archive.h"
 #include "filemap.h"
 #include "pg_rewind.h"
+#include "utils/relmapper.h"
 
 /*
  * RmgrNames is an array of resource manager names, to make error messages
@@ -390,6 +391,10 @@ extractPageInfo(XLogReaderState *record)
 		 * system. No need to do anything special here.
 		 */
 	}
+	else if (rmid == RM_RELMAP_ID && info == XLOG_RELMAP_CREATE)
+	{
+		/* ignore */
+	}
 	else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_CREATE)
 	{
 		/*
diff --git a/src/include/commands/dbcommands_xlog.h b/src/include/commands/dbcommands_xlog.h
index f5ed762..9e4e382 100644
--- a/src/include/commands/dbcommands_xlog.h
+++ b/src/include/commands/dbcommands_xlog.h
@@ -23,13 +23,14 @@
 
 typedef struct xl_dbase_create_rec
 {
-	/* Records copying of a single subdirectory incl. contents */
 	Oid			db_id;
 	Oid			tablespace_id;
-	Oid			src_db_id;
-	Oid			src_tablespace_id;
+	int32       nbytes;         /* size of version data */
+	char		version[FLEXIBLE_ARRAY_MEMBER];
 } xl_dbase_create_rec;
 
+#define MinSizeOfDbaseCreateRec offsetof(xl_dbase_create_rec, version)
+
 typedef struct xl_dbase_drop_rec
 {
 	Oid			db_id;
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index aa64fb4..bef6d6a 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -184,7 +184,8 @@ extern Buffer ReadBufferExtended(Relation reln, ForkNumber forkNum,
 								 BufferAccessStrategy strategy);
 extern Buffer ReadBufferWithoutRelcache(RelFileNode rnode,
 										ForkNumber forkNum, BlockNumber blockNum,
-										ReadBufferMode mode, BufferAccessStrategy strategy);
+										ReadBufferMode mode, BufferAccessStrategy strategy,
+										char relpersistence);
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern void MarkBufferDirty(Buffer buffer);
diff --git a/src/include/utils/relmapper.h b/src/include/utils/relmapper.h
index c0d14da..6f42ace 100644
--- a/src/include/utils/relmapper.h
+++ b/src/include/utils/relmapper.h
@@ -23,6 +23,7 @@
  */
 
 #define XLOG_RELMAP_UPDATE		0x00
+#define XLOG_RELMAP_CREATE		0x10
 
 typedef struct xl_relmap_update
 {
@@ -39,6 +40,8 @@ extern Oid	RelationMapOidToFilenode(Oid relationId, bool shared);
 
 extern Oid	RelationMapFilenodeToOid(Oid relationId, bool shared);
 
+extern Oid DatabaseRelationOidToFilenode(char *dbpath, Oid relationId);
+
 extern void RelationMapUpdateMap(Oid relationId, Oid fileNode, bool shared,
 								 bool immediate);
 
@@ -62,7 +65,8 @@ extern void RelationMapInitializePhase3(void);
 extern Size EstimateRelationMapSpace(void);
 extern void SerializeRelationMap(Size maxSize, char *startAddress);
 extern void RestoreRelationMap(char *startAddress);
-
+extern void CreateAndCopyRelMap(Oid dbid, Oid tsid, char *srcdbpath,
+								char *dstdbpath);
 extern void relmap_redo(XLogReaderState *record);
 extern void relmap_desc(StringInfo buf, XLogReaderState *record);
 extern const char *relmap_identify(uint8 info);
-- 
1.8.3.1

#25

Dilip Kumar

dilipbalaut@gmail.com

over 4 years ago

In reply to: Dilip Kumar (#24)

3 attachment(s)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Tue, Jul 6, 2021 at 3:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Jun 18, 2021 at 12:50 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jun 17, 2021 at 2:48 PM Andres Freund <andres@anarazel.de> wrote:

Or do you mean that looking at the filesystem at all is bypassing shared
buffers?

This is what I mean. I think we will end up in a better spot if we can
avoid doing that without creating too much ugliness elsewhere.

The patch was not getting applied on head so I have rebased it, along
with that now I have used bufmgr layer for writing writing/logging
destination pages as well instead of directly using sgmr layer.

I have done further cleanup of the patch and also divided it into 3 patches.

0001 - Currently, write_relmap_file and load_relmap_file are tightly
coupled with shared_map and local_map. As part of the higher level
patch set we need remap read/write interfaces that are not dependent
upon shared_map and local_map, and we should be able to pass map
memory as an external parameter instead.

0002- Support new interfaces in relmapper, 1) Support copying the
relmap file from one database path to the other database path. 2) Like
RelationMapOidToFilenode, provide another interface which do the same
but instead of getting it for the database we are connected to it will
get it for the input database path. These interfaces are required for
the next patch for supporting the wal logged created database.

0003- The main patch for WAL logging the created database operation.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v2-0001-Refactor-relmap-load-and-relmap-write-functions.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Refactor-relmap-load-and-relmap-write-functions.patchDownload

From 2f6a5341ef7f97fc258fb7d108129a824678fd75 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Wed, 1 Sep 2021 14:06:29 +0530
Subject: [PATCH v2 1/3] Refactor relmap load and relmap write functions

Currently, write_relmap_file and load_relmap_file are tightly
coupled with shared_map and local_map.  As part of the higher
level patch set we need remap read/write interfaces that are
not dependent upon shared_map and local_map, and we should be
able to pass map memory as an external parameter instead.
---
 src/backend/utils/cache/relmapper.c | 163 ++++++++++++++++++++++--------------
 1 file changed, 102 insertions(+), 61 deletions(-)

diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index a6e38ad..ae62910 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -136,6 +136,12 @@ static void apply_map_update(RelMapFile *map, Oid relationId, Oid fileNode,
 							 bool add_okay);
 static void merge_map_updates(RelMapFile *map, const RelMapFile *updates,
 							  bool add_okay);
+static void read_relmap_file(char *mapfilename, RelMapFile *map,
+							 bool lock_held);
+static void write_relmap_file_internal(char *mapfilename, RelMapFile *newmap,
+									   bool write_wal, bool send_sinval,
+									   bool preserve_files, Oid dbid, Oid tsid,
+									   const char *dbpath);
 static void load_relmap_file(bool shared, bool lock_held);
 static void write_relmap_file(bool shared, RelMapFile *newmap,
 							  bool write_wal, bool send_sinval, bool preserve_files,
@@ -692,31 +698,17 @@ RestoreRelationMap(char *startAddress)
  * Because the map file is essential for access to core system catalogs,
  * failure to read it is a fatal error.
  *
- * Note that the local case requires DatabasePath to be set up.
+ * lock_held, pass true if caller already have the relation mapping or higher
+ * level lock.
  */
 static void
-load_relmap_file(bool shared, bool lock_held)
+read_relmap_file(char *mapfilename, RelMapFile *map, bool lock_held)
 {
-	RelMapFile *map;
-	char		mapfilename[MAXPGPATH];
 	pg_crc32c	crc;
 	int			fd;
 	int			r;
 
-	if (shared)
-	{
-		snprintf(mapfilename, sizeof(mapfilename), "global/%s",
-				 RELMAPPER_FILENAME);
-		map = &shared_map;
-	}
-	else
-	{
-		snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
-				 DatabasePath, RELMAPPER_FILENAME);
-		map = &local_map;
-	}
-
-	/* Read data ... */
+	/* Open the relmap file for reading. */
 	fd = OpenTransientFile(mapfilename, O_RDONLY | PG_BINARY);
 	if (fd < 0)
 		ereport(FATAL,
@@ -779,62 +771,53 @@ load_relmap_file(bool shared, bool lock_held)
 }
 
 /*
- * Write out a new shared or local map file with the given contents.
- *
- * The magic number and CRC are automatically updated in *newmap.  On
- * success, we copy the data to the appropriate permanent static variable.
- *
- * If write_wal is true then an appropriate WAL message is emitted.
- * (It will be false for bootstrap and WAL replay cases.)
- *
- * If send_sinval is true then a SI invalidation message is sent.
- * (This should be true except in bootstrap case.)
+ * load_relmap_file -- load data from the shared or local map file
  *
- * If preserve_files is true then the storage manager is warned not to
- * delete the files listed in the map.
+ * Because the map file is essential for access to core system catalogs,
+ * failure to read it is a fatal error.
  *
- * Because this may be called during WAL replay when MyDatabaseId,
- * DatabasePath, etc aren't valid, we require the caller to pass in suitable
- * values.  The caller is also responsible for being sure no concurrent
- * map update could be happening.
+ * Note that the local case requires DatabasePath to be set up.
  */
 static void
-write_relmap_file(bool shared, RelMapFile *newmap,
-				  bool write_wal, bool send_sinval, bool preserve_files,
-				  Oid dbid, Oid tsid, const char *dbpath)
+load_relmap_file(bool shared, bool lock_held)
 {
-	int			fd;
-	RelMapFile *realmap;
+	RelMapFile *map;
 	char		mapfilename[MAXPGPATH];
 
-	/*
-	 * Fill in the overhead fields and update CRC.
-	 */
-	newmap->magic = RELMAPPER_FILEMAGIC;
-	if (newmap->num_mappings < 0 || newmap->num_mappings > MAX_MAPPINGS)
-		elog(ERROR, "attempt to write bogus relation mapping");
-
-	INIT_CRC32C(newmap->crc);
-	COMP_CRC32C(newmap->crc, (char *) newmap, offsetof(RelMapFile, crc));
-	FIN_CRC32C(newmap->crc);
-
-	/*
-	 * Open the target file.  We prefer to do this before entering the
-	 * critical section, so that an open() failure need not force PANIC.
-	 */
 	if (shared)
 	{
 		snprintf(mapfilename, sizeof(mapfilename), "global/%s",
 				 RELMAPPER_FILENAME);
-		realmap = &shared_map;
+		map = &shared_map;
 	}
 	else
 	{
 		snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
-				 dbpath, RELMAPPER_FILENAME);
-		realmap = &local_map;
+				 DatabasePath, RELMAPPER_FILENAME);
+		map = &local_map;
 	}
 
+	/* Read data ... */
+	read_relmap_file(mapfilename, map, lock_held);
+}
+
+/*
+ * Helper function for write_relmap_file, Read comments atop write_relmap_file
+ * for more details.  The CRC should be computed by the caller and stored in
+ * the newmap.
+ */
+static void
+write_relmap_file_internal(char *mapfilename, RelMapFile *newmap,
+						   bool write_wal, bool send_sinval,
+						   bool preserve_files, Oid dbid, Oid tsid,
+						   const char *dbpath)
+{
+	int			fd;
+
+	/*
+	 * Open the target file.  We prefer to do this before entering the
+	 * critical section, so that an open() failure need not force PANIC.
+	 */
 	fd = OpenTransientFile(mapfilename, O_WRONLY | O_CREAT | PG_BINARY);
 	if (fd < 0)
 		ereport(ERROR,
@@ -934,6 +917,68 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		}
 	}
 
+	/* Critical section done */
+	if (write_wal)
+		END_CRIT_SECTION();
+}
+
+/*
+ * Write out a new shared or local map file with the given contents.
+ *
+ * The magic number and CRC are automatically updated in *newmap.  On
+ * success, we copy the data to the appropriate permanent static variable.
+ *
+ * If write_wal is true then an appropriate WAL message is emitted.
+ * (It will be false for bootstrap and WAL replay cases.)
+ *
+ * If send_sinval is true then a SI invalidation message is sent.
+ * (This should be true except in bootstrap case.)
+ *
+ * If preserve_files is true then the storage manager is warned not to
+ * delete the files listed in the map.
+ *
+ * Because this may be called during WAL replay when MyDatabaseId,
+ * DatabasePath, etc aren't valid, we require the caller to pass in suitable
+ * values.  The caller is also responsible for being sure no concurrent
+ * map update could be happening.
+ */
+static void
+write_relmap_file(bool shared, RelMapFile *newmap,
+				  bool write_wal, bool send_sinval, bool preserve_files,
+				  Oid dbid, Oid tsid, const char *dbpath)
+{
+	RelMapFile *realmap;
+	char		mapfilename[MAXPGPATH];
+
+	/*
+	 * Fill in the overhead fields and update CRC.
+	 */
+	newmap->magic = RELMAPPER_FILEMAGIC;
+	if (newmap->num_mappings < 0 || newmap->num_mappings > MAX_MAPPINGS)
+		elog(ERROR, "attempt to write bogus relation mapping");
+
+	INIT_CRC32C(newmap->crc);
+	COMP_CRC32C(newmap->crc, (char *) newmap, offsetof(RelMapFile, crc));
+	FIN_CRC32C(newmap->crc);
+
+	if (shared)
+	{
+		snprintf(mapfilename, sizeof(mapfilename), "global/%s",
+				 RELMAPPER_FILENAME);
+		realmap = &shared_map;
+	}
+	else
+	{
+		snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+				 dbpath, RELMAPPER_FILENAME);
+		realmap = &local_map;
+	}
+
+	/* Write the map to the relmap file. */
+	write_relmap_file_internal(mapfilename, newmap, write_wal,
+							   send_sinval, preserve_files, dbid, tsid,
+							   dbpath);
+
 	/*
 	 * Success, update permanent copy.  During bootstrap, we might be working
 	 * on the permanent copy itself, in which case skip the memcpy() to avoid
@@ -943,10 +988,6 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		memcpy(realmap, newmap, sizeof(RelMapFile));
 	else
 		Assert(!send_sinval);	/* must be bootstrapping */
-
-	/* Critical section done */
-	if (write_wal)
-		END_CRIT_SECTION();
 }
 
 /*
-- 
1.8.3.1

v2-0002-Extend-relmap-interfaces.patchtext/x-patch; charset=US-ASCII; name=v2-0002-Extend-relmap-interfaces.patchDownload

From 50761c0caf60b17be99c4ba063369d74e75c77fa Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Wed, 1 Sep 2021 14:16:35 +0530
Subject: [PATCH v2 2/3] Extend relmap interfaces

Support new interfaces in relmapper, 1) Support copying the
relmap file from one database path to the other database path.
2) Like RelationMapOidToFilenode, provide another interface
which do the same but instead of getting it for the database
we are connected to it will get it for the input database
path.

These interfaces are required for next patch for supporting the
wal logged created database.
---
 src/backend/utils/cache/relmapper.c | 131 +++++++++++++++++++++++++++++++-----
 src/include/utils/relmapper.h       |   6 +-
 2 files changed, 119 insertions(+), 18 deletions(-)

diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index ae62910..182054e 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -141,7 +141,7 @@ static void read_relmap_file(char *mapfilename, RelMapFile *map,
 static void write_relmap_file_internal(char *mapfilename, RelMapFile *newmap,
 									   bool write_wal, bool send_sinval,
 									   bool preserve_files, Oid dbid, Oid tsid,
-									   const char *dbpath);
+									   const char *dbpath, bool create);
 static void load_relmap_file(bool shared, bool lock_held);
 static void write_relmap_file(bool shared, RelMapFile *newmap,
 							  bool write_wal, bool send_sinval, bool preserve_files,
@@ -256,6 +256,40 @@ RelationMapFilenodeToOid(Oid filenode, bool shared)
 }
 
 /*
+ * RelationMapOidToFilenodeForDatabase
+ *
+ * Find relfilenode for the given relation id in the dbpath.  Returns
+ * InvalidOid if the relationId is not found in the relmap.
+ *
+ * This function is only called during CREATE DATABASE command, so we can pass
+ * lock_held as true while reading the relmap file since we are already holding
+ * the exclusive lock on the database.
+ */
+Oid
+RelationMapOidToFilenodeForDatabase(char *dbpath, Oid relationId)
+{
+	RelMapFile	map;
+	int			i;
+	char		mapfilename[MAXPGPATH];
+
+	/* Relmap file path for the given dbpath. */
+	snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+			 dbpath, RELMAPPER_FILENAME);
+
+	/* Read the relmap file from the source database. */
+	read_relmap_file(mapfilename, &map, true);
+
+	/* Iterate over the relmap entries to find the input relation oid. */
+	for (i = 0; i < map.num_mappings; i++)
+	{
+		if (relationId == map.mappings[i].mapoid)
+			return map.mappings[i].mapfilenode;
+	}
+
+	return InvalidOid;
+}
+
+/*
  * RelationMapUpdateMap
  *
  * Install a new relfilenode mapping for the specified relation.
@@ -693,10 +727,47 @@ RestoreRelationMap(char *startAddress)
 }
 
 /*
- * load_relmap_file -- load data from the shared or local map file
+ * CopyRelationMap
  *
- * Because the map file is essential for access to core system catalogs,
- * failure to read it is a fatal error.
+ * Copy relmapfile from source db path to the destination db path and WAL log
+ * the operation.  This function is only called during the create database so
+ * the caller must hold the exclusive lock on the source database.  Destination
+ * database is not yet created so we don't have any issue.
+ */
+void
+CopyRelationMap(Oid dbid, Oid tsid, char *srcdbpath, char *dstdbpath)
+{
+	RelMapFile map;
+	char mapfilename[MAXPGPATH];
+
+	/* Relmap file path of the source database. */
+	snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+			 srcdbpath, RELMAPPER_FILENAME);
+
+	/*
+	 * Read the relmap file from the source database.  We are not connected to
+	 * the database so we can not take the relmap lock, but we are already
+	 * holding exclusive lock on the database so pass lock_held as true.
+	 */
+	read_relmap_file(mapfilename, &map, true);
+
+	/* Relmap file path of the destination database. */
+	snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+			 dstdbpath, RELMAPPER_FILENAME);
+
+	/*
+	 * Write map contents into the destination database's relmap file.
+	 * write_relmap_file_internal, expects that the CRC should have been
+	 * computed and stored in the input map.  But, since we have read this map
+	 * from the source database and directly writing to the destination file
+	 * without updating it so we don't need to recompute it.
+	 */
+	write_relmap_file_internal(mapfilename, &map, true, false, true, dbid,
+							   tsid, dstdbpath, true);
+}
+
+/*
+ * read_relmap_file - read the relmap file data.
  *
  * lock_held, pass true if caller already have the relation mapping or higher
  * level lock.
@@ -802,15 +873,18 @@ load_relmap_file(bool shared, bool lock_held)
 }
 
 /*
- * Helper function for write_relmap_file, Read comments atop write_relmap_file
- * for more details.  The CRC should be computed by the caller and stored in
- * the newmap.
+ * Helper function for write_relmap_file and CopyRelationMap, Read comments
+ * atop write_relmap_file for more details.  The CRC should be computed by the
+ * caller and stored in the newmap.
+ *
+ * Pass the create = true, if we are copying the relmap file during CREATE
+ * DATABASE command.
  */
 static void
 write_relmap_file_internal(char *mapfilename, RelMapFile *newmap,
 						   bool write_wal, bool send_sinval,
 						   bool preserve_files, Oid dbid, Oid tsid,
-						   const char *dbpath)
+						   const char *dbpath, bool create)
 {
 	int			fd;
 
@@ -836,6 +910,7 @@ write_relmap_file_internal(char *mapfilename, RelMapFile *newmap,
 		xlrec.dbid = dbid;
 		xlrec.tsid = tsid;
 		xlrec.nbytes = sizeof(RelMapFile);
+		xlrec.create = create;
 
 		XLogBeginInsert();
 		XLogRegisterData((char *) (&xlrec), MinSizeOfRelmapUpdate);
@@ -977,7 +1052,7 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 	/* Write the map to the relmap file. */
 	write_relmap_file_internal(mapfilename, newmap, write_wal,
 							   send_sinval, preserve_files, dbid, tsid,
-							   dbpath);
+							   dbpath, false);
 
 	/*
 	 * Success, update permanent copy.  During bootstrap, we might be working
@@ -1069,15 +1144,37 @@ relmap_redo(XLogReaderState *record)
 		 * Write out the new map and send sinval, but of course don't write a
 		 * new WAL entry.  There's no surrounding transaction to tell to
 		 * preserve files, either.
-		 *
-		 * There shouldn't be anyone else updating relmaps during WAL replay,
-		 * but grab the lock to interlock against load_relmap_file().
 		 */
-		LWLockAcquire(RelationMappingLock, LW_EXCLUSIVE);
-		write_relmap_file((xlrec->dbid == InvalidOid), &newmap,
-						  false, true, false,
-						  xlrec->dbid, xlrec->tsid, dbpath);
-		LWLockRelease(RelationMappingLock);
+		if (!xlrec->create)
+		{
+			/*
+			 * There shouldn't be anyone else updating relmaps during WAL
+			 * replay, but grab the lock to interlock against
+			 * load_relmap_file().
+			 */
+			LWLockAcquire(RelationMappingLock, LW_EXCLUSIVE);
+			write_relmap_file((xlrec->dbid == InvalidOid), &newmap,
+							false, true, false,
+							xlrec->dbid, xlrec->tsid, dbpath);
+			LWLockRelease(RelationMappingLock);
+		}
+		else
+		{
+			char		mapfilename[MAXPGPATH];
+
+			/* Construct the mapfilename. */
+			snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+					 dbpath, RELMAPPER_FILENAME);
+
+			/*
+			 * We don't need to take relmap lock because this wal is logged
+			 * while creating a new database, so there could be no one else
+			 * reading/writing the relmap file.
+			 */
+			write_relmap_file_internal(mapfilename, &newmap, false, false,
+									   false, xlrec->dbid, xlrec->tsid, dbpath,
+									   true);
+		}
 
 		pfree(dbpath);
 	}
diff --git a/src/include/utils/relmapper.h b/src/include/utils/relmapper.h
index c0d14da..4165f09 100644
--- a/src/include/utils/relmapper.h
+++ b/src/include/utils/relmapper.h
@@ -29,6 +29,7 @@ typedef struct xl_relmap_update
 	Oid			dbid;			/* database ID, or 0 for shared map */
 	Oid			tsid;			/* database's tablespace, or pg_global */
 	int32		nbytes;			/* size of relmap data */
+	bool		create;			/* true if creating new relmap */
 	char		data[FLEXIBLE_ARRAY_MEMBER];
 } xl_relmap_update;
 
@@ -39,6 +40,8 @@ extern Oid	RelationMapOidToFilenode(Oid relationId, bool shared);
 
 extern Oid	RelationMapFilenodeToOid(Oid relationId, bool shared);
 
+extern Oid RelationMapOidToFilenodeForDatabase(char *dbpath, Oid relationId);
+
 extern void RelationMapUpdateMap(Oid relationId, Oid fileNode, bool shared,
 								 bool immediate);
 
@@ -62,7 +65,8 @@ extern void RelationMapInitializePhase3(void);
 extern Size EstimateRelationMapSpace(void);
 extern void SerializeRelationMap(Size maxSize, char *startAddress);
 extern void RestoreRelationMap(char *startAddress);
-
+extern void CopyRelationMap(Oid dbid, Oid tsid, char *srcdbpath,
+							char *dstdbpath);
 extern void relmap_redo(XLogReaderState *record);
 extern void relmap_desc(StringInfo buf, XLogReaderState *record);
 extern const char *relmap_identify(uint8 info);
-- 
1.8.3.1

v2-0003-WAL-logged-CREATE-DATABASE.patchtext/x-patch; charset=US-ASCII; name=v2-0003-WAL-logged-CREATE-DATABASE.patchDownload

From 7a734dc46fcdbb38f90f9d44e3932af5f123e154 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Tue, 31 Aug 2021 15:15:00 +0530
Subject: [PATCH v2 3/3] WAL logged CREATE DATABASE

Currently, CREATE DATABASE forces a checkpoint, then copies all the files,
then forces another checkpoint. The comments in the createdb() function
explain the reasons for this. The attached patch fixes this problem by making
create database completely WAL logged so that we can avoid the checkpoints.

This can also be useful for supporting the TDE. For example, if we need different
encryption for the source and the target database then we can not re-encrypt the
page data if we copy the whole directory.  But with this patch, we are copying
page by page so we have an opportunity to re-encrypt the page before copying that
to the target database.
---
 contrib/bloom/blinsert.c                 |   2 +-
 src/backend/access/heap/heapam_handler.c |   2 +-
 src/backend/access/nbtree/nbtree.c       |   2 +-
 src/backend/access/rmgrdesc/dbasedesc.c  |   9 +-
 src/backend/access/transam/xlogutils.c   |  12 +-
 src/backend/commands/dbcommands.c        | 662 ++++++++++++++++++++-----------
 src/backend/commands/tablecmds.c         |  59 +--
 src/backend/storage/buffer/bufmgr.c      |  13 +-
 src/bin/pg_rewind/parsexlog.c            |   3 +-
 src/include/commands/dbcommands_xlog.h   |   9 +-
 src/include/commands/tablecmds.h         |   5 +
 src/include/storage/bufmgr.h             |   3 +-
 12 files changed, 495 insertions(+), 286 deletions(-)

diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c
index 23661d1..d7054dc 100644
--- a/contrib/bloom/blinsert.c
+++ b/contrib/bloom/blinsert.c
@@ -173,7 +173,7 @@ blbuildempty(Relation index)
 	 * Write the page and log it.  It might seem that an immediate sync would
 	 * be sufficient to guarantee that the file exists on disk, but recovery
 	 * itself might remove it while replaying, for example, an
-	 * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we need
+	 * XLOG_DBASEDIR_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we need
 	 * this even when wal_level=minimal.
 	 */
 	PageSetChecksumInplace(metapage, BLOOM_METAPAGE_BLKNO);
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 9befe01..dd4e038 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -601,7 +601,7 @@ heapam_relation_set_new_filenode(Relation rel,
 	 * even if the page has been logged, because the write did not go through
 	 * shared_buffers and therefore a concurrent checkpoint may have moved the
 	 * redo pointer past our xlog record.  Recovery may as well remove it
-	 * while replaying, for example, XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE
+	 * while replaying, for example, XLOG_DBASEDIR_CREATE or XLOG_TBLSPC_CREATE
 	 * record. Therefore, logging is necessary even if wal_level=minimal.
 	 */
 	if (persistence == RELPERSISTENCE_UNLOGGED)
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 30df244..5839dc7 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -159,7 +159,7 @@ btbuildempty(Relation index)
 	 * Write the page and log it.  It might seem that an immediate sync would
 	 * be sufficient to guarantee that the file exists on disk, but recovery
 	 * itself might remove it while replaying, for example, an
-	 * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we need
+	 * XLOG_DBASEDIR_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we need
 	 * this even when wal_level=minimal.
 	 */
 	PageSetChecksumInplace(metapage, BTREE_METAPAGE);
diff --git a/src/backend/access/rmgrdesc/dbasedesc.c b/src/backend/access/rmgrdesc/dbasedesc.c
index 2660984..0ce38fa 100644
--- a/src/backend/access/rmgrdesc/dbasedesc.c
+++ b/src/backend/access/rmgrdesc/dbasedesc.c
@@ -24,12 +24,11 @@ dbase_desc(StringInfo buf, XLogReaderState *record)
 	char	   *rec = XLogRecGetData(record);
 	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
 
-	if (info == XLOG_DBASE_CREATE)
+	if (info == XLOG_DBASEDIR_CREATE)
 	{
 		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) rec;
 
-		appendStringInfo(buf, "copy dir %u/%u to %u/%u",
-						 xlrec->src_tablespace_id, xlrec->src_db_id,
+		appendStringInfo(buf, "create dir %u/%u",
 						 xlrec->tablespace_id, xlrec->db_id);
 	}
 	else if (info == XLOG_DBASE_DROP)
@@ -51,8 +50,8 @@ dbase_identify(uint8 info)
 
 	switch (info & ~XLR_INFO_MASK)
 	{
-		case XLOG_DBASE_CREATE:
-			id = "CREATE";
+		case XLOG_DBASEDIR_CREATE:
+			id = "CREATE DIR";
 			break;
 		case XLOG_DBASE_DROP:
 			id = "DROP";
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 88a1bfd..a7a8b79 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -483,8 +483,8 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 	if (blkno < lastblock)
 	{
 		/* page exists in file */
-		buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-										   mode, NULL);
+		buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno, mode, NULL,
+										   RELPERSISTENCE_PERMANENT);
 	}
 	else
 	{
@@ -508,8 +508,8 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 					LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 				ReleaseBuffer(buffer);
 			}
-			buffer = ReadBufferWithoutRelcache(rnode, forknum,
-											   P_NEW, mode, NULL);
+			buffer = ReadBufferWithoutRelcache(rnode, forknum, P_NEW, mode,
+											   NULL, RELPERSISTENCE_PERMANENT);
 		}
 		while (BufferGetBlockNumber(buffer) < blkno);
 		/* Handle the corner case that P_NEW returns non-consecutive pages */
@@ -518,8 +518,8 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 			if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
 				LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 			ReleaseBuffer(buffer);
-			buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-											   mode, NULL);
+			buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno, mode,
+											   NULL, RELPERSISTENCE_PERMANENT);
 		}
 	}
 
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 029fab4..055c3d3 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -36,15 +36,20 @@
 #include "catalog/indexing.h"
 #include "catalog/objectaccess.h"
 #include "catalog/pg_authid.h"
+#include "catalog/pg_auth_members.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_db_role_setting.h"
+#include "catalog/pg_proc.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_tablespace.h"
+#include "catalog/storage.h"
+#include "catalog/storage_xlog.h"
 #include "commands/comment.h"
 #include "commands/dbcommands.h"
 #include "commands/dbcommands_xlog.h"
 #include "commands/defrem.h"
 #include "commands/seclabel.h"
+#include "commands/tablecmds.h"
 #include "commands/tablespace.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
@@ -62,6 +67,7 @@
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
 #include "utils/pg_locale.h"
+#include "utils/relmapper.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 
@@ -77,6 +83,19 @@ typedef struct
 	Oid			dest_tsoid;		/* tablespace we are trying to move to */
 } movedb_failure_params;
 
+/*
+ * While creating a database, we first scan the pg_class of the source database
+ * and identify all the relations to be copied to the target database.  This
+ * is used for storing one relation entry and we will create a list of these
+ * entries for each valid relation of the source database.
+ */
+typedef struct CreateDBRelInfo
+{
+	RelFileNode		rnode;				/* physical relation identifier */
+	char			relpersistence;		/* relation's persistence level */
+} CreateDBRelInfo;
+
+
 /* non-export function prototypes */
 static void createdb_failure_callback(int code, Datum arg);
 static void movedb(const char *dbname, const char *tblspcname);
@@ -91,6 +110,387 @@ static bool have_createdb_privilege(void);
 static void remove_dbtablespaces(Oid db_id);
 static bool check_db_file_conflict(Oid db_id);
 static int	errdetail_busy_db(int notherbackends, int npreparedxacts);
+static void CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid,
+									bool isRedo);
+static List *GetDatabaseRelationList(Oid srctbid, Oid srcdbid,
+									 Oid relfilenode);
+void RelationCopyStorageUsingBuffer(SMgrRelation src, SMgrRelation dst,
+									ForkNumber forkNum, char relpersistence);
+static void CopyDatabase(Oid src_dboid, Oid dboid, Oid src_tsid, Oid dst_tsid);
+
+/*
+ * CreateDirAndVersionFile - Create database directory and write out the
+ *							 PG_VERSION file in the database path.
+ *
+ * If isRedo is true, it's okay for the database directory to exist already.
+ *
+ * We can directly write PG_MAJORVERSION in the version file instead of copying
+ * from the source database file because these two must be the same.
+ */
+static void
+CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid, bool isRedo)
+{
+	int		fd;
+	int		nbytes;
+	char	versionfile[MAXPGPATH];
+
+	/* Create the empty db directory. */
+	if (MakePGDirectory(dbpath) < 0)
+	{
+		/* Failure other than already exists or not in WAL replay? */
+		if (errno != EEXIST || !isRedo)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not create directory \"%s\": %m", dbpath)));
+	}
+
+	/* Create PG_VERSION file in the database path. */
+	snprintf(versionfile, sizeof(versionfile), "%s/%s", dbpath, "PG_VERSION");
+	fd = OpenTransientFile(versionfile, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
+
+	/*
+	 * If file already exist and we are in WAL replay then just retry to open
+	 * in write mode.
+	 */
+	if (fd < 0 && errno == EEXIST && isRedo)
+		fd = OpenTransientFile(versionfile, O_RDWR | PG_BINARY);
+	if (fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m", versionfile)));
+
+	nbytes = strlen(PG_MAJORVERSION);
+
+	/* If we are not in WAL replay then write the WAL. */
+	if (!isRedo)
+	{
+		xl_dbase_create_rec xlrec;
+		XLogRecPtr	lsn;
+
+		/* now errors are fatal ... */
+		START_CRIT_SECTION();
+
+		xlrec.db_id = dbid;
+		xlrec.tablespace_id = tsid;
+		xlrec.nbytes = nbytes;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec), MinSizeOfDbaseCreateRec);
+		XLogRegisterData((char *) PG_MAJORVERSION, nbytes);
+
+		lsn = XLogInsert(RM_DBASE_ID, XLOG_DBASEDIR_CREATE);
+
+		/* As always, WAL must hit the disk before the data update does. */
+		XLogFlush(lsn);
+	}
+
+	/* Write PG_MAJORVERSION in the PG_VERSION file. */
+	pgstat_report_wait_start(WAIT_EVENT_COPY_FILE_WRITE);
+	errno = 0;
+	if ((int) write(fd, (char *) PG_MAJORVERSION, nbytes) != nbytes)
+	{
+		/* If write didn't set errno, assume problem is no disk space. */
+		if (errno == 0)
+			errno = ENOSPC;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m", versionfile)));
+	}
+	pgstat_report_wait_end();
+
+	/* Close the version file. */
+	CloseTransientFile(fd);
+
+	/* Critical section done. */
+	if (!isRedo)
+		END_CRIT_SECTION();
+}
+
+/*
+ * GetDatabaseRelationList - Get list of all valid relnode for the given dbid.
+ *
+ * Iterate over each block of the input relfilnode of the pg_class.  We will
+ * scan each block and identify the all the visible tuple, from there we will
+ * get all valid relation in the source database.  We will remember that
+ * information in a rnodelist and return it to the caller.
+ */
+static List *
+GetDatabaseRelationList(Oid tbid, Oid dbid, Oid relfilenode)
+{
+	SMgrRelation	rd_smgr;
+	RelFileNode		rnode;
+	BlockNumber		nblocks;
+	BlockNumber		blkno;
+	OffsetNumber	offnum;
+	OffsetNumber	maxoff;
+	Buffer			buf;
+	Page			page;
+	List		   *rnodelist = NIL;
+	HeapTupleData	tuple;
+	Form_pg_class	classForm;
+	BufferAccessStrategy bstrategy;
+
+	/* Prepare a relnode for pg_class relation. */
+	rnode.spcNode = tbid;
+	rnode.dbNode = dbid;
+	rnode.relNode = relfilenode;
+
+	/*
+	 * We are not connected to the source database so open the pg_class
+	 * relation at the smgr level and get the block count.
+	 */
+	rd_smgr = smgropen(rnode, InvalidBackendId);
+	nblocks = smgrnblocks(rd_smgr, MAIN_FORKNUM);
+
+	/*
+	 * We're going to read the whole pg_class so better to use bulk-read buffer
+	 * access strategy.
+	 */
+	bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	/* Iterate over each block on the pg_class relation. */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		/*
+		 * We are not connected to the source database so we can not directly
+		 * use the top-level bufmgr interfaces.  So directly use the lower
+		 * level bufmgr interface which operates on the rnode.
+		 */
+		buf = ReadBufferWithoutRelcache(rnode, MAIN_FORKNUM, blkno,
+										RBM_NORMAL, bstrategy,
+										RELPERSISTENCE_PERMANENT);
+
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buf);
+		if (PageIsNew(page) || PageIsEmpty(page))
+			continue;
+
+		maxoff = PageGetMaxOffsetNumber(page);
+
+		/* Iterate over each tuple on the page. */
+		for (offnum = FirstOffsetNumber;
+			 offnum <= maxoff;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemid;
+
+			itemid = PageGetItemId(page, offnum);
+
+			/* Nothing to do if slot is empty or already dead. */
+			if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid) ||
+				ItemIdIsRedirected(itemid))
+				continue;
+
+			Assert(ItemIdIsNormal(itemid));
+			ItemPointerSet(&(tuple.t_self), blkno, offnum);
+
+			tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
+			tuple.t_len = ItemIdGetLength(itemid);
+			tuple.t_tableOid = RelationRelationId;
+
+			/*
+			 * If the tuple is visible then add its relfilenode info to the
+			 * list.
+			 */
+			if (HeapTupleSatisfiesVisibility(&tuple, GetActiveSnapshot(), buf))
+			{
+				Oid				relfilenode = InvalidOid;
+				CreateDBRelInfo   *relinfo;
+
+				classForm = (Form_pg_class) GETSTRUCT(&tuple);
+
+				/* We don't need to copy the shared objects to the target. */
+				if (classForm->reltablespace == GLOBALTABLESPACE_OID)
+					continue;
+
+				/*
+				 * If the object doesn't have the storage then nothing to be
+				 * done for that object so just ignore it.
+				 */
+				if (!RELKIND_HAS_STORAGE(classForm->relkind))
+					continue;
+
+				/* Built-in oids are mapped directly */
+				if (classForm->oid < FirstGenbkiObjectId)
+					relfilenode = classForm->oid;
+				else if (OidIsValid(classForm->relfilenode))
+					relfilenode = classForm->relfilenode;
+				else
+					continue;
+
+				/* We must have a valid relfilenode oid. */
+				Assert(OidIsValid(relfilenode));
+
+				/* Prepare a rel info element and add it to the list. */
+				relinfo = (CreateDBRelInfo *) palloc(sizeof(CreateDBRelInfo));
+				if (OidIsValid(classForm->reltablespace))
+					relinfo->rnode.spcNode = classForm->reltablespace;
+				else
+					relinfo->rnode.spcNode = tbid;
+
+				relinfo->rnode.dbNode = dbid;
+				relinfo->rnode.relNode = relfilenode;
+				relinfo->relpersistence = classForm->relpersistence;
+
+				if (rnodelist == NULL)
+					rnodelist = list_make1(relinfo);
+				else
+					rnodelist = lappend(rnodelist, relinfo);
+			}
+		}
+
+		/* Release the buffer lock. */
+		UnlockReleaseBuffer(buf);
+	}
+
+	return rnodelist;
+}
+
+/*
+ * Copy a fork's data, block by block using buffers.  Same as
+ * RelationCopyStorage but instead of using smgr this will copy using bufmgr.
+ */
+void
+RelationCopyStorageUsingBuffer(SMgrRelation src, SMgrRelation dst,
+							   ForkNumber forkNum, char relpersistence)
+{
+	Buffer		srcBuf;
+	Buffer		dstBuf;
+	Page		srcPage;
+	Page		dstPage;
+	bool		use_wal;
+	bool		copying_initfork;
+	BlockNumber nblocks;
+	BlockNumber blkno;
+	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	/* Refer comments in RelationCopyStorage. */
+	copying_initfork = relpersistence == RELPERSISTENCE_UNLOGGED &&
+		forkNum == INIT_FORKNUM;
+	use_wal = XLogIsNeeded() &&
+		(relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork);
+
+	/* Get the number of blocks in the source relation. */
+	nblocks = smgrnblocks(src, forkNum);
+
+	/* Iterate over each block of the source relation file. */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		/* If we got a cancel signal during the copy of the data, quit */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Read block from source relation. */
+		srcBuf = ReadBufferWithoutRelcache(src->smgr_rnode.node, forkNum,
+										   blkno, RBM_NORMAL, bstrategy,
+										   relpersistence);
+		srcPage = BufferGetPage(srcBuf);
+		if (PageIsNew(srcPage) || PageIsEmpty(srcPage))
+		{
+			ReleaseBuffer(srcBuf);
+			continue;
+		}
+
+		/* Use P_NEW to extend the relation */
+		dstBuf = ReadBufferWithoutRelcache(dst->smgr_rnode.node, MAIN_FORKNUM,
+										   P_NEW, RBM_NORMAL, NULL,
+										   relpersistence);
+		LockBuffer(dstBuf, BUFFER_LOCK_EXCLUSIVE);
+
+		START_CRIT_SECTION();
+
+		/* Initialize the page and write the data. */
+		dstPage = BufferGetPage(dstBuf);
+		PageInit(dstPage, BufferGetPageSize(dstBuf), 0);
+		memcpy(dstPage, srcPage, BLCKSZ);
+		MarkBufferDirty(dstBuf);
+
+		/* WAL-log the copied page */
+		if (use_wal)
+			log_newpage_buffer(dstBuf, true);
+
+		END_CRIT_SECTION();
+
+		PageSetChecksumInplace(dstPage, blkno);
+
+		UnlockReleaseBuffer(dstBuf);
+		ReleaseBuffer(srcBuf);
+	}
+}
+
+/*
+ * Copy data block by block from the source database to the destination
+ * database.
+ */
+static void
+CopyDatabase(Oid src_dboid, Oid dst_dboid, Oid src_tsid, Oid dst_tsid)
+{
+	char	   *srcpath;
+	char	   *dstpath;
+	Oid			relfilenode;
+	List	   *rnodelist = NULL;
+	ListCell   *cell;
+	RelFileNode	srcrnode;
+	RelFileNode	dstrnode;
+	CreateDBRelInfo	*relinfo;
+
+	/* Get the source database path. */
+	srcpath = GetDatabasePath(src_dboid, src_tsid);
+
+	/* Get the destination database path. */
+	dstpath = GetDatabasePath(dst_dboid, dst_tsid);
+
+	/* Create database directory and write PG_VERSION file. */
+	CreateDirAndVersionFile(dstpath, dst_dboid, dst_tsid, false);
+
+	/* Copy relmap file from source database to the destination database. */
+	CopyRelationMap(dst_dboid, dst_tsid, srcpath, dstpath);
+
+	/* Get pg_class relfilenode. */
+	relfilenode = RelationMapOidToFilenodeForDatabase(srcpath,
+													  RelationRelationId);
+
+	/* Get list of all valid relnode from the source database. */
+	rnodelist = GetDatabaseRelationList(src_tsid, src_dboid,
+										relfilenode);
+	Assert(rnodelist != NIL);
+
+	/*
+	* Iterate over each relfilenode and copy the relation data block by block
+	* from source database to the destination database.
+	*/
+	foreach(cell, rnodelist)
+	{
+		SMgrRelation	src_smgr;
+		SMgrRelation	dst_smgr;
+
+		relinfo = lfirst(cell);
+		srcrnode = relinfo->rnode;
+
+		/*
+		 * If the relation is from the default tablespace then we need to
+		 * create it in the destinations db's default tablespace.  Otherwise,
+		 * we need to create in the same tablespace as it is created in the
+		 * source database.
+		 */
+		if (srcrnode.spcNode != src_tsid)
+			dstrnode.spcNode = srcrnode.spcNode;
+		else
+			dstrnode.spcNode = dst_tsid;
+
+		dstrnode.dbNode = dst_dboid;
+		dstrnode.relNode = srcrnode.relNode;
+
+		/* Open the source and the destination relation at smgr level. */
+		src_smgr = smgropen(srcrnode, InvalidBackendId);
+		dst_smgr = smgropen(dstrnode, InvalidBackendId);
+
+		/* Copy relation storage from source to the destination. */
+		RelationCopyAllFork(src_smgr, dst_smgr, relinfo->relpersistence,
+							RelationCopyStorageUsingBuffer);
+	}
+
+	list_free_deep(rnodelist);
+}
 
 
 /*
@@ -99,8 +499,6 @@ static int	errdetail_busy_db(int notherbackends, int npreparedxacts);
 Oid
 createdb(ParseState *pstate, const CreatedbStmt *stmt)
 {
-	TableScanDesc scan;
-	Relation	rel;
 	Oid			src_dboid;
 	Oid			src_owner;
 	int			src_encoding = -1;
@@ -562,140 +960,19 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	/* Post creation hook for new database */
 	InvokeObjectPostCreateHook(DatabaseRelationId, dboid, 0);
 
-	/*
-	 * Force a checkpoint before starting the copy. This will force all dirty
-	 * buffers, including those of unlogged tables, out to disk, to ensure
-	 * source database is up-to-date on disk for the copy.
-	 * FlushDatabaseBuffers() would suffice for that, but we also want to
-	 * process any pending unlink requests. Otherwise, if a checkpoint
-	 * happened while we're copying files, a file might be deleted just when
-	 * we're about to copy it, causing the lstat() call in copydir() to fail
-	 * with ENOENT.
-	 */
-	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT
-					  | CHECKPOINT_FLUSH_ALL);
-
-	/*
-	 * Once we start copying subdirectories, we need to be able to clean 'em
-	 * up if we fail.  Use an ENSURE block to make sure this happens.  (This
-	 * is not a 100% solution, because of the possibility of failure during
-	 * transaction commit after we leave this routine, but it should handle
-	 * most scenarios.)
-	 */
 	fparms.src_dboid = src_dboid;
 	fparms.dest_dboid = dboid;
 	PG_ENSURE_ERROR_CLEANUP(createdb_failure_callback,
 							PointerGetDatum(&fparms));
-	{
-		/*
-		 * Iterate through all tablespaces of the template database, and copy
-		 * each one to the new database.
-		 */
-		rel = table_open(TableSpaceRelationId, AccessShareLock);
-		scan = table_beginscan_catalog(rel, 0, NULL);
-		while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
-		{
-			Form_pg_tablespace spaceform = (Form_pg_tablespace) GETSTRUCT(tuple);
-			Oid			srctablespace = spaceform->oid;
-			Oid			dsttablespace;
-			char	   *srcpath;
-			char	   *dstpath;
-			struct stat st;
-
-			/* No need to copy global tablespace */
-			if (srctablespace == GLOBALTABLESPACE_OID)
-				continue;
-
-			srcpath = GetDatabasePath(src_dboid, srctablespace);
-
-			if (stat(srcpath, &st) < 0 || !S_ISDIR(st.st_mode) ||
-				directory_is_empty(srcpath))
-			{
-				/* Assume we can ignore it */
-				pfree(srcpath);
-				continue;
-			}
-
-			if (srctablespace == src_deftablespace)
-				dsttablespace = dst_deftablespace;
-			else
-				dsttablespace = srctablespace;
-
-			dstpath = GetDatabasePath(dboid, dsttablespace);
-
-			/*
-			 * Copy this subdirectory to the new location
-			 *
-			 * We don't need to copy subdirectories
-			 */
-			copydir(srcpath, dstpath, false);
-
-			/* Record the filesystem change in XLOG */
-			{
-				xl_dbase_create_rec xlrec;
-
-				xlrec.db_id = dboid;
-				xlrec.tablespace_id = dsttablespace;
-				xlrec.src_db_id = src_dboid;
-				xlrec.src_tablespace_id = srctablespace;
-
-				XLogBeginInsert();
-				XLogRegisterData((char *) &xlrec, sizeof(xl_dbase_create_rec));
-
-				(void) XLogInsert(RM_DBASE_ID,
-								  XLOG_DBASE_CREATE | XLR_SPECIAL_REL_UPDATE);
-			}
-		}
-		table_endscan(scan);
-		table_close(rel, AccessShareLock);
-
-		/*
-		 * We force a checkpoint before committing.  This effectively means
-		 * that committed XLOG_DBASE_CREATE operations will never need to be
-		 * replayed (at least not in ordinary crash recovery; we still have to
-		 * make the XLOG entry for the benefit of PITR operations). This
-		 * avoids two nasty scenarios:
-		 *
-		 * #1: When PITR is off, we don't XLOG the contents of newly created
-		 * indexes; therefore the drop-and-recreate-whole-directory behavior
-		 * of DBASE_CREATE replay would lose such indexes.
-		 *
-		 * #2: Since we have to recopy the source database during DBASE_CREATE
-		 * replay, we run the risk of copying changes in it that were
-		 * committed after the original CREATE DATABASE command but before the
-		 * system crash that led to the replay.  This is at least unexpected
-		 * and at worst could lead to inconsistencies, eg duplicate table
-		 * names.
-		 *
-		 * (Both of these were real bugs in releases 8.0 through 8.0.3.)
-		 *
-		 * In PITR replay, the first of these isn't an issue, and the second
-		 * is only a risk if the CREATE DATABASE and subsequent template
-		 * database change both occur while a base backup is being taken.
-		 * There doesn't seem to be much we can do about that except document
-		 * it as a limitation.
-		 *
-		 * Perhaps if we ever implement CREATE DATABASE in a less cheesy way,
-		 * we can avoid this.
-		 */
-		RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
-
-		/*
-		 * Close pg_database, but keep lock till commit.
-		 */
-		table_close(pg_database_rel, NoLock);
-
-		/*
-		 * Force synchronous commit, thus minimizing the window between
-		 * creation of the database files and committal of the transaction. If
-		 * we crash before committing, we'll have a DB that's taking up disk
-		 * space but is not in pg_database, which is not good.
-		 */
-		ForceSyncCommit();
-	}
+	CopyDatabase(src_dboid, dboid, src_deftablespace, dst_deftablespace);
 	PG_END_ENSURE_ERROR_CLEANUP(createdb_failure_callback,
 								PointerGetDatum(&fparms));
 
+	/*
+	 * Close pg_database, but keep lock till commit.
+	 */
+	table_close(pg_database_rel, NoLock);
+
 	return dboid;
 }
 
@@ -1190,43 +1467,12 @@ movedb(const char *dbname, const char *tblspcname)
 				 errdetail_busy_db(notherbackends, npreparedxacts)));
 
 	/*
-	 * Get old and new database paths
+	 * Get new database path
 	 */
 	src_dbpath = GetDatabasePath(db_id, src_tblspcoid);
 	dst_dbpath = GetDatabasePath(db_id, dst_tblspcoid);
 
 	/*
-	 * Force a checkpoint before proceeding. This will force all dirty
-	 * buffers, including those of unlogged tables, out to disk, to ensure
-	 * source database is up-to-date on disk for the copy.
-	 * FlushDatabaseBuffers() would suffice for that, but we also want to
-	 * process any pending unlink requests. Otherwise, the check for existing
-	 * files in the target directory might fail unnecessarily, not to mention
-	 * that the copy might fail due to source files getting deleted under it.
-	 * On Windows, this also ensures that background procs don't hold any open
-	 * files, which would cause rmdir() to fail.
-	 */
-	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT
-					  | CHECKPOINT_FLUSH_ALL);
-
-	/*
-	 * Now drop all buffers holding data of the target database; they should
-	 * no longer be dirty so DropDatabaseBuffers is safe.
-	 *
-	 * It might seem that we could just let these buffers age out of shared
-	 * buffers naturally, since they should not get referenced anymore.  The
-	 * problem with that is that if the user later moves the database back to
-	 * its original tablespace, any still-surviving buffers would appear to
-	 * contain valid data again --- but they'd be missing any changes made in
-	 * the database while it was in the new tablespace.  In any case, freeing
-	 * buffers that should never be used again seems worth the cycles.
-	 *
-	 * Note: it'd be sufficient to get rid of buffers matching db_id and
-	 * src_tblspcoid, but bufmgr.c presently provides no API for that.
-	 */
-	DropDatabaseBuffers(db_id);
-
-	/*
 	 * Check for existence of files in the target directory, i.e., objects of
 	 * this database that are already in the target tablespace.  We can't
 	 * allow the move in such a case, because we would need to change those
@@ -1271,28 +1517,7 @@ movedb(const char *dbname, const char *tblspcname)
 	PG_ENSURE_ERROR_CLEANUP(movedb_failure_callback,
 							PointerGetDatum(&fparms));
 	{
-		/*
-		 * Copy files from the old tablespace to the new one
-		 */
-		copydir(src_dbpath, dst_dbpath, false);
-
-		/*
-		 * Record the filesystem change in XLOG
-		 */
-		{
-			xl_dbase_create_rec xlrec;
-
-			xlrec.db_id = db_id;
-			xlrec.tablespace_id = dst_tblspcoid;
-			xlrec.src_db_id = db_id;
-			xlrec.src_tablespace_id = src_tblspcoid;
-
-			XLogBeginInsert();
-			XLogRegisterData((char *) &xlrec, sizeof(xl_dbase_create_rec));
-
-			(void) XLogInsert(RM_DBASE_ID,
-							  XLOG_DBASE_CREATE | XLR_SPECIAL_REL_UPDATE);
-		}
+		CopyDatabase(db_id, db_id, src_tblspcoid, dst_tblspcoid);
 
 		/*
 		 * Update the database's pg_database tuple
@@ -1326,22 +1551,6 @@ movedb(const char *dbname, const char *tblspcname)
 		systable_endscan(sysscan);
 
 		/*
-		 * Force another checkpoint here.  As in CREATE DATABASE, this is to
-		 * ensure that we don't have to replay a committed XLOG_DBASE_CREATE
-		 * operation, which would cause us to lose any unlogged operations
-		 * done in the new DB tablespace before the next checkpoint.
-		 */
-		RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
-
-		/*
-		 * Force synchronous commit, thus minimizing the window between
-		 * copying the database files and committal of the transaction. If we
-		 * crash before committing, we'll leave an orphaned set of files on
-		 * disk, which is not fatal but not good either.
-		 */
-		ForceSyncCommit();
-
-		/*
 		 * Close pg_database, but keep lock till commit.
 		 */
 		table_close(pgdbrel, NoLock);
@@ -1350,6 +1559,23 @@ movedb(const char *dbname, const char *tblspcname)
 								PointerGetDatum(&fparms));
 
 	/*
+	 * Now drop all buffers holding data of the target database; they should
+	 * no longer be dirty so DropDatabaseBuffers is safe.
+	 *
+	 * It might seem that we could just let these buffers age out of shared
+	 * buffers naturally, since they should not get referenced anymore.  The
+	 * problem with that is that if the user later moves the database back to
+	 * its original tablespace, any still-surviving buffers would appear to
+	 * contain valid data again --- but they'd be missing any changes made in
+	 * the database while it was in the new tablespace.  In any case, freeing
+	 * buffers that should never be used again seems worth the cycles.
+	 *
+	 * Note: it'd be sufficient to get rid of buffers matching db_id and
+	 * src_tblspcoid, but bufmgr.c presently provides no API for that.
+	 */
+	DropDatabaseBuffers(db_id);
+
+	/*
 	 * Commit the transaction so that the pg_database update is committed. If
 	 * we crash while removing files, the database won't be corrupt, we'll
 	 * just leave some orphaned files in the old directory.
@@ -2138,42 +2364,14 @@ dbase_redo(XLogReaderState *record)
 	/* Backup blocks are not used in dbase records */
 	Assert(!XLogRecHasAnyBlockRefs(record));
 
-	if (info == XLOG_DBASE_CREATE)
+	if (info == XLOG_DBASEDIR_CREATE)
 	{
 		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) XLogRecGetData(record);
-		char	   *src_path;
-		char	   *dst_path;
-		struct stat st;
-
-		src_path = GetDatabasePath(xlrec->src_db_id, xlrec->src_tablespace_id);
-		dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
+		char	   *dbpath;
 
-		/*
-		 * Our theory for replaying a CREATE is to forcibly drop the target
-		 * subdirectory if present, then re-copy the source data. This may be
-		 * more work than needed, but it is simple to implement.
-		 */
-		if (stat(dst_path, &st) == 0 && S_ISDIR(st.st_mode))
-		{
-			if (!rmtree(dst_path, true))
-				/* If this failed, copydir() below is going to error. */
-				ereport(WARNING,
-						(errmsg("some useless files may be left behind in old database directory \"%s\"",
-								dst_path)));
-		}
-
-		/*
-		 * Force dirty buffers out to disk, to ensure source database is
-		 * up-to-date for the copy.
-		 */
-		FlushDatabaseBuffers(xlrec->src_db_id);
-
-		/*
-		 * Copy this subdirectory to the new location
-		 *
-		 * We don't need to copy subdirectories
-		 */
-		copydir(src_path, dst_path, false);
+		dbpath = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
+		CreateDirAndVersionFile(dbpath, xlrec->db_id, xlrec->tablespace_id,
+								true);
 	}
 	else if (info == XLOG_DBASE_DROP)
 	{
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index dbee6ae..c3e5aee 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -14189,21 +14189,13 @@ AlterTableMoveAll(AlterTableMoveAllStmt *stmt)
 	return new_tablespaceoid;
 }
 
-static void
-index_copy_data(Relation rel, RelFileNode newrnode)
+/*
+ * Copy source smgr all fork's data to the destination smgr.
+ */
+void
+RelationCopyAllFork(SMgrRelation src_smgr, SMgrRelation	dst_smgr,
+					char relpersistence, copy_relation_storage copy_storage)
 {
-	SMgrRelation dstrel;
-
-	dstrel = smgropen(newrnode, rel->rd_backend);
-
-	/*
-	 * Since we copy the file directly without looking at the shared buffers,
-	 * we'd better first flush out any pages of the source relation that are
-	 * in shared buffers.  We assume no new changes will be made while we are
-	 * holding exclusive lock on the rel.
-	 */
-	FlushRelationBuffers(rel);
-
 	/*
 	 * Create and copy all forks of the relation, and schedule unlinking of
 	 * old physical files.
@@ -14211,32 +14203,51 @@ index_copy_data(Relation rel, RelFileNode newrnode)
 	 * NOTE: any conflict in relfilenode value will be caught in
 	 * RelationCreateStorage().
 	 */
-	RelationCreateStorage(newrnode, rel->rd_rel->relpersistence);
+	RelationCreateStorage(dst_smgr->smgr_rnode.node, relpersistence);
 
 	/* copy main fork */
-	RelationCopyStorage(RelationGetSmgr(rel), dstrel, MAIN_FORKNUM,
-						rel->rd_rel->relpersistence);
+	copy_storage(src_smgr, dst_smgr, MAIN_FORKNUM, relpersistence);
 
 	/* copy those extra forks that exist */
 	for (ForkNumber forkNum = MAIN_FORKNUM + 1;
 		 forkNum <= MAX_FORKNUM; forkNum++)
 	{
-		if (smgrexists(RelationGetSmgr(rel), forkNum))
+		if (smgrexists(src_smgr, forkNum))
 		{
-			smgrcreate(dstrel, forkNum, false);
+			smgrcreate(dst_smgr, forkNum, false);
 
 			/*
 			 * WAL log creation if the relation is persistent, or this is the
 			 * init fork of an unlogged relation.
 			 */
-			if (RelationIsPermanent(rel) ||
-				(rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
+			if (relpersistence == RELPERSISTENCE_PERMANENT ||
+				(relpersistence == RELPERSISTENCE_UNLOGGED &&
 				 forkNum == INIT_FORKNUM))
-				log_smgrcreate(&newrnode, forkNum);
-			RelationCopyStorage(RelationGetSmgr(rel), dstrel, forkNum,
-								rel->rd_rel->relpersistence);
+				log_smgrcreate(&dst_smgr->smgr_rnode.node, forkNum);
+
+			/* Copy a fork's data, block by block. */
+			copy_storage(src_smgr, dst_smgr, forkNum, relpersistence);
 		}
 	}
+}
+
+static void
+index_copy_data(Relation rel, RelFileNode newrnode)
+{
+	SMgrRelation dstrel;
+
+	dstrel = smgropen(newrnode, rel->rd_backend);
+
+	/*
+	 * Since we copy the file directly without looking at the shared buffers,
+	 * we'd better first flush out any pages of the source relation that are
+	 * in shared buffers.  We assume no new changes will be made while we are
+	 * holding exclusive lock on the rel.
+	 */
+	FlushRelationBuffers(rel);
+
+	RelationCopyAllFork(RelationGetSmgr(rel), dstrel,
+						rel->rd_rel->relpersistence, RelationCopyStorage);
 
 	/* drop old relation, and close new one */
 	RelationDropStorage(rel);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index bc1753a..8ec0448 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -770,24 +770,17 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 /*
  * ReadBufferWithoutRelcache -- like ReadBufferExtended, but doesn't require
  *		a relcache entry for the relation.
- *
- * NB: At present, this function may only be used on permanent relations, which
- * is OK, because we only use it during XLOG replay.  If in the future we
- * want to use it on temporary or unlogged relations, we could pass additional
- * parameters.
  */
 Buffer
 ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
 						  BlockNumber blockNum, ReadBufferMode mode,
-						  BufferAccessStrategy strategy)
+						  BufferAccessStrategy strategy, char relpersistence)
 {
 	bool		hit;
 
 	SMgrRelation smgr = smgropen(rnode, InvalidBackendId);
 
-	Assert(InRecovery);
-
-	return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
+	return ReadBuffer_common(smgr, relpersistence, forkNum, blockNum,
 							 mode, strategy, &hit);
 }
 
@@ -797,7 +790,7 @@ ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
  *
  * *hit is set to true if the request was satisfied from shared buffer cache.
  */
-static Buffer
+Buffer
 ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 				  BlockNumber blockNum, ReadBufferMode mode,
 				  BufferAccessStrategy strategy, bool *hit)
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 59ebac7..0390b53 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -23,6 +23,7 @@
 #include "fe_utils/archive.h"
 #include "filemap.h"
 #include "pg_rewind.h"
+#include "utils/relmapper.h"
 
 /*
  * RmgrNames is an array of resource manager names, to make error messages
@@ -370,7 +371,7 @@ extractPageInfo(XLogReaderState *record)
 
 	/* Is this a special record type that I recognize? */
 
-	if (rmid == RM_DBASE_ID && rminfo == XLOG_DBASE_CREATE)
+	if (rmid == RM_DBASE_ID && rminfo == XLOG_DBASEDIR_CREATE)
 	{
 		/*
 		 * New databases can be safely ignored. It won't be present in the
diff --git a/src/include/commands/dbcommands_xlog.h b/src/include/commands/dbcommands_xlog.h
index f5ed762..caf3676 100644
--- a/src/include/commands/dbcommands_xlog.h
+++ b/src/include/commands/dbcommands_xlog.h
@@ -18,18 +18,19 @@
 #include "lib/stringinfo.h"
 
 /* record types */
-#define XLOG_DBASE_CREATE		0x00
+#define XLOG_DBASEDIR_CREATE	0x00
 #define XLOG_DBASE_DROP			0x10
 
 typedef struct xl_dbase_create_rec
 {
-	/* Records copying of a single subdirectory incl. contents */
 	Oid			db_id;
 	Oid			tablespace_id;
-	Oid			src_db_id;
-	Oid			src_tablespace_id;
+	int32       nbytes;         /* size of version data */
+	char		version[FLEXIBLE_ARRAY_MEMBER];
 } xl_dbase_create_rec;
 
+#define MinSizeOfDbaseCreateRec offsetof(xl_dbase_create_rec, version)
+
 typedef struct xl_dbase_drop_rec
 {
 	Oid			db_id;
diff --git a/src/include/commands/tablecmds.h b/src/include/commands/tablecmds.h
index 336549c..e0e0aa5 100644
--- a/src/include/commands/tablecmds.h
+++ b/src/include/commands/tablecmds.h
@@ -19,10 +19,13 @@
 #include "catalog/objectaddress.h"
 #include "nodes/parsenodes.h"
 #include "storage/lock.h"
+#include "storage/smgr.h"
 #include "utils/relcache.h"
 
 struct AlterTableUtilityContext;	/* avoid including tcop/utility.h here */
 
+typedef void (*copy_relation_storage) (SMgrRelation src, SMgrRelation dst,
+									  ForkNumber forkNum, char relpersistence);
 
 extern ObjectAddress DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 									ObjectAddress *typaddress, const char *queryString);
@@ -42,6 +45,8 @@ extern void AlterTableInternal(Oid relid, List *cmds, bool recurse);
 
 extern Oid	AlterTableMoveAll(AlterTableMoveAllStmt *stmt);
 
+extern void RelationCopyAllFork(SMgrRelation src_smgr, SMgrRelation	dst_smgr,
+								char relpersistence, copy_relation_storage copy_storage);
 extern ObjectAddress AlterTableNamespace(AlterObjectSchemaStmt *stmt,
 										 Oid *oldschema);
 
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index cfce23e..dcf42be 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -184,7 +184,8 @@ extern Buffer ReadBufferExtended(Relation reln, ForkNumber forkNum,
 								 BufferAccessStrategy strategy);
 extern Buffer ReadBufferWithoutRelcache(RelFileNode rnode,
 										ForkNumber forkNum, BlockNumber blockNum,
-										ReadBufferMode mode, BufferAccessStrategy strategy);
+										ReadBufferMode mode, BufferAccessStrategy strategy,
+										char relpersistence);
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern void MarkBufferDirty(Buffer buffer);
-- 
1.8.3.1

#26

Robert Haas

robertmhaas@gmail.com

over 4 years ago

In reply to: Dilip Kumar (#25)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Thu, Sep 2, 2021 at 2:06 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

0003- The main patch for WAL logging the created database operation.

Andres pointed out that this approach ends up accessing relations
without taking a lock on them. It doesn't look like you did anything
about that.

+ /* Built-in oids are mapped directly */
+ if (classForm->oid < FirstGenbkiObjectId)
+ relfilenode = classForm->oid;
+ else if (OidIsValid(classForm->relfilenode))
+ relfilenode = classForm->relfilenode;
+ else
+ continue;

Am I missing something, or is this totally busted?

[rhaas pgsql]$ createdb
[rhaas pgsql]$ psql
psql (15devel)
Type "help" for help.

rhaas=# select oid::regclass from pg_class where relfilenode not in
(0, oid) and oid < 10000;
oid
-----
(0 rows)

rhaas=# vacuum full pg_attrdef;
VACUUM
rhaas=# select oid::regclass from pg_class where relfilenode not in
(0, oid) and oid < 10000;
oid
--------------------------------
pg_attrdef_adrelid_adnum_index
pg_attrdef_oid_index
pg_toast.pg_toast_2604
pg_toast.pg_toast_2604_index
pg_attrdef
(5 rows)

  /*
+ * Now drop all buffers holding data of the target database; they should
+ * no longer be dirty so DropDatabaseBuffers is safe.

The way things worked before, this was true, but now AFAICS it's
false. I'm not sure whether that means that DropDatabaseBuffers() here
is actually unsafe or whether it just means that you haven't updated
the comment to explain the reason.

+ * Since we copy the file directly without looking at the shared buffers,
+ * we'd better first flush out any pages of the source relation that are
+ * in shared buffers.  We assume no new changes will be made while we are
+ * holding exclusive lock on the rel.

Ditto.

+ /* As always, WAL must hit the disk before the data update does. */

Actually, the way it's coded now, part of the on-disk changes are done
before WAL is issued, and part are done after. I doubt that's the
right idea. There's nothing special about writing the actual payload
bytes vs. the other on-disk changes (creating directories and files).
In any case the ordering deserves a better-considered comment than
this one.

+ XLogRegisterData((char *) PG_MAJORVERSION, nbytes);

Surely this is utterly pointless.

+ CopyDatabase(src_dboid, dboid, src_deftablespace, dst_deftablespace);
PG_END_ENSURE_ERROR_CLEANUP(createdb_failure_callback,
PointerGetDatum(&fparms));

I'd leave braces around the code for which we're ensuring error
cleanup even if it's just one line.

+ if (info == XLOG_DBASEDIR_CREATE)
{
xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) XLogRecGetData(record);

Seems odd to rename the record but not change the name of the struct.
I think I would be inclined to keep the existing record name, but if
we're going to change it we should be more thorough.

--
Robert Haas
EDB: http://www.enterprisedb.com

#27

Dilip Kumar

dilipbalaut@gmail.com

over 4 years ago

In reply to: Andres Freund (#22)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Fri, Jun 18, 2021 at 12:18 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2021-06-17 14:22:52 -0400, Robert Haas wrote:

On Thu, Jun 17, 2021 at 2:17 PM Andres Freund <andres@anarazel.de>

wrote:

Adding a hacky special case implementation for cross-database relation
accesses that violates all kinds of assumptions (like holding a lock on
a relation when accessing it / pinning pages, processing relcache
invals, ...) doesn't seem like a good plan.

I agree that we don't want hacky code that violates assumptions, but
bypassing shared_buffers is a bit hacky, too. Can't we lock the
relations as we're copying them? We know pg_class's OID a fortiori,
and we can find out all the other OIDs as we go.

We possibly can - but I'm not sure that won't end up violating some
other assumptions.

Yeah, we can surely lock the relation as described by Robert, but IMHO,
while creating the database we are already holding the exclusive lock on
the database and there is no one else allowed to be connected to the
database, so do we actually need to bother about the lock for the
correctness?

I'm just thinking that the hackiness of going around shared_buffers
feels irreducible, but maybe the hackiness in the patch is something
that can be solved with more engineering.

Which bypassing of shared buffers are you talking about here? We'd still
have to solve a subset of the issues around locking (at least on the
source side), but I don't think we need to read pg_class contents to be
able to go through shared_buffers? As I suggested, we can use the init
fork presence to infer relpersistence?

I believe we want to avoid scanning pg_class for identifying the relation
list so that we can avoid this special-purpose code? IMHO, scanning the
disk, such as going through all the tablespaces and then finding the source
database directory and identifying each relfilenode, also appears to be a
special-purpose code, unless I am missing what you mean by special-purpose
code.

Or do you mean that looking at the filesystem at all is bypassing shared

buffers?

I think we already have such a code in multiple places where we bypass the
shared buffers for copying the relation
e.g. index_copy_data(), heapam_relation_copy_data(). So the current system
as it stands, we can not claim that we are designing something for the
first time where we are bypassing the shared buffers. So if we are
thinking that bypassing the shared buffers is hackish and we don't want to
use it for the new patches then lets avoid it completely even while
identifying the relfilenodes to be copied.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#28

Dilip Kumar

dilipbalaut@gmail.com

over 4 years ago

In reply to: Robert Haas (#26)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Thu, Sep 2, 2021 at 8:52 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Sep 2, 2021 at 2:06 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

0003- The main patch for WAL logging the created database operation.

Andres pointed out that this approach ends up accessing relations
without taking a lock on them. It doesn't look like you did anything
about that.

I missed that, I have shared my opinion about this in my last email [1]/messages/by-id/CAFiTN-sP_6hWv5AxcwnWCgg=4hyEeeZcCgFucZsYWpr3XQbP1g@mail.gmail.com

+ /* Built-in oids are mapped directly */
+ if (classForm->oid < FirstGenbkiObjectId)
+ relfilenode = classForm->oid;
+ else if (OidIsValid(classForm->relfilenode))
+ relfilenode = classForm->relfilenode;
+ else
+ continue;

Am I missing something, or is this totally busted?

Oops, I think the condition should be like below, but I will think
carefully before posting the next version if there is something else I am
missing.

if (OidIsValid(classForm->relfilenode))
relfilenode = classForm->relfilenode;
else if if (classForm->oid < FirstGenbkiObjectId)
relfilenode = classForm->oid;
else
continue

/*
+ * Now drop all buffers holding data of the target database; they should
+ * no longer be dirty so DropDatabaseBuffers is safe.
The way things worked before, this was true, but now AFAICS it's
false. I'm not sure whether that means that DropDatabaseBuffers() here
is actually unsafe or whether it just means that you haven't updated
the comment to explain the reason.

I think DropDatabaseBuffers(), itself is unsafe, we just copied pages using
bufmgr and dirtied the buffers so dropping buffers is definitely unsafe
here.

+ * Since we copy the file directly without looking at the shared buffers,
+ * we'd better first flush out any pages of the source relation that are
+ * in shared buffers.  We assume no new changes will be made while we are
+ * holding exclusive lock on the rel.

Ditto.

Yeah this comment no longer makes sense now.

+ /* As always, WAL must hit the disk before the data update does. */

Actually, the way it's coded now, part of the on-disk changes are done
before WAL is issued, and part are done after. I doubt that's the
right idea.

There's nothing special about writing the actual payload

bytes vs. the other on-disk changes (creating directories and files).
In any case the ordering deserves a better-considered comment than
this one.

Agreed to all, but In general, I think WAL hitting the disk before data is
more applicable for the shared buffers, where we want to flush the WAL
first before writing the shared buffer so that in case of torn page we have
an option to recover the page from previous FPI. But in such cases where we
are creating a directory or file there is no such requirement. Anyways, I
agreed with the comments that it should be more uniform and the comment
should be correct.

+ XLogRegisterData((char *) PG_MAJORVERSION, nbytes);

Surely this is utterly pointless.

Yeah it is. During the WAL replay also we know the PG_MAJORVERSION :)

+ CopyDatabase(src_dboid, dboid, src_deftablespace, dst_deftablespace);
PG_END_ENSURE_ERROR_CLEANUP(createdb_failure_callback,
PointerGetDatum(&fparms));

I'd leave braces around the code for which we're ensuring error
cleanup even if it's just one line.

Okay

+ if (info == XLOG_DBASEDIR_CREATE)
{
xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *)
XLogRecGetData(record);

Seems odd to rename the record but not change the name of the struct.
I think I would be inclined to keep the existing record name, but if
we're going to change it we should be more thorough.

Right, I think we can leave the record name as it is.

[1]: /messages/by-id/CAFiTN-sP_6hWv5AxcwnWCgg=4hyEeeZcCgFucZsYWpr3XQbP1g@mail.gmail.com
/messages/by-id/CAFiTN-sP_6hWv5AxcwnWCgg=4hyEeeZcCgFucZsYWpr3XQbP1g@mail.gmail.com

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#29

Robert Haas

robertmhaas@gmail.com

over 4 years ago

In reply to: Dilip Kumar (#28)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Fri, Sep 3, 2021 at 6:23 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

+ /* Built-in oids are mapped directly */
+ if (classForm->oid < FirstGenbkiObjectId)
+ relfilenode = classForm->oid;
+ else if (OidIsValid(classForm->relfilenode))
+ relfilenode = classForm->relfilenode;
+ else
+ continue;
Am I missing something, or is this totally busted?
Oops, I think the condition should be like below, but I will think carefully before posting the next version if there is something else I am missing.

if (OidIsValid(classForm->relfilenode))
relfilenode = classForm->relfilenode;
else if if (classForm->oid < FirstGenbkiObjectId)
relfilenode = classForm->oid;
else
continue

What about mapped rels that have been rewritten at some point?

Agreed to all, but In general, I think WAL hitting the disk before data is more applicable for the shared buffers, where we want to flush the WAL first before writing the shared buffer so that in case of torn page we have an option to recover the page from previous FPI. But in such cases where we are creating a directory or file there is no such requirement. Anyways, I agreed with the comments that it should be more uniform and the comment should be correct.

There have been previous debates about whether WAL records for
filesystem operations should be issued before or after those
operations are performed. I'm not sure how easy those discussion are
to find in the archives, but it's very relevant here. I think the
short version is - if we write a WAL record first and then the
operation fails afterward, we have to PANIC. But if we perform the
operation first and then write the WAL record if it succeeds, then we
could crash before writing WAL and end up out of sync with our
standbys. If we then later do any WAL-logged operation locally that
depends on that operation having been performed, replay will fail on
the standby. There used to be, or maybe still are, comments in the
code defending the latter approach, but more recently it's been
strongly criticized. The thinking, AIUI, is basically that filesystem
operations really ought not to fail, because nobody should be doing
weird things to the data directory, and if they do, panicking is OK.
But having replay fail in strange ways on the standby later is not OK.

I'm not sure if everyone agrees with that logic; it seems somewhat
debatable. I *think* I personally agree with it but ... I'm not even
100% sure about that.

--
Robert Haas
EDB: http://www.enterprisedb.com

#30

Andres Freund

andres@anarazel.de

over 4 years ago

In reply to: Dilip Kumar (#27)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

Hi,

On 2021-09-03 14:25:10 +0530, Dilip Kumar wrote:

Yeah, we can surely lock the relation as described by Robert, but IMHO,
while creating the database we are already holding the exclusive lock on
the database and there is no one else allowed to be connected to the
database, so do we actually need to bother about the lock for the
correctness?

The problem is that checkpointer, bgwriter, buffer reclaim don't care about
the database of the buffer they're working on... The exclusive lock on the
database doesn't change anything about that. Perhaps you can justify it's safe
because there can't be any dirty buffers or such though.

I think we already have such a code in multiple places where we bypass the
shared buffers for copying the relation
e.g. index_copy_data(), heapam_relation_copy_data().

That's not at all comparable. We hold an exclusive lock on the relation at
that point, and we don't have a separate implementation of reading tuples from
the table or something like that.

Greetings,

Andres Freund

#31

Dilip Kumar

dilipbalaut@gmail.com

over 4 years ago

In reply to: Andres Freund (#30)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Sat, Sep 4, 2021 at 3:24 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2021-09-03 14:25:10 +0530, Dilip Kumar wrote:

Yeah, we can surely lock the relation as described by Robert, but IMHO,
while creating the database we are already holding the exclusive lock on
the database and there is no one else allowed to be connected to the
database, so do we actually need to bother about the lock for the
correctness?

The problem is that checkpointer, bgwriter, buffer reclaim don't care about
the database of the buffer they're working on... The exclusive lock on the
database doesn't change anything about that.

But these directly operate on the buffers and In my patch, whether we are
reading the pg_class for identifying the relfilenode or we are copying the
relation block by block we are always holding the lock on the buffer.

Perhaps you can justify it's safe
because there can't be any dirty buffers or such though.

I think we already have such a code in multiple places where we bypass

the

shared buffers for copying the relation
e.g. index_copy_data(), heapam_relation_copy_data().

That's not at all comparable. We hold an exclusive lock on the relation at
that point, and we don't have a separate implementation of reading tuples
from
the table or something like that.

Okay, but my example was against the point Robert raised that he feels that
bypassing the shared buffer anywhere is hackish. But yeah, I agree his
point might be that even if we are using it in existing code we can not
justify it.

For moving forward I think the main open concerns we have as of now are

1. Special purpose code of scanning pg_class, so that we can solve it by
scanning the source database directory, I think Robert doesn't like this
approach because we are directly scanning to directory and bypassing the
shared buffers? But this is not any worse than what we have now right? I
mean now also we are scanning the directory directly, so only change will
be instead of copying files directly we will read file and copy block by
block.

2. Another problem is, while copying the relation we are accessing the
relation buffers but we are not holding the relation lock, but we are
already holding the buffer so I am not sure do we really have a problem
here w.r.t checkpointer, bgwriter? But if we have the problem then also we
can create the lock tag and acquire the relation lock.

3. While copying the relation whether to use the bufmgr or directly use the
smgr?

If we use the bufmgr then maybe we can avoid flushing some of the buffers
to the disk and save some I/O but in general we copy from the template
database so there might not be a lot of dirty buffers and we might not save
anything, OTOH, if we directly use the smgr for copying the relation data
we can reuse some existing code RelationCopyStorage() and the patch will be
simpler. Other than just code simplicity or IO there is also a concern by
Robert that he doesn't like to bypass the bufmgr, and that will be
applicable to the point #1 as well as #3.

Thoughts?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#32

Andres Freund

andres@anarazel.de

over 4 years ago

In reply to: Dilip Kumar (#31)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On 2021-09-05 14:22:51 +0530, Dilip Kumar wrote:

On Sat, Sep 4, 2021 at 3:24 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2021-09-03 14:25:10 +0530, Dilip Kumar wrote:

Yeah, we can surely lock the relation as described by Robert, but IMHO,
while creating the database we are already holding the exclusive lock on
the database and there is no one else allowed to be connected to the
database, so do we actually need to bother about the lock for the
correctness?

The problem is that checkpointer, bgwriter, buffer reclaim don't care about
the database of the buffer they're working on... The exclusive lock on the
database doesn't change anything about that.

But these directly operate on the buffers and In my patch, whether we are
reading the pg_class for identifying the relfilenode or we are copying the
relation block by block we are always holding the lock on the buffer.

I don't think a buffer lock is really sufficient. See e.g. code like:

static void
InvalidateBuffer(BufferDesc *buf)
{
...
/*
* We assume the only reason for it to be pinned is that someone else is
* flushing the page out. Wait for them to finish. (This could be an
* infinite loop if the refcount is messed up... it would be nice to time
* out after awhile, but there seems no way to be sure how many loops may
* be needed. Note that if the other guy has pinned the buffer but not
* yet done StartBufferIO, WaitIO will fall through and we'll effectively
* be busy-looping here.)
*/
if (BUF_STATE_GET_REFCOUNT(buf_state) != 0)
{
UnlockBufHdr(buf, buf_state);
LWLockRelease(oldPartitionLock);
/* safety check: should definitely not be our *own* pin */
if (GetPrivateRefCount(BufferDescriptorGetBuffer(buf)) > 0)
elog(ERROR, "buffer is pinned in InvalidateBuffer");
WaitIO(buf);
goto retry;
}

IOW, currently we assume that you're only allowed to pin a block in a relation
while you hold a lock on the relation. It might be a good idea to change that,
but it's not as trivial as one might think - consider e.g. dropping a
relation's buffers while holding an exclusive lock: If there's potential
concurrent reads of that buffer we'd be in trouble.

3. While copying the relation whether to use the bufmgr or directly use the
smgr?

If we use the bufmgr then maybe we can avoid flushing some of the buffers
to the disk and save some I/O but in general we copy from the template
database so there might not be a lot of dirty buffers and we might not save
anything

I would assume the big benefit would be that the *target* database does not
have to be written out / shared buffer is immediately populated.

Greetings,

Andres Freund

#33

Dilip Kumar

dilipbalaut@gmail.com

over 4 years ago

In reply to: Andres Freund (#32)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Mon, Sep 6, 2021 at 1:58 AM Andres Freund <andres@anarazel.de> wrote:

On 2021-09-05 14:22:51 +0530, Dilip Kumar wrote:

But these directly operate on the buffers and In my patch, whether we are

reading the pg_class for identifying the relfilenode or we are copying

the

relation block by block we are always holding the lock on the buffer.

I don't think a buffer lock is really sufficient. See e.g. code like:

I agree that the only buffer lock is not sufficient, but here we are
talking about the case where we are already holding the exclusive lock on
the database + the buffer lock. So the cases like below which should be
called only from the drop relation must be protected by the database
exclusive lock and the other example like buffer reclaim/checkpointer
should be protected by the buffer pin + lock. Having said that, I am not
against the point that we should not acquire the relation lock in our
case. I agree that if there is an assumption that for holding the buffer
pin we must be holding the relation lock then better not to break that.

static void

InvalidateBuffer(BufferDesc *buf)
{
...
/*
* We assume the only reason for it to be pinned is that someone
else is
* flushing the page out. Wait for them to finish. (This could
be an
* infinite loop if the refcount is messed up... it would be nice
to time
* out after awhile, but there seems no way to be sure how many
loops may
* be needed. Note that if the other guy has pinned the buffer
but not
* yet done StartBufferIO, WaitIO will fall through and we'll
effectively
* be busy-looping here.)
*/
if (BUF_STATE_GET_REFCOUNT(buf_state) != 0)
{
UnlockBufHdr(buf, buf_state);
LWLockRelease(oldPartitionLock);
/* safety check: should definitely not be our *own* pin */
if (GetPrivateRefCount(BufferDescriptorGetBuffer(buf)) > 0)
elog(ERROR, "buffer is pinned in
InvalidateBuffer");
WaitIO(buf);
goto retry;
}

IOW, currently we assume that you're only allowed to pin a block in a
relation
while you hold a lock on the relation. It might be a good idea to change
that,
but it's not as trivial as one might think - consider e.g. dropping a
relation's buffers while holding an exclusive lock: If there's potential
concurrent reads of that buffer we'd be in trouble.

3. While copying the relation whether to use the bufmgr or directly use

the

smgr?

If we use the bufmgr then maybe we can avoid flushing some of the buffers
to the disk and save some I/O but in general we copy from the template
database so there might not be a lot of dirty buffers and we might not

save

anything

I would assume the big benefit would be that the *target* database does not
have to be written out / shared buffer is immediately populated.

Okay, that makes sense. Infact for using the shared buffers for the
destination database's relation we don't even have the locking issue,
because that database is not yet accessible to anyone right?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#34

Dilip Kumar

dilipbalaut@gmail.com

over 4 years ago

In reply to: Dilip Kumar (#33)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Mon, Sep 6, 2021 at 11:17 AM Dilip Kumar <dilipbalaut@gmail.com> wrote

Okay, that makes sense. Infact for using the shared buffers for the
destination database's relation we don't even have the locking issue,
because that database is not yet accessible to anyone right?

Based on all these discussions I am planning to change the design as below,

- FlushDatabaseBuffers().

- Scan the database directory under each tablespace and prepare a
tablespace-wise relfilenode list, along with this we will remember the
persistent level as well based on the presence of INITFORK.

- Next, copy each relfilenode to the destination, while copying for the
source relation directly use the smgrread whereas for the destination
relation use bufmgr. The main reasons for not using the bufmgr for the
source relations are a) We can avoid acquiring a special-purpose lock on
the relation b) We are copying from the template database so in most of the
cases there might not be many dirty buffers for that database so there is
no real need for using the shared buffers.

Any objections to the above design?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#35

Robert Haas

robertmhaas@gmail.com

over 4 years ago

In reply to: Andres Freund (#30)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Fri, Sep 3, 2021 at 5:54 PM Andres Freund <andres@anarazel.de> wrote:

I think we already have such a code in multiple places where we bypass the
shared buffers for copying the relation
e.g. index_copy_data(), heapam_relation_copy_data().

That's not at all comparable. We hold an exclusive lock on the relation at
that point, and we don't have a separate implementation of reading tuples from
the table or something like that.

I don't think there's a way to do this that is perfectly clean, so the
discussion here is really about finding the least unpleasant
alternative. I *really* like the idea of using pg_class to figure out
what relations to copy. As far as I'm concerned, pg_class is the
canonical list of what's in the database, and to the extent that the
filesystem happens to agree, that's good luck. From that perspective,
using the filesystem to figure out what to copy is by definition a
hack.

Now, having to use dedicated tuple-reading code is also a hack, but to
me that's largely an accident of questionable design decisions
elsewhere. You can't read a buffer with just the minimal amount of
information that you need to read a buffer; you have to have a
relcache entry, so we have things like ReadBufferWithoutRelcache and
CreateFakeRelcacheEntry. It's a little crazy to me that someone saw
that ReadBuffer() needed a thing which some callers might not have and
instead of saying "hmm, maybe we ought to change the arguments so that
anyone with enough information to call this function can do so," they
said "hmm, let's create a fake object that is not really the same as a
real one but good enough to fool the function into doing the right
thing, probably." I think the code layering here is just flat-out
broken and ought to be fixed. A layer whose job it is to read and
write blocks should not know that relations are even a thing. (The
widespread use of global variables in the relcache code, the catcache
code, and many other places in lieu of explicit parameter-passing just
makes everything a lot worse.)

So I think if we commit to the hackiness of the sort that this patch
introduces, there is some hope of things getting better in the future.
I don't think it's a real easy path forward, but maybe it's possible.
If on the other hand we commit to using the filesystem, I don't see
how it ever gets any better. Unlogged tables are a great example of a
feature that depended on the filesystem and it now seems to me to be -
by far - the worst thing about that feature. I have no idea how to get
rid of that dependency or all of the associated problems without
reverting the feature. But in this case, we seem to have another
option, and so I think we should take it.

Your (or other people's mileage) may vary ... this is just my view of it.

--
Robert Haas
EDB: http://www.enterprisedb.com

#36

Dilip Kumar

dilipbalaut@gmail.com

over 4 years ago

In reply to: Robert Haas (#35)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Wed, Sep 8, 2021 at 9:54 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Sep 3, 2021 at 5:54 PM Andres Freund <andres@anarazel.de> wrote:

I think we already have such a code in multiple places where we bypass

the

shared buffers for copying the relation
e.g. index_copy_data(), heapam_relation_copy_data().

That's not at all comparable. We hold an exclusive lock on the relation

at

that point, and we don't have a separate implementation of reading

tuples from

the table or something like that.

I don't think there's a way to do this that is perfectly clean, so the
discussion here is really about finding the least unpleasant
alternative. I *really* like the idea of using pg_class to figure out
what relations to copy. As far as I'm concerned, pg_class is the
canonical list of what's in the database, and to the extent that the
filesystem happens to agree, that's good luck. From that perspective,
using the filesystem to figure out what to copy is by definition a
hack.

I agree with you, even though I think that scanning pg_class for
identifying the relfilenode looks like a more sensible thing to do than
directly scanning the file system, we need to consider one point that, now
also in current system (in create database) we are scanning the directory
for copying the file so instead of copying them directly we need to
logically identify the relfilenode and then copy it block by block, so
maybe this approach will not make anyone unhappy because it is not any
worse than the current system.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#37

Robert Haas

robertmhaas@gmail.com

over 4 years ago

In reply to: Dilip Kumar (#36)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Sat, Sep 11, 2021 at 12:17 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I agree with you, even though I think that scanning pg_class for identifying the relfilenode looks like a more sensible thing to do than directly scanning the file system, we need to consider one point that, now also in current system (in create database) we are scanning the directory for copying the file so instead of copying them directly we need to logically identify the relfilenode and then copy it block by block, so maybe this approach will not make anyone unhappy because it is not any worse than the current system.

So, I agree. If we can't get agreement on this approach, then we can
do that, and as you say, it's no worse than what we are doing now. But
I am just trying to lay out my view of why I think that's not as good
as this.

--
Robert Haas
EDB: http://www.enterprisedb.com

#38

Dilip Kumar

dilipbalaut@gmail.com

over 4 years ago

In reply to: Robert Haas (#26)

6 attachment(s)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Thu, Sep 2, 2021 at 8:52 PM Robert Haas <robertmhaas@gmail.com> wrote:

PFA, updated version of the patch, where I have fixed the issues
reported by you and also done some more refactoring and patch split,
next I am planning to post the patch with another approach where we
scan the directory instead of scanning the pg_class for identifying
the relfilenodes. For specific comments please find my response
inline,

Andres pointed out that this approach ends up accessing relations
without taking a lock on them. It doesn't look like you did anything
about that.

Now I have acquired a lock before scanning the pg_class as well as
other relfilenode.

+ /* Built-in oids are mapped directly */
+ if (classForm->oid < FirstGenbkiObjectId)
+ relfilenode = classForm->oid;
+ else if (OidIsValid(classForm->relfilenode))
+ relfilenode = classForm->relfilenode;
+ else
+ continue;

Am I missing something, or is this totally busted?

Handled the mapped relation using relmapper.

/*
+ * Now drop all buffers holding data of the target database; they should
+ * no longer be dirty so DropDatabaseBuffers is safe.
The way things worked before, this was true, but now AFAICS it's
false. I'm not sure whether that means that DropDatabaseBuffers() here
is actually unsafe or whether it just means that you haven't updated
the comment to explain the reason.

Now we can only drop the buffer specific to old tablespace not the new
tablespace so can not directly use the dboid, so extended the
DropDatabaseBuffers interface to take tablespace oid as and input and
updated the comments accordingly.

+ * Since we copy the file directly without looking at the shared buffers,
+ * we'd better first flush out any pages of the source relation that are
+ * in shared buffers.  We assume no new changes will be made while we are
+ * holding exclusive lock on the rel.

Ditto.

I think these comments is related to index_copy_data() and this is
still valid, it is showing in the patch due to some refactoring so I
have separated out this refactoring patch as 0003 to avoid confusion.

+ /* As always, WAL must hit the disk before the data update does. */

Actually, the way it's coded now, part of the on-disk changes are done
before WAL is issued, and part are done after. I doubt that's the
right idea. There's nothing special about writing the actual payload
bytes vs. the other on-disk changes (creating directories and files).
In any case the ordering deserves a better-considered comment than
this one.

Changed, now WAL first and then disk change.

Open question:
- Scan pg_class vs scan directories
- Whether to retain the old created database mechanism as option or not.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v3-0002-Extend-relmap-interfaces.patchapplication/octet-stream; name=v3-0002-Extend-relmap-interfaces.patchDownload

From c0c4fae8722ed0a8d5ceae7e2ef925477ee4db02 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Wed, 1 Sep 2021 14:16:35 +0530
Subject: [PATCH v3 2/6] Extend relmap interfaces

Support new interfaces in relmapper, 1) Support copying the
relmap file from one database path to the other database path.
2) Like RelationMapOidToFilenode, provide another interface
which do the same but instead of getting it for the database
we are connected to it will get it for the input database
path.

These interfaces are required for next patch for supporting the
wal logged created database.
---
 src/backend/utils/cache/relmapper.c | 131 ++++++++++++++++++++++++----
 src/include/utils/relmapper.h       |   6 +-
 2 files changed, 119 insertions(+), 18 deletions(-)

diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index ae6291018a..182054eec7 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -141,7 +141,7 @@ static void read_relmap_file(char *mapfilename, RelMapFile *map,
 static void write_relmap_file_internal(char *mapfilename, RelMapFile *newmap,
 									   bool write_wal, bool send_sinval,
 									   bool preserve_files, Oid dbid, Oid tsid,
-									   const char *dbpath);
+									   const char *dbpath, bool create);
 static void load_relmap_file(bool shared, bool lock_held);
 static void write_relmap_file(bool shared, RelMapFile *newmap,
 							  bool write_wal, bool send_sinval, bool preserve_files,
@@ -255,6 +255,40 @@ RelationMapFilenodeToOid(Oid filenode, bool shared)
 	return InvalidOid;
 }
 
+/*
+ * RelationMapOidToFilenodeForDatabase
+ *
+ * Find relfilenode for the given relation id in the dbpath.  Returns
+ * InvalidOid if the relationId is not found in the relmap.
+ *
+ * This function is only called during CREATE DATABASE command, so we can pass
+ * lock_held as true while reading the relmap file since we are already holding
+ * the exclusive lock on the database.
+ */
+Oid
+RelationMapOidToFilenodeForDatabase(char *dbpath, Oid relationId)
+{
+	RelMapFile	map;
+	int			i;
+	char		mapfilename[MAXPGPATH];
+
+	/* Relmap file path for the given dbpath. */
+	snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+			 dbpath, RELMAPPER_FILENAME);
+
+	/* Read the relmap file from the source database. */
+	read_relmap_file(mapfilename, &map, true);
+
+	/* Iterate over the relmap entries to find the input relation oid. */
+	for (i = 0; i < map.num_mappings; i++)
+	{
+		if (relationId == map.mappings[i].mapoid)
+			return map.mappings[i].mapfilenode;
+	}
+
+	return InvalidOid;
+}
+
 /*
  * RelationMapUpdateMap
  *
@@ -693,10 +727,47 @@ RestoreRelationMap(char *startAddress)
 }
 
 /*
- * load_relmap_file -- load data from the shared or local map file
+ * CopyRelationMap
  *
- * Because the map file is essential for access to core system catalogs,
- * failure to read it is a fatal error.
+ * Copy relmapfile from source db path to the destination db path and WAL log
+ * the operation.  This function is only called during the create database so
+ * the caller must hold the exclusive lock on the source database.  Destination
+ * database is not yet created so we don't have any issue.
+ */
+void
+CopyRelationMap(Oid dbid, Oid tsid, char *srcdbpath, char *dstdbpath)
+{
+	RelMapFile map;
+	char mapfilename[MAXPGPATH];
+
+	/* Relmap file path of the source database. */
+	snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+			 srcdbpath, RELMAPPER_FILENAME);
+
+	/*
+	 * Read the relmap file from the source database.  We are not connected to
+	 * the database so we can not take the relmap lock, but we are already
+	 * holding exclusive lock on the database so pass lock_held as true.
+	 */
+	read_relmap_file(mapfilename, &map, true);
+
+	/* Relmap file path of the destination database. */
+	snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+			 dstdbpath, RELMAPPER_FILENAME);
+
+	/*
+	 * Write map contents into the destination database's relmap file.
+	 * write_relmap_file_internal, expects that the CRC should have been
+	 * computed and stored in the input map.  But, since we have read this map
+	 * from the source database and directly writing to the destination file
+	 * without updating it so we don't need to recompute it.
+	 */
+	write_relmap_file_internal(mapfilename, &map, true, false, true, dbid,
+							   tsid, dstdbpath, true);
+}
+
+/*
+ * read_relmap_file - read the relmap file data.
  *
  * lock_held, pass true if caller already have the relation mapping or higher
  * level lock.
@@ -802,15 +873,18 @@ load_relmap_file(bool shared, bool lock_held)
 }
 
 /*
- * Helper function for write_relmap_file, Read comments atop write_relmap_file
- * for more details.  The CRC should be computed by the caller and stored in
- * the newmap.
+ * Helper function for write_relmap_file and CopyRelationMap, Read comments
+ * atop write_relmap_file for more details.  The CRC should be computed by the
+ * caller and stored in the newmap.
+ *
+ * Pass the create = true, if we are copying the relmap file during CREATE
+ * DATABASE command.
  */
 static void
 write_relmap_file_internal(char *mapfilename, RelMapFile *newmap,
 						   bool write_wal, bool send_sinval,
 						   bool preserve_files, Oid dbid, Oid tsid,
-						   const char *dbpath)
+						   const char *dbpath, bool create)
 {
 	int			fd;
 
@@ -836,6 +910,7 @@ write_relmap_file_internal(char *mapfilename, RelMapFile *newmap,
 		xlrec.dbid = dbid;
 		xlrec.tsid = tsid;
 		xlrec.nbytes = sizeof(RelMapFile);
+		xlrec.create = create;
 
 		XLogBeginInsert();
 		XLogRegisterData((char *) (&xlrec), MinSizeOfRelmapUpdate);
@@ -977,7 +1052,7 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 	/* Write the map to the relmap file. */
 	write_relmap_file_internal(mapfilename, newmap, write_wal,
 							   send_sinval, preserve_files, dbid, tsid,
-							   dbpath);
+							   dbpath, false);
 
 	/*
 	 * Success, update permanent copy.  During bootstrap, we might be working
@@ -1069,15 +1144,37 @@ relmap_redo(XLogReaderState *record)
 		 * Write out the new map and send sinval, but of course don't write a
 		 * new WAL entry.  There's no surrounding transaction to tell to
 		 * preserve files, either.
-		 *
-		 * There shouldn't be anyone else updating relmaps during WAL replay,
-		 * but grab the lock to interlock against load_relmap_file().
 		 */
-		LWLockAcquire(RelationMappingLock, LW_EXCLUSIVE);
-		write_relmap_file((xlrec->dbid == InvalidOid), &newmap,
-						  false, true, false,
-						  xlrec->dbid, xlrec->tsid, dbpath);
-		LWLockRelease(RelationMappingLock);
+		if (!xlrec->create)
+		{
+			/*
+			 * There shouldn't be anyone else updating relmaps during WAL
+			 * replay, but grab the lock to interlock against
+			 * load_relmap_file().
+			 */
+			LWLockAcquire(RelationMappingLock, LW_EXCLUSIVE);
+			write_relmap_file((xlrec->dbid == InvalidOid), &newmap,
+							false, true, false,
+							xlrec->dbid, xlrec->tsid, dbpath);
+			LWLockRelease(RelationMappingLock);
+		}
+		else
+		{
+			char		mapfilename[MAXPGPATH];
+
+			/* Construct the mapfilename. */
+			snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+					 dbpath, RELMAPPER_FILENAME);
+
+			/*
+			 * We don't need to take relmap lock because this wal is logged
+			 * while creating a new database, so there could be no one else
+			 * reading/writing the relmap file.
+			 */
+			write_relmap_file_internal(mapfilename, &newmap, false, false,
+									   false, xlrec->dbid, xlrec->tsid, dbpath,
+									   true);
+		}
 
 		pfree(dbpath);
 	}
diff --git a/src/include/utils/relmapper.h b/src/include/utils/relmapper.h
index c0d14daad9..4165f0990b 100644
--- a/src/include/utils/relmapper.h
+++ b/src/include/utils/relmapper.h
@@ -29,6 +29,7 @@ typedef struct xl_relmap_update
 	Oid			dbid;			/* database ID, or 0 for shared map */
 	Oid			tsid;			/* database's tablespace, or pg_global */
 	int32		nbytes;			/* size of relmap data */
+	bool		create;			/* true if creating new relmap */
 	char		data[FLEXIBLE_ARRAY_MEMBER];
 } xl_relmap_update;
 
@@ -39,6 +40,8 @@ extern Oid	RelationMapOidToFilenode(Oid relationId, bool shared);
 
 extern Oid	RelationMapFilenodeToOid(Oid relationId, bool shared);
 
+extern Oid RelationMapOidToFilenodeForDatabase(char *dbpath, Oid relationId);
+
 extern void RelationMapUpdateMap(Oid relationId, Oid fileNode, bool shared,
 								 bool immediate);
 
@@ -62,7 +65,8 @@ extern void RelationMapInitializePhase3(void);
 extern Size EstimateRelationMapSpace(void);
 extern void SerializeRelationMap(Size maxSize, char *startAddress);
 extern void RestoreRelationMap(char *startAddress);
-
+extern void CopyRelationMap(Oid dbid, Oid tsid, char *srcdbpath,
+							char *dstdbpath);
 extern void relmap_redo(XLogReaderState *record);
 extern void relmap_desc(StringInfo buf, XLogReaderState *record);
 extern const char *relmap_identify(uint8 info);
-- 
2.23.0

v3-0003-Refactor-index_copy_data.patchapplication/octet-stream; name=v3-0003-Refactor-index_copy_data.patchDownload

From 197111e1c31aff66335eaf05de0189ae8d17f48d Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 24 Sep 2021 18:13:25 +0530
Subject: [PATCH v3 3/6] Refactor index_copy_data

Make separate interface for copying relation storage, this will
be used by later patch for copying the database relations.
---
 src/backend/commands/tablecmds.c | 59 +++++++++++++++++++-------------
 src/include/commands/tablecmds.h |  5 +++
 2 files changed, 40 insertions(+), 24 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index dbee6ae199..c3e5aee8a4 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -14189,21 +14189,13 @@ AlterTableMoveAll(AlterTableMoveAllStmt *stmt)
 	return new_tablespaceoid;
 }
 
-static void
-index_copy_data(Relation rel, RelFileNode newrnode)
+/*
+ * Copy source smgr all fork's data to the destination smgr.
+ */
+void
+RelationCopyAllFork(SMgrRelation src_smgr, SMgrRelation	dst_smgr,
+					char relpersistence, copy_relation_storage copy_storage)
 {
-	SMgrRelation dstrel;
-
-	dstrel = smgropen(newrnode, rel->rd_backend);
-
-	/*
-	 * Since we copy the file directly without looking at the shared buffers,
-	 * we'd better first flush out any pages of the source relation that are
-	 * in shared buffers.  We assume no new changes will be made while we are
-	 * holding exclusive lock on the rel.
-	 */
-	FlushRelationBuffers(rel);
-
 	/*
 	 * Create and copy all forks of the relation, and schedule unlinking of
 	 * old physical files.
@@ -14211,32 +14203,51 @@ index_copy_data(Relation rel, RelFileNode newrnode)
 	 * NOTE: any conflict in relfilenode value will be caught in
 	 * RelationCreateStorage().
 	 */
-	RelationCreateStorage(newrnode, rel->rd_rel->relpersistence);
+	RelationCreateStorage(dst_smgr->smgr_rnode.node, relpersistence);
 
 	/* copy main fork */
-	RelationCopyStorage(RelationGetSmgr(rel), dstrel, MAIN_FORKNUM,
-						rel->rd_rel->relpersistence);
+	copy_storage(src_smgr, dst_smgr, MAIN_FORKNUM, relpersistence);
 
 	/* copy those extra forks that exist */
 	for (ForkNumber forkNum = MAIN_FORKNUM + 1;
 		 forkNum <= MAX_FORKNUM; forkNum++)
 	{
-		if (smgrexists(RelationGetSmgr(rel), forkNum))
+		if (smgrexists(src_smgr, forkNum))
 		{
-			smgrcreate(dstrel, forkNum, false);
+			smgrcreate(dst_smgr, forkNum, false);
 
 			/*
 			 * WAL log creation if the relation is persistent, or this is the
 			 * init fork of an unlogged relation.
 			 */
-			if (RelationIsPermanent(rel) ||
-				(rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
+			if (relpersistence == RELPERSISTENCE_PERMANENT ||
+				(relpersistence == RELPERSISTENCE_UNLOGGED &&
 				 forkNum == INIT_FORKNUM))
-				log_smgrcreate(&newrnode, forkNum);
-			RelationCopyStorage(RelationGetSmgr(rel), dstrel, forkNum,
-								rel->rd_rel->relpersistence);
+				log_smgrcreate(&dst_smgr->smgr_rnode.node, forkNum);
+
+			/* Copy a fork's data, block by block. */
+			copy_storage(src_smgr, dst_smgr, forkNum, relpersistence);
 		}
 	}
+}
+
+static void
+index_copy_data(Relation rel, RelFileNode newrnode)
+{
+	SMgrRelation dstrel;
+
+	dstrel = smgropen(newrnode, rel->rd_backend);
+
+	/*
+	 * Since we copy the file directly without looking at the shared buffers,
+	 * we'd better first flush out any pages of the source relation that are
+	 * in shared buffers.  We assume no new changes will be made while we are
+	 * holding exclusive lock on the rel.
+	 */
+	FlushRelationBuffers(rel);
+
+	RelationCopyAllFork(RelationGetSmgr(rel), dstrel,
+						rel->rd_rel->relpersistence, RelationCopyStorage);
 
 	/* drop old relation, and close new one */
 	RelationDropStorage(rel);
diff --git a/src/include/commands/tablecmds.h b/src/include/commands/tablecmds.h
index 336549cc5f..e0e0aa5aa0 100644
--- a/src/include/commands/tablecmds.h
+++ b/src/include/commands/tablecmds.h
@@ -19,10 +19,13 @@
 #include "catalog/objectaddress.h"
 #include "nodes/parsenodes.h"
 #include "storage/lock.h"
+#include "storage/smgr.h"
 #include "utils/relcache.h"
 
 struct AlterTableUtilityContext;	/* avoid including tcop/utility.h here */
 
+typedef void (*copy_relation_storage) (SMgrRelation src, SMgrRelation dst,
+									  ForkNumber forkNum, char relpersistence);
 
 extern ObjectAddress DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 									ObjectAddress *typaddress, const char *queryString);
@@ -42,6 +45,8 @@ extern void AlterTableInternal(Oid relid, List *cmds, bool recurse);
 
 extern Oid	AlterTableMoveAll(AlterTableMoveAllStmt *stmt);
 
+extern void RelationCopyAllFork(SMgrRelation src_smgr, SMgrRelation	dst_smgr,
+								char relpersistence, copy_relation_storage copy_storage);
 extern ObjectAddress AlterTableNamespace(AlterObjectSchemaStmt *stmt,
 										 Oid *oldschema);
 
-- 
2.23.0

v3-0004-Extend-bufmgr-interfaces.patchapplication/octet-stream; name=v3-0004-Extend-bufmgr-interfaces.patchDownload

From f736deb8ccb00157866ad37f692ebf6730e441c2 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 24 Sep 2021 18:23:39 +0530
Subject: [PATCH v3 4/6] Extend bufmgr interfaces

Extend ReadBufferWithoutRelcache interface to take relpersistence as
and input and extend DropDatabaseBuffers to take tablespace oid as
input.
---
 src/backend/storage/buffer/bufmgr.c | 24 +++++++++++-------------
 src/include/storage/bufmgr.h        |  5 +++--
 2 files changed, 14 insertions(+), 15 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e88e4e918b..ed54c34031 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -770,24 +770,17 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 /*
  * ReadBufferWithoutRelcache -- like ReadBufferExtended, but doesn't require
  *		a relcache entry for the relation.
- *
- * NB: At present, this function may only be used on permanent relations, which
- * is OK, because we only use it during XLOG replay.  If in the future we
- * want to use it on temporary or unlogged relations, we could pass additional
- * parameters.
  */
 Buffer
 ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
 						  BlockNumber blockNum, ReadBufferMode mode,
-						  BufferAccessStrategy strategy)
+						  BufferAccessStrategy strategy, char relpersistence)
 {
 	bool		hit;
 
 	SMgrRelation smgr = smgropen(rnode, InvalidBackendId);
 
-	Assert(InRecovery);
-
-	return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
+	return ReadBuffer_common(smgr, relpersistence, forkNum, blockNum,
 							 mode, strategy, &hit);
 }
 
@@ -797,7 +790,7 @@ ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
  *
  * *hit is set to true if the request was satisfied from shared buffer cache.
  */
-static Buffer
+Buffer
 ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 				  BlockNumber blockNum, ReadBufferMode mode,
 				  BufferAccessStrategy strategy, bool *hit)
@@ -3402,10 +3395,13 @@ FindAndDropRelFileNodeBuffers(RelFileNode rnode, ForkNumber forkNum,
  *		database, to avoid trying to flush data to disk when the directory
  *		tree no longer exists.  Implementation is pretty similar to
  *		DropRelFileNodeBuffers() which is for destroying just one relation.
+ *
+ *		If a valid tablespace oid is passed then it will compare the tablespace
+ *		oid as well otherwise just the db oid.
  * --------------------------------------------------------------------
  */
 void
-DropDatabaseBuffers(Oid dbid)
+DropDatabaseBuffers(Oid dbid, Oid tbsid)
 {
 	int			i;
 
@@ -3423,11 +3419,13 @@ DropDatabaseBuffers(Oid dbid)
 		 * As in DropRelFileNodeBuffers, an unlocked precheck should be safe
 		 * and saves some cycles.
 		 */
-		if (bufHdr->tag.rnode.dbNode != dbid)
+		if (bufHdr->tag.rnode.dbNode != dbid ||
+			(OidIsValid(tbsid) && bufHdr->tag.rnode.spcNode != tbsid))
 			continue;
 
 		buf_state = LockBufHdr(bufHdr);
-		if (bufHdr->tag.rnode.dbNode == dbid)
+		if (bufHdr->tag.rnode.dbNode == dbid &&
+			(!OidIsValid(tbsid) || bufHdr->tag.rnode.spcNode == tbsid))
 			InvalidateBuffer(bufHdr);	/* releases spinlock */
 		else
 			UnlockBufHdr(bufHdr, buf_state);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index cfce23ecbc..237c6a9078 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -184,7 +184,8 @@ extern Buffer ReadBufferExtended(Relation reln, ForkNumber forkNum,
 								 BufferAccessStrategy strategy);
 extern Buffer ReadBufferWithoutRelcache(RelFileNode rnode,
 										ForkNumber forkNum, BlockNumber blockNum,
-										ReadBufferMode mode, BufferAccessStrategy strategy);
+										ReadBufferMode mode, BufferAccessStrategy strategy,
+										char relpersistence);
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern void MarkBufferDirty(Buffer buffer);
@@ -207,7 +208,7 @@ extern void FlushDatabaseBuffers(Oid dbid);
 extern void DropRelFileNodeBuffers(struct SMgrRelationData *smgr_reln, ForkNumber *forkNum,
 								   int nforks, BlockNumber *firstDelBlock);
 extern void DropRelFileNodesAllBuffers(struct SMgrRelationData **smgr_reln, int nnodes);
-extern void DropDatabaseBuffers(Oid dbid);
+extern void DropDatabaseBuffers(Oid dbid, Oid tbsid);
 
 #define RelationGetNumberOfBlocks(reln) \
 	RelationGetNumberOfBlocksInFork(reln, MAIN_FORKNUM)
-- 
2.23.0

v3-0001-Refactor-relmap-load-and-relmap-write-functions.patchapplication/octet-stream; name=v3-0001-Refactor-relmap-load-and-relmap-write-functions.patchDownload

From adce8cad1eb99b646faeb5bcbf8b91220657c862 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Wed, 1 Sep 2021 14:06:29 +0530
Subject: [PATCH v3 1/6] Refactor relmap load and relmap write functions

Currently, write_relmap_file and load_relmap_file are tightly
coupled with shared_map and local_map.  As part of the higher
level patch set we need remap read/write interfaces that are
not dependent upon shared_map and local_map, and we should be
able to pass map memory as an external parameter instead.
---
 src/backend/utils/cache/relmapper.c | 163 +++++++++++++++++-----------
 1 file changed, 102 insertions(+), 61 deletions(-)

diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index a6e38adce3..ae6291018a 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -136,6 +136,12 @@ static void apply_map_update(RelMapFile *map, Oid relationId, Oid fileNode,
 							 bool add_okay);
 static void merge_map_updates(RelMapFile *map, const RelMapFile *updates,
 							  bool add_okay);
+static void read_relmap_file(char *mapfilename, RelMapFile *map,
+							 bool lock_held);
+static void write_relmap_file_internal(char *mapfilename, RelMapFile *newmap,
+									   bool write_wal, bool send_sinval,
+									   bool preserve_files, Oid dbid, Oid tsid,
+									   const char *dbpath);
 static void load_relmap_file(bool shared, bool lock_held);
 static void write_relmap_file(bool shared, RelMapFile *newmap,
 							  bool write_wal, bool send_sinval, bool preserve_files,
@@ -692,31 +698,17 @@ RestoreRelationMap(char *startAddress)
  * Because the map file is essential for access to core system catalogs,
  * failure to read it is a fatal error.
  *
- * Note that the local case requires DatabasePath to be set up.
+ * lock_held, pass true if caller already have the relation mapping or higher
+ * level lock.
  */
 static void
-load_relmap_file(bool shared, bool lock_held)
+read_relmap_file(char *mapfilename, RelMapFile *map, bool lock_held)
 {
-	RelMapFile *map;
-	char		mapfilename[MAXPGPATH];
 	pg_crc32c	crc;
 	int			fd;
 	int			r;
 
-	if (shared)
-	{
-		snprintf(mapfilename, sizeof(mapfilename), "global/%s",
-				 RELMAPPER_FILENAME);
-		map = &shared_map;
-	}
-	else
-	{
-		snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
-				 DatabasePath, RELMAPPER_FILENAME);
-		map = &local_map;
-	}
-
-	/* Read data ... */
+	/* Open the relmap file for reading. */
 	fd = OpenTransientFile(mapfilename, O_RDONLY | PG_BINARY);
 	if (fd < 0)
 		ereport(FATAL,
@@ -779,62 +771,53 @@ load_relmap_file(bool shared, bool lock_held)
 }
 
 /*
- * Write out a new shared or local map file with the given contents.
- *
- * The magic number and CRC are automatically updated in *newmap.  On
- * success, we copy the data to the appropriate permanent static variable.
- *
- * If write_wal is true then an appropriate WAL message is emitted.
- * (It will be false for bootstrap and WAL replay cases.)
- *
- * If send_sinval is true then a SI invalidation message is sent.
- * (This should be true except in bootstrap case.)
+ * load_relmap_file -- load data from the shared or local map file
  *
- * If preserve_files is true then the storage manager is warned not to
- * delete the files listed in the map.
+ * Because the map file is essential for access to core system catalogs,
+ * failure to read it is a fatal error.
  *
- * Because this may be called during WAL replay when MyDatabaseId,
- * DatabasePath, etc aren't valid, we require the caller to pass in suitable
- * values.  The caller is also responsible for being sure no concurrent
- * map update could be happening.
+ * Note that the local case requires DatabasePath to be set up.
  */
 static void
-write_relmap_file(bool shared, RelMapFile *newmap,
-				  bool write_wal, bool send_sinval, bool preserve_files,
-				  Oid dbid, Oid tsid, const char *dbpath)
+load_relmap_file(bool shared, bool lock_held)
 {
-	int			fd;
-	RelMapFile *realmap;
+	RelMapFile *map;
 	char		mapfilename[MAXPGPATH];
 
-	/*
-	 * Fill in the overhead fields and update CRC.
-	 */
-	newmap->magic = RELMAPPER_FILEMAGIC;
-	if (newmap->num_mappings < 0 || newmap->num_mappings > MAX_MAPPINGS)
-		elog(ERROR, "attempt to write bogus relation mapping");
-
-	INIT_CRC32C(newmap->crc);
-	COMP_CRC32C(newmap->crc, (char *) newmap, offsetof(RelMapFile, crc));
-	FIN_CRC32C(newmap->crc);
-
-	/*
-	 * Open the target file.  We prefer to do this before entering the
-	 * critical section, so that an open() failure need not force PANIC.
-	 */
 	if (shared)
 	{
 		snprintf(mapfilename, sizeof(mapfilename), "global/%s",
 				 RELMAPPER_FILENAME);
-		realmap = &shared_map;
+		map = &shared_map;
 	}
 	else
 	{
 		snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
-				 dbpath, RELMAPPER_FILENAME);
-		realmap = &local_map;
+				 DatabasePath, RELMAPPER_FILENAME);
+		map = &local_map;
 	}
 
+	/* Read data ... */
+	read_relmap_file(mapfilename, map, lock_held);
+}
+
+/*
+ * Helper function for write_relmap_file, Read comments atop write_relmap_file
+ * for more details.  The CRC should be computed by the caller and stored in
+ * the newmap.
+ */
+static void
+write_relmap_file_internal(char *mapfilename, RelMapFile *newmap,
+						   bool write_wal, bool send_sinval,
+						   bool preserve_files, Oid dbid, Oid tsid,
+						   const char *dbpath)
+{
+	int			fd;
+
+	/*
+	 * Open the target file.  We prefer to do this before entering the
+	 * critical section, so that an open() failure need not force PANIC.
+	 */
 	fd = OpenTransientFile(mapfilename, O_WRONLY | O_CREAT | PG_BINARY);
 	if (fd < 0)
 		ereport(ERROR,
@@ -934,6 +917,68 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		}
 	}
 
+	/* Critical section done */
+	if (write_wal)
+		END_CRIT_SECTION();
+}
+
+/*
+ * Write out a new shared or local map file with the given contents.
+ *
+ * The magic number and CRC are automatically updated in *newmap.  On
+ * success, we copy the data to the appropriate permanent static variable.
+ *
+ * If write_wal is true then an appropriate WAL message is emitted.
+ * (It will be false for bootstrap and WAL replay cases.)
+ *
+ * If send_sinval is true then a SI invalidation message is sent.
+ * (This should be true except in bootstrap case.)
+ *
+ * If preserve_files is true then the storage manager is warned not to
+ * delete the files listed in the map.
+ *
+ * Because this may be called during WAL replay when MyDatabaseId,
+ * DatabasePath, etc aren't valid, we require the caller to pass in suitable
+ * values.  The caller is also responsible for being sure no concurrent
+ * map update could be happening.
+ */
+static void
+write_relmap_file(bool shared, RelMapFile *newmap,
+				  bool write_wal, bool send_sinval, bool preserve_files,
+				  Oid dbid, Oid tsid, const char *dbpath)
+{
+	RelMapFile *realmap;
+	char		mapfilename[MAXPGPATH];
+
+	/*
+	 * Fill in the overhead fields and update CRC.
+	 */
+	newmap->magic = RELMAPPER_FILEMAGIC;
+	if (newmap->num_mappings < 0 || newmap->num_mappings > MAX_MAPPINGS)
+		elog(ERROR, "attempt to write bogus relation mapping");
+
+	INIT_CRC32C(newmap->crc);
+	COMP_CRC32C(newmap->crc, (char *) newmap, offsetof(RelMapFile, crc));
+	FIN_CRC32C(newmap->crc);
+
+	if (shared)
+	{
+		snprintf(mapfilename, sizeof(mapfilename), "global/%s",
+				 RELMAPPER_FILENAME);
+		realmap = &shared_map;
+	}
+	else
+	{
+		snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+				 dbpath, RELMAPPER_FILENAME);
+		realmap = &local_map;
+	}
+
+	/* Write the map to the relmap file. */
+	write_relmap_file_internal(mapfilename, newmap, write_wal,
+							   send_sinval, preserve_files, dbid, tsid,
+							   dbpath);
+
 	/*
 	 * Success, update permanent copy.  During bootstrap, we might be working
 	 * on the permanent copy itself, in which case skip the memcpy() to avoid
@@ -943,10 +988,6 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		memcpy(realmap, newmap, sizeof(RelMapFile));
 	else
 		Assert(!send_sinval);	/* must be bootstrapping */
-
-	/* Critical section done */
-	if (write_wal)
-		END_CRIT_SECTION();
 }
 
 /*
-- 
2.23.0

v3-0005-New-interface-to-lock-relation-id.patchapplication/octet-stream; name=v3-0005-New-interface-to-lock-relation-id.patchDownload

From bd093125a4f971da0ee1226c0ef12875ca7df16c Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 24 Sep 2021 18:29:17 +0530
Subject: [PATCH v3 5/6] New interface to lock relation id

Same as LockRelationOid, but instead of rel oid it will take
LockRelId object as an input.  So instead of using MyDatabaseId it
will use the dboid passed in the LockRelId object. So this will
provide an option to lock the relation even if we are not connected
to the database.
---
 src/backend/storage/lmgr/lmgr.c | 28 ++++++++++++++++++++++++++++
 src/include/storage/lmgr.h      |  1 +
 2 files changed, 29 insertions(+)

diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index cdf2266d6d..4a321aa4b2 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -175,6 +175,34 @@ ConditionalLockRelationOid(Oid relid, LOCKMODE lockmode)
 	return true;
 }
 
+/*
+ *		LockRelationId
+ *
+ * Lock, given a LockRelId.  Same as LockRelationOid but take LockRelId as an
+ * input.
+ */
+void
+LockRelationId(LockRelId *relid, LOCKMODE lockmode)
+{
+	LOCKTAG		tag;
+	LOCALLOCK  *locallock;
+	LockAcquireResult res;
+
+	SET_LOCKTAG_RELATION(tag, relid->dbId, relid->relId);
+
+	res = LockAcquireExtended(&tag, lockmode, false, false, true, &locallock);
+
+	/*
+	 * Now that we have the lock, check for invalidation messages; see notes
+	 * in LockRelationOid.
+	 */
+	if (res != LOCKACQUIRE_ALREADY_CLEAR)
+	{
+		AcceptInvalidationMessages();
+		MarkLockClear(locallock);
+	}
+}
+
 /*
  *		UnlockRelationId
  *
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index b009559229..092ee934b4 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -38,6 +38,7 @@ extern void RelationInitLockInfo(Relation relation);
 
 /* Lock a relation */
 extern void LockRelationOid(Oid relid, LOCKMODE lockmode);
+extern void LockRelationId(LockRelId *relid, LOCKMODE lockmode);
 extern bool ConditionalLockRelationOid(Oid relid, LOCKMODE lockmode);
 extern void UnlockRelationId(LockRelId *relid, LOCKMODE lockmode);
 extern void UnlockRelationOid(Oid relid, LOCKMODE lockmode);
-- 
2.23.0

v3-0006-WAL-logged-CREATE-DATABASE.patchapplication/octet-stream; name=v3-0006-WAL-logged-CREATE-DATABASE.patchDownload

From fcd0b07ea3b2bfed0d5eb243496bc114703643d0 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 24 Sep 2021 18:33:01 +0530
Subject: [PATCH v3 6/6] WAL logged CREATE DATABASE

Currently, CREATE DATABASE forces a checkpoint, then copies all the files,
then forces another checkpoint. The comments in the createdb() function
explain the reasons for this. The attached patch fixes this problem by making
create database completely WAL logged so that we can avoid the checkpoints.

This can also be useful for supporting the TDE. For example, if we need different
encryption for the source and the target database then we can not re-encrypt the
page data if we copy the whole directory.  But with this patch, we are copying
page by page so we have an opportunity to re-encrypt the page before copying that
to the target database.
---
 src/backend/access/rmgrdesc/dbasedesc.c |   3 +-
 src/backend/access/transam/xlogutils.c  |  12 +-
 src/backend/commands/dbcommands.c       | 704 ++++++++++++++++--------
 src/bin/pg_rewind/parsexlog.c           |   1 +
 src/include/commands/dbcommands_xlog.h  |   3 -
 5 files changed, 479 insertions(+), 244 deletions(-)

diff --git a/src/backend/access/rmgrdesc/dbasedesc.c b/src/backend/access/rmgrdesc/dbasedesc.c
index 26609845aa..5010f72b2c 100644
--- a/src/backend/access/rmgrdesc/dbasedesc.c
+++ b/src/backend/access/rmgrdesc/dbasedesc.c
@@ -28,8 +28,7 @@ dbase_desc(StringInfo buf, XLogReaderState *record)
 	{
 		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) rec;
 
-		appendStringInfo(buf, "copy dir %u/%u to %u/%u",
-						 xlrec->src_tablespace_id, xlrec->src_db_id,
+		appendStringInfo(buf, "create dir %u/%u",
 						 xlrec->tablespace_id, xlrec->db_id);
 	}
 	else if (info == XLOG_DBASE_DROP)
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 88a1bfd939..a7a8b79d6e 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -483,8 +483,8 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 	if (blkno < lastblock)
 	{
 		/* page exists in file */
-		buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-										   mode, NULL);
+		buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno, mode, NULL,
+										   RELPERSISTENCE_PERMANENT);
 	}
 	else
 	{
@@ -508,8 +508,8 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 					LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 				ReleaseBuffer(buffer);
 			}
-			buffer = ReadBufferWithoutRelcache(rnode, forknum,
-											   P_NEW, mode, NULL);
+			buffer = ReadBufferWithoutRelcache(rnode, forknum, P_NEW, mode,
+											   NULL, RELPERSISTENCE_PERMANENT);
 		}
 		while (BufferGetBlockNumber(buffer) < blkno);
 		/* Handle the corner case that P_NEW returns non-consecutive pages */
@@ -518,8 +518,8 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 			if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
 				LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 			ReleaseBuffer(buffer);
-			buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-											   mode, NULL);
+			buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno, mode,
+											   NULL, RELPERSISTENCE_PERMANENT);
 		}
 	}
 
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 029fab48df..2b70d4d388 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -45,13 +45,13 @@
 #include "commands/dbcommands_xlog.h"
 #include "commands/defrem.h"
 #include "commands/seclabel.h"
+#include "commands/tablecmds.h"
 #include "commands/tablespace.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
 #include "replication/slot.h"
-#include "storage/copydir.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
@@ -62,6 +62,7 @@
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
 #include "utils/pg_locale.h"
+#include "utils/relmapper.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 
@@ -77,6 +78,19 @@ typedef struct
 	Oid			dest_tsoid;		/* tablespace we are trying to move to */
 } movedb_failure_params;
 
+/*
+ * When creating a database, we scan the pg_class of the source database to
+ * identify all the relations to be copied.  The structure is used for storing
+ * information about each relation of the source database.
+ */
+typedef struct CreateDBRelInfo
+{
+	RelFileNode		rnode;				/* physical relation identifier */
+	Oid				reloid;				/* relation oid */
+	char			relpersistence;		/* relation's persistence level */
+} CreateDBRelInfo;
+
+
 /* non-export function prototypes */
 static void createdb_failure_callback(int code, Datum arg);
 static void movedb(const char *dbname, const char *tblspcname);
@@ -91,6 +105,426 @@ static bool have_createdb_privilege(void);
 static void remove_dbtablespaces(Oid db_id);
 static bool check_db_file_conflict(Oid db_id);
 static int	errdetail_busy_db(int notherbackends, int npreparedxacts);
+static void CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid,
+									bool isRedo);
+static List *GetDatabaseRelationList(Oid srctbid, Oid srcdbid, char *srcpath);
+void RelationCopyStorageUsingBuffer(SMgrRelation src, SMgrRelation dst,
+									ForkNumber forkNum, char relpersistence);
+static void CopyDatabase(Oid src_dboid, Oid dboid, Oid src_tsid, Oid dst_tsid);
+
+/*
+ * CreateDirAndVersionFile - Create database directory and write out the
+ *							 PG_VERSION file in the database path.
+ *
+ * If isRedo is true, it's okay for the database directory to exist already.
+ *
+ * We can directly write PG_MAJORVERSION in the version file instead of copying
+ * from the source database file because these two must be the same.
+ */
+static void
+CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid, bool isRedo)
+{
+	int		fd;
+	int		nbytes = strlen(PG_MAJORVERSION);
+	char	versionfile[MAXPGPATH];
+
+	/* If we are not in WAL replay then write the WAL. */
+	if (!isRedo)
+	{
+		xl_dbase_create_rec xlrec;
+		XLogRecPtr	lsn;
+
+		/* Now errors are fatal ... */
+		START_CRIT_SECTION();
+
+		xlrec.db_id = dbid;
+		xlrec.tablespace_id = tsid;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec), sizeof(xl_dbase_create_rec));
+
+		lsn = XLogInsert(RM_DBASE_ID, XLOG_DBASE_CREATE);
+
+		/* As always, WAL must hit the disk before the data update does. */
+		XLogFlush(lsn);
+	}
+
+	/* Create database directory. */
+	if (MakePGDirectory(dbpath) < 0)
+	{
+		/* Failure other than already exists or not in WAL replay? */
+		if (errno != EEXIST || !isRedo)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not create directory \"%s\": %m", dbpath)));
+	}
+
+	/*
+	 * Create PG_VERSION file in the database path.  If the file already exists
+	 * and we are in WAL replay then try again to open it in the write mode.
+	 */
+	snprintf(versionfile, sizeof(versionfile), "%s/%s", dbpath, "PG_VERSION");
+
+	fd = OpenTransientFile(versionfile, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
+	if (fd < 0 && errno == EEXIST && isRedo)
+		fd = OpenTransientFile(versionfile, O_RDWR | PG_BINARY);
+
+	if (fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m", versionfile)));
+
+	/* Write PG_MAJORVERSION in the PG_VERSION file. */
+	pgstat_report_wait_start(WAIT_EVENT_COPY_FILE_WRITE);
+	errno = 0;
+	if ((int) write(fd, (char *) PG_MAJORVERSION, nbytes) != nbytes)
+	{
+		/* If write didn't set errno, assume problem is no disk space. */
+		if (errno == 0)
+			errno = ENOSPC;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m", versionfile)));
+	}
+	pgstat_report_wait_end();
+
+	/* Close the version file. */
+	CloseTransientFile(fd);
+
+	/* Critical section done. */
+	if (!isRedo)
+		END_CRIT_SECTION();
+}
+
+/*
+ * GetDatabaseRelationList - Get relfilenode list to be copied.
+ *
+ * Iterate over each block of the pg_class relation.  From there, we will check
+ * all the visible tuples in order to get a list of all the valid relfilenodes
+ * in the source database that should be copied to the target database.
+ */
+static List *
+GetDatabaseRelationList(Oid tbid, Oid dbid, char *srcpath)
+{
+	SMgrRelation	rd_smgr;
+	RelFileNode		rnode;
+	BlockNumber		nblocks;
+	BlockNumber		blkno;
+	OffsetNumber	offnum;
+	OffsetNumber	maxoff;
+	Buffer			buf;
+	Oid				relfilenode;
+	Page			page;
+	List		   *rnodelist = NIL;
+	HeapTupleData	tuple;
+	Form_pg_class	classForm;
+	LockRelId		relid;
+	BufferAccessStrategy bstrategy;
+
+	/* Get pg_class relfilenode. */
+	relfilenode = RelationMapOidToFilenodeForDatabase(srcpath,
+													  RelationRelationId);
+	/*
+	 * We are going to read the buffers associated with the pg_class relation.
+	 * Thus, acquire the relation level lock before start scanning.  As we are
+	 * not connected to the database, we cannot use relation_open directly, so
+	 * we have to lock using relation id.
+	 */
+	relid.dbId = dbid;
+	relid.relId = RelationRelationId;
+	LockRelationId(&relid, AccessShareLock);
+
+	/* Prepare a relnode for pg_class relation. */
+	rnode.spcNode = tbid;
+	rnode.dbNode = dbid;
+	rnode.relNode = relfilenode;
+
+	/*
+	 * We are not connected to the source database so open the pg_class
+	 * relation at the smgr level and get the block count.
+	 */
+	rd_smgr = smgropen(rnode, InvalidBackendId);
+	nblocks = smgrnblocks(rd_smgr, MAIN_FORKNUM);
+
+	/*
+	 * We're going to read the whole pg_class so better to use bulk-read buffer
+	 * access strategy.
+	 */
+	bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	/* Iterate over each block on the pg_class relation. */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		/*
+		 * We are not connected to the source database so directly use the lower
+		 * level bufmgr interface which operates on the rnode.
+		 */
+		buf = ReadBufferWithoutRelcache(rnode, MAIN_FORKNUM, blkno,
+										RBM_NORMAL, bstrategy,
+										RELPERSISTENCE_PERMANENT);
+
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buf);
+		if (PageIsNew(page) || PageIsEmpty(page))
+			continue;
+
+		maxoff = PageGetMaxOffsetNumber(page);
+
+		/* Iterate over each tuple on the page. */
+		for (offnum = FirstOffsetNumber;
+			 offnum <= maxoff;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemid;
+
+			itemid = PageGetItemId(page, offnum);
+
+			/* Nothing to do if slot is empty or already dead. */
+			if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid) ||
+				ItemIdIsRedirected(itemid))
+				continue;
+
+			Assert(ItemIdIsNormal(itemid));
+			ItemPointerSet(&(tuple.t_self), blkno, offnum);
+
+			tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
+			tuple.t_len = ItemIdGetLength(itemid);
+			tuple.t_tableOid = RelationRelationId;
+
+			/*
+			 * If the tuple is visible then add its relfilenode info to the
+			 * list.
+			 */
+			if (HeapTupleSatisfiesVisibility(&tuple, GetActiveSnapshot(), buf))
+			{
+				Oid				relfilenode = InvalidOid;
+				CreateDBRelInfo   *relinfo;
+
+				classForm = (Form_pg_class) GETSTRUCT(&tuple);
+
+				/* We don't need to copy the shared objects to the target. */
+				if (classForm->reltablespace == GLOBALTABLESPACE_OID)
+					continue;
+
+				/*
+				 * If the object doesn't have the storage then nothing to be
+				 * done for that object so just ignore it.
+				 */
+				if (!RELKIND_HAS_STORAGE(classForm->relkind))
+					continue;
+
+				/*
+				 * If relfilenode is valid then directly use it.  Otherwise,
+				 * consult the relmapper for the mapped relation.
+				 *
+				 * XXX We can optimize RelationMapOidToFileenodeForDatabase API
+				 * so that instead of reading the relmap file every time, it can
+				 * save it in a temporary variable and use it for subsequent
+				 * calls.  Then later reset it once we're done or at the
+				 * transaction end.
+				 */
+				if (OidIsValid(classForm->relfilenode))
+					relfilenode = classForm->relfilenode;
+				else
+					relfilenode = RelationMapOidToFilenodeForDatabase(srcpath,
+													classForm->oid);
+
+				/* We must have a valid relfilenode oid. */
+				Assert(OidIsValid(relfilenode));
+
+				/* Prepare a rel info element and add it to the list. */
+				relinfo = (CreateDBRelInfo *) palloc(sizeof(CreateDBRelInfo));
+				if (OidIsValid(classForm->reltablespace))
+					relinfo->rnode.spcNode = classForm->reltablespace;
+				else
+					relinfo->rnode.spcNode = tbid;
+
+				relinfo->rnode.dbNode = dbid;
+				relinfo->rnode.relNode = relfilenode;
+				relinfo->reloid = classForm->oid;
+				relinfo->relpersistence = classForm->relpersistence;
+
+				if (rnodelist == NULL)
+					rnodelist = list_make1(relinfo);
+				else
+					rnodelist = lappend(rnodelist, relinfo);
+			}
+		}
+
+		/* Release the buffer lock. */
+		UnlockReleaseBuffer(buf);
+	}
+
+	/* Release the lock. */
+	UnlockRelationId(&relid, AccessShareLock);
+
+	return rnodelist;
+}
+
+/*
+ * RelationCopyStorageUsingBuffer - Copy fork's data using bufmgr.
+ *
+ * Same as RelationCopyStorage but instead of using smgrread and smgrextend this
+ * will copy using bufmgr apis.
+ */
+void
+RelationCopyStorageUsingBuffer(SMgrRelation src, SMgrRelation dst,
+							   ForkNumber forkNum, char relpersistence)
+{
+	Buffer		srcBuf;
+	Buffer		dstBuf;
+	Page		srcPage;
+	Page		dstPage;
+	bool		use_wal;
+	bool		copying_initfork;
+	BlockNumber nblocks;
+	BlockNumber blkno;
+	BufferAccessStrategy bstrategy_src;
+	BufferAccessStrategy bstrategy_dst;
+
+	/* Refer comments in RelationCopyStorage. */
+	copying_initfork = relpersistence == RELPERSISTENCE_UNLOGGED &&
+		forkNum == INIT_FORKNUM;
+	use_wal = XLogIsNeeded() &&
+		(relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork);
+
+	/* Get number of blocks in the source relation. */
+	nblocks = smgrnblocks(src, forkNum);
+
+	/*
+	 * We are going to copy whole relation from the source to the destination
+	 * so use BAS_BULKREAD strategy for the source relation and BAS_BULKWRITE
+	 * strategy for the destination relation.
+	 */
+	bstrategy_src = GetAccessStrategy(BAS_BULKREAD);
+	bstrategy_dst = GetAccessStrategy(BAS_BULKWRITE);
+
+	/* Iterate over each block of the source relation file. */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		/* If we got a cancel signal during the copy of the data, quit */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Read block from source relation. */
+		srcBuf = ReadBufferWithoutRelcache(src->smgr_rnode.node, forkNum,
+										   blkno, RBM_NORMAL, bstrategy_src,
+										   relpersistence);
+		srcPage = BufferGetPage(srcBuf);
+		if (PageIsNew(srcPage) || PageIsEmpty(srcPage))
+		{
+			ReleaseBuffer(srcBuf);
+			continue;
+		}
+
+		/* Use P_NEW to extend the relation. */
+		dstBuf = ReadBufferWithoutRelcache(dst->smgr_rnode.node, forkNum,
+										   P_NEW, RBM_NORMAL, bstrategy_dst,
+										   relpersistence);
+		LockBuffer(dstBuf, BUFFER_LOCK_EXCLUSIVE);
+
+		START_CRIT_SECTION();
+
+		/* Initialize the page and write the data. */
+		dstPage = BufferGetPage(dstBuf);
+		PageInit(dstPage, BufferGetPageSize(dstBuf), 0);
+		memcpy(dstPage, srcPage, BLCKSZ);
+		MarkBufferDirty(dstBuf);
+
+		/* WAL-log the copied page. */
+		if (use_wal)
+			log_newpage_buffer(dstBuf, true);
+
+		END_CRIT_SECTION();
+
+		UnlockReleaseBuffer(dstBuf);
+		ReleaseBuffer(srcBuf);
+	}
+}
+
+/*
+ * CopyDatabase - Copy source database to the target database.
+ *
+ * Create target database directory and copy data files from the source database
+ * to the target database, block by block and WAL log all the operations.
+ */
+static void
+CopyDatabase(Oid src_dboid, Oid dst_dboid, Oid src_tsid, Oid dst_tsid)
+{
+	char	   *srcpath;
+	char	   *dstpath;
+	List	   *rnodelist = NULL;
+	ListCell   *cell;
+	LockRelId	relid;
+	RelFileNode	srcrnode;
+	RelFileNode	dstrnode;
+	CreateDBRelInfo	*relinfo;
+
+	/* Get the source database path. */
+	srcpath = GetDatabasePath(src_dboid, src_tsid);
+
+	/* Get the destination database path. */
+	dstpath = GetDatabasePath(dst_dboid, dst_tsid);
+
+	/* Create database directory and write PG_VERSION file. */
+	CreateDirAndVersionFile(dstpath, dst_dboid, dst_tsid, false);
+
+	/* Copy relmap file from source database to the destination database. */
+	CopyRelationMap(dst_dboid, dst_tsid, srcpath, dstpath);
+
+	/* Get list of all valid relnode from the source database. */
+	rnodelist = GetDatabaseRelationList(src_tsid, src_dboid, srcpath);
+	Assert(rnodelist != NIL);
+
+	/*
+	 * Database id is common for all the relation so set it before entering to
+	 * the loop.
+	 */
+	relid.dbId = src_dboid;
+
+	/*
+	 * Iterate over each relfilenode and copy the relation data block by block
+	 * from source database to the destination database.
+	 */
+	foreach(cell, rnodelist)
+	{
+		SMgrRelation	src_smgr;
+		SMgrRelation	dst_smgr;
+
+		relinfo = lfirst(cell);
+		srcrnode = relinfo->rnode;
+
+		/*
+		 * If the relation is from the default tablespace then we need to
+		 * create it in the destinations db's default tablespace.  Otherwise,
+		 * we need to create in the same tablespace as it is in the source
+		 * database.
+		 */
+		if (srcrnode.spcNode != src_tsid)
+			dstrnode.spcNode = srcrnode.spcNode;
+		else
+			dstrnode.spcNode = dst_tsid;
+
+		dstrnode.dbNode = dst_dboid;
+		dstrnode.relNode = srcrnode.relNode;
+
+		/* Acquire the lock on relation before start copying. */
+		relid.relId = relinfo->reloid;
+		LockRelationId(&relid, AccessShareLock);
+
+		/* Open the source and the destination relation at smgr level. */
+		src_smgr = smgropen(srcrnode, InvalidBackendId);
+		dst_smgr = smgropen(dstrnode, InvalidBackendId);
+
+		/* Copy relation storage from source to the destination. */
+		RelationCopyAllFork(src_smgr, dst_smgr, relinfo->relpersistence,
+							RelationCopyStorageUsingBuffer);
+
+		/* Release the lock. */
+		UnlockRelationId(&relid, AccessShareLock);
+	}
+
+	list_free_deep(rnodelist);
+}
 
 
 /*
@@ -99,8 +533,6 @@ static int	errdetail_busy_db(int notherbackends, int npreparedxacts);
 Oid
 createdb(ParseState *pstate, const CreatedbStmt *stmt)
 {
-	TableScanDesc scan;
-	Relation	rel;
 	Oid			src_dboid;
 	Oid			src_owner;
 	int			src_encoding = -1;
@@ -563,139 +995,27 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	InvokeObjectPostCreateHook(DatabaseRelationId, dboid, 0);
 
 	/*
-	 * Force a checkpoint before starting the copy. This will force all dirty
-	 * buffers, including those of unlogged tables, out to disk, to ensure
-	 * source database is up-to-date on disk for the copy.
-	 * FlushDatabaseBuffers() would suffice for that, but we also want to
-	 * process any pending unlink requests. Otherwise, if a checkpoint
-	 * happened while we're copying files, a file might be deleted just when
-	 * we're about to copy it, causing the lstat() call in copydir() to fail
-	 * with ENOENT.
-	 */
-	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT
-					  | CHECKPOINT_FLUSH_ALL);
-
-	/*
-	 * Once we start copying subdirectories, we need to be able to clean 'em
-	 * up if we fail.  Use an ENSURE block to make sure this happens.  (This
-	 * is not a 100% solution, because of the possibility of failure during
-	 * transaction commit after we leave this routine, but it should handle
-	 * most scenarios.)
+	 * Once we start copying files from the source database, we need to be able
+	 * to clean 'em up if we fail.  Use an ENSURE block to make sure this
+	 * happens.  (This is not a 100% solution, because of the possibility of
+	 * failure during transaction commit after we leave this routine, but it
+	 * should handle most scenarios.)
 	 */
 	fparms.src_dboid = src_dboid;
 	fparms.dest_dboid = dboid;
 	PG_ENSURE_ERROR_CLEANUP(createdb_failure_callback,
 							PointerGetDatum(&fparms));
 	{
-		/*
-		 * Iterate through all tablespaces of the template database, and copy
-		 * each one to the new database.
-		 */
-		rel = table_open(TableSpaceRelationId, AccessShareLock);
-		scan = table_beginscan_catalog(rel, 0, NULL);
-		while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
-		{
-			Form_pg_tablespace spaceform = (Form_pg_tablespace) GETSTRUCT(tuple);
-			Oid			srctablespace = spaceform->oid;
-			Oid			dsttablespace;
-			char	   *srcpath;
-			char	   *dstpath;
-			struct stat st;
-
-			/* No need to copy global tablespace */
-			if (srctablespace == GLOBALTABLESPACE_OID)
-				continue;
-
-			srcpath = GetDatabasePath(src_dboid, srctablespace);
-
-			if (stat(srcpath, &st) < 0 || !S_ISDIR(st.st_mode) ||
-				directory_is_empty(srcpath))
-			{
-				/* Assume we can ignore it */
-				pfree(srcpath);
-				continue;
-			}
-
-			if (srctablespace == src_deftablespace)
-				dsttablespace = dst_deftablespace;
-			else
-				dsttablespace = srctablespace;
-
-			dstpath = GetDatabasePath(dboid, dsttablespace);
-
-			/*
-			 * Copy this subdirectory to the new location
-			 *
-			 * We don't need to copy subdirectories
-			 */
-			copydir(srcpath, dstpath, false);
-
-			/* Record the filesystem change in XLOG */
-			{
-				xl_dbase_create_rec xlrec;
-
-				xlrec.db_id = dboid;
-				xlrec.tablespace_id = dsttablespace;
-				xlrec.src_db_id = src_dboid;
-				xlrec.src_tablespace_id = srctablespace;
-
-				XLogBeginInsert();
-				XLogRegisterData((char *) &xlrec, sizeof(xl_dbase_create_rec));
-
-				(void) XLogInsert(RM_DBASE_ID,
-								  XLOG_DBASE_CREATE | XLR_SPECIAL_REL_UPDATE);
-			}
-		}
-		table_endscan(scan);
-		table_close(rel, AccessShareLock);
-
-		/*
-		 * We force a checkpoint before committing.  This effectively means
-		 * that committed XLOG_DBASE_CREATE operations will never need to be
-		 * replayed (at least not in ordinary crash recovery; we still have to
-		 * make the XLOG entry for the benefit of PITR operations). This
-		 * avoids two nasty scenarios:
-		 *
-		 * #1: When PITR is off, we don't XLOG the contents of newly created
-		 * indexes; therefore the drop-and-recreate-whole-directory behavior
-		 * of DBASE_CREATE replay would lose such indexes.
-		 *
-		 * #2: Since we have to recopy the source database during DBASE_CREATE
-		 * replay, we run the risk of copying changes in it that were
-		 * committed after the original CREATE DATABASE command but before the
-		 * system crash that led to the replay.  This is at least unexpected
-		 * and at worst could lead to inconsistencies, eg duplicate table
-		 * names.
-		 *
-		 * (Both of these were real bugs in releases 8.0 through 8.0.3.)
-		 *
-		 * In PITR replay, the first of these isn't an issue, and the second
-		 * is only a risk if the CREATE DATABASE and subsequent template
-		 * database change both occur while a base backup is being taken.
-		 * There doesn't seem to be much we can do about that except document
-		 * it as a limitation.
-		 *
-		 * Perhaps if we ever implement CREATE DATABASE in a less cheesy way,
-		 * we can avoid this.
-		 */
-		RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
-
-		/*
-		 * Close pg_database, but keep lock till commit.
-		 */
-		table_close(pg_database_rel, NoLock);
-
-		/*
-		 * Force synchronous commit, thus minimizing the window between
-		 * creation of the database files and committal of the transaction. If
-		 * we crash before committing, we'll have a DB that's taking up disk
-		 * space but is not in pg_database, which is not good.
-		 */
-		ForceSyncCommit();
+		CopyDatabase(src_dboid, dboid, src_deftablespace, dst_deftablespace);
 	}
 	PG_END_ENSURE_ERROR_CLEANUP(createdb_failure_callback,
 								PointerGetDatum(&fparms));
 
+	/*
+	 * Close pg_database, but keep lock till commit.
+	 */
+	table_close(pg_database_rel, NoLock);
+
 	return dboid;
 }
 
@@ -938,7 +1258,7 @@ dropdb(const char *dbname, bool missing_ok, bool force)
 	 * is important to ensure that no remaining backend tries to write out a
 	 * dirty buffer to the dead database later...
 	 */
-	DropDatabaseBuffers(db_id);
+	DropDatabaseBuffers(db_id, InvalidOid);
 
 	/*
 	 * Tell the stats collector to forget it immediately, too.
@@ -1195,37 +1515,6 @@ movedb(const char *dbname, const char *tblspcname)
 	src_dbpath = GetDatabasePath(db_id, src_tblspcoid);
 	dst_dbpath = GetDatabasePath(db_id, dst_tblspcoid);
 
-	/*
-	 * Force a checkpoint before proceeding. This will force all dirty
-	 * buffers, including those of unlogged tables, out to disk, to ensure
-	 * source database is up-to-date on disk for the copy.
-	 * FlushDatabaseBuffers() would suffice for that, but we also want to
-	 * process any pending unlink requests. Otherwise, the check for existing
-	 * files in the target directory might fail unnecessarily, not to mention
-	 * that the copy might fail due to source files getting deleted under it.
-	 * On Windows, this also ensures that background procs don't hold any open
-	 * files, which would cause rmdir() to fail.
-	 */
-	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT
-					  | CHECKPOINT_FLUSH_ALL);
-
-	/*
-	 * Now drop all buffers holding data of the target database; they should
-	 * no longer be dirty so DropDatabaseBuffers is safe.
-	 *
-	 * It might seem that we could just let these buffers age out of shared
-	 * buffers naturally, since they should not get referenced anymore.  The
-	 * problem with that is that if the user later moves the database back to
-	 * its original tablespace, any still-surviving buffers would appear to
-	 * contain valid data again --- but they'd be missing any changes made in
-	 * the database while it was in the new tablespace.  In any case, freeing
-	 * buffers that should never be used again seems worth the cycles.
-	 *
-	 * Note: it'd be sufficient to get rid of buffers matching db_id and
-	 * src_tblspcoid, but bufmgr.c presently provides no API for that.
-	 */
-	DropDatabaseBuffers(db_id);
-
 	/*
 	 * Check for existence of files in the target directory, i.e., objects of
 	 * this database that are already in the target tablespace.  We can't
@@ -1261,38 +1550,16 @@ movedb(const char *dbname, const char *tblspcname)
 	}
 
 	/*
-	 * Use an ENSURE block to make sure we remove the debris if the copy fails
-	 * (eg, due to out-of-disk-space).  This is not a 100% solution, because
-	 * of the possibility of failure during transaction commit, but it should
-	 * handle most scenarios.
+	 * Use an ENSURE block to make sure we remove the debris if the copy fails.
+	 * This is not a 100% solution, because of the possibility of failure
+	 * during transaction commit, but it should handle most scenarios.
 	 */
 	fparms.dest_dboid = db_id;
 	fparms.dest_tsoid = dst_tblspcoid;
 	PG_ENSURE_ERROR_CLEANUP(movedb_failure_callback,
 							PointerGetDatum(&fparms));
 	{
-		/*
-		 * Copy files from the old tablespace to the new one
-		 */
-		copydir(src_dbpath, dst_dbpath, false);
-
-		/*
-		 * Record the filesystem change in XLOG
-		 */
-		{
-			xl_dbase_create_rec xlrec;
-
-			xlrec.db_id = db_id;
-			xlrec.tablespace_id = dst_tblspcoid;
-			xlrec.src_db_id = db_id;
-			xlrec.src_tablespace_id = src_tblspcoid;
-
-			XLogBeginInsert();
-			XLogRegisterData((char *) &xlrec, sizeof(xl_dbase_create_rec));
-
-			(void) XLogInsert(RM_DBASE_ID,
-							  XLOG_DBASE_CREATE | XLR_SPECIAL_REL_UPDATE);
-		}
+		CopyDatabase(db_id, db_id, src_tblspcoid, dst_tblspcoid);
 
 		/*
 		 * Update the database's pg_database tuple
@@ -1325,22 +1592,6 @@ movedb(const char *dbname, const char *tblspcname)
 
 		systable_endscan(sysscan);
 
-		/*
-		 * Force another checkpoint here.  As in CREATE DATABASE, this is to
-		 * ensure that we don't have to replay a committed XLOG_DBASE_CREATE
-		 * operation, which would cause us to lose any unlogged operations
-		 * done in the new DB tablespace before the next checkpoint.
-		 */
-		RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
-
-		/*
-		 * Force synchronous commit, thus minimizing the window between
-		 * copying the database files and committal of the transaction. If we
-		 * crash before committing, we'll leave an orphaned set of files on
-		 * disk, which is not fatal but not good either.
-		 */
-		ForceSyncCommit();
-
 		/*
 		 * Close pg_database, but keep lock till commit.
 		 */
@@ -1349,6 +1600,21 @@ movedb(const char *dbname, const char *tblspcname)
 	PG_END_ENSURE_ERROR_CLEANUP(movedb_failure_callback,
 								PointerGetDatum(&fparms));
 
+	/*
+	 * Now drop all buffers holding data of the target database for the old
+	 * tablespace oid; We have already copied all the data to the new
+	 * tablespace so we no longer required the old buffers.
+	 *
+	 * It might seem that we could just let these buffers age out of shared
+	 * buffers naturally, since they should not get referenced anymore.  The
+	 * problem with that is that if the user later moves the database back to
+	 * its original tablespace, any still-surviving buffers would appear to
+	 * contain valid data again --- but they'd be missing any changes made in
+	 * the database while it was in the new tablespace.  In any case, freeing
+	 * buffers that should never be used again seems worth the cycles.
+	 */
+	DropDatabaseBuffers(db_id, src_tblspcoid);
+
 	/*
 	 * Commit the transaction so that the pg_database update is committed. If
 	 * we crash while removing files, the database won't be corrupt, we'll
@@ -2141,39 +2407,11 @@ dbase_redo(XLogReaderState *record)
 	if (info == XLOG_DBASE_CREATE)
 	{
 		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) XLogRecGetData(record);
-		char	   *src_path;
-		char	   *dst_path;
-		struct stat st;
-
-		src_path = GetDatabasePath(xlrec->src_db_id, xlrec->src_tablespace_id);
-		dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
-
-		/*
-		 * Our theory for replaying a CREATE is to forcibly drop the target
-		 * subdirectory if present, then re-copy the source data. This may be
-		 * more work than needed, but it is simple to implement.
-		 */
-		if (stat(dst_path, &st) == 0 && S_ISDIR(st.st_mode))
-		{
-			if (!rmtree(dst_path, true))
-				/* If this failed, copydir() below is going to error. */
-				ereport(WARNING,
-						(errmsg("some useless files may be left behind in old database directory \"%s\"",
-								dst_path)));
-		}
-
-		/*
-		 * Force dirty buffers out to disk, to ensure source database is
-		 * up-to-date for the copy.
-		 */
-		FlushDatabaseBuffers(xlrec->src_db_id);
+		char	   *dbpath;
 
-		/*
-		 * Copy this subdirectory to the new location
-		 *
-		 * We don't need to copy subdirectories
-		 */
-		copydir(src_path, dst_path, false);
+		dbpath = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
+		CreateDirAndVersionFile(dbpath, xlrec->db_id, xlrec->tablespace_id,
+								true);
 	}
 	else if (info == XLOG_DBASE_DROP)
 	{
@@ -2201,7 +2439,7 @@ dbase_redo(XLogReaderState *record)
 		ReplicationSlotsDropDBSlots(xlrec->db_id);
 
 		/* Drop pages for this database that are in the shared buffer cache */
-		DropDatabaseBuffers(xlrec->db_id);
+		DropDatabaseBuffers(xlrec->db_id, InvalidOid);
 
 		/* Also, clean out any fsync requests that might be pending in md.c */
 		ForgetDatabaseSyncRequests(xlrec->db_id);
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 59ebac7d6a..f71b3446f8 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -23,6 +23,7 @@
 #include "fe_utils/archive.h"
 #include "filemap.h"
 #include "pg_rewind.h"
+#include "utils/relmapper.h"
 
 /*
  * RmgrNames is an array of resource manager names, to make error messages
diff --git a/src/include/commands/dbcommands_xlog.h b/src/include/commands/dbcommands_xlog.h
index f5ed762677..21dc58ea5d 100644
--- a/src/include/commands/dbcommands_xlog.h
+++ b/src/include/commands/dbcommands_xlog.h
@@ -23,11 +23,8 @@
 
 typedef struct xl_dbase_create_rec
 {
-	/* Records copying of a single subdirectory incl. contents */
 	Oid			db_id;
 	Oid			tablespace_id;
-	Oid			src_db_id;
-	Oid			src_tablespace_id;
 } xl_dbase_create_rec;
 
 typedef struct xl_dbase_drop_rec
-- 
2.23.0

#39

Dilip Kumar

dilipbalaut@gmail.com

over 4 years ago

In reply to: Dilip Kumar (#38)

6 attachment(s)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Mon, Sep 27, 2021 at 12:23 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Open question:
- Scan pg_class vs scan directories
- Whether to retain the old created database mechanism as option or not.

I have done some code improvement in 0001 and 0002.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v4-0001-Refactor-relmap-load-and-relmap-write-functions.patchtext/x-patch; charset=US-ASCII; name=v4-0001-Refactor-relmap-load-and-relmap-write-functions.patchDownload

From fce1a87e25d20bf4b1a85c6cc535db42b5bdfc73 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Wed, 1 Sep 2021 14:06:29 +0530
Subject: [PATCH v4 1/6] Refactor relmap load and relmap write functions

Currently, write_relmap_file and load_relmap_file are tightly
coupled with shared_map and local_map.  As part of the higher
level patch set we need remap read/write interfaces that are
not dependent upon shared_map and local_map, and we should be
able to pass map memory as an external parameter instead.
---
 src/backend/utils/cache/relmapper.c | 163 ++++++++++++++++++++++--------------
 1 file changed, 99 insertions(+), 64 deletions(-)

diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index a6e38ad..bb39632 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -136,6 +136,12 @@ static void apply_map_update(RelMapFile *map, Oid relationId, Oid fileNode,
 							 bool add_okay);
 static void merge_map_updates(RelMapFile *map, const RelMapFile *updates,
 							  bool add_okay);
+static void read_relmap_file(char *mapfilename, RelMapFile *map,
+							 bool lock_held);
+static void write_relmap_file_internal(char *mapfilename, RelMapFile *newmap,
+									   bool write_wal, bool send_sinval,
+									   bool preserve_files, Oid dbid, Oid tsid,
+									   const char *dbpath);
 static void load_relmap_file(bool shared, bool lock_held);
 static void write_relmap_file(bool shared, RelMapFile *newmap,
 							  bool write_wal, bool send_sinval, bool preserve_files,
@@ -687,36 +693,19 @@ RestoreRelationMap(char *startAddress)
 }
 
 /*
- * load_relmap_file -- load data from the shared or local map file
+ * read_relmap_file -- read data from given mapfilename file.
  *
  * Because the map file is essential for access to core system catalogs,
  * failure to read it is a fatal error.
- *
- * Note that the local case requires DatabasePath to be set up.
  */
 static void
-load_relmap_file(bool shared, bool lock_held)
+read_relmap_file(char *mapfilename, RelMapFile *map, bool lock_held)
 {
-	RelMapFile *map;
-	char		mapfilename[MAXPGPATH];
 	pg_crc32c	crc;
 	int			fd;
 	int			r;
 
-	if (shared)
-	{
-		snprintf(mapfilename, sizeof(mapfilename), "global/%s",
-				 RELMAPPER_FILENAME);
-		map = &shared_map;
-	}
-	else
-	{
-		snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
-				 DatabasePath, RELMAPPER_FILENAME);
-		map = &local_map;
-	}
-
-	/* Read data ... */
+	/* Open the relmap file for reading. */
 	fd = OpenTransientFile(mapfilename, O_RDONLY | PG_BINARY);
 	if (fd < 0)
 		ereport(FATAL,
@@ -779,62 +768,50 @@ load_relmap_file(bool shared, bool lock_held)
 }
 
 /*
- * Write out a new shared or local map file with the given contents.
- *
- * The magic number and CRC are automatically updated in *newmap.  On
- * success, we copy the data to the appropriate permanent static variable.
- *
- * If write_wal is true then an appropriate WAL message is emitted.
- * (It will be false for bootstrap and WAL replay cases.)
- *
- * If send_sinval is true then a SI invalidation message is sent.
- * (This should be true except in bootstrap case.)
- *
- * If preserve_files is true then the storage manager is warned not to
- * delete the files listed in the map.
+ * load_relmap_file -- load data from the shared or local map file
  *
- * Because this may be called during WAL replay when MyDatabaseId,
- * DatabasePath, etc aren't valid, we require the caller to pass in suitable
- * values.  The caller is also responsible for being sure no concurrent
- * map update could be happening.
+ * Note that the local case requires DatabasePath to be set up.
  */
 static void
-write_relmap_file(bool shared, RelMapFile *newmap,
-				  bool write_wal, bool send_sinval, bool preserve_files,
-				  Oid dbid, Oid tsid, const char *dbpath)
+load_relmap_file(bool shared, bool lock_held)
 {
-	int			fd;
-	RelMapFile *realmap;
+	RelMapFile *map;
 	char		mapfilename[MAXPGPATH];
 
-	/*
-	 * Fill in the overhead fields and update CRC.
-	 */
-	newmap->magic = RELMAPPER_FILEMAGIC;
-	if (newmap->num_mappings < 0 || newmap->num_mappings > MAX_MAPPINGS)
-		elog(ERROR, "attempt to write bogus relation mapping");
-
-	INIT_CRC32C(newmap->crc);
-	COMP_CRC32C(newmap->crc, (char *) newmap, offsetof(RelMapFile, crc));
-	FIN_CRC32C(newmap->crc);
-
-	/*
-	 * Open the target file.  We prefer to do this before entering the
-	 * critical section, so that an open() failure need not force PANIC.
-	 */
 	if (shared)
 	{
 		snprintf(mapfilename, sizeof(mapfilename), "global/%s",
 				 RELMAPPER_FILENAME);
-		realmap = &shared_map;
+		map = &shared_map;
 	}
 	else
 	{
 		snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
-				 dbpath, RELMAPPER_FILENAME);
-		realmap = &local_map;
+				 DatabasePath, RELMAPPER_FILENAME);
+		map = &local_map;
 	}
 
+	/* Read data ... */
+	read_relmap_file(mapfilename, map, lock_held);
+}
+
+/*
+ * Helper function for write_relmap_file, Read comments atop write_relmap_file
+ * for more details.  The CRC should be computed by the caller and stored in
+ * the newmap.
+ */
+static void
+write_relmap_file_internal(char *mapfilename, RelMapFile *newmap,
+						   bool write_wal, bool send_sinval,
+						   bool preserve_files, Oid dbid, Oid tsid,
+						   const char *dbpath)
+{
+	int			fd;
+
+	/*
+	 * Open the target file.  We prefer to do this before entering the
+	 * critical section, so that an open() failure need not force PANIC.
+	 */
 	fd = OpenTransientFile(mapfilename, O_WRONLY | O_CREAT | PG_BINARY);
 	if (fd < 0)
 		ereport(ERROR,
@@ -934,6 +911,68 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		}
 	}
 
+	/* Critical section done */
+	if (write_wal)
+		END_CRIT_SECTION();
+}
+
+/*
+ * Write out a new shared or local map file with the given contents.
+ *
+ * The magic number and CRC are automatically updated in *newmap.  On
+ * success, we copy the data to the appropriate permanent static variable.
+ *
+ * If write_wal is true then an appropriate WAL message is emitted.
+ * (It will be false for bootstrap and WAL replay cases.)
+ *
+ * If send_sinval is true then a SI invalidation message is sent.
+ * (This should be true except in bootstrap case.)
+ *
+ * If preserve_files is true then the storage manager is warned not to
+ * delete the files listed in the map.
+ *
+ * Because this may be called during WAL replay when MyDatabaseId,
+ * DatabasePath, etc aren't valid, we require the caller to pass in suitable
+ * values.  The caller is also responsible for being sure no concurrent
+ * map update could be happening.
+ */
+static void
+write_relmap_file(bool shared, RelMapFile *newmap,
+				  bool write_wal, bool send_sinval, bool preserve_files,
+				  Oid dbid, Oid tsid, const char *dbpath)
+{
+	RelMapFile *realmap;
+	char		mapfilename[MAXPGPATH];
+
+	/*
+	 * Fill in the overhead fields and update CRC.
+	 */
+	newmap->magic = RELMAPPER_FILEMAGIC;
+	if (newmap->num_mappings < 0 || newmap->num_mappings > MAX_MAPPINGS)
+		elog(ERROR, "attempt to write bogus relation mapping");
+
+	INIT_CRC32C(newmap->crc);
+	COMP_CRC32C(newmap->crc, (char *) newmap, offsetof(RelMapFile, crc));
+	FIN_CRC32C(newmap->crc);
+
+	if (shared)
+	{
+		snprintf(mapfilename, sizeof(mapfilename), "global/%s",
+				 RELMAPPER_FILENAME);
+		realmap = &shared_map;
+	}
+	else
+	{
+		snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+				 dbpath, RELMAPPER_FILENAME);
+		realmap = &local_map;
+	}
+
+	/* Write the map to the relmap file. */
+	write_relmap_file_internal(mapfilename, newmap, write_wal,
+							   send_sinval, preserve_files, dbid, tsid,
+							   dbpath);
+
 	/*
 	 * Success, update permanent copy.  During bootstrap, we might be working
 	 * on the permanent copy itself, in which case skip the memcpy() to avoid
@@ -943,10 +982,6 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		memcpy(realmap, newmap, sizeof(RelMapFile));
 	else
 		Assert(!send_sinval);	/* must be bootstrapping */
-
-	/* Critical section done */
-	if (write_wal)
-		END_CRIT_SECTION();
 }
 
 /*
-- 
1.8.3.1

v4-0003-Refactor-index_copy_data.patchtext/x-patch; charset=US-ASCII; name=v4-0003-Refactor-index_copy_data.patchDownload

From 27743d5cf737dd10e54c4231e3d2c5cace3b87ea Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 24 Sep 2021 18:13:25 +0530
Subject: [PATCH v4 3/6] Refactor index_copy_data

Make separate interface for copying relation storage, this will
be used by later patch for copying the database relations.
---
 src/backend/commands/tablecmds.c | 59 ++++++++++++++++++++++++----------------
 src/include/commands/tablecmds.h |  5 ++++
 2 files changed, 40 insertions(+), 24 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index ff97b61..426d1b0 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -14189,21 +14189,13 @@ AlterTableMoveAll(AlterTableMoveAllStmt *stmt)
 	return new_tablespaceoid;
 }
 
-static void
-index_copy_data(Relation rel, RelFileNode newrnode)
+/*
+ * Copy source smgr all fork's data to the destination smgr.
+ */
+void
+RelationCopyAllFork(SMgrRelation src_smgr, SMgrRelation	dst_smgr,
+					char relpersistence, copy_relation_storage copy_storage)
 {
-	SMgrRelation dstrel;
-
-	dstrel = smgropen(newrnode, rel->rd_backend);
-
-	/*
-	 * Since we copy the file directly without looking at the shared buffers,
-	 * we'd better first flush out any pages of the source relation that are
-	 * in shared buffers.  We assume no new changes will be made while we are
-	 * holding exclusive lock on the rel.
-	 */
-	FlushRelationBuffers(rel);
-
 	/*
 	 * Create and copy all forks of the relation, and schedule unlinking of
 	 * old physical files.
@@ -14211,32 +14203,51 @@ index_copy_data(Relation rel, RelFileNode newrnode)
 	 * NOTE: any conflict in relfilenode value will be caught in
 	 * RelationCreateStorage().
 	 */
-	RelationCreateStorage(newrnode, rel->rd_rel->relpersistence);
+	RelationCreateStorage(dst_smgr->smgr_rnode.node, relpersistence);
 
 	/* copy main fork */
-	RelationCopyStorage(RelationGetSmgr(rel), dstrel, MAIN_FORKNUM,
-						rel->rd_rel->relpersistence);
+	copy_storage(src_smgr, dst_smgr, MAIN_FORKNUM, relpersistence);
 
 	/* copy those extra forks that exist */
 	for (ForkNumber forkNum = MAIN_FORKNUM + 1;
 		 forkNum <= MAX_FORKNUM; forkNum++)
 	{
-		if (smgrexists(RelationGetSmgr(rel), forkNum))
+		if (smgrexists(src_smgr, forkNum))
 		{
-			smgrcreate(dstrel, forkNum, false);
+			smgrcreate(dst_smgr, forkNum, false);
 
 			/*
 			 * WAL log creation if the relation is persistent, or this is the
 			 * init fork of an unlogged relation.
 			 */
-			if (RelationIsPermanent(rel) ||
-				(rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
+			if (relpersistence == RELPERSISTENCE_PERMANENT ||
+				(relpersistence == RELPERSISTENCE_UNLOGGED &&
 				 forkNum == INIT_FORKNUM))
-				log_smgrcreate(&newrnode, forkNum);
-			RelationCopyStorage(RelationGetSmgr(rel), dstrel, forkNum,
-								rel->rd_rel->relpersistence);
+				log_smgrcreate(&dst_smgr->smgr_rnode.node, forkNum);
+
+			/* Copy a fork's data, block by block. */
+			copy_storage(src_smgr, dst_smgr, forkNum, relpersistence);
 		}
 	}
+}
+
+static void
+index_copy_data(Relation rel, RelFileNode newrnode)
+{
+	SMgrRelation dstrel;
+
+	dstrel = smgropen(newrnode, rel->rd_backend);
+
+	/*
+	 * Since we copy the file directly without looking at the shared buffers,
+	 * we'd better first flush out any pages of the source relation that are
+	 * in shared buffers.  We assume no new changes will be made while we are
+	 * holding exclusive lock on the rel.
+	 */
+	FlushRelationBuffers(rel);
+
+	RelationCopyAllFork(RelationGetSmgr(rel), dstrel,
+						rel->rd_rel->relpersistence, RelationCopyStorage);
 
 	/* drop old relation, and close new one */
 	RelationDropStorage(rel);
diff --git a/src/include/commands/tablecmds.h b/src/include/commands/tablecmds.h
index 336549c..e0e0aa5 100644
--- a/src/include/commands/tablecmds.h
+++ b/src/include/commands/tablecmds.h
@@ -19,10 +19,13 @@
 #include "catalog/objectaddress.h"
 #include "nodes/parsenodes.h"
 #include "storage/lock.h"
+#include "storage/smgr.h"
 #include "utils/relcache.h"
 
 struct AlterTableUtilityContext;	/* avoid including tcop/utility.h here */
 
+typedef void (*copy_relation_storage) (SMgrRelation src, SMgrRelation dst,
+									  ForkNumber forkNum, char relpersistence);
 
 extern ObjectAddress DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 									ObjectAddress *typaddress, const char *queryString);
@@ -42,6 +45,8 @@ extern void AlterTableInternal(Oid relid, List *cmds, bool recurse);
 
 extern Oid	AlterTableMoveAll(AlterTableMoveAllStmt *stmt);
 
+extern void RelationCopyAllFork(SMgrRelation src_smgr, SMgrRelation	dst_smgr,
+								char relpersistence, copy_relation_storage copy_storage);
 extern ObjectAddress AlterTableNamespace(AlterObjectSchemaStmt *stmt,
 										 Oid *oldschema);
 
-- 
1.8.3.1

v4-0004-Extend-bufmgr-interfaces.patchtext/x-patch; charset=US-ASCII; name=v4-0004-Extend-bufmgr-interfaces.patchDownload

From 5f174c31b9218c0f30d30ef9102ca3268fa40094 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 24 Sep 2021 18:23:39 +0530
Subject: [PATCH v4 4/6] Extend bufmgr interfaces

Extend ReadBufferWithoutRelcache interface to take relpersistence as
and input and extend DropDatabaseBuffers to take tablespace oid as
input.
---
 src/backend/storage/buffer/bufmgr.c | 24 +++++++++++-------------
 src/include/storage/bufmgr.h        |  5 +++--
 2 files changed, 14 insertions(+), 15 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e88e4e9..ed54c34 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -770,24 +770,17 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 /*
  * ReadBufferWithoutRelcache -- like ReadBufferExtended, but doesn't require
  *		a relcache entry for the relation.
- *
- * NB: At present, this function may only be used on permanent relations, which
- * is OK, because we only use it during XLOG replay.  If in the future we
- * want to use it on temporary or unlogged relations, we could pass additional
- * parameters.
  */
 Buffer
 ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
 						  BlockNumber blockNum, ReadBufferMode mode,
-						  BufferAccessStrategy strategy)
+						  BufferAccessStrategy strategy, char relpersistence)
 {
 	bool		hit;
 
 	SMgrRelation smgr = smgropen(rnode, InvalidBackendId);
 
-	Assert(InRecovery);
-
-	return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
+	return ReadBuffer_common(smgr, relpersistence, forkNum, blockNum,
 							 mode, strategy, &hit);
 }
 
@@ -797,7 +790,7 @@ ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
  *
  * *hit is set to true if the request was satisfied from shared buffer cache.
  */
-static Buffer
+Buffer
 ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 				  BlockNumber blockNum, ReadBufferMode mode,
 				  BufferAccessStrategy strategy, bool *hit)
@@ -3402,10 +3395,13 @@ FindAndDropRelFileNodeBuffers(RelFileNode rnode, ForkNumber forkNum,
  *		database, to avoid trying to flush data to disk when the directory
  *		tree no longer exists.  Implementation is pretty similar to
  *		DropRelFileNodeBuffers() which is for destroying just one relation.
+ *
+ *		If a valid tablespace oid is passed then it will compare the tablespace
+ *		oid as well otherwise just the db oid.
  * --------------------------------------------------------------------
  */
 void
-DropDatabaseBuffers(Oid dbid)
+DropDatabaseBuffers(Oid dbid, Oid tbsid)
 {
 	int			i;
 
@@ -3423,11 +3419,13 @@ DropDatabaseBuffers(Oid dbid)
 		 * As in DropRelFileNodeBuffers, an unlocked precheck should be safe
 		 * and saves some cycles.
 		 */
-		if (bufHdr->tag.rnode.dbNode != dbid)
+		if (bufHdr->tag.rnode.dbNode != dbid ||
+			(OidIsValid(tbsid) && bufHdr->tag.rnode.spcNode != tbsid))
 			continue;
 
 		buf_state = LockBufHdr(bufHdr);
-		if (bufHdr->tag.rnode.dbNode == dbid)
+		if (bufHdr->tag.rnode.dbNode == dbid &&
+			(!OidIsValid(tbsid) || bufHdr->tag.rnode.spcNode == tbsid))
 			InvalidateBuffer(bufHdr);	/* releases spinlock */
 		else
 			UnlockBufHdr(bufHdr, buf_state);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index cfce23e..237c6a9 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -184,7 +184,8 @@ extern Buffer ReadBufferExtended(Relation reln, ForkNumber forkNum,
 								 BufferAccessStrategy strategy);
 extern Buffer ReadBufferWithoutRelcache(RelFileNode rnode,
 										ForkNumber forkNum, BlockNumber blockNum,
-										ReadBufferMode mode, BufferAccessStrategy strategy);
+										ReadBufferMode mode, BufferAccessStrategy strategy,
+										char relpersistence);
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern void MarkBufferDirty(Buffer buffer);
@@ -207,7 +208,7 @@ extern void FlushDatabaseBuffers(Oid dbid);
 extern void DropRelFileNodeBuffers(struct SMgrRelationData *smgr_reln, ForkNumber *forkNum,
 								   int nforks, BlockNumber *firstDelBlock);
 extern void DropRelFileNodesAllBuffers(struct SMgrRelationData **smgr_reln, int nnodes);
-extern void DropDatabaseBuffers(Oid dbid);
+extern void DropDatabaseBuffers(Oid dbid, Oid tbsid);
 
 #define RelationGetNumberOfBlocks(reln) \
 	RelationGetNumberOfBlocksInFork(reln, MAIN_FORKNUM)
-- 
1.8.3.1

v4-0005-New-interface-to-lock-relation-id.patchtext/x-patch; charset=US-ASCII; name=v4-0005-New-interface-to-lock-relation-id.patchDownload

From 28e9dc7185b0553a5604ac49ac0092ffad2306b7 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 24 Sep 2021 18:29:17 +0530
Subject: [PATCH v4 5/6] New interface to lock relation id

Same as LockRelationOid, but instead of rel oid it will take
LockRelId object as an input.  So instead of using MyDatabaseId it
will use the dboid passed in the LockRelId object. So this will
provide an option to lock the relation even if we are not connected
to the database.
---
 src/backend/storage/lmgr/lmgr.c | 28 ++++++++++++++++++++++++++++
 src/include/storage/lmgr.h      |  1 +
 2 files changed, 29 insertions(+)

diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index cdf2266..4a321aa 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -176,6 +176,34 @@ ConditionalLockRelationOid(Oid relid, LOCKMODE lockmode)
 }
 
 /*
+ *		LockRelationId
+ *
+ * Lock, given a LockRelId.  Same as LockRelationOid but take LockRelId as an
+ * input.
+ */
+void
+LockRelationId(LockRelId *relid, LOCKMODE lockmode)
+{
+	LOCKTAG		tag;
+	LOCALLOCK  *locallock;
+	LockAcquireResult res;
+
+	SET_LOCKTAG_RELATION(tag, relid->dbId, relid->relId);
+
+	res = LockAcquireExtended(&tag, lockmode, false, false, true, &locallock);
+
+	/*
+	 * Now that we have the lock, check for invalidation messages; see notes
+	 * in LockRelationOid.
+	 */
+	if (res != LOCKACQUIRE_ALREADY_CLEAR)
+	{
+		AcceptInvalidationMessages();
+		MarkLockClear(locallock);
+	}
+}
+
+/*
  *		UnlockRelationId
  *
  * Unlock, given a LockRelId.  This is preferred over UnlockRelationOid
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index b009559..092ee93 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -38,6 +38,7 @@ extern void RelationInitLockInfo(Relation relation);
 
 /* Lock a relation */
 extern void LockRelationOid(Oid relid, LOCKMODE lockmode);
+extern void LockRelationId(LockRelId *relid, LOCKMODE lockmode);
 extern bool ConditionalLockRelationOid(Oid relid, LOCKMODE lockmode);
 extern void UnlockRelationId(LockRelId *relid, LOCKMODE lockmode);
 extern void UnlockRelationOid(Oid relid, LOCKMODE lockmode);
-- 
1.8.3.1

v4-0002-Extend-relmap-interfaces.patchtext/x-patch; charset=US-ASCII; name=v4-0002-Extend-relmap-interfaces.patchDownload

From 8b113b9d59ef9aa75c8ed3862b7cf8058da157ee Mon Sep 17 00:00:00 2001
From: dilipkumar <dilipbalaut@gmail.com>
Date: Mon, 4 Oct 2021 13:50:44 +0530
Subject: [PATCH v4 2/6] Extend relmap interfaces

Support new interfaces in relmapper, 1) Support copying the
relmap file from one database path to the other database path.
2) Like RelationMapOidToFilenode, provide another interface
which do the same but instead of getting it for the database
we are connected to it will get it for the input database
path.

These interfaces are required for next patch for supporting the
wal logged created database.
---
 src/backend/utils/cache/relmapper.c | 122 +++++++++++++++++++++++++++++++-----
 src/include/utils/relmapper.h       |   6 +-
 2 files changed, 112 insertions(+), 16 deletions(-)

diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index bb39632..51f361c 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -141,7 +141,7 @@ static void read_relmap_file(char *mapfilename, RelMapFile *map,
 static void write_relmap_file_internal(char *mapfilename, RelMapFile *newmap,
 									   bool write_wal, bool send_sinval,
 									   bool preserve_files, Oid dbid, Oid tsid,
-									   const char *dbpath);
+									   const char *dbpath, bool create);
 static void load_relmap_file(bool shared, bool lock_held);
 static void write_relmap_file(bool shared, RelMapFile *newmap,
 							  bool write_wal, bool send_sinval, bool preserve_files,
@@ -256,6 +256,36 @@ RelationMapFilenodeToOid(Oid filenode, bool shared)
 }
 
 /*
+ * RelationMapOidToFilenodeForDatabase
+ *
+ * Find relfilenode for the given relation id in the dbpath.  Returns
+ * InvalidOid if the relationId is not found in the relmap.
+ */
+Oid
+RelationMapOidToFilenodeForDatabase(char *dbpath, Oid relationId)
+{
+	RelMapFile	map;
+	int			i;
+	char		mapfilename[MAXPGPATH];
+
+	/* Relmap file path for the given dbpath. */
+	snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+			 dbpath, RELMAPPER_FILENAME);
+
+	/* Read the relmap file from the source database. */
+	read_relmap_file(mapfilename, &map, false);
+
+	/* Iterate over the relmap entries to find the input relation oid. */
+	for (i = 0; i < map.num_mappings; i++)
+	{
+		if (relationId == map.mappings[i].mapoid)
+			return map.mappings[i].mapfilenode;
+	}
+
+	return InvalidOid;
+}
+
+/*
  * RelationMapUpdateMap
  *
  * Install a new relfilenode mapping for the specified relation.
@@ -693,7 +723,43 @@ RestoreRelationMap(char *startAddress)
 }
 
 /*
- * read_relmap_file -- read data from given mapfilename file.
+ * CopyRelationMap
+ *
+ * Copy relmapfile from source db path to the destination db path and WAL log
+ * the operation.  This function is only called during the create database, so
+ * the destination database is not yet visible to anyone else, thus we don't
+ * need to acquire the relmap lock while updating the destination relmap.
+ */
+void
+CopyRelationMap(Oid dbid, Oid tsid, char *srcdbpath, char *dstdbpath)
+{
+	RelMapFile map;
+	char mapfilename[MAXPGPATH];
+
+	/* Relmap file path of the source database. */
+	snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+			 srcdbpath, RELMAPPER_FILENAME);
+
+	/* Read the relmap file from the source database. */
+	read_relmap_file(mapfilename, &map, false);
+
+	/* Relmap file path of the destination database. */
+	snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+			 dstdbpath, RELMAPPER_FILENAME);
+
+	/*
+	 * Write map contents into the destination database's relmap file.
+	 * write_relmap_file_internal, expects that the CRC should have been
+	 * computed and stored in the input map.  But, since we have read this map
+	 * from the source database and directly writing to the destination file
+	 * without updating it so we don't need to recompute it.
+	 */
+	write_relmap_file_internal(mapfilename, &map, true, false, true, dbid,
+							   tsid, dstdbpath, true);
+}
+
+/*
+ * read_relmap_file - read data from given mapfilename file.
  *
  * Because the map file is essential for access to core system catalogs,
  * failure to read it is a fatal error.
@@ -796,15 +862,18 @@ load_relmap_file(bool shared, bool lock_held)
 }
 
 /*
- * Helper function for write_relmap_file, Read comments atop write_relmap_file
- * for more details.  The CRC should be computed by the caller and stored in
- * the newmap.
+ * Helper function for write_relmap_file and CopyRelationMap, Read comments
+ * atop write_relmap_file for more details.  The CRC should be computed by the
+ * caller and stored in the newmap.
+ *
+ * Pass the create = true, if we are copying the relmap file during CREATE
+ * DATABASE command.
  */
 static void
 write_relmap_file_internal(char *mapfilename, RelMapFile *newmap,
 						   bool write_wal, bool send_sinval,
 						   bool preserve_files, Oid dbid, Oid tsid,
-						   const char *dbpath)
+						   const char *dbpath, bool create)
 {
 	int			fd;
 
@@ -830,6 +899,7 @@ write_relmap_file_internal(char *mapfilename, RelMapFile *newmap,
 		xlrec.dbid = dbid;
 		xlrec.tsid = tsid;
 		xlrec.nbytes = sizeof(RelMapFile);
+		xlrec.create = create;
 
 		XLogBeginInsert();
 		XLogRegisterData((char *) (&xlrec), MinSizeOfRelmapUpdate);
@@ -971,7 +1041,7 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 	/* Write the map to the relmap file. */
 	write_relmap_file_internal(mapfilename, newmap, write_wal,
 							   send_sinval, preserve_files, dbid, tsid,
-							   dbpath);
+							   dbpath, false);
 
 	/*
 	 * Success, update permanent copy.  During bootstrap, we might be working
@@ -1063,15 +1133,37 @@ relmap_redo(XLogReaderState *record)
 		 * Write out the new map and send sinval, but of course don't write a
 		 * new WAL entry.  There's no surrounding transaction to tell to
 		 * preserve files, either.
-		 *
-		 * There shouldn't be anyone else updating relmaps during WAL replay,
-		 * but grab the lock to interlock against load_relmap_file().
 		 */
-		LWLockAcquire(RelationMappingLock, LW_EXCLUSIVE);
-		write_relmap_file((xlrec->dbid == InvalidOid), &newmap,
-						  false, true, false,
-						  xlrec->dbid, xlrec->tsid, dbpath);
-		LWLockRelease(RelationMappingLock);
+		if (!xlrec->create)
+		{
+			/*
+			 * There shouldn't be anyone else updating relmaps during WAL
+			 * replay, but grab the lock to interlock against
+			 * load_relmap_file().
+			 */
+			LWLockAcquire(RelationMappingLock, LW_EXCLUSIVE);
+			write_relmap_file((xlrec->dbid == InvalidOid), &newmap,
+							false, true, false,
+							xlrec->dbid, xlrec->tsid, dbpath);
+			LWLockRelease(RelationMappingLock);
+		}
+		else
+		{
+			char		mapfilename[MAXPGPATH];
+
+			/* Construct the mapfilename. */
+			snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+					 dbpath, RELMAPPER_FILENAME);
+
+			/*
+			 * We don't need to take relmap lock because this wal is logged
+			 * while creating a new database, so there could be no one else
+			 * reading/writing the relmap file.
+			 */
+			write_relmap_file_internal(mapfilename, &newmap, false, false,
+									   false, xlrec->dbid, xlrec->tsid, dbpath,
+									   true);
+		}
 
 		pfree(dbpath);
 	}
diff --git a/src/include/utils/relmapper.h b/src/include/utils/relmapper.h
index c0d14da..4165f09 100644
--- a/src/include/utils/relmapper.h
+++ b/src/include/utils/relmapper.h
@@ -29,6 +29,7 @@ typedef struct xl_relmap_update
 	Oid			dbid;			/* database ID, or 0 for shared map */
 	Oid			tsid;			/* database's tablespace, or pg_global */
 	int32		nbytes;			/* size of relmap data */
+	bool		create;			/* true if creating new relmap */
 	char		data[FLEXIBLE_ARRAY_MEMBER];
 } xl_relmap_update;
 
@@ -39,6 +40,8 @@ extern Oid	RelationMapOidToFilenode(Oid relationId, bool shared);
 
 extern Oid	RelationMapFilenodeToOid(Oid relationId, bool shared);
 
+extern Oid RelationMapOidToFilenodeForDatabase(char *dbpath, Oid relationId);
+
 extern void RelationMapUpdateMap(Oid relationId, Oid fileNode, bool shared,
 								 bool immediate);
 
@@ -62,7 +65,8 @@ extern void RelationMapInitializePhase3(void);
 extern Size EstimateRelationMapSpace(void);
 extern void SerializeRelationMap(Size maxSize, char *startAddress);
 extern void RestoreRelationMap(char *startAddress);
-
+extern void CopyRelationMap(Oid dbid, Oid tsid, char *srcdbpath,
+							char *dstdbpath);
 extern void relmap_redo(XLogReaderState *record);
 extern void relmap_desc(StringInfo buf, XLogReaderState *record);
 extern const char *relmap_identify(uint8 info);
-- 
1.8.3.1

v4-0006-WAL-logged-CREATE-DATABASE.patchtext/x-patch; charset=US-ASCII; name=v4-0006-WAL-logged-CREATE-DATABASE.patchDownload

From b236bf6517b799c1ba7a3c74bfb50505b9d7accd Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 24 Sep 2021 18:33:01 +0530
Subject: [PATCH v4 6/6] WAL logged CREATE DATABASE

Currently, CREATE DATABASE forces a checkpoint, then copies all the files,
then forces another checkpoint. The comments in the createdb() function
explain the reasons for this. The attached patch fixes this problem by making
create database completely WAL logged so that we can avoid the checkpoints.

This can also be useful for supporting the TDE. For example, if we need different
encryption for the source and the target database then we can not re-encrypt the
page data if we copy the whole directory.  But with this patch, we are copying
page by page so we have an opportunity to re-encrypt the page before copying that
to the target database.
---
 src/backend/access/rmgrdesc/dbasedesc.c |   3 +-
 src/backend/access/transam/xlogutils.c  |  12 +-
 src/backend/commands/dbcommands.c       | 704 +++++++++++++++++++++-----------
 src/bin/pg_rewind/parsexlog.c           |   1 +
 src/include/commands/dbcommands_xlog.h  |   3 -
 5 files changed, 479 insertions(+), 244 deletions(-)

diff --git a/src/backend/access/rmgrdesc/dbasedesc.c b/src/backend/access/rmgrdesc/dbasedesc.c
index 2660984..5010f72 100644
--- a/src/backend/access/rmgrdesc/dbasedesc.c
+++ b/src/backend/access/rmgrdesc/dbasedesc.c
@@ -28,8 +28,7 @@ dbase_desc(StringInfo buf, XLogReaderState *record)
 	{
 		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) rec;
 
-		appendStringInfo(buf, "copy dir %u/%u to %u/%u",
-						 xlrec->src_tablespace_id, xlrec->src_db_id,
+		appendStringInfo(buf, "create dir %u/%u",
 						 xlrec->tablespace_id, xlrec->db_id);
 	}
 	else if (info == XLOG_DBASE_DROP)
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 88a1bfd..a7a8b79 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -483,8 +483,8 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 	if (blkno < lastblock)
 	{
 		/* page exists in file */
-		buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-										   mode, NULL);
+		buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno, mode, NULL,
+										   RELPERSISTENCE_PERMANENT);
 	}
 	else
 	{
@@ -508,8 +508,8 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 					LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 				ReleaseBuffer(buffer);
 			}
-			buffer = ReadBufferWithoutRelcache(rnode, forknum,
-											   P_NEW, mode, NULL);
+			buffer = ReadBufferWithoutRelcache(rnode, forknum, P_NEW, mode,
+											   NULL, RELPERSISTENCE_PERMANENT);
 		}
 		while (BufferGetBlockNumber(buffer) < blkno);
 		/* Handle the corner case that P_NEW returns non-consecutive pages */
@@ -518,8 +518,8 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 			if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
 				LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 			ReleaseBuffer(buffer);
-			buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-											   mode, NULL);
+			buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno, mode,
+											   NULL, RELPERSISTENCE_PERMANENT);
 		}
 	}
 
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 029fab4..2b70d4d 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -45,13 +45,13 @@
 #include "commands/dbcommands_xlog.h"
 #include "commands/defrem.h"
 #include "commands/seclabel.h"
+#include "commands/tablecmds.h"
 #include "commands/tablespace.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
 #include "replication/slot.h"
-#include "storage/copydir.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
@@ -62,6 +62,7 @@
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
 #include "utils/pg_locale.h"
+#include "utils/relmapper.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 
@@ -77,6 +78,19 @@ typedef struct
 	Oid			dest_tsoid;		/* tablespace we are trying to move to */
 } movedb_failure_params;
 
+/*
+ * When creating a database, we scan the pg_class of the source database to
+ * identify all the relations to be copied.  The structure is used for storing
+ * information about each relation of the source database.
+ */
+typedef struct CreateDBRelInfo
+{
+	RelFileNode		rnode;				/* physical relation identifier */
+	Oid				reloid;				/* relation oid */
+	char			relpersistence;		/* relation's persistence level */
+} CreateDBRelInfo;
+
+
 /* non-export function prototypes */
 static void createdb_failure_callback(int code, Datum arg);
 static void movedb(const char *dbname, const char *tblspcname);
@@ -91,6 +105,426 @@ static bool have_createdb_privilege(void);
 static void remove_dbtablespaces(Oid db_id);
 static bool check_db_file_conflict(Oid db_id);
 static int	errdetail_busy_db(int notherbackends, int npreparedxacts);
+static void CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid,
+									bool isRedo);
+static List *GetDatabaseRelationList(Oid srctbid, Oid srcdbid, char *srcpath);
+void RelationCopyStorageUsingBuffer(SMgrRelation src, SMgrRelation dst,
+									ForkNumber forkNum, char relpersistence);
+static void CopyDatabase(Oid src_dboid, Oid dboid, Oid src_tsid, Oid dst_tsid);
+
+/*
+ * CreateDirAndVersionFile - Create database directory and write out the
+ *							 PG_VERSION file in the database path.
+ *
+ * If isRedo is true, it's okay for the database directory to exist already.
+ *
+ * We can directly write PG_MAJORVERSION in the version file instead of copying
+ * from the source database file because these two must be the same.
+ */
+static void
+CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid, bool isRedo)
+{
+	int		fd;
+	int		nbytes = strlen(PG_MAJORVERSION);
+	char	versionfile[MAXPGPATH];
+
+	/* If we are not in WAL replay then write the WAL. */
+	if (!isRedo)
+	{
+		xl_dbase_create_rec xlrec;
+		XLogRecPtr	lsn;
+
+		/* Now errors are fatal ... */
+		START_CRIT_SECTION();
+
+		xlrec.db_id = dbid;
+		xlrec.tablespace_id = tsid;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec), sizeof(xl_dbase_create_rec));
+
+		lsn = XLogInsert(RM_DBASE_ID, XLOG_DBASE_CREATE);
+
+		/* As always, WAL must hit the disk before the data update does. */
+		XLogFlush(lsn);
+	}
+
+	/* Create database directory. */
+	if (MakePGDirectory(dbpath) < 0)
+	{
+		/* Failure other than already exists or not in WAL replay? */
+		if (errno != EEXIST || !isRedo)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not create directory \"%s\": %m", dbpath)));
+	}
+
+	/*
+	 * Create PG_VERSION file in the database path.  If the file already exists
+	 * and we are in WAL replay then try again to open it in the write mode.
+	 */
+	snprintf(versionfile, sizeof(versionfile), "%s/%s", dbpath, "PG_VERSION");
+
+	fd = OpenTransientFile(versionfile, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
+	if (fd < 0 && errno == EEXIST && isRedo)
+		fd = OpenTransientFile(versionfile, O_RDWR | PG_BINARY);
+
+	if (fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m", versionfile)));
+
+	/* Write PG_MAJORVERSION in the PG_VERSION file. */
+	pgstat_report_wait_start(WAIT_EVENT_COPY_FILE_WRITE);
+	errno = 0;
+	if ((int) write(fd, (char *) PG_MAJORVERSION, nbytes) != nbytes)
+	{
+		/* If write didn't set errno, assume problem is no disk space. */
+		if (errno == 0)
+			errno = ENOSPC;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m", versionfile)));
+	}
+	pgstat_report_wait_end();
+
+	/* Close the version file. */
+	CloseTransientFile(fd);
+
+	/* Critical section done. */
+	if (!isRedo)
+		END_CRIT_SECTION();
+}
+
+/*
+ * GetDatabaseRelationList - Get relfilenode list to be copied.
+ *
+ * Iterate over each block of the pg_class relation.  From there, we will check
+ * all the visible tuples in order to get a list of all the valid relfilenodes
+ * in the source database that should be copied to the target database.
+ */
+static List *
+GetDatabaseRelationList(Oid tbid, Oid dbid, char *srcpath)
+{
+	SMgrRelation	rd_smgr;
+	RelFileNode		rnode;
+	BlockNumber		nblocks;
+	BlockNumber		blkno;
+	OffsetNumber	offnum;
+	OffsetNumber	maxoff;
+	Buffer			buf;
+	Oid				relfilenode;
+	Page			page;
+	List		   *rnodelist = NIL;
+	HeapTupleData	tuple;
+	Form_pg_class	classForm;
+	LockRelId		relid;
+	BufferAccessStrategy bstrategy;
+
+	/* Get pg_class relfilenode. */
+	relfilenode = RelationMapOidToFilenodeForDatabase(srcpath,
+													  RelationRelationId);
+	/*
+	 * We are going to read the buffers associated with the pg_class relation.
+	 * Thus, acquire the relation level lock before start scanning.  As we are
+	 * not connected to the database, we cannot use relation_open directly, so
+	 * we have to lock using relation id.
+	 */
+	relid.dbId = dbid;
+	relid.relId = RelationRelationId;
+	LockRelationId(&relid, AccessShareLock);
+
+	/* Prepare a relnode for pg_class relation. */
+	rnode.spcNode = tbid;
+	rnode.dbNode = dbid;
+	rnode.relNode = relfilenode;
+
+	/*
+	 * We are not connected to the source database so open the pg_class
+	 * relation at the smgr level and get the block count.
+	 */
+	rd_smgr = smgropen(rnode, InvalidBackendId);
+	nblocks = smgrnblocks(rd_smgr, MAIN_FORKNUM);
+
+	/*
+	 * We're going to read the whole pg_class so better to use bulk-read buffer
+	 * access strategy.
+	 */
+	bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	/* Iterate over each block on the pg_class relation. */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		/*
+		 * We are not connected to the source database so directly use the lower
+		 * level bufmgr interface which operates on the rnode.
+		 */
+		buf = ReadBufferWithoutRelcache(rnode, MAIN_FORKNUM, blkno,
+										RBM_NORMAL, bstrategy,
+										RELPERSISTENCE_PERMANENT);
+
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buf);
+		if (PageIsNew(page) || PageIsEmpty(page))
+			continue;
+
+		maxoff = PageGetMaxOffsetNumber(page);
+
+		/* Iterate over each tuple on the page. */
+		for (offnum = FirstOffsetNumber;
+			 offnum <= maxoff;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemid;
+
+			itemid = PageGetItemId(page, offnum);
+
+			/* Nothing to do if slot is empty or already dead. */
+			if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid) ||
+				ItemIdIsRedirected(itemid))
+				continue;
+
+			Assert(ItemIdIsNormal(itemid));
+			ItemPointerSet(&(tuple.t_self), blkno, offnum);
+
+			tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
+			tuple.t_len = ItemIdGetLength(itemid);
+			tuple.t_tableOid = RelationRelationId;
+
+			/*
+			 * If the tuple is visible then add its relfilenode info to the
+			 * list.
+			 */
+			if (HeapTupleSatisfiesVisibility(&tuple, GetActiveSnapshot(), buf))
+			{
+				Oid				relfilenode = InvalidOid;
+				CreateDBRelInfo   *relinfo;
+
+				classForm = (Form_pg_class) GETSTRUCT(&tuple);
+
+				/* We don't need to copy the shared objects to the target. */
+				if (classForm->reltablespace == GLOBALTABLESPACE_OID)
+					continue;
+
+				/*
+				 * If the object doesn't have the storage then nothing to be
+				 * done for that object so just ignore it.
+				 */
+				if (!RELKIND_HAS_STORAGE(classForm->relkind))
+					continue;
+
+				/*
+				 * If relfilenode is valid then directly use it.  Otherwise,
+				 * consult the relmapper for the mapped relation.
+				 *
+				 * XXX We can optimize RelationMapOidToFileenodeForDatabase API
+				 * so that instead of reading the relmap file every time, it can
+				 * save it in a temporary variable and use it for subsequent
+				 * calls.  Then later reset it once we're done or at the
+				 * transaction end.
+				 */
+				if (OidIsValid(classForm->relfilenode))
+					relfilenode = classForm->relfilenode;
+				else
+					relfilenode = RelationMapOidToFilenodeForDatabase(srcpath,
+													classForm->oid);
+
+				/* We must have a valid relfilenode oid. */
+				Assert(OidIsValid(relfilenode));
+
+				/* Prepare a rel info element and add it to the list. */
+				relinfo = (CreateDBRelInfo *) palloc(sizeof(CreateDBRelInfo));
+				if (OidIsValid(classForm->reltablespace))
+					relinfo->rnode.spcNode = classForm->reltablespace;
+				else
+					relinfo->rnode.spcNode = tbid;
+
+				relinfo->rnode.dbNode = dbid;
+				relinfo->rnode.relNode = relfilenode;
+				relinfo->reloid = classForm->oid;
+				relinfo->relpersistence = classForm->relpersistence;
+
+				if (rnodelist == NULL)
+					rnodelist = list_make1(relinfo);
+				else
+					rnodelist = lappend(rnodelist, relinfo);
+			}
+		}
+
+		/* Release the buffer lock. */
+		UnlockReleaseBuffer(buf);
+	}
+
+	/* Release the lock. */
+	UnlockRelationId(&relid, AccessShareLock);
+
+	return rnodelist;
+}
+
+/*
+ * RelationCopyStorageUsingBuffer - Copy fork's data using bufmgr.
+ *
+ * Same as RelationCopyStorage but instead of using smgrread and smgrextend this
+ * will copy using bufmgr apis.
+ */
+void
+RelationCopyStorageUsingBuffer(SMgrRelation src, SMgrRelation dst,
+							   ForkNumber forkNum, char relpersistence)
+{
+	Buffer		srcBuf;
+	Buffer		dstBuf;
+	Page		srcPage;
+	Page		dstPage;
+	bool		use_wal;
+	bool		copying_initfork;
+	BlockNumber nblocks;
+	BlockNumber blkno;
+	BufferAccessStrategy bstrategy_src;
+	BufferAccessStrategy bstrategy_dst;
+
+	/* Refer comments in RelationCopyStorage. */
+	copying_initfork = relpersistence == RELPERSISTENCE_UNLOGGED &&
+		forkNum == INIT_FORKNUM;
+	use_wal = XLogIsNeeded() &&
+		(relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork);
+
+	/* Get number of blocks in the source relation. */
+	nblocks = smgrnblocks(src, forkNum);
+
+	/*
+	 * We are going to copy whole relation from the source to the destination
+	 * so use BAS_BULKREAD strategy for the source relation and BAS_BULKWRITE
+	 * strategy for the destination relation.
+	 */
+	bstrategy_src = GetAccessStrategy(BAS_BULKREAD);
+	bstrategy_dst = GetAccessStrategy(BAS_BULKWRITE);
+
+	/* Iterate over each block of the source relation file. */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		/* If we got a cancel signal during the copy of the data, quit */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Read block from source relation. */
+		srcBuf = ReadBufferWithoutRelcache(src->smgr_rnode.node, forkNum,
+										   blkno, RBM_NORMAL, bstrategy_src,
+										   relpersistence);
+		srcPage = BufferGetPage(srcBuf);
+		if (PageIsNew(srcPage) || PageIsEmpty(srcPage))
+		{
+			ReleaseBuffer(srcBuf);
+			continue;
+		}
+
+		/* Use P_NEW to extend the relation. */
+		dstBuf = ReadBufferWithoutRelcache(dst->smgr_rnode.node, forkNum,
+										   P_NEW, RBM_NORMAL, bstrategy_dst,
+										   relpersistence);
+		LockBuffer(dstBuf, BUFFER_LOCK_EXCLUSIVE);
+
+		START_CRIT_SECTION();
+
+		/* Initialize the page and write the data. */
+		dstPage = BufferGetPage(dstBuf);
+		PageInit(dstPage, BufferGetPageSize(dstBuf), 0);
+		memcpy(dstPage, srcPage, BLCKSZ);
+		MarkBufferDirty(dstBuf);
+
+		/* WAL-log the copied page. */
+		if (use_wal)
+			log_newpage_buffer(dstBuf, true);
+
+		END_CRIT_SECTION();
+
+		UnlockReleaseBuffer(dstBuf);
+		ReleaseBuffer(srcBuf);
+	}
+}
+
+/*
+ * CopyDatabase - Copy source database to the target database.
+ *
+ * Create target database directory and copy data files from the source database
+ * to the target database, block by block and WAL log all the operations.
+ */
+static void
+CopyDatabase(Oid src_dboid, Oid dst_dboid, Oid src_tsid, Oid dst_tsid)
+{
+	char	   *srcpath;
+	char	   *dstpath;
+	List	   *rnodelist = NULL;
+	ListCell   *cell;
+	LockRelId	relid;
+	RelFileNode	srcrnode;
+	RelFileNode	dstrnode;
+	CreateDBRelInfo	*relinfo;
+
+	/* Get the source database path. */
+	srcpath = GetDatabasePath(src_dboid, src_tsid);
+
+	/* Get the destination database path. */
+	dstpath = GetDatabasePath(dst_dboid, dst_tsid);
+
+	/* Create database directory and write PG_VERSION file. */
+	CreateDirAndVersionFile(dstpath, dst_dboid, dst_tsid, false);
+
+	/* Copy relmap file from source database to the destination database. */
+	CopyRelationMap(dst_dboid, dst_tsid, srcpath, dstpath);
+
+	/* Get list of all valid relnode from the source database. */
+	rnodelist = GetDatabaseRelationList(src_tsid, src_dboid, srcpath);
+	Assert(rnodelist != NIL);
+
+	/*
+	 * Database id is common for all the relation so set it before entering to
+	 * the loop.
+	 */
+	relid.dbId = src_dboid;
+
+	/*
+	 * Iterate over each relfilenode and copy the relation data block by block
+	 * from source database to the destination database.
+	 */
+	foreach(cell, rnodelist)
+	{
+		SMgrRelation	src_smgr;
+		SMgrRelation	dst_smgr;
+
+		relinfo = lfirst(cell);
+		srcrnode = relinfo->rnode;
+
+		/*
+		 * If the relation is from the default tablespace then we need to
+		 * create it in the destinations db's default tablespace.  Otherwise,
+		 * we need to create in the same tablespace as it is in the source
+		 * database.
+		 */
+		if (srcrnode.spcNode != src_tsid)
+			dstrnode.spcNode = srcrnode.spcNode;
+		else
+			dstrnode.spcNode = dst_tsid;
+
+		dstrnode.dbNode = dst_dboid;
+		dstrnode.relNode = srcrnode.relNode;
+
+		/* Acquire the lock on relation before start copying. */
+		relid.relId = relinfo->reloid;
+		LockRelationId(&relid, AccessShareLock);
+
+		/* Open the source and the destination relation at smgr level. */
+		src_smgr = smgropen(srcrnode, InvalidBackendId);
+		dst_smgr = smgropen(dstrnode, InvalidBackendId);
+
+		/* Copy relation storage from source to the destination. */
+		RelationCopyAllFork(src_smgr, dst_smgr, relinfo->relpersistence,
+							RelationCopyStorageUsingBuffer);
+
+		/* Release the lock. */
+		UnlockRelationId(&relid, AccessShareLock);
+	}
+
+	list_free_deep(rnodelist);
+}
 
 
 /*
@@ -99,8 +533,6 @@ static int	errdetail_busy_db(int notherbackends, int npreparedxacts);
 Oid
 createdb(ParseState *pstate, const CreatedbStmt *stmt)
 {
-	TableScanDesc scan;
-	Relation	rel;
 	Oid			src_dboid;
 	Oid			src_owner;
 	int			src_encoding = -1;
@@ -563,139 +995,27 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	InvokeObjectPostCreateHook(DatabaseRelationId, dboid, 0);
 
 	/*
-	 * Force a checkpoint before starting the copy. This will force all dirty
-	 * buffers, including those of unlogged tables, out to disk, to ensure
-	 * source database is up-to-date on disk for the copy.
-	 * FlushDatabaseBuffers() would suffice for that, but we also want to
-	 * process any pending unlink requests. Otherwise, if a checkpoint
-	 * happened while we're copying files, a file might be deleted just when
-	 * we're about to copy it, causing the lstat() call in copydir() to fail
-	 * with ENOENT.
-	 */
-	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT
-					  | CHECKPOINT_FLUSH_ALL);
-
-	/*
-	 * Once we start copying subdirectories, we need to be able to clean 'em
-	 * up if we fail.  Use an ENSURE block to make sure this happens.  (This
-	 * is not a 100% solution, because of the possibility of failure during
-	 * transaction commit after we leave this routine, but it should handle
-	 * most scenarios.)
+	 * Once we start copying files from the source database, we need to be able
+	 * to clean 'em up if we fail.  Use an ENSURE block to make sure this
+	 * happens.  (This is not a 100% solution, because of the possibility of
+	 * failure during transaction commit after we leave this routine, but it
+	 * should handle most scenarios.)
 	 */
 	fparms.src_dboid = src_dboid;
 	fparms.dest_dboid = dboid;
 	PG_ENSURE_ERROR_CLEANUP(createdb_failure_callback,
 							PointerGetDatum(&fparms));
 	{
-		/*
-		 * Iterate through all tablespaces of the template database, and copy
-		 * each one to the new database.
-		 */
-		rel = table_open(TableSpaceRelationId, AccessShareLock);
-		scan = table_beginscan_catalog(rel, 0, NULL);
-		while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
-		{
-			Form_pg_tablespace spaceform = (Form_pg_tablespace) GETSTRUCT(tuple);
-			Oid			srctablespace = spaceform->oid;
-			Oid			dsttablespace;
-			char	   *srcpath;
-			char	   *dstpath;
-			struct stat st;
-
-			/* No need to copy global tablespace */
-			if (srctablespace == GLOBALTABLESPACE_OID)
-				continue;
-
-			srcpath = GetDatabasePath(src_dboid, srctablespace);
-
-			if (stat(srcpath, &st) < 0 || !S_ISDIR(st.st_mode) ||
-				directory_is_empty(srcpath))
-			{
-				/* Assume we can ignore it */
-				pfree(srcpath);
-				continue;
-			}
-
-			if (srctablespace == src_deftablespace)
-				dsttablespace = dst_deftablespace;
-			else
-				dsttablespace = srctablespace;
-
-			dstpath = GetDatabasePath(dboid, dsttablespace);
-
-			/*
-			 * Copy this subdirectory to the new location
-			 *
-			 * We don't need to copy subdirectories
-			 */
-			copydir(srcpath, dstpath, false);
-
-			/* Record the filesystem change in XLOG */
-			{
-				xl_dbase_create_rec xlrec;
-
-				xlrec.db_id = dboid;
-				xlrec.tablespace_id = dsttablespace;
-				xlrec.src_db_id = src_dboid;
-				xlrec.src_tablespace_id = srctablespace;
-
-				XLogBeginInsert();
-				XLogRegisterData((char *) &xlrec, sizeof(xl_dbase_create_rec));
-
-				(void) XLogInsert(RM_DBASE_ID,
-								  XLOG_DBASE_CREATE | XLR_SPECIAL_REL_UPDATE);
-			}
-		}
-		table_endscan(scan);
-		table_close(rel, AccessShareLock);
-
-		/*
-		 * We force a checkpoint before committing.  This effectively means
-		 * that committed XLOG_DBASE_CREATE operations will never need to be
-		 * replayed (at least not in ordinary crash recovery; we still have to
-		 * make the XLOG entry for the benefit of PITR operations). This
-		 * avoids two nasty scenarios:
-		 *
-		 * #1: When PITR is off, we don't XLOG the contents of newly created
-		 * indexes; therefore the drop-and-recreate-whole-directory behavior
-		 * of DBASE_CREATE replay would lose such indexes.
-		 *
-		 * #2: Since we have to recopy the source database during DBASE_CREATE
-		 * replay, we run the risk of copying changes in it that were
-		 * committed after the original CREATE DATABASE command but before the
-		 * system crash that led to the replay.  This is at least unexpected
-		 * and at worst could lead to inconsistencies, eg duplicate table
-		 * names.
-		 *
-		 * (Both of these were real bugs in releases 8.0 through 8.0.3.)
-		 *
-		 * In PITR replay, the first of these isn't an issue, and the second
-		 * is only a risk if the CREATE DATABASE and subsequent template
-		 * database change both occur while a base backup is being taken.
-		 * There doesn't seem to be much we can do about that except document
-		 * it as a limitation.
-		 *
-		 * Perhaps if we ever implement CREATE DATABASE in a less cheesy way,
-		 * we can avoid this.
-		 */
-		RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
-
-		/*
-		 * Close pg_database, but keep lock till commit.
-		 */
-		table_close(pg_database_rel, NoLock);
-
-		/*
-		 * Force synchronous commit, thus minimizing the window between
-		 * creation of the database files and committal of the transaction. If
-		 * we crash before committing, we'll have a DB that's taking up disk
-		 * space but is not in pg_database, which is not good.
-		 */
-		ForceSyncCommit();
+		CopyDatabase(src_dboid, dboid, src_deftablespace, dst_deftablespace);
 	}
 	PG_END_ENSURE_ERROR_CLEANUP(createdb_failure_callback,
 								PointerGetDatum(&fparms));
 
+	/*
+	 * Close pg_database, but keep lock till commit.
+	 */
+	table_close(pg_database_rel, NoLock);
+
 	return dboid;
 }
 
@@ -938,7 +1258,7 @@ dropdb(const char *dbname, bool missing_ok, bool force)
 	 * is important to ensure that no remaining backend tries to write out a
 	 * dirty buffer to the dead database later...
 	 */
-	DropDatabaseBuffers(db_id);
+	DropDatabaseBuffers(db_id, InvalidOid);
 
 	/*
 	 * Tell the stats collector to forget it immediately, too.
@@ -1196,37 +1516,6 @@ movedb(const char *dbname, const char *tblspcname)
 	dst_dbpath = GetDatabasePath(db_id, dst_tblspcoid);
 
 	/*
-	 * Force a checkpoint before proceeding. This will force all dirty
-	 * buffers, including those of unlogged tables, out to disk, to ensure
-	 * source database is up-to-date on disk for the copy.
-	 * FlushDatabaseBuffers() would suffice for that, but we also want to
-	 * process any pending unlink requests. Otherwise, the check for existing
-	 * files in the target directory might fail unnecessarily, not to mention
-	 * that the copy might fail due to source files getting deleted under it.
-	 * On Windows, this also ensures that background procs don't hold any open
-	 * files, which would cause rmdir() to fail.
-	 */
-	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT
-					  | CHECKPOINT_FLUSH_ALL);
-
-	/*
-	 * Now drop all buffers holding data of the target database; they should
-	 * no longer be dirty so DropDatabaseBuffers is safe.
-	 *
-	 * It might seem that we could just let these buffers age out of shared
-	 * buffers naturally, since they should not get referenced anymore.  The
-	 * problem with that is that if the user later moves the database back to
-	 * its original tablespace, any still-surviving buffers would appear to
-	 * contain valid data again --- but they'd be missing any changes made in
-	 * the database while it was in the new tablespace.  In any case, freeing
-	 * buffers that should never be used again seems worth the cycles.
-	 *
-	 * Note: it'd be sufficient to get rid of buffers matching db_id and
-	 * src_tblspcoid, but bufmgr.c presently provides no API for that.
-	 */
-	DropDatabaseBuffers(db_id);
-
-	/*
 	 * Check for existence of files in the target directory, i.e., objects of
 	 * this database that are already in the target tablespace.  We can't
 	 * allow the move in such a case, because we would need to change those
@@ -1261,38 +1550,16 @@ movedb(const char *dbname, const char *tblspcname)
 	}
 
 	/*
-	 * Use an ENSURE block to make sure we remove the debris if the copy fails
-	 * (eg, due to out-of-disk-space).  This is not a 100% solution, because
-	 * of the possibility of failure during transaction commit, but it should
-	 * handle most scenarios.
+	 * Use an ENSURE block to make sure we remove the debris if the copy fails.
+	 * This is not a 100% solution, because of the possibility of failure
+	 * during transaction commit, but it should handle most scenarios.
 	 */
 	fparms.dest_dboid = db_id;
 	fparms.dest_tsoid = dst_tblspcoid;
 	PG_ENSURE_ERROR_CLEANUP(movedb_failure_callback,
 							PointerGetDatum(&fparms));
 	{
-		/*
-		 * Copy files from the old tablespace to the new one
-		 */
-		copydir(src_dbpath, dst_dbpath, false);
-
-		/*
-		 * Record the filesystem change in XLOG
-		 */
-		{
-			xl_dbase_create_rec xlrec;
-
-			xlrec.db_id = db_id;
-			xlrec.tablespace_id = dst_tblspcoid;
-			xlrec.src_db_id = db_id;
-			xlrec.src_tablespace_id = src_tblspcoid;
-
-			XLogBeginInsert();
-			XLogRegisterData((char *) &xlrec, sizeof(xl_dbase_create_rec));
-
-			(void) XLogInsert(RM_DBASE_ID,
-							  XLOG_DBASE_CREATE | XLR_SPECIAL_REL_UPDATE);
-		}
+		CopyDatabase(db_id, db_id, src_tblspcoid, dst_tblspcoid);
 
 		/*
 		 * Update the database's pg_database tuple
@@ -1326,22 +1593,6 @@ movedb(const char *dbname, const char *tblspcname)
 		systable_endscan(sysscan);
 
 		/*
-		 * Force another checkpoint here.  As in CREATE DATABASE, this is to
-		 * ensure that we don't have to replay a committed XLOG_DBASE_CREATE
-		 * operation, which would cause us to lose any unlogged operations
-		 * done in the new DB tablespace before the next checkpoint.
-		 */
-		RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
-
-		/*
-		 * Force synchronous commit, thus minimizing the window between
-		 * copying the database files and committal of the transaction. If we
-		 * crash before committing, we'll leave an orphaned set of files on
-		 * disk, which is not fatal but not good either.
-		 */
-		ForceSyncCommit();
-
-		/*
 		 * Close pg_database, but keep lock till commit.
 		 */
 		table_close(pgdbrel, NoLock);
@@ -1350,6 +1601,21 @@ movedb(const char *dbname, const char *tblspcname)
 								PointerGetDatum(&fparms));
 
 	/*
+	 * Now drop all buffers holding data of the target database for the old
+	 * tablespace oid; We have already copied all the data to the new
+	 * tablespace so we no longer required the old buffers.
+	 *
+	 * It might seem that we could just let these buffers age out of shared
+	 * buffers naturally, since they should not get referenced anymore.  The
+	 * problem with that is that if the user later moves the database back to
+	 * its original tablespace, any still-surviving buffers would appear to
+	 * contain valid data again --- but they'd be missing any changes made in
+	 * the database while it was in the new tablespace.  In any case, freeing
+	 * buffers that should never be used again seems worth the cycles.
+	 */
+	DropDatabaseBuffers(db_id, src_tblspcoid);
+
+	/*
 	 * Commit the transaction so that the pg_database update is committed. If
 	 * we crash while removing files, the database won't be corrupt, we'll
 	 * just leave some orphaned files in the old directory.
@@ -2141,39 +2407,11 @@ dbase_redo(XLogReaderState *record)
 	if (info == XLOG_DBASE_CREATE)
 	{
 		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) XLogRecGetData(record);
-		char	   *src_path;
-		char	   *dst_path;
-		struct stat st;
-
-		src_path = GetDatabasePath(xlrec->src_db_id, xlrec->src_tablespace_id);
-		dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
-
-		/*
-		 * Our theory for replaying a CREATE is to forcibly drop the target
-		 * subdirectory if present, then re-copy the source data. This may be
-		 * more work than needed, but it is simple to implement.
-		 */
-		if (stat(dst_path, &st) == 0 && S_ISDIR(st.st_mode))
-		{
-			if (!rmtree(dst_path, true))
-				/* If this failed, copydir() below is going to error. */
-				ereport(WARNING,
-						(errmsg("some useless files may be left behind in old database directory \"%s\"",
-								dst_path)));
-		}
-
-		/*
-		 * Force dirty buffers out to disk, to ensure source database is
-		 * up-to-date for the copy.
-		 */
-		FlushDatabaseBuffers(xlrec->src_db_id);
+		char	   *dbpath;
 
-		/*
-		 * Copy this subdirectory to the new location
-		 *
-		 * We don't need to copy subdirectories
-		 */
-		copydir(src_path, dst_path, false);
+		dbpath = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
+		CreateDirAndVersionFile(dbpath, xlrec->db_id, xlrec->tablespace_id,
+								true);
 	}
 	else if (info == XLOG_DBASE_DROP)
 	{
@@ -2201,7 +2439,7 @@ dbase_redo(XLogReaderState *record)
 		ReplicationSlotsDropDBSlots(xlrec->db_id);
 
 		/* Drop pages for this database that are in the shared buffer cache */
-		DropDatabaseBuffers(xlrec->db_id);
+		DropDatabaseBuffers(xlrec->db_id, InvalidOid);
 
 		/* Also, clean out any fsync requests that might be pending in md.c */
 		ForgetDatabaseSyncRequests(xlrec->db_id);
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 59ebac7..f71b344 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -23,6 +23,7 @@
 #include "fe_utils/archive.h"
 #include "filemap.h"
 #include "pg_rewind.h"
+#include "utils/relmapper.h"
 
 /*
  * RmgrNames is an array of resource manager names, to make error messages
diff --git a/src/include/commands/dbcommands_xlog.h b/src/include/commands/dbcommands_xlog.h
index f5ed762..21dc58e 100644
--- a/src/include/commands/dbcommands_xlog.h
+++ b/src/include/commands/dbcommands_xlog.h
@@ -23,11 +23,8 @@
 
 typedef struct xl_dbase_create_rec
 {
-	/* Records copying of a single subdirectory incl. contents */
 	Oid			db_id;
 	Oid			tablespace_id;
-	Oid			src_db_id;
-	Oid			src_tablespace_id;
 } xl_dbase_create_rec;
 
 typedef struct xl_dbase_drop_rec
-- 
1.8.3.1

#40

Dilip Kumar

dilipbalaut@gmail.com

over 4 years ago

In reply to: Dilip Kumar (#39)

7 attachment(s)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Mon, Oct 4, 2021 at 2:51 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have implemented the patch with approach2 as well, i.e. instead of
scanning the pg-class, we scan the directory.

IMHO, we have already discussed most of the advantages and
disadvantages of both approaches so I don't want to mention those
again. But I have noticed one more issue with the approach2,
basically, if we scan the directory then we don't have any way to
identify the relation-OID and that is required in order to acquire the
relation lock before copying it, right?

Patch details:
0001 to 0006 implements an approach1
0007 removes the code of pg_class scanning and adds the directory scan.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v5-0001-Refactor-relmap-load-and-relmap-write-functions.patchapplication/octet-stream; name=v5-0001-Refactor-relmap-load-and-relmap-write-functions.patchDownload

From fce1a87e25d20bf4b1a85c6cc535db42b5bdfc73 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Wed, 1 Sep 2021 14:06:29 +0530
Subject: [PATCH v5 1/7] Refactor relmap load and relmap write functions

Currently, write_relmap_file and load_relmap_file are tightly
coupled with shared_map and local_map.  As part of the higher
level patch set we need remap read/write interfaces that are
not dependent upon shared_map and local_map, and we should be
able to pass map memory as an external parameter instead.
---
 src/backend/utils/cache/relmapper.c | 163 +++++++++++++++++-----------
 1 file changed, 99 insertions(+), 64 deletions(-)

diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index a6e38adce3..bb39632080 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -136,6 +136,12 @@ static void apply_map_update(RelMapFile *map, Oid relationId, Oid fileNode,
 							 bool add_okay);
 static void merge_map_updates(RelMapFile *map, const RelMapFile *updates,
 							  bool add_okay);
+static void read_relmap_file(char *mapfilename, RelMapFile *map,
+							 bool lock_held);
+static void write_relmap_file_internal(char *mapfilename, RelMapFile *newmap,
+									   bool write_wal, bool send_sinval,
+									   bool preserve_files, Oid dbid, Oid tsid,
+									   const char *dbpath);
 static void load_relmap_file(bool shared, bool lock_held);
 static void write_relmap_file(bool shared, RelMapFile *newmap,
 							  bool write_wal, bool send_sinval, bool preserve_files,
@@ -687,36 +693,19 @@ RestoreRelationMap(char *startAddress)
 }
 
 /*
- * load_relmap_file -- load data from the shared or local map file
+ * read_relmap_file -- read data from given mapfilename file.
  *
  * Because the map file is essential for access to core system catalogs,
  * failure to read it is a fatal error.
- *
- * Note that the local case requires DatabasePath to be set up.
  */
 static void
-load_relmap_file(bool shared, bool lock_held)
+read_relmap_file(char *mapfilename, RelMapFile *map, bool lock_held)
 {
-	RelMapFile *map;
-	char		mapfilename[MAXPGPATH];
 	pg_crc32c	crc;
 	int			fd;
 	int			r;
 
-	if (shared)
-	{
-		snprintf(mapfilename, sizeof(mapfilename), "global/%s",
-				 RELMAPPER_FILENAME);
-		map = &shared_map;
-	}
-	else
-	{
-		snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
-				 DatabasePath, RELMAPPER_FILENAME);
-		map = &local_map;
-	}
-
-	/* Read data ... */
+	/* Open the relmap file for reading. */
 	fd = OpenTransientFile(mapfilename, O_RDONLY | PG_BINARY);
 	if (fd < 0)
 		ereport(FATAL,
@@ -779,62 +768,50 @@ load_relmap_file(bool shared, bool lock_held)
 }
 
 /*
- * Write out a new shared or local map file with the given contents.
- *
- * The magic number and CRC are automatically updated in *newmap.  On
- * success, we copy the data to the appropriate permanent static variable.
- *
- * If write_wal is true then an appropriate WAL message is emitted.
- * (It will be false for bootstrap and WAL replay cases.)
- *
- * If send_sinval is true then a SI invalidation message is sent.
- * (This should be true except in bootstrap case.)
- *
- * If preserve_files is true then the storage manager is warned not to
- * delete the files listed in the map.
+ * load_relmap_file -- load data from the shared or local map file
  *
- * Because this may be called during WAL replay when MyDatabaseId,
- * DatabasePath, etc aren't valid, we require the caller to pass in suitable
- * values.  The caller is also responsible for being sure no concurrent
- * map update could be happening.
+ * Note that the local case requires DatabasePath to be set up.
  */
 static void
-write_relmap_file(bool shared, RelMapFile *newmap,
-				  bool write_wal, bool send_sinval, bool preserve_files,
-				  Oid dbid, Oid tsid, const char *dbpath)
+load_relmap_file(bool shared, bool lock_held)
 {
-	int			fd;
-	RelMapFile *realmap;
+	RelMapFile *map;
 	char		mapfilename[MAXPGPATH];
 
-	/*
-	 * Fill in the overhead fields and update CRC.
-	 */
-	newmap->magic = RELMAPPER_FILEMAGIC;
-	if (newmap->num_mappings < 0 || newmap->num_mappings > MAX_MAPPINGS)
-		elog(ERROR, "attempt to write bogus relation mapping");
-
-	INIT_CRC32C(newmap->crc);
-	COMP_CRC32C(newmap->crc, (char *) newmap, offsetof(RelMapFile, crc));
-	FIN_CRC32C(newmap->crc);
-
-	/*
-	 * Open the target file.  We prefer to do this before entering the
-	 * critical section, so that an open() failure need not force PANIC.
-	 */
 	if (shared)
 	{
 		snprintf(mapfilename, sizeof(mapfilename), "global/%s",
 				 RELMAPPER_FILENAME);
-		realmap = &shared_map;
+		map = &shared_map;
 	}
 	else
 	{
 		snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
-				 dbpath, RELMAPPER_FILENAME);
-		realmap = &local_map;
+				 DatabasePath, RELMAPPER_FILENAME);
+		map = &local_map;
 	}
 
+	/* Read data ... */
+	read_relmap_file(mapfilename, map, lock_held);
+}
+
+/*
+ * Helper function for write_relmap_file, Read comments atop write_relmap_file
+ * for more details.  The CRC should be computed by the caller and stored in
+ * the newmap.
+ */
+static void
+write_relmap_file_internal(char *mapfilename, RelMapFile *newmap,
+						   bool write_wal, bool send_sinval,
+						   bool preserve_files, Oid dbid, Oid tsid,
+						   const char *dbpath)
+{
+	int			fd;
+
+	/*
+	 * Open the target file.  We prefer to do this before entering the
+	 * critical section, so that an open() failure need not force PANIC.
+	 */
 	fd = OpenTransientFile(mapfilename, O_WRONLY | O_CREAT | PG_BINARY);
 	if (fd < 0)
 		ereport(ERROR,
@@ -934,6 +911,68 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		}
 	}
 
+	/* Critical section done */
+	if (write_wal)
+		END_CRIT_SECTION();
+}
+
+/*
+ * Write out a new shared or local map file with the given contents.
+ *
+ * The magic number and CRC are automatically updated in *newmap.  On
+ * success, we copy the data to the appropriate permanent static variable.
+ *
+ * If write_wal is true then an appropriate WAL message is emitted.
+ * (It will be false for bootstrap and WAL replay cases.)
+ *
+ * If send_sinval is true then a SI invalidation message is sent.
+ * (This should be true except in bootstrap case.)
+ *
+ * If preserve_files is true then the storage manager is warned not to
+ * delete the files listed in the map.
+ *
+ * Because this may be called during WAL replay when MyDatabaseId,
+ * DatabasePath, etc aren't valid, we require the caller to pass in suitable
+ * values.  The caller is also responsible for being sure no concurrent
+ * map update could be happening.
+ */
+static void
+write_relmap_file(bool shared, RelMapFile *newmap,
+				  bool write_wal, bool send_sinval, bool preserve_files,
+				  Oid dbid, Oid tsid, const char *dbpath)
+{
+	RelMapFile *realmap;
+	char		mapfilename[MAXPGPATH];
+
+	/*
+	 * Fill in the overhead fields and update CRC.
+	 */
+	newmap->magic = RELMAPPER_FILEMAGIC;
+	if (newmap->num_mappings < 0 || newmap->num_mappings > MAX_MAPPINGS)
+		elog(ERROR, "attempt to write bogus relation mapping");
+
+	INIT_CRC32C(newmap->crc);
+	COMP_CRC32C(newmap->crc, (char *) newmap, offsetof(RelMapFile, crc));
+	FIN_CRC32C(newmap->crc);
+
+	if (shared)
+	{
+		snprintf(mapfilename, sizeof(mapfilename), "global/%s",
+				 RELMAPPER_FILENAME);
+		realmap = &shared_map;
+	}
+	else
+	{
+		snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+				 dbpath, RELMAPPER_FILENAME);
+		realmap = &local_map;
+	}
+
+	/* Write the map to the relmap file. */
+	write_relmap_file_internal(mapfilename, newmap, write_wal,
+							   send_sinval, preserve_files, dbid, tsid,
+							   dbpath);
+
 	/*
 	 * Success, update permanent copy.  During bootstrap, we might be working
 	 * on the permanent copy itself, in which case skip the memcpy() to avoid
@@ -943,10 +982,6 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		memcpy(realmap, newmap, sizeof(RelMapFile));
 	else
 		Assert(!send_sinval);	/* must be bootstrapping */
-
-	/* Critical section done */
-	if (write_wal)
-		END_CRIT_SECTION();
 }
 
 /*
-- 
2.23.0

v5-0003-Refactor-index_copy_data.patchapplication/octet-stream; name=v5-0003-Refactor-index_copy_data.patchDownload

From 27743d5cf737dd10e54c4231e3d2c5cace3b87ea Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 24 Sep 2021 18:13:25 +0530
Subject: [PATCH v5 3/7] Refactor index_copy_data

Make separate interface for copying relation storage, this will
be used by later patch for copying the database relations.
---
 src/backend/commands/tablecmds.c | 59 +++++++++++++++++++-------------
 src/include/commands/tablecmds.h |  5 +++
 2 files changed, 40 insertions(+), 24 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index ff97b618e6..426d1b02bf 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -14189,21 +14189,13 @@ AlterTableMoveAll(AlterTableMoveAllStmt *stmt)
 	return new_tablespaceoid;
 }
 
-static void
-index_copy_data(Relation rel, RelFileNode newrnode)
+/*
+ * Copy source smgr all fork's data to the destination smgr.
+ */
+void
+RelationCopyAllFork(SMgrRelation src_smgr, SMgrRelation	dst_smgr,
+					char relpersistence, copy_relation_storage copy_storage)
 {
-	SMgrRelation dstrel;
-
-	dstrel = smgropen(newrnode, rel->rd_backend);
-
-	/*
-	 * Since we copy the file directly without looking at the shared buffers,
-	 * we'd better first flush out any pages of the source relation that are
-	 * in shared buffers.  We assume no new changes will be made while we are
-	 * holding exclusive lock on the rel.
-	 */
-	FlushRelationBuffers(rel);
-
 	/*
 	 * Create and copy all forks of the relation, and schedule unlinking of
 	 * old physical files.
@@ -14211,32 +14203,51 @@ index_copy_data(Relation rel, RelFileNode newrnode)
 	 * NOTE: any conflict in relfilenode value will be caught in
 	 * RelationCreateStorage().
 	 */
-	RelationCreateStorage(newrnode, rel->rd_rel->relpersistence);
+	RelationCreateStorage(dst_smgr->smgr_rnode.node, relpersistence);
 
 	/* copy main fork */
-	RelationCopyStorage(RelationGetSmgr(rel), dstrel, MAIN_FORKNUM,
-						rel->rd_rel->relpersistence);
+	copy_storage(src_smgr, dst_smgr, MAIN_FORKNUM, relpersistence);
 
 	/* copy those extra forks that exist */
 	for (ForkNumber forkNum = MAIN_FORKNUM + 1;
 		 forkNum <= MAX_FORKNUM; forkNum++)
 	{
-		if (smgrexists(RelationGetSmgr(rel), forkNum))
+		if (smgrexists(src_smgr, forkNum))
 		{
-			smgrcreate(dstrel, forkNum, false);
+			smgrcreate(dst_smgr, forkNum, false);
 
 			/*
 			 * WAL log creation if the relation is persistent, or this is the
 			 * init fork of an unlogged relation.
 			 */
-			if (RelationIsPermanent(rel) ||
-				(rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
+			if (relpersistence == RELPERSISTENCE_PERMANENT ||
+				(relpersistence == RELPERSISTENCE_UNLOGGED &&
 				 forkNum == INIT_FORKNUM))
-				log_smgrcreate(&newrnode, forkNum);
-			RelationCopyStorage(RelationGetSmgr(rel), dstrel, forkNum,
-								rel->rd_rel->relpersistence);
+				log_smgrcreate(&dst_smgr->smgr_rnode.node, forkNum);
+
+			/* Copy a fork's data, block by block. */
+			copy_storage(src_smgr, dst_smgr, forkNum, relpersistence);
 		}
 	}
+}
+
+static void
+index_copy_data(Relation rel, RelFileNode newrnode)
+{
+	SMgrRelation dstrel;
+
+	dstrel = smgropen(newrnode, rel->rd_backend);
+
+	/*
+	 * Since we copy the file directly without looking at the shared buffers,
+	 * we'd better first flush out any pages of the source relation that are
+	 * in shared buffers.  We assume no new changes will be made while we are
+	 * holding exclusive lock on the rel.
+	 */
+	FlushRelationBuffers(rel);
+
+	RelationCopyAllFork(RelationGetSmgr(rel), dstrel,
+						rel->rd_rel->relpersistence, RelationCopyStorage);
 
 	/* drop old relation, and close new one */
 	RelationDropStorage(rel);
diff --git a/src/include/commands/tablecmds.h b/src/include/commands/tablecmds.h
index 336549cc5f..e0e0aa5aa0 100644
--- a/src/include/commands/tablecmds.h
+++ b/src/include/commands/tablecmds.h
@@ -19,10 +19,13 @@
 #include "catalog/objectaddress.h"
 #include "nodes/parsenodes.h"
 #include "storage/lock.h"
+#include "storage/smgr.h"
 #include "utils/relcache.h"
 
 struct AlterTableUtilityContext;	/* avoid including tcop/utility.h here */
 
+typedef void (*copy_relation_storage) (SMgrRelation src, SMgrRelation dst,
+									  ForkNumber forkNum, char relpersistence);
 
 extern ObjectAddress DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 									ObjectAddress *typaddress, const char *queryString);
@@ -42,6 +45,8 @@ extern void AlterTableInternal(Oid relid, List *cmds, bool recurse);
 
 extern Oid	AlterTableMoveAll(AlterTableMoveAllStmt *stmt);
 
+extern void RelationCopyAllFork(SMgrRelation src_smgr, SMgrRelation	dst_smgr,
+								char relpersistence, copy_relation_storage copy_storage);
 extern ObjectAddress AlterTableNamespace(AlterObjectSchemaStmt *stmt,
 										 Oid *oldschema);
 
-- 
2.23.0

v5-0004-Extend-bufmgr-interfaces.patchapplication/octet-stream; name=v5-0004-Extend-bufmgr-interfaces.patchDownload

From 57030c4cc7ca6d2af8581c04a4307a46b383f01b Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 24 Sep 2021 18:23:39 +0530
Subject: [PATCH v5 4/7] Extend bufmgr interfaces

Extend ReadBufferWithoutRelcache interface to take relpersistence as
and input and extend DropDatabaseBuffers to take tablespace oid as
input.
---
 src/backend/access/transam/xlogutils.c |  9 ++++++---
 src/backend/commands/dbcommands.c      |  9 +++------
 src/backend/storage/buffer/bufmgr.c    | 24 +++++++++++-------------
 src/include/storage/bufmgr.h           |  5 +++--
 4 files changed, 23 insertions(+), 24 deletions(-)

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 88a1bfd939..e734a91bf7 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -484,7 +484,8 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 	{
 		/* page exists in file */
 		buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-										   mode, NULL);
+										   mode, NULL,
+										   RELPERSISTENCE_PERMANENT);
 	}
 	else
 	{
@@ -509,7 +510,8 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 				ReleaseBuffer(buffer);
 			}
 			buffer = ReadBufferWithoutRelcache(rnode, forknum,
-											   P_NEW, mode, NULL);
+											   P_NEW, mode, NULL,
+											   RELPERSISTENCE_PERMANENT);
 		}
 		while (BufferGetBlockNumber(buffer) < blkno);
 		/* Handle the corner case that P_NEW returns non-consecutive pages */
@@ -519,7 +521,8 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 				LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 			ReleaseBuffer(buffer);
 			buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-											   mode, NULL);
+											   mode, NULL,
+											   RELPERSISTENCE_PERMANENT);
 		}
 	}
 
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 029fab48df..1d963d8428 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -938,7 +938,7 @@ dropdb(const char *dbname, bool missing_ok, bool force)
 	 * is important to ensure that no remaining backend tries to write out a
 	 * dirty buffer to the dead database later...
 	 */
-	DropDatabaseBuffers(db_id);
+	DropDatabaseBuffers(db_id, InvalidOid);
 
 	/*
 	 * Tell the stats collector to forget it immediately, too.
@@ -1220,11 +1220,8 @@ movedb(const char *dbname, const char *tblspcname)
 	 * contain valid data again --- but they'd be missing any changes made in
 	 * the database while it was in the new tablespace.  In any case, freeing
 	 * buffers that should never be used again seems worth the cycles.
-	 *
-	 * Note: it'd be sufficient to get rid of buffers matching db_id and
-	 * src_tblspcoid, but bufmgr.c presently provides no API for that.
 	 */
-	DropDatabaseBuffers(db_id);
+	DropDatabaseBuffers(db_id, src_tblspcoid);
 
 	/*
 	 * Check for existence of files in the target directory, i.e., objects of
@@ -2201,7 +2198,7 @@ dbase_redo(XLogReaderState *record)
 		ReplicationSlotsDropDBSlots(xlrec->db_id);
 
 		/* Drop pages for this database that are in the shared buffer cache */
-		DropDatabaseBuffers(xlrec->db_id);
+		DropDatabaseBuffers(xlrec->db_id, InvalidOid);
 
 		/* Also, clean out any fsync requests that might be pending in md.c */
 		ForgetDatabaseSyncRequests(xlrec->db_id);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e88e4e918b..ed54c34031 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -770,24 +770,17 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 /*
  * ReadBufferWithoutRelcache -- like ReadBufferExtended, but doesn't require
  *		a relcache entry for the relation.
- *
- * NB: At present, this function may only be used on permanent relations, which
- * is OK, because we only use it during XLOG replay.  If in the future we
- * want to use it on temporary or unlogged relations, we could pass additional
- * parameters.
  */
 Buffer
 ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
 						  BlockNumber blockNum, ReadBufferMode mode,
-						  BufferAccessStrategy strategy)
+						  BufferAccessStrategy strategy, char relpersistence)
 {
 	bool		hit;
 
 	SMgrRelation smgr = smgropen(rnode, InvalidBackendId);
 
-	Assert(InRecovery);
-
-	return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
+	return ReadBuffer_common(smgr, relpersistence, forkNum, blockNum,
 							 mode, strategy, &hit);
 }
 
@@ -797,7 +790,7 @@ ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
  *
  * *hit is set to true if the request was satisfied from shared buffer cache.
  */
-static Buffer
+Buffer
 ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 				  BlockNumber blockNum, ReadBufferMode mode,
 				  BufferAccessStrategy strategy, bool *hit)
@@ -3402,10 +3395,13 @@ FindAndDropRelFileNodeBuffers(RelFileNode rnode, ForkNumber forkNum,
  *		database, to avoid trying to flush data to disk when the directory
  *		tree no longer exists.  Implementation is pretty similar to
  *		DropRelFileNodeBuffers() which is for destroying just one relation.
+ *
+ *		If a valid tablespace oid is passed then it will compare the tablespace
+ *		oid as well otherwise just the db oid.
  * --------------------------------------------------------------------
  */
 void
-DropDatabaseBuffers(Oid dbid)
+DropDatabaseBuffers(Oid dbid, Oid tbsid)
 {
 	int			i;
 
@@ -3423,11 +3419,13 @@ DropDatabaseBuffers(Oid dbid)
 		 * As in DropRelFileNodeBuffers, an unlocked precheck should be safe
 		 * and saves some cycles.
 		 */
-		if (bufHdr->tag.rnode.dbNode != dbid)
+		if (bufHdr->tag.rnode.dbNode != dbid ||
+			(OidIsValid(tbsid) && bufHdr->tag.rnode.spcNode != tbsid))
 			continue;
 
 		buf_state = LockBufHdr(bufHdr);
-		if (bufHdr->tag.rnode.dbNode == dbid)
+		if (bufHdr->tag.rnode.dbNode == dbid &&
+			(!OidIsValid(tbsid) || bufHdr->tag.rnode.spcNode == tbsid))
 			InvalidateBuffer(bufHdr);	/* releases spinlock */
 		else
 			UnlockBufHdr(bufHdr, buf_state);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index cfce23ecbc..237c6a9078 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -184,7 +184,8 @@ extern Buffer ReadBufferExtended(Relation reln, ForkNumber forkNum,
 								 BufferAccessStrategy strategy);
 extern Buffer ReadBufferWithoutRelcache(RelFileNode rnode,
 										ForkNumber forkNum, BlockNumber blockNum,
-										ReadBufferMode mode, BufferAccessStrategy strategy);
+										ReadBufferMode mode, BufferAccessStrategy strategy,
+										char relpersistence);
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern void MarkBufferDirty(Buffer buffer);
@@ -207,7 +208,7 @@ extern void FlushDatabaseBuffers(Oid dbid);
 extern void DropRelFileNodeBuffers(struct SMgrRelationData *smgr_reln, ForkNumber *forkNum,
 								   int nforks, BlockNumber *firstDelBlock);
 extern void DropRelFileNodesAllBuffers(struct SMgrRelationData **smgr_reln, int nnodes);
-extern void DropDatabaseBuffers(Oid dbid);
+extern void DropDatabaseBuffers(Oid dbid, Oid tbsid);
 
 #define RelationGetNumberOfBlocks(reln) \
 	RelationGetNumberOfBlocksInFork(reln, MAIN_FORKNUM)
-- 
2.23.0

v5-0005-New-interface-to-lock-relation-id.patchapplication/octet-stream; name=v5-0005-New-interface-to-lock-relation-id.patchDownload

From 3573561044471110ece483c0c94ae50578cb2ee6 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 24 Sep 2021 18:29:17 +0530
Subject: [PATCH v5 5/7] New interface to lock relation id

Same as LockRelationOid, but instead of rel oid it will take
LockRelId object as an input.  So instead of using MyDatabaseId it
will use the dboid passed in the LockRelId object. So this will
provide an option to lock the relation even if we are not connected
to the database.
---
 src/backend/storage/lmgr/lmgr.c | 28 ++++++++++++++++++++++++++++
 src/include/storage/lmgr.h      |  1 +
 2 files changed, 29 insertions(+)

diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index cdf2266d6d..4a321aa4b2 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -175,6 +175,34 @@ ConditionalLockRelationOid(Oid relid, LOCKMODE lockmode)
 	return true;
 }
 
+/*
+ *		LockRelationId
+ *
+ * Lock, given a LockRelId.  Same as LockRelationOid but take LockRelId as an
+ * input.
+ */
+void
+LockRelationId(LockRelId *relid, LOCKMODE lockmode)
+{
+	LOCKTAG		tag;
+	LOCALLOCK  *locallock;
+	LockAcquireResult res;
+
+	SET_LOCKTAG_RELATION(tag, relid->dbId, relid->relId);
+
+	res = LockAcquireExtended(&tag, lockmode, false, false, true, &locallock);
+
+	/*
+	 * Now that we have the lock, check for invalidation messages; see notes
+	 * in LockRelationOid.
+	 */
+	if (res != LOCKACQUIRE_ALREADY_CLEAR)
+	{
+		AcceptInvalidationMessages();
+		MarkLockClear(locallock);
+	}
+}
+
 /*
  *		UnlockRelationId
  *
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index b009559229..092ee934b4 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -38,6 +38,7 @@ extern void RelationInitLockInfo(Relation relation);
 
 /* Lock a relation */
 extern void LockRelationOid(Oid relid, LOCKMODE lockmode);
+extern void LockRelationId(LockRelId *relid, LOCKMODE lockmode);
 extern bool ConditionalLockRelationOid(Oid relid, LOCKMODE lockmode);
 extern void UnlockRelationId(LockRelId *relid, LOCKMODE lockmode);
 extern void UnlockRelationOid(Oid relid, LOCKMODE lockmode);
-- 
2.23.0

v5-0002-Extend-relmap-interfaces.patchapplication/octet-stream; name=v5-0002-Extend-relmap-interfaces.patchDownload

From 8b113b9d59ef9aa75c8ed3862b7cf8058da157ee Mon Sep 17 00:00:00 2001
From: dilipkumar <dilipbalaut@gmail.com>
Date: Mon, 4 Oct 2021 13:50:44 +0530
Subject: [PATCH v5 2/7] Extend relmap interfaces

Support new interfaces in relmapper, 1) Support copying the
relmap file from one database path to the other database path.
2) Like RelationMapOidToFilenode, provide another interface
which do the same but instead of getting it for the database
we are connected to it will get it for the input database
path.

These interfaces are required for next patch for supporting the
wal logged created database.
---
 src/backend/utils/cache/relmapper.c | 122 ++++++++++++++++++++++++----
 src/include/utils/relmapper.h       |   6 +-
 2 files changed, 112 insertions(+), 16 deletions(-)

diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index bb39632080..51f361cf64 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -141,7 +141,7 @@ static void read_relmap_file(char *mapfilename, RelMapFile *map,
 static void write_relmap_file_internal(char *mapfilename, RelMapFile *newmap,
 									   bool write_wal, bool send_sinval,
 									   bool preserve_files, Oid dbid, Oid tsid,
-									   const char *dbpath);
+									   const char *dbpath, bool create);
 static void load_relmap_file(bool shared, bool lock_held);
 static void write_relmap_file(bool shared, RelMapFile *newmap,
 							  bool write_wal, bool send_sinval, bool preserve_files,
@@ -255,6 +255,36 @@ RelationMapFilenodeToOid(Oid filenode, bool shared)
 	return InvalidOid;
 }
 
+/*
+ * RelationMapOidToFilenodeForDatabase
+ *
+ * Find relfilenode for the given relation id in the dbpath.  Returns
+ * InvalidOid if the relationId is not found in the relmap.
+ */
+Oid
+RelationMapOidToFilenodeForDatabase(char *dbpath, Oid relationId)
+{
+	RelMapFile	map;
+	int			i;
+	char		mapfilename[MAXPGPATH];
+
+	/* Relmap file path for the given dbpath. */
+	snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+			 dbpath, RELMAPPER_FILENAME);
+
+	/* Read the relmap file from the source database. */
+	read_relmap_file(mapfilename, &map, false);
+
+	/* Iterate over the relmap entries to find the input relation oid. */
+	for (i = 0; i < map.num_mappings; i++)
+	{
+		if (relationId == map.mappings[i].mapoid)
+			return map.mappings[i].mapfilenode;
+	}
+
+	return InvalidOid;
+}
+
 /*
  * RelationMapUpdateMap
  *
@@ -693,7 +723,43 @@ RestoreRelationMap(char *startAddress)
 }
 
 /*
- * read_relmap_file -- read data from given mapfilename file.
+ * CopyRelationMap
+ *
+ * Copy relmapfile from source db path to the destination db path and WAL log
+ * the operation.  This function is only called during the create database, so
+ * the destination database is not yet visible to anyone else, thus we don't
+ * need to acquire the relmap lock while updating the destination relmap.
+ */
+void
+CopyRelationMap(Oid dbid, Oid tsid, char *srcdbpath, char *dstdbpath)
+{
+	RelMapFile map;
+	char mapfilename[MAXPGPATH];
+
+	/* Relmap file path of the source database. */
+	snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+			 srcdbpath, RELMAPPER_FILENAME);
+
+	/* Read the relmap file from the source database. */
+	read_relmap_file(mapfilename, &map, false);
+
+	/* Relmap file path of the destination database. */
+	snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+			 dstdbpath, RELMAPPER_FILENAME);
+
+	/*
+	 * Write map contents into the destination database's relmap file.
+	 * write_relmap_file_internal, expects that the CRC should have been
+	 * computed and stored in the input map.  But, since we have read this map
+	 * from the source database and directly writing to the destination file
+	 * without updating it so we don't need to recompute it.
+	 */
+	write_relmap_file_internal(mapfilename, &map, true, false, true, dbid,
+							   tsid, dstdbpath, true);
+}
+
+/*
+ * read_relmap_file - read data from given mapfilename file.
  *
  * Because the map file is essential for access to core system catalogs,
  * failure to read it is a fatal error.
@@ -796,15 +862,18 @@ load_relmap_file(bool shared, bool lock_held)
 }
 
 /*
- * Helper function for write_relmap_file, Read comments atop write_relmap_file
- * for more details.  The CRC should be computed by the caller and stored in
- * the newmap.
+ * Helper function for write_relmap_file and CopyRelationMap, Read comments
+ * atop write_relmap_file for more details.  The CRC should be computed by the
+ * caller and stored in the newmap.
+ *
+ * Pass the create = true, if we are copying the relmap file during CREATE
+ * DATABASE command.
  */
 static void
 write_relmap_file_internal(char *mapfilename, RelMapFile *newmap,
 						   bool write_wal, bool send_sinval,
 						   bool preserve_files, Oid dbid, Oid tsid,
-						   const char *dbpath)
+						   const char *dbpath, bool create)
 {
 	int			fd;
 
@@ -830,6 +899,7 @@ write_relmap_file_internal(char *mapfilename, RelMapFile *newmap,
 		xlrec.dbid = dbid;
 		xlrec.tsid = tsid;
 		xlrec.nbytes = sizeof(RelMapFile);
+		xlrec.create = create;
 
 		XLogBeginInsert();
 		XLogRegisterData((char *) (&xlrec), MinSizeOfRelmapUpdate);
@@ -971,7 +1041,7 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 	/* Write the map to the relmap file. */
 	write_relmap_file_internal(mapfilename, newmap, write_wal,
 							   send_sinval, preserve_files, dbid, tsid,
-							   dbpath);
+							   dbpath, false);
 
 	/*
 	 * Success, update permanent copy.  During bootstrap, we might be working
@@ -1063,15 +1133,37 @@ relmap_redo(XLogReaderState *record)
 		 * Write out the new map and send sinval, but of course don't write a
 		 * new WAL entry.  There's no surrounding transaction to tell to
 		 * preserve files, either.
-		 *
-		 * There shouldn't be anyone else updating relmaps during WAL replay,
-		 * but grab the lock to interlock against load_relmap_file().
 		 */
-		LWLockAcquire(RelationMappingLock, LW_EXCLUSIVE);
-		write_relmap_file((xlrec->dbid == InvalidOid), &newmap,
-						  false, true, false,
-						  xlrec->dbid, xlrec->tsid, dbpath);
-		LWLockRelease(RelationMappingLock);
+		if (!xlrec->create)
+		{
+			/*
+			 * There shouldn't be anyone else updating relmaps during WAL
+			 * replay, but grab the lock to interlock against
+			 * load_relmap_file().
+			 */
+			LWLockAcquire(RelationMappingLock, LW_EXCLUSIVE);
+			write_relmap_file((xlrec->dbid == InvalidOid), &newmap,
+							false, true, false,
+							xlrec->dbid, xlrec->tsid, dbpath);
+			LWLockRelease(RelationMappingLock);
+		}
+		else
+		{
+			char		mapfilename[MAXPGPATH];
+
+			/* Construct the mapfilename. */
+			snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+					 dbpath, RELMAPPER_FILENAME);
+
+			/*
+			 * We don't need to take relmap lock because this wal is logged
+			 * while creating a new database, so there could be no one else
+			 * reading/writing the relmap file.
+			 */
+			write_relmap_file_internal(mapfilename, &newmap, false, false,
+									   false, xlrec->dbid, xlrec->tsid, dbpath,
+									   true);
+		}
 
 		pfree(dbpath);
 	}
diff --git a/src/include/utils/relmapper.h b/src/include/utils/relmapper.h
index c0d14daad9..4165f0990b 100644
--- a/src/include/utils/relmapper.h
+++ b/src/include/utils/relmapper.h
@@ -29,6 +29,7 @@ typedef struct xl_relmap_update
 	Oid			dbid;			/* database ID, or 0 for shared map */
 	Oid			tsid;			/* database's tablespace, or pg_global */
 	int32		nbytes;			/* size of relmap data */
+	bool		create;			/* true if creating new relmap */
 	char		data[FLEXIBLE_ARRAY_MEMBER];
 } xl_relmap_update;
 
@@ -39,6 +40,8 @@ extern Oid	RelationMapOidToFilenode(Oid relationId, bool shared);
 
 extern Oid	RelationMapFilenodeToOid(Oid relationId, bool shared);
 
+extern Oid RelationMapOidToFilenodeForDatabase(char *dbpath, Oid relationId);
+
 extern void RelationMapUpdateMap(Oid relationId, Oid fileNode, bool shared,
 								 bool immediate);
 
@@ -62,7 +65,8 @@ extern void RelationMapInitializePhase3(void);
 extern Size EstimateRelationMapSpace(void);
 extern void SerializeRelationMap(Size maxSize, char *startAddress);
 extern void RestoreRelationMap(char *startAddress);
-
+extern void CopyRelationMap(Oid dbid, Oid tsid, char *srcdbpath,
+							char *dstdbpath);
 extern void relmap_redo(XLogReaderState *record);
 extern void relmap_desc(StringInfo buf, XLogReaderState *record);
 extern const char *relmap_identify(uint8 info);
-- 
2.23.0

v5-0006-WAL-logged-CREATE-DATABASE.patchapplication/octet-stream; name=v5-0006-WAL-logged-CREATE-DATABASE.patchDownload

From 694ba24fc5697be1830bf587e76b7a04e700e087 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Tue, 5 Oct 2021 11:45:02 +0530
Subject: [PATCH v5 6/7] WAL logged CREATE DATABASE

Currently, CREATE DATABASE forces a checkpoint, then copies all the files,
then forces another checkpoint. The comments in the createdb() function
explain the reasons for this. The attached patch fixes this problem by making
create database completely WAL logged so that we can avoid the checkpoints.

This can also be useful for supporting the TDE. For example, if we need different
encryption for the source and the target database then we can not re-encrypt the
page data if we copy the whole directory.  But with this patch, we are copying
page by page so we have an opportunity to re-encrypt the page before copying that
to the target database.
---
 src/backend/access/rmgrdesc/dbasedesc.c |   3 +-
 src/backend/commands/dbcommands.c       | 697 ++++++++++++++++--------
 src/bin/pg_rewind/parsexlog.c           |   1 +
 src/include/commands/dbcommands_xlog.h  |   3 -
 4 files changed, 471 insertions(+), 233 deletions(-)

diff --git a/src/backend/access/rmgrdesc/dbasedesc.c b/src/backend/access/rmgrdesc/dbasedesc.c
index 26609845aa..5010f72b2c 100644
--- a/src/backend/access/rmgrdesc/dbasedesc.c
+++ b/src/backend/access/rmgrdesc/dbasedesc.c
@@ -28,8 +28,7 @@ dbase_desc(StringInfo buf, XLogReaderState *record)
 	{
 		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) rec;
 
-		appendStringInfo(buf, "copy dir %u/%u to %u/%u",
-						 xlrec->src_tablespace_id, xlrec->src_db_id,
+		appendStringInfo(buf, "create dir %u/%u",
 						 xlrec->tablespace_id, xlrec->db_id);
 	}
 	else if (info == XLOG_DBASE_DROP)
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 1d963d8428..2b70d4d388 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -45,13 +45,13 @@
 #include "commands/dbcommands_xlog.h"
 #include "commands/defrem.h"
 #include "commands/seclabel.h"
+#include "commands/tablecmds.h"
 #include "commands/tablespace.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
 #include "replication/slot.h"
-#include "storage/copydir.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
@@ -62,6 +62,7 @@
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
 #include "utils/pg_locale.h"
+#include "utils/relmapper.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 
@@ -77,6 +78,19 @@ typedef struct
 	Oid			dest_tsoid;		/* tablespace we are trying to move to */
 } movedb_failure_params;
 
+/*
+ * When creating a database, we scan the pg_class of the source database to
+ * identify all the relations to be copied.  The structure is used for storing
+ * information about each relation of the source database.
+ */
+typedef struct CreateDBRelInfo
+{
+	RelFileNode		rnode;				/* physical relation identifier */
+	Oid				reloid;				/* relation oid */
+	char			relpersistence;		/* relation's persistence level */
+} CreateDBRelInfo;
+
+
 /* non-export function prototypes */
 static void createdb_failure_callback(int code, Datum arg);
 static void movedb(const char *dbname, const char *tblspcname);
@@ -91,6 +105,426 @@ static bool have_createdb_privilege(void);
 static void remove_dbtablespaces(Oid db_id);
 static bool check_db_file_conflict(Oid db_id);
 static int	errdetail_busy_db(int notherbackends, int npreparedxacts);
+static void CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid,
+									bool isRedo);
+static List *GetDatabaseRelationList(Oid srctbid, Oid srcdbid, char *srcpath);
+void RelationCopyStorageUsingBuffer(SMgrRelation src, SMgrRelation dst,
+									ForkNumber forkNum, char relpersistence);
+static void CopyDatabase(Oid src_dboid, Oid dboid, Oid src_tsid, Oid dst_tsid);
+
+/*
+ * CreateDirAndVersionFile - Create database directory and write out the
+ *							 PG_VERSION file in the database path.
+ *
+ * If isRedo is true, it's okay for the database directory to exist already.
+ *
+ * We can directly write PG_MAJORVERSION in the version file instead of copying
+ * from the source database file because these two must be the same.
+ */
+static void
+CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid, bool isRedo)
+{
+	int		fd;
+	int		nbytes = strlen(PG_MAJORVERSION);
+	char	versionfile[MAXPGPATH];
+
+	/* If we are not in WAL replay then write the WAL. */
+	if (!isRedo)
+	{
+		xl_dbase_create_rec xlrec;
+		XLogRecPtr	lsn;
+
+		/* Now errors are fatal ... */
+		START_CRIT_SECTION();
+
+		xlrec.db_id = dbid;
+		xlrec.tablespace_id = tsid;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec), sizeof(xl_dbase_create_rec));
+
+		lsn = XLogInsert(RM_DBASE_ID, XLOG_DBASE_CREATE);
+
+		/* As always, WAL must hit the disk before the data update does. */
+		XLogFlush(lsn);
+	}
+
+	/* Create database directory. */
+	if (MakePGDirectory(dbpath) < 0)
+	{
+		/* Failure other than already exists or not in WAL replay? */
+		if (errno != EEXIST || !isRedo)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not create directory \"%s\": %m", dbpath)));
+	}
+
+	/*
+	 * Create PG_VERSION file in the database path.  If the file already exists
+	 * and we are in WAL replay then try again to open it in the write mode.
+	 */
+	snprintf(versionfile, sizeof(versionfile), "%s/%s", dbpath, "PG_VERSION");
+
+	fd = OpenTransientFile(versionfile, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
+	if (fd < 0 && errno == EEXIST && isRedo)
+		fd = OpenTransientFile(versionfile, O_RDWR | PG_BINARY);
+
+	if (fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m", versionfile)));
+
+	/* Write PG_MAJORVERSION in the PG_VERSION file. */
+	pgstat_report_wait_start(WAIT_EVENT_COPY_FILE_WRITE);
+	errno = 0;
+	if ((int) write(fd, (char *) PG_MAJORVERSION, nbytes) != nbytes)
+	{
+		/* If write didn't set errno, assume problem is no disk space. */
+		if (errno == 0)
+			errno = ENOSPC;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m", versionfile)));
+	}
+	pgstat_report_wait_end();
+
+	/* Close the version file. */
+	CloseTransientFile(fd);
+
+	/* Critical section done. */
+	if (!isRedo)
+		END_CRIT_SECTION();
+}
+
+/*
+ * GetDatabaseRelationList - Get relfilenode list to be copied.
+ *
+ * Iterate over each block of the pg_class relation.  From there, we will check
+ * all the visible tuples in order to get a list of all the valid relfilenodes
+ * in the source database that should be copied to the target database.
+ */
+static List *
+GetDatabaseRelationList(Oid tbid, Oid dbid, char *srcpath)
+{
+	SMgrRelation	rd_smgr;
+	RelFileNode		rnode;
+	BlockNumber		nblocks;
+	BlockNumber		blkno;
+	OffsetNumber	offnum;
+	OffsetNumber	maxoff;
+	Buffer			buf;
+	Oid				relfilenode;
+	Page			page;
+	List		   *rnodelist = NIL;
+	HeapTupleData	tuple;
+	Form_pg_class	classForm;
+	LockRelId		relid;
+	BufferAccessStrategy bstrategy;
+
+	/* Get pg_class relfilenode. */
+	relfilenode = RelationMapOidToFilenodeForDatabase(srcpath,
+													  RelationRelationId);
+	/*
+	 * We are going to read the buffers associated with the pg_class relation.
+	 * Thus, acquire the relation level lock before start scanning.  As we are
+	 * not connected to the database, we cannot use relation_open directly, so
+	 * we have to lock using relation id.
+	 */
+	relid.dbId = dbid;
+	relid.relId = RelationRelationId;
+	LockRelationId(&relid, AccessShareLock);
+
+	/* Prepare a relnode for pg_class relation. */
+	rnode.spcNode = tbid;
+	rnode.dbNode = dbid;
+	rnode.relNode = relfilenode;
+
+	/*
+	 * We are not connected to the source database so open the pg_class
+	 * relation at the smgr level and get the block count.
+	 */
+	rd_smgr = smgropen(rnode, InvalidBackendId);
+	nblocks = smgrnblocks(rd_smgr, MAIN_FORKNUM);
+
+	/*
+	 * We're going to read the whole pg_class so better to use bulk-read buffer
+	 * access strategy.
+	 */
+	bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	/* Iterate over each block on the pg_class relation. */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		/*
+		 * We are not connected to the source database so directly use the lower
+		 * level bufmgr interface which operates on the rnode.
+		 */
+		buf = ReadBufferWithoutRelcache(rnode, MAIN_FORKNUM, blkno,
+										RBM_NORMAL, bstrategy,
+										RELPERSISTENCE_PERMANENT);
+
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buf);
+		if (PageIsNew(page) || PageIsEmpty(page))
+			continue;
+
+		maxoff = PageGetMaxOffsetNumber(page);
+
+		/* Iterate over each tuple on the page. */
+		for (offnum = FirstOffsetNumber;
+			 offnum <= maxoff;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemid;
+
+			itemid = PageGetItemId(page, offnum);
+
+			/* Nothing to do if slot is empty or already dead. */
+			if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid) ||
+				ItemIdIsRedirected(itemid))
+				continue;
+
+			Assert(ItemIdIsNormal(itemid));
+			ItemPointerSet(&(tuple.t_self), blkno, offnum);
+
+			tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
+			tuple.t_len = ItemIdGetLength(itemid);
+			tuple.t_tableOid = RelationRelationId;
+
+			/*
+			 * If the tuple is visible then add its relfilenode info to the
+			 * list.
+			 */
+			if (HeapTupleSatisfiesVisibility(&tuple, GetActiveSnapshot(), buf))
+			{
+				Oid				relfilenode = InvalidOid;
+				CreateDBRelInfo   *relinfo;
+
+				classForm = (Form_pg_class) GETSTRUCT(&tuple);
+
+				/* We don't need to copy the shared objects to the target. */
+				if (classForm->reltablespace == GLOBALTABLESPACE_OID)
+					continue;
+
+				/*
+				 * If the object doesn't have the storage then nothing to be
+				 * done for that object so just ignore it.
+				 */
+				if (!RELKIND_HAS_STORAGE(classForm->relkind))
+					continue;
+
+				/*
+				 * If relfilenode is valid then directly use it.  Otherwise,
+				 * consult the relmapper for the mapped relation.
+				 *
+				 * XXX We can optimize RelationMapOidToFileenodeForDatabase API
+				 * so that instead of reading the relmap file every time, it can
+				 * save it in a temporary variable and use it for subsequent
+				 * calls.  Then later reset it once we're done or at the
+				 * transaction end.
+				 */
+				if (OidIsValid(classForm->relfilenode))
+					relfilenode = classForm->relfilenode;
+				else
+					relfilenode = RelationMapOidToFilenodeForDatabase(srcpath,
+													classForm->oid);
+
+				/* We must have a valid relfilenode oid. */
+				Assert(OidIsValid(relfilenode));
+
+				/* Prepare a rel info element and add it to the list. */
+				relinfo = (CreateDBRelInfo *) palloc(sizeof(CreateDBRelInfo));
+				if (OidIsValid(classForm->reltablespace))
+					relinfo->rnode.spcNode = classForm->reltablespace;
+				else
+					relinfo->rnode.spcNode = tbid;
+
+				relinfo->rnode.dbNode = dbid;
+				relinfo->rnode.relNode = relfilenode;
+				relinfo->reloid = classForm->oid;
+				relinfo->relpersistence = classForm->relpersistence;
+
+				if (rnodelist == NULL)
+					rnodelist = list_make1(relinfo);
+				else
+					rnodelist = lappend(rnodelist, relinfo);
+			}
+		}
+
+		/* Release the buffer lock. */
+		UnlockReleaseBuffer(buf);
+	}
+
+	/* Release the lock. */
+	UnlockRelationId(&relid, AccessShareLock);
+
+	return rnodelist;
+}
+
+/*
+ * RelationCopyStorageUsingBuffer - Copy fork's data using bufmgr.
+ *
+ * Same as RelationCopyStorage but instead of using smgrread and smgrextend this
+ * will copy using bufmgr apis.
+ */
+void
+RelationCopyStorageUsingBuffer(SMgrRelation src, SMgrRelation dst,
+							   ForkNumber forkNum, char relpersistence)
+{
+	Buffer		srcBuf;
+	Buffer		dstBuf;
+	Page		srcPage;
+	Page		dstPage;
+	bool		use_wal;
+	bool		copying_initfork;
+	BlockNumber nblocks;
+	BlockNumber blkno;
+	BufferAccessStrategy bstrategy_src;
+	BufferAccessStrategy bstrategy_dst;
+
+	/* Refer comments in RelationCopyStorage. */
+	copying_initfork = relpersistence == RELPERSISTENCE_UNLOGGED &&
+		forkNum == INIT_FORKNUM;
+	use_wal = XLogIsNeeded() &&
+		(relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork);
+
+	/* Get number of blocks in the source relation. */
+	nblocks = smgrnblocks(src, forkNum);
+
+	/*
+	 * We are going to copy whole relation from the source to the destination
+	 * so use BAS_BULKREAD strategy for the source relation and BAS_BULKWRITE
+	 * strategy for the destination relation.
+	 */
+	bstrategy_src = GetAccessStrategy(BAS_BULKREAD);
+	bstrategy_dst = GetAccessStrategy(BAS_BULKWRITE);
+
+	/* Iterate over each block of the source relation file. */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		/* If we got a cancel signal during the copy of the data, quit */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Read block from source relation. */
+		srcBuf = ReadBufferWithoutRelcache(src->smgr_rnode.node, forkNum,
+										   blkno, RBM_NORMAL, bstrategy_src,
+										   relpersistence);
+		srcPage = BufferGetPage(srcBuf);
+		if (PageIsNew(srcPage) || PageIsEmpty(srcPage))
+		{
+			ReleaseBuffer(srcBuf);
+			continue;
+		}
+
+		/* Use P_NEW to extend the relation. */
+		dstBuf = ReadBufferWithoutRelcache(dst->smgr_rnode.node, forkNum,
+										   P_NEW, RBM_NORMAL, bstrategy_dst,
+										   relpersistence);
+		LockBuffer(dstBuf, BUFFER_LOCK_EXCLUSIVE);
+
+		START_CRIT_SECTION();
+
+		/* Initialize the page and write the data. */
+		dstPage = BufferGetPage(dstBuf);
+		PageInit(dstPage, BufferGetPageSize(dstBuf), 0);
+		memcpy(dstPage, srcPage, BLCKSZ);
+		MarkBufferDirty(dstBuf);
+
+		/* WAL-log the copied page. */
+		if (use_wal)
+			log_newpage_buffer(dstBuf, true);
+
+		END_CRIT_SECTION();
+
+		UnlockReleaseBuffer(dstBuf);
+		ReleaseBuffer(srcBuf);
+	}
+}
+
+/*
+ * CopyDatabase - Copy source database to the target database.
+ *
+ * Create target database directory and copy data files from the source database
+ * to the target database, block by block and WAL log all the operations.
+ */
+static void
+CopyDatabase(Oid src_dboid, Oid dst_dboid, Oid src_tsid, Oid dst_tsid)
+{
+	char	   *srcpath;
+	char	   *dstpath;
+	List	   *rnodelist = NULL;
+	ListCell   *cell;
+	LockRelId	relid;
+	RelFileNode	srcrnode;
+	RelFileNode	dstrnode;
+	CreateDBRelInfo	*relinfo;
+
+	/* Get the source database path. */
+	srcpath = GetDatabasePath(src_dboid, src_tsid);
+
+	/* Get the destination database path. */
+	dstpath = GetDatabasePath(dst_dboid, dst_tsid);
+
+	/* Create database directory and write PG_VERSION file. */
+	CreateDirAndVersionFile(dstpath, dst_dboid, dst_tsid, false);
+
+	/* Copy relmap file from source database to the destination database. */
+	CopyRelationMap(dst_dboid, dst_tsid, srcpath, dstpath);
+
+	/* Get list of all valid relnode from the source database. */
+	rnodelist = GetDatabaseRelationList(src_tsid, src_dboid, srcpath);
+	Assert(rnodelist != NIL);
+
+	/*
+	 * Database id is common for all the relation so set it before entering to
+	 * the loop.
+	 */
+	relid.dbId = src_dboid;
+
+	/*
+	 * Iterate over each relfilenode and copy the relation data block by block
+	 * from source database to the destination database.
+	 */
+	foreach(cell, rnodelist)
+	{
+		SMgrRelation	src_smgr;
+		SMgrRelation	dst_smgr;
+
+		relinfo = lfirst(cell);
+		srcrnode = relinfo->rnode;
+
+		/*
+		 * If the relation is from the default tablespace then we need to
+		 * create it in the destinations db's default tablespace.  Otherwise,
+		 * we need to create in the same tablespace as it is in the source
+		 * database.
+		 */
+		if (srcrnode.spcNode != src_tsid)
+			dstrnode.spcNode = srcrnode.spcNode;
+		else
+			dstrnode.spcNode = dst_tsid;
+
+		dstrnode.dbNode = dst_dboid;
+		dstrnode.relNode = srcrnode.relNode;
+
+		/* Acquire the lock on relation before start copying. */
+		relid.relId = relinfo->reloid;
+		LockRelationId(&relid, AccessShareLock);
+
+		/* Open the source and the destination relation at smgr level. */
+		src_smgr = smgropen(srcrnode, InvalidBackendId);
+		dst_smgr = smgropen(dstrnode, InvalidBackendId);
+
+		/* Copy relation storage from source to the destination. */
+		RelationCopyAllFork(src_smgr, dst_smgr, relinfo->relpersistence,
+							RelationCopyStorageUsingBuffer);
+
+		/* Release the lock. */
+		UnlockRelationId(&relid, AccessShareLock);
+	}
+
+	list_free_deep(rnodelist);
+}
 
 
 /*
@@ -99,8 +533,6 @@ static int	errdetail_busy_db(int notherbackends, int npreparedxacts);
 Oid
 createdb(ParseState *pstate, const CreatedbStmt *stmt)
 {
-	TableScanDesc scan;
-	Relation	rel;
 	Oid			src_dboid;
 	Oid			src_owner;
 	int			src_encoding = -1;
@@ -563,139 +995,27 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	InvokeObjectPostCreateHook(DatabaseRelationId, dboid, 0);
 
 	/*
-	 * Force a checkpoint before starting the copy. This will force all dirty
-	 * buffers, including those of unlogged tables, out to disk, to ensure
-	 * source database is up-to-date on disk for the copy.
-	 * FlushDatabaseBuffers() would suffice for that, but we also want to
-	 * process any pending unlink requests. Otherwise, if a checkpoint
-	 * happened while we're copying files, a file might be deleted just when
-	 * we're about to copy it, causing the lstat() call in copydir() to fail
-	 * with ENOENT.
-	 */
-	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT
-					  | CHECKPOINT_FLUSH_ALL);
-
-	/*
-	 * Once we start copying subdirectories, we need to be able to clean 'em
-	 * up if we fail.  Use an ENSURE block to make sure this happens.  (This
-	 * is not a 100% solution, because of the possibility of failure during
-	 * transaction commit after we leave this routine, but it should handle
-	 * most scenarios.)
+	 * Once we start copying files from the source database, we need to be able
+	 * to clean 'em up if we fail.  Use an ENSURE block to make sure this
+	 * happens.  (This is not a 100% solution, because of the possibility of
+	 * failure during transaction commit after we leave this routine, but it
+	 * should handle most scenarios.)
 	 */
 	fparms.src_dboid = src_dboid;
 	fparms.dest_dboid = dboid;
 	PG_ENSURE_ERROR_CLEANUP(createdb_failure_callback,
 							PointerGetDatum(&fparms));
 	{
-		/*
-		 * Iterate through all tablespaces of the template database, and copy
-		 * each one to the new database.
-		 */
-		rel = table_open(TableSpaceRelationId, AccessShareLock);
-		scan = table_beginscan_catalog(rel, 0, NULL);
-		while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
-		{
-			Form_pg_tablespace spaceform = (Form_pg_tablespace) GETSTRUCT(tuple);
-			Oid			srctablespace = spaceform->oid;
-			Oid			dsttablespace;
-			char	   *srcpath;
-			char	   *dstpath;
-			struct stat st;
-
-			/* No need to copy global tablespace */
-			if (srctablespace == GLOBALTABLESPACE_OID)
-				continue;
-
-			srcpath = GetDatabasePath(src_dboid, srctablespace);
-
-			if (stat(srcpath, &st) < 0 || !S_ISDIR(st.st_mode) ||
-				directory_is_empty(srcpath))
-			{
-				/* Assume we can ignore it */
-				pfree(srcpath);
-				continue;
-			}
-
-			if (srctablespace == src_deftablespace)
-				dsttablespace = dst_deftablespace;
-			else
-				dsttablespace = srctablespace;
-
-			dstpath = GetDatabasePath(dboid, dsttablespace);
-
-			/*
-			 * Copy this subdirectory to the new location
-			 *
-			 * We don't need to copy subdirectories
-			 */
-			copydir(srcpath, dstpath, false);
-
-			/* Record the filesystem change in XLOG */
-			{
-				xl_dbase_create_rec xlrec;
-
-				xlrec.db_id = dboid;
-				xlrec.tablespace_id = dsttablespace;
-				xlrec.src_db_id = src_dboid;
-				xlrec.src_tablespace_id = srctablespace;
-
-				XLogBeginInsert();
-				XLogRegisterData((char *) &xlrec, sizeof(xl_dbase_create_rec));
-
-				(void) XLogInsert(RM_DBASE_ID,
-								  XLOG_DBASE_CREATE | XLR_SPECIAL_REL_UPDATE);
-			}
-		}
-		table_endscan(scan);
-		table_close(rel, AccessShareLock);
-
-		/*
-		 * We force a checkpoint before committing.  This effectively means
-		 * that committed XLOG_DBASE_CREATE operations will never need to be
-		 * replayed (at least not in ordinary crash recovery; we still have to
-		 * make the XLOG entry for the benefit of PITR operations). This
-		 * avoids two nasty scenarios:
-		 *
-		 * #1: When PITR is off, we don't XLOG the contents of newly created
-		 * indexes; therefore the drop-and-recreate-whole-directory behavior
-		 * of DBASE_CREATE replay would lose such indexes.
-		 *
-		 * #2: Since we have to recopy the source database during DBASE_CREATE
-		 * replay, we run the risk of copying changes in it that were
-		 * committed after the original CREATE DATABASE command but before the
-		 * system crash that led to the replay.  This is at least unexpected
-		 * and at worst could lead to inconsistencies, eg duplicate table
-		 * names.
-		 *
-		 * (Both of these were real bugs in releases 8.0 through 8.0.3.)
-		 *
-		 * In PITR replay, the first of these isn't an issue, and the second
-		 * is only a risk if the CREATE DATABASE and subsequent template
-		 * database change both occur while a base backup is being taken.
-		 * There doesn't seem to be much we can do about that except document
-		 * it as a limitation.
-		 *
-		 * Perhaps if we ever implement CREATE DATABASE in a less cheesy way,
-		 * we can avoid this.
-		 */
-		RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
-
-		/*
-		 * Close pg_database, but keep lock till commit.
-		 */
-		table_close(pg_database_rel, NoLock);
-
-		/*
-		 * Force synchronous commit, thus minimizing the window between
-		 * creation of the database files and committal of the transaction. If
-		 * we crash before committing, we'll have a DB that's taking up disk
-		 * space but is not in pg_database, which is not good.
-		 */
-		ForceSyncCommit();
+		CopyDatabase(src_dboid, dboid, src_deftablespace, dst_deftablespace);
 	}
 	PG_END_ENSURE_ERROR_CLEANUP(createdb_failure_callback,
 								PointerGetDatum(&fparms));
 
+	/*
+	 * Close pg_database, but keep lock till commit.
+	 */
+	table_close(pg_database_rel, NoLock);
+
 	return dboid;
 }
 
@@ -1195,34 +1515,6 @@ movedb(const char *dbname, const char *tblspcname)
 	src_dbpath = GetDatabasePath(db_id, src_tblspcoid);
 	dst_dbpath = GetDatabasePath(db_id, dst_tblspcoid);
 
-	/*
-	 * Force a checkpoint before proceeding. This will force all dirty
-	 * buffers, including those of unlogged tables, out to disk, to ensure
-	 * source database is up-to-date on disk for the copy.
-	 * FlushDatabaseBuffers() would suffice for that, but we also want to
-	 * process any pending unlink requests. Otherwise, the check for existing
-	 * files in the target directory might fail unnecessarily, not to mention
-	 * that the copy might fail due to source files getting deleted under it.
-	 * On Windows, this also ensures that background procs don't hold any open
-	 * files, which would cause rmdir() to fail.
-	 */
-	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT
-					  | CHECKPOINT_FLUSH_ALL);
-
-	/*
-	 * Now drop all buffers holding data of the target database; they should
-	 * no longer be dirty so DropDatabaseBuffers is safe.
-	 *
-	 * It might seem that we could just let these buffers age out of shared
-	 * buffers naturally, since they should not get referenced anymore.  The
-	 * problem with that is that if the user later moves the database back to
-	 * its original tablespace, any still-surviving buffers would appear to
-	 * contain valid data again --- but they'd be missing any changes made in
-	 * the database while it was in the new tablespace.  In any case, freeing
-	 * buffers that should never be used again seems worth the cycles.
-	 */
-	DropDatabaseBuffers(db_id, src_tblspcoid);
-
 	/*
 	 * Check for existence of files in the target directory, i.e., objects of
 	 * this database that are already in the target tablespace.  We can't
@@ -1258,38 +1550,16 @@ movedb(const char *dbname, const char *tblspcname)
 	}
 
 	/*
-	 * Use an ENSURE block to make sure we remove the debris if the copy fails
-	 * (eg, due to out-of-disk-space).  This is not a 100% solution, because
-	 * of the possibility of failure during transaction commit, but it should
-	 * handle most scenarios.
+	 * Use an ENSURE block to make sure we remove the debris if the copy fails.
+	 * This is not a 100% solution, because of the possibility of failure
+	 * during transaction commit, but it should handle most scenarios.
 	 */
 	fparms.dest_dboid = db_id;
 	fparms.dest_tsoid = dst_tblspcoid;
 	PG_ENSURE_ERROR_CLEANUP(movedb_failure_callback,
 							PointerGetDatum(&fparms));
 	{
-		/*
-		 * Copy files from the old tablespace to the new one
-		 */
-		copydir(src_dbpath, dst_dbpath, false);
-
-		/*
-		 * Record the filesystem change in XLOG
-		 */
-		{
-			xl_dbase_create_rec xlrec;
-
-			xlrec.db_id = db_id;
-			xlrec.tablespace_id = dst_tblspcoid;
-			xlrec.src_db_id = db_id;
-			xlrec.src_tablespace_id = src_tblspcoid;
-
-			XLogBeginInsert();
-			XLogRegisterData((char *) &xlrec, sizeof(xl_dbase_create_rec));
-
-			(void) XLogInsert(RM_DBASE_ID,
-							  XLOG_DBASE_CREATE | XLR_SPECIAL_REL_UPDATE);
-		}
+		CopyDatabase(db_id, db_id, src_tblspcoid, dst_tblspcoid);
 
 		/*
 		 * Update the database's pg_database tuple
@@ -1322,22 +1592,6 @@ movedb(const char *dbname, const char *tblspcname)
 
 		systable_endscan(sysscan);
 
-		/*
-		 * Force another checkpoint here.  As in CREATE DATABASE, this is to
-		 * ensure that we don't have to replay a committed XLOG_DBASE_CREATE
-		 * operation, which would cause us to lose any unlogged operations
-		 * done in the new DB tablespace before the next checkpoint.
-		 */
-		RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
-
-		/*
-		 * Force synchronous commit, thus minimizing the window between
-		 * copying the database files and committal of the transaction. If we
-		 * crash before committing, we'll leave an orphaned set of files on
-		 * disk, which is not fatal but not good either.
-		 */
-		ForceSyncCommit();
-
 		/*
 		 * Close pg_database, but keep lock till commit.
 		 */
@@ -1346,6 +1600,21 @@ movedb(const char *dbname, const char *tblspcname)
 	PG_END_ENSURE_ERROR_CLEANUP(movedb_failure_callback,
 								PointerGetDatum(&fparms));
 
+	/*
+	 * Now drop all buffers holding data of the target database for the old
+	 * tablespace oid; We have already copied all the data to the new
+	 * tablespace so we no longer required the old buffers.
+	 *
+	 * It might seem that we could just let these buffers age out of shared
+	 * buffers naturally, since they should not get referenced anymore.  The
+	 * problem with that is that if the user later moves the database back to
+	 * its original tablespace, any still-surviving buffers would appear to
+	 * contain valid data again --- but they'd be missing any changes made in
+	 * the database while it was in the new tablespace.  In any case, freeing
+	 * buffers that should never be used again seems worth the cycles.
+	 */
+	DropDatabaseBuffers(db_id, src_tblspcoid);
+
 	/*
 	 * Commit the transaction so that the pg_database update is committed. If
 	 * we crash while removing files, the database won't be corrupt, we'll
@@ -2138,39 +2407,11 @@ dbase_redo(XLogReaderState *record)
 	if (info == XLOG_DBASE_CREATE)
 	{
 		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) XLogRecGetData(record);
-		char	   *src_path;
-		char	   *dst_path;
-		struct stat st;
-
-		src_path = GetDatabasePath(xlrec->src_db_id, xlrec->src_tablespace_id);
-		dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
-
-		/*
-		 * Our theory for replaying a CREATE is to forcibly drop the target
-		 * subdirectory if present, then re-copy the source data. This may be
-		 * more work than needed, but it is simple to implement.
-		 */
-		if (stat(dst_path, &st) == 0 && S_ISDIR(st.st_mode))
-		{
-			if (!rmtree(dst_path, true))
-				/* If this failed, copydir() below is going to error. */
-				ereport(WARNING,
-						(errmsg("some useless files may be left behind in old database directory \"%s\"",
-								dst_path)));
-		}
-
-		/*
-		 * Force dirty buffers out to disk, to ensure source database is
-		 * up-to-date for the copy.
-		 */
-		FlushDatabaseBuffers(xlrec->src_db_id);
+		char	   *dbpath;
 
-		/*
-		 * Copy this subdirectory to the new location
-		 *
-		 * We don't need to copy subdirectories
-		 */
-		copydir(src_path, dst_path, false);
+		dbpath = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
+		CreateDirAndVersionFile(dbpath, xlrec->db_id, xlrec->tablespace_id,
+								true);
 	}
 	else if (info == XLOG_DBASE_DROP)
 	{
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 59ebac7d6a..f71b3446f8 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -23,6 +23,7 @@
 #include "fe_utils/archive.h"
 #include "filemap.h"
 #include "pg_rewind.h"
+#include "utils/relmapper.h"
 
 /*
  * RmgrNames is an array of resource manager names, to make error messages
diff --git a/src/include/commands/dbcommands_xlog.h b/src/include/commands/dbcommands_xlog.h
index f5ed762677..21dc58ea5d 100644
--- a/src/include/commands/dbcommands_xlog.h
+++ b/src/include/commands/dbcommands_xlog.h
@@ -23,11 +23,8 @@
 
 typedef struct xl_dbase_create_rec
 {
-	/* Records copying of a single subdirectory incl. contents */
 	Oid			db_id;
 	Oid			tablespace_id;
-	Oid			src_db_id;
-	Oid			src_tablespace_id;
 } xl_dbase_create_rec;
 
 typedef struct xl_dbase_drop_rec
-- 
2.23.0

v5-0007-POC-WAL-LOG-CREATE-DATABASE-APPROACH-2.patchapplication/octet-stream; name=v5-0007-POC-WAL-LOG-CREATE-DATABASE-APPROACH-2.patchDownload

From e82dc3f212c06990d15d495232f19e296ddb1afd Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 4 Oct 2021 19:08:51 +0530
Subject: [PATCH v5 7/7] POC WAL LOG CREATE DATABASE-APPROACH-2

Modify previous patch so that instead of scanning the pg_class
for identifying the source database relation list, directly scan
the database directory.
---
 src/backend/commands/dbcommands.c | 347 ++++++++++--------------------
 1 file changed, 119 insertions(+), 228 deletions(-)

diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 2b70d4d388..eeba3fd004 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -78,19 +78,6 @@ typedef struct
 	Oid			dest_tsoid;		/* tablespace we are trying to move to */
 } movedb_failure_params;
 
-/*
- * When creating a database, we scan the pg_class of the source database to
- * identify all the relations to be copied.  The structure is used for storing
- * information about each relation of the source database.
- */
-typedef struct CreateDBRelInfo
-{
-	RelFileNode		rnode;				/* physical relation identifier */
-	Oid				reloid;				/* relation oid */
-	char			relpersistence;		/* relation's persistence level */
-} CreateDBRelInfo;
-
-
 /* non-export function prototypes */
 static void createdb_failure_callback(int code, Datum arg);
 static void movedb(const char *dbname, const char *tblspcname);
@@ -107,7 +94,6 @@ static bool check_db_file_conflict(Oid db_id);
 static int	errdetail_busy_db(int notherbackends, int npreparedxacts);
 static void CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid,
 									bool isRedo);
-static List *GetDatabaseRelationList(Oid srctbid, Oid srcdbid, char *srcpath);
 void RelationCopyStorageUsingBuffer(SMgrRelation src, SMgrRelation dst,
 									ForkNumber forkNum, char relpersistence);
 static void CopyDatabase(Oid src_dboid, Oid dboid, Oid src_tsid, Oid dst_tsid);
@@ -196,171 +182,6 @@ CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid, bool isRedo)
 		END_CRIT_SECTION();
 }
 
-/*
- * GetDatabaseRelationList - Get relfilenode list to be copied.
- *
- * Iterate over each block of the pg_class relation.  From there, we will check
- * all the visible tuples in order to get a list of all the valid relfilenodes
- * in the source database that should be copied to the target database.
- */
-static List *
-GetDatabaseRelationList(Oid tbid, Oid dbid, char *srcpath)
-{
-	SMgrRelation	rd_smgr;
-	RelFileNode		rnode;
-	BlockNumber		nblocks;
-	BlockNumber		blkno;
-	OffsetNumber	offnum;
-	OffsetNumber	maxoff;
-	Buffer			buf;
-	Oid				relfilenode;
-	Page			page;
-	List		   *rnodelist = NIL;
-	HeapTupleData	tuple;
-	Form_pg_class	classForm;
-	LockRelId		relid;
-	BufferAccessStrategy bstrategy;
-
-	/* Get pg_class relfilenode. */
-	relfilenode = RelationMapOidToFilenodeForDatabase(srcpath,
-													  RelationRelationId);
-	/*
-	 * We are going to read the buffers associated with the pg_class relation.
-	 * Thus, acquire the relation level lock before start scanning.  As we are
-	 * not connected to the database, we cannot use relation_open directly, so
-	 * we have to lock using relation id.
-	 */
-	relid.dbId = dbid;
-	relid.relId = RelationRelationId;
-	LockRelationId(&relid, AccessShareLock);
-
-	/* Prepare a relnode for pg_class relation. */
-	rnode.spcNode = tbid;
-	rnode.dbNode = dbid;
-	rnode.relNode = relfilenode;
-
-	/*
-	 * We are not connected to the source database so open the pg_class
-	 * relation at the smgr level and get the block count.
-	 */
-	rd_smgr = smgropen(rnode, InvalidBackendId);
-	nblocks = smgrnblocks(rd_smgr, MAIN_FORKNUM);
-
-	/*
-	 * We're going to read the whole pg_class so better to use bulk-read buffer
-	 * access strategy.
-	 */
-	bstrategy = GetAccessStrategy(BAS_BULKREAD);
-
-	/* Iterate over each block on the pg_class relation. */
-	for (blkno = 0; blkno < nblocks; blkno++)
-	{
-		/*
-		 * We are not connected to the source database so directly use the lower
-		 * level bufmgr interface which operates on the rnode.
-		 */
-		buf = ReadBufferWithoutRelcache(rnode, MAIN_FORKNUM, blkno,
-										RBM_NORMAL, bstrategy,
-										RELPERSISTENCE_PERMANENT);
-
-		LockBuffer(buf, BUFFER_LOCK_SHARE);
-		page = BufferGetPage(buf);
-		if (PageIsNew(page) || PageIsEmpty(page))
-			continue;
-
-		maxoff = PageGetMaxOffsetNumber(page);
-
-		/* Iterate over each tuple on the page. */
-		for (offnum = FirstOffsetNumber;
-			 offnum <= maxoff;
-			 offnum = OffsetNumberNext(offnum))
-		{
-			ItemId		itemid;
-
-			itemid = PageGetItemId(page, offnum);
-
-			/* Nothing to do if slot is empty or already dead. */
-			if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid) ||
-				ItemIdIsRedirected(itemid))
-				continue;
-
-			Assert(ItemIdIsNormal(itemid));
-			ItemPointerSet(&(tuple.t_self), blkno, offnum);
-
-			tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
-			tuple.t_len = ItemIdGetLength(itemid);
-			tuple.t_tableOid = RelationRelationId;
-
-			/*
-			 * If the tuple is visible then add its relfilenode info to the
-			 * list.
-			 */
-			if (HeapTupleSatisfiesVisibility(&tuple, GetActiveSnapshot(), buf))
-			{
-				Oid				relfilenode = InvalidOid;
-				CreateDBRelInfo   *relinfo;
-
-				classForm = (Form_pg_class) GETSTRUCT(&tuple);
-
-				/* We don't need to copy the shared objects to the target. */
-				if (classForm->reltablespace == GLOBALTABLESPACE_OID)
-					continue;
-
-				/*
-				 * If the object doesn't have the storage then nothing to be
-				 * done for that object so just ignore it.
-				 */
-				if (!RELKIND_HAS_STORAGE(classForm->relkind))
-					continue;
-
-				/*
-				 * If relfilenode is valid then directly use it.  Otherwise,
-				 * consult the relmapper for the mapped relation.
-				 *
-				 * XXX We can optimize RelationMapOidToFileenodeForDatabase API
-				 * so that instead of reading the relmap file every time, it can
-				 * save it in a temporary variable and use it for subsequent
-				 * calls.  Then later reset it once we're done or at the
-				 * transaction end.
-				 */
-				if (OidIsValid(classForm->relfilenode))
-					relfilenode = classForm->relfilenode;
-				else
-					relfilenode = RelationMapOidToFilenodeForDatabase(srcpath,
-													classForm->oid);
-
-				/* We must have a valid relfilenode oid. */
-				Assert(OidIsValid(relfilenode));
-
-				/* Prepare a rel info element and add it to the list. */
-				relinfo = (CreateDBRelInfo *) palloc(sizeof(CreateDBRelInfo));
-				if (OidIsValid(classForm->reltablespace))
-					relinfo->rnode.spcNode = classForm->reltablespace;
-				else
-					relinfo->rnode.spcNode = tbid;
-
-				relinfo->rnode.dbNode = dbid;
-				relinfo->rnode.relNode = relfilenode;
-				relinfo->reloid = classForm->oid;
-				relinfo->relpersistence = classForm->relpersistence;
-
-				if (rnodelist == NULL)
-					rnodelist = list_make1(relinfo);
-				else
-					rnodelist = lappend(rnodelist, relinfo);
-			}
-		}
-
-		/* Release the buffer lock. */
-		UnlockReleaseBuffer(buf);
-	}
-
-	/* Release the lock. */
-	UnlockRelationId(&relid, AccessShareLock);
-
-	return rnodelist;
-}
-
 /*
  * RelationCopyStorageUsingBuffer - Copy fork's data using bufmgr.
  *
@@ -441,6 +262,44 @@ RelationCopyStorageUsingBuffer(SMgrRelation src, SMgrRelation dst,
 	}
 }
 
+/*
+ * GetRelfileNodeFromFileName - Get relfilenode from the filename
+ *
+ * Return InvalidOid for the temp relation.  Also return InvalidOid if the file
+ * is not for the first segment or for the MAIN_FORKNUM.  Basically, for each
+ * relation we want to return a valid relfilenode only for the MAIN_FORKNUM and
+ * for the first segment.  And, based on the relfilenode the caller will take
+ * care of copying all the forks.
+ */
+static Oid
+GetRelfileNodeFromFileName(char *filename)
+{
+	int		nmatch;
+	int		segno;
+	int		backendId;
+	Oid		relfilenode;
+	char	forkname[FORKNAMECHARS + 1];
+
+	/* Return InvalidOid if it's file for temp relation. */
+	nmatch = sscanf(filename, "%d_%u", &backendId, &relfilenode);
+	if (nmatch == 2)
+		return InvalidOid;
+
+	/* If not first segment, return InvalidOid. */
+	nmatch = sscanf(filename, "%u.%u", &relfilenode, &segno);
+	if (nmatch == 2)
+		return InvalidOid;
+
+	/* If not main fork, return InvalidOid. */
+	nmatch = sscanf(filename, "%u_%s", &relfilenode, forkname);
+	if (nmatch == 2)
+		return InvalidOid;
+	else if (nmatch == 1)
+		return relfilenode;
+
+	return InvalidOid;
+}
+
 /*
  * CopyDatabase - Copy source database to the target database.
  *
@@ -450,14 +309,14 @@ RelationCopyStorageUsingBuffer(SMgrRelation src, SMgrRelation dst,
 static void
 CopyDatabase(Oid src_dboid, Oid dst_dboid, Oid src_tsid, Oid dst_tsid)
 {
+	DIR		   *xldir;
+	struct dirent *xlde;
 	char	   *srcpath;
 	char	   *dstpath;
-	List	   *rnodelist = NULL;
-	ListCell   *cell;
-	LockRelId	relid;
+	char		fromfile[MAXPGPATH * 2];
+	Oid			relfilenode;
 	RelFileNode	srcrnode;
 	RelFileNode	dstrnode;
-	CreateDBRelInfo	*relinfo;
 
 	/* Get the source database path. */
 	srcpath = GetDatabasePath(src_dboid, src_tsid);
@@ -471,61 +330,55 @@ CopyDatabase(Oid src_dboid, Oid dst_dboid, Oid src_tsid, Oid dst_tsid)
 	/* Copy relmap file from source database to the destination database. */
 	CopyRelationMap(dst_dboid, dst_tsid, srcpath, dstpath);
 
-	/* Get list of all valid relnode from the source database. */
-	rnodelist = GetDatabaseRelationList(src_tsid, src_dboid, srcpath);
-	Assert(rnodelist != NIL);
+	xldir = AllocateDir(srcpath);
+	srcrnode.spcNode = src_tsid;
+	srcrnode.dbNode = src_dboid;
+	dstrnode.spcNode = dst_tsid;
+	dstrnode.dbNode = dst_dboid;
 
-	/*
-	 * Database id is common for all the relation so set it before entering to
-	 * the loop.
-	 */
-	relid.dbId = src_dboid;
-
-	/*
-	 * Iterate over each relfilenode and copy the relation data block by block
-	 * from source database to the destination database.
-	 */
-	foreach(cell, rnodelist)
+	while ((xlde = ReadDir(xldir, srcpath)) != NULL)
 	{
-		SMgrRelation	src_smgr;
-		SMgrRelation	dst_smgr;
+		struct stat fst;
 
-		relinfo = lfirst(cell);
-		srcrnode = relinfo->rnode;
+		/* If we got a cancel signal during the copy of the directory, quit */
+		CHECK_FOR_INTERRUPTS();
 
-		/*
-		 * If the relation is from the default tablespace then we need to
-		 * create it in the destinations db's default tablespace.  Otherwise,
-		 * we need to create in the same tablespace as it is in the source
-		 * database.
-		 */
-		if (srcrnode.spcNode != src_tsid)
-			dstrnode.spcNode = srcrnode.spcNode;
-		else
-			dstrnode.spcNode = dst_tsid;
+		if (strcmp(xlde->d_name, ".") == 0 ||
+			strcmp(xlde->d_name, "..") == 0)
+			continue;
 
-		dstrnode.dbNode = dst_dboid;
-		dstrnode.relNode = srcrnode.relNode;
+		snprintf(fromfile, sizeof(fromfile), "%s/%s", srcpath, xlde->d_name);
 
-		/* Acquire the lock on relation before start copying. */
-		relid.relId = relinfo->reloid;
-		LockRelationId(&relid, AccessShareLock);
+		if (lstat(fromfile, &fst) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not stat file \"%s\": %m", fromfile)));
 
-		/* Open the source and the destination relation at smgr level. */
-		src_smgr = smgropen(srcrnode, InvalidBackendId);
-		dst_smgr = smgropen(dstrnode, InvalidBackendId);
+		relfilenode = GetRelfileNodeFromFileName(xlde->d_name);
 
-		/* Copy relation storage from source to the destination. */
-		RelationCopyAllFork(src_smgr, dst_smgr, relinfo->relpersistence,
-							RelationCopyStorageUsingBuffer);
+		if (OidIsValid(relfilenode))
+		{
+			SMgrRelation	src_smgr;
+			SMgrRelation	dst_smgr;
+			char			relpersistence;
 
-		/* Release the lock. */
-		UnlockRelationId(&relid, AccessShareLock);
-	}
+			dstrnode.relNode = srcrnode.relNode = relfilenode;
 
-	list_free_deep(rnodelist);
-}
+			/* Open the source and the destination relation at smgr level. */
+			src_smgr = smgropen(srcrnode, InvalidBackendId);
+			dst_smgr = smgropen(dstrnode, InvalidBackendId);
+
+			if (smgrexists(src_smgr, INIT_FORKNUM))
+				relpersistence = RELPERSISTENCE_UNLOGGED;
+			else
+				relpersistence = RELPERSISTENCE_PERMANENT;
 
+			RelationCopyAllFork(src_smgr, dst_smgr, relpersistence,
+								RelationCopyStorageUsingBuffer);
+		}
+	}
+	FreeDir(xldir);
+}
 
 /*
  * CREATE DATABASE
@@ -533,6 +386,8 @@ CopyDatabase(Oid src_dboid, Oid dst_dboid, Oid src_tsid, Oid dst_tsid)
 Oid
 createdb(ParseState *pstate, const CreatedbStmt *stmt)
 {
+	TableScanDesc scan;
+	Relation	rel;
 	Oid			src_dboid;
 	Oid			src_owner;
 	int			src_encoding = -1;
@@ -1006,7 +861,43 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	PG_ENSURE_ERROR_CLEANUP(createdb_failure_callback,
 							PointerGetDatum(&fparms));
 	{
-		CopyDatabase(src_dboid, dboid, src_deftablespace, dst_deftablespace);
+		/*
+		 * Iterate through all tablespaces of the template database, and copy
+		 * each one to the new database.
+		 */
+		rel = table_open(TableSpaceRelationId, AccessShareLock);
+		scan = table_beginscan_catalog(rel, 0, NULL);
+		while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+		{
+			Form_pg_tablespace spaceform = (Form_pg_tablespace) GETSTRUCT(tuple);
+			Oid			srctablespace = spaceform->oid;
+			Oid			dsttablespace;
+			char	   *srcpath;
+			struct stat st;
+
+			/* No need to copy global tablespace */
+			if (srctablespace == GLOBALTABLESPACE_OID)
+				continue;
+
+			srcpath = GetDatabasePath(src_dboid, srctablespace);
+
+			if (stat(srcpath, &st) < 0 || !S_ISDIR(st.st_mode) ||
+				directory_is_empty(srcpath))
+			{
+				/* Assume we can ignore it */
+				pfree(srcpath);
+				continue;
+			}
+
+			if (srctablespace == src_deftablespace)
+				dsttablespace = dst_deftablespace;
+			else
+				dsttablespace = srctablespace;
+
+			CopyDatabase(src_dboid, dboid, srctablespace, dsttablespace);
+		}
+		table_endscan(scan);
+		table_close(rel, AccessShareLock);
 	}
 	PG_END_ENSURE_ERROR_CLEANUP(createdb_failure_callback,
 								PointerGetDatum(&fparms));
-- 
2.23.0

#41

John Naylor

john.naylor@enterprisedb.com

about 4 years ago

In reply to: Dilip Kumar (#40)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

Hi,

I've looked over this patch set and email thread a couple times, and I
don't see anything amiss, but I'm also not terribly familiar with the
subsystems this part of the code relies on. I haven't yet tried to stress
test with a large database, but it seems like a good idea to do so.

I have a couple comments and questions:

0006:

+ * XXX We can optimize RelationMapOidToFileenodeForDatabase API
+ * so that instead of reading the relmap file every time, it can
+ * save it in a temporary variable and use it for subsequent
+ * calls.  Then later reset it once we're done or at the
+ * transaction end.

Do we really need to consider optimizing here? Only a handful of relations
will be found in the relmap, right?

+ * Once we start copying files from the source database, we need to be able
+ * to clean 'em up if we fail.  Use an ENSURE block to make sure this
+ * happens.  (This is not a 100% solution, because of the possibility of
+ * failure during transaction commit after we leave this routine, but it
+ * should handle most scenarios.)

This comment in master started with

- * Once we start copying subdirectories, we need to be able to clean 'em

Is the distinction important enough to change this comment? Also, is "most
scenarios" still true with the patch? I haven't read into how ENSURE works.

Same with this comment change, seems fine the way it was:

- * Use an ENSURE block to make sure we remove the debris if the copy fails
- * (eg, due to out-of-disk-space).  This is not a 100% solution, because
- * of the possibility of failure during transaction commit, but it should
- * handle most scenarios.
+ * Use an ENSURE block to make sure we remove the debris if the copy fails.
+ * This is not a 100% solution, because of the possibility of failure
+ * during transaction commit, but it should handle most scenarios.

And do we need additional tests? Maybe we don't, but it seems good to make
sure.

I haven't looked at 0007, and I have no opinion on which approach is better.

--
John Naylor
EDB: http://www.enterprisedb.com

#42

Dilip Kumar

dilipbalaut@gmail.com

about 4 years ago

In reply to: John Naylor (#41)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Tue, Nov 23, 2021 at 10:29 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

Hi,

I've looked over this patch set and email thread a couple times, and I don't see anything amiss, but I'm also not terribly familiar with the subsystems this part of the code relies on. I haven't yet tried to stress test with a large database, but it seems like a good idea to do so.

Thanks, John for looking into the patches. Yeah, that makes sense,
next week I will try to test with a large database and maybe with
multiple tablespaces as well to see how this behaves.

I have a couple comments and questions:

0006:
+ * XXX We can optimize RelationMapOidToFileenodeForDatabase API
+ * so that instead of reading the relmap file every time, it can
+ * save it in a temporary variable and use it for subsequent
+ * calls.  Then later reset it once we're done or at the
+ * transaction end.
Do we really need to consider optimizing here? Only a handful of relations will be found in the relmap, right?

You are right, it is actually not required I will remove this comment.

+ * Once we start copying files from the source database, we need to be able
+ * to clean 'em up if we fail.  Use an ENSURE block to make sure this
+ * happens.  (This is not a 100% solution, because of the possibility of
+ * failure during transaction commit after we leave this routine, but it
+ * should handle most scenarios.)
This comment in master started with

- * Once we start copying subdirectories, we need to be able to clean 'em

Is the distinction important enough to change this comment? Also, is "most scenarios" still true with the patch? I haven't read into how ENSURE works.

Actually, it is like PG_TRY(), CATCH() block with extra assurance to
cleanup on shm_exit as well. And in the cleanup function, we go
through all the tablespaces and remove the new DB-related directory
which we are trying to create. And you are right, we actually don't
need to change the comments.

Same with this comment change, seems fine the way it was:

Correct.

- * Use an ENSURE block to make sure we remove the debris if the copy fails
- * (eg, due to out-of-disk-space).  This is not a 100% solution, because
- * of the possibility of failure during transaction commit, but it should
- * handle most scenarios.
+ * Use an ENSURE block to make sure we remove the debris if the copy fails.
+ * This is not a 100% solution, because of the possibility of failure
+ * during transaction commit, but it should handle most scenarios.

And do we need additional tests? Maybe we don't, but it seems good to make sure.

I haven't looked at 0007, and I have no opinion on which approach is better.

Okay, I like approach 6 because of mainly two reasons, 1) it is not
directly scanning the raw file to identify which files to copy so
seems cleaner to me 2) with 0007 if we directly scan directory we
don't know the relation oid, so before acquiring the buffer lock there
is no way to acquire the relation lock.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#43

Greg Nancarrow

gregn4422@gmail.com

about 4 years ago

In reply to: Dilip Kumar (#40)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Tue, Oct 5, 2021 at 7:07 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Patch details:
0001 to 0006 implements an approach1
0007 removes the code of pg_class scanning and adds the directory scan.

I had a scan through the patches, though have not yet actually run any
tests to try to better gauge their benefit.
I do have some initial review comments though:

0003

src/backend/commands/tablecmds.c
(1) RelationCopyAllFork()
In the following comment:

+/*
+ * Copy source smgr all fork's data to the destination smgr.
+ */

Shouldn't it say "smgr relation"?
Also, you could additionally say ", using a specified fork data
copying function." or something like that, to account for the
additional argument.

0006

src/backend/commands/dbcommands.c
(1) function prototype location

The following prototype is currently located in the "non-export
function prototypes" section of the source file, but it's not static -
shouldn't it be in dbcommands.h?

+void RelationCopyStorageUsingBuffer(SMgrRelation src, SMgrRelation dst,
+        ForkNumber forkNum, char relpersistence);

(2) CreateDirAndVersionFile()
Shouldn't the following code:

+ fd = OpenTransientFile(versionfile, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
+ if (fd < 0 && errno == EEXIST && isRedo)
+   fd = OpenTransientFile(versionfile, O_RDWR | PG_BINARY);

actually be:

+ fd = OpenTransientFile(versionfile, O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
+ if (fd < 0 && errno == EEXIST && isRedo)
+   fd = OpenTransientFile(versionfile, O_WRONLY | O_TRUNC | PG_BINARY);

since we're only writing to that file descriptor and we want to
truncate the file if it already exists.

The current comment says "... open it in the write mode.", but should
say "... open it in write mode."

Also, shouldn't you be writing a newline (\n) after the
PG_MAJORVERSION ? (compare with code in initdb.c)

(3) GetDatabaseRelationList()
Shouldn't:

+  if (PageIsNew(page) || PageIsEmpty(page))
+    continue;

be:

+  if (PageIsNew(page) || PageIsEmpty(page))
+  {
+    UnlockReleaseBuffer(buf);
+    continue;
+  }

Also, in the following code:

+  if (rnodelist == NULL)
+    rnodelist = list_make1(relinfo);
+  else
+    rnodelist = lappend(rnodelist, relinfo);

it should really be "== NIL" rather than "== NULL".
But in any case, that code can just be:

rnodelist = lappend(rnodelist, relinfo);

because lappend() will create a list if the first arg is NIL.

(4) RelationCopyStorageUsingBuffer()

In the function comments, IMO it is better to use "APIs" instead of "apis".

Also, better to use "get" instead of "got" in the following comment:

+ /* If we got a cancel signal during the copy of the data, quit */

0007

(I think I prefer the first approach rather than this 2nd approach)

src/backend/commands/dbcommands.c
(1) createdb()
pfree(srcpath) seems to be missing, in the case that CopyDatabase() gets called.

(2) GetRelfileNodeFromFileName()
%s in sscanf() allows an unbounded read and is considered potentially
dangerous (allows buffer overflow), especially here where
FORKNAMECHARS is so small.

+ nmatch = sscanf(filename, "%u_%s", &relfilenode, forkname);

how about using the following instead in this case:

+ nmatch = sscanf(filename, "%u_%4s", &relfilenode, forkname);

(even if there were > 4 chars after the underscore, it would still
match and InvalidOid would be returned because nmatch==2)

Regards,
Greg Nancarrow
Fujitsu Australia

#44

Dilip Kumar

dilipbalaut@gmail.com

about 4 years ago

In reply to: Greg Nancarrow (#43)

6 attachment(s)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Thu, Nov 25, 2021 at 1:07 PM Greg Nancarrow <gregn4422@gmail.com> wrote:

On Tue, Oct 5, 2021 at 7:07 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Patch details:
0001 to 0006 implements an approach1
0007 removes the code of pg_class scanning and adds the directory scan.

I had a scan through the patches, though have not yet actually run any
tests to try to better gauge their benefit.
I do have some initial review comments though:

0003

src/backend/commands/tablecmds.c
(1) RelationCopyAllFork()
In the following comment:
+/*
+ * Copy source smgr all fork's data to the destination smgr.
+ */
Shouldn't it say "smgr relation"?
Also, you could additionally say ", using a specified fork data
copying function." or something like that, to account for the
additional argument.

0006

src/backend/commands/dbcommands.c
(1) function prototype location

The following prototype is currently located in the "non-export
function prototypes" section of the source file, but it's not static -
shouldn't it be in dbcommands.h?
+void RelationCopyStorageUsingBuffer(SMgrRelation src, SMgrRelation dst,
+        ForkNumber forkNum, char relpersistence);
(2) CreateDirAndVersionFile()
Shouldn't the following code:
+ fd = OpenTransientFile(versionfile, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
+ if (fd < 0 && errno == EEXIST && isRedo)
+   fd = OpenTransientFile(versionfile, O_RDWR | PG_BINARY);
actually be:
+ fd = OpenTransientFile(versionfile, O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
+ if (fd < 0 && errno == EEXIST && isRedo)
+   fd = OpenTransientFile(versionfile, O_WRONLY | O_TRUNC | PG_BINARY);
since we're only writing to that file descriptor and we want to
truncate the file if it already exists.

The current comment says "... open it in the write mode.", but should
say "... open it in write mode."

Also, shouldn't you be writing a newline (\n) after the
PG_MAJORVERSION ? (compare with code in initdb.c)

(3) GetDatabaseRelationList()
Shouldn't:
+  if (PageIsNew(page) || PageIsEmpty(page))
+    continue;
be:
+  if (PageIsNew(page) || PageIsEmpty(page))
+  {
+    UnlockReleaseBuffer(buf);
+    continue;
+  }
?

Also, in the following code:
+  if (rnodelist == NULL)
+    rnodelist = list_make1(relinfo);
+  else
+    rnodelist = lappend(rnodelist, relinfo);
it should really be "== NIL" rather than "== NULL".
But in any case, that code can just be:

rnodelist = lappend(rnodelist, relinfo);

because lappend() will create a list if the first arg is NIL.

(4) RelationCopyStorageUsingBuffer()

In the function comments, IMO it is better to use "APIs" instead of "apis".

Also, better to use "get" instead of "got" in the following comment:

+ /* If we got a cancel signal during the copy of the data, quit */

Thanks for the review and many valuable comments, I have fixed all of
them except this comment (/* If we got a cancel signal during the copy
of the data, quit */) because this looks fine to me. 0007, I have
dropped from the patchset for now. I have also included fixes for
comments given by John.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v6-0005-New-interface-to-lock-relation-id.patchapplication/octet-stream; name=v6-0005-New-interface-to-lock-relation-id.patchDownload

From 6ffe40687e391518b9e37971781b5b7b559296a4 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 24 Sep 2021 18:29:17 +0530
Subject: [PATCH v6 5/6] New interface to lock relation id

Same as LockRelationOid, but instead of rel oid it will take
LockRelId object as an input.  So instead of using MyDatabaseId it
will use the dboid passed in the LockRelId object. So this will
provide an option to lock the relation even if we are not connected
to the database.
---
 src/backend/storage/lmgr/lmgr.c | 28 ++++++++++++++++++++++++++++
 src/include/storage/lmgr.h      |  1 +
 2 files changed, 29 insertions(+)

diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 2db0424ad9..89d3ecbeb7 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -175,6 +175,34 @@ ConditionalLockRelationOid(Oid relid, LOCKMODE lockmode)
 	return true;
 }
 
+/*
+ *		LockRelationId
+ *
+ * Lock, given a LockRelId.  Same as LockRelationOid but take LockRelId as an
+ * input.
+ */
+void
+LockRelationId(LockRelId *relid, LOCKMODE lockmode)
+{
+	LOCKTAG		tag;
+	LOCALLOCK  *locallock;
+	LockAcquireResult res;
+
+	SET_LOCKTAG_RELATION(tag, relid->dbId, relid->relId);
+
+	res = LockAcquireExtended(&tag, lockmode, false, false, true, &locallock);
+
+	/*
+	 * Now that we have the lock, check for invalidation messages; see notes
+	 * in LockRelationOid.
+	 */
+	if (res != LOCKACQUIRE_ALREADY_CLEAR)
+	{
+		AcceptInvalidationMessages();
+		MarkLockClear(locallock);
+	}
+}
+
 /*
  *		UnlockRelationId
  *
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index b009559229..092ee934b4 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -38,6 +38,7 @@ extern void RelationInitLockInfo(Relation relation);
 
 /* Lock a relation */
 extern void LockRelationOid(Oid relid, LOCKMODE lockmode);
+extern void LockRelationId(LockRelId *relid, LOCKMODE lockmode);
 extern bool ConditionalLockRelationOid(Oid relid, LOCKMODE lockmode);
 extern void UnlockRelationId(LockRelId *relid, LOCKMODE lockmode);
 extern void UnlockRelationOid(Oid relid, LOCKMODE lockmode);
-- 
2.23.0

v6-0003-Refactor-index_copy_data.patchapplication/octet-stream; name=v6-0003-Refactor-index_copy_data.patchDownload

From b181e676e0446560eabb28cf0cbf9024e98230e1 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 24 Sep 2021 18:13:25 +0530
Subject: [PATCH v6 3/6] Refactor index_copy_data

Make separate interface for copying relation storage, this will
be used by later patch for copying the database relations.
---
 src/backend/commands/tablecmds.c | 61 +++++++++++++++++++-------------
 src/include/commands/tablecmds.h |  5 +++
 2 files changed, 42 insertions(+), 24 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 5e9cae26a0..25f897f4d6 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -14237,21 +14237,15 @@ AlterTableMoveAll(AlterTableMoveAllStmt *stmt)
 	return new_tablespaceoid;
 }
 
-static void
-index_copy_data(Relation rel, RelFileNode newrnode)
+/*
+ * Copy source smgr relation's all fork's data to the destination.
+ *
+ * copy_storage - storage copy function, which is passed by the caller.
+ */
+void
+RelationCopyAllFork(SMgrRelation src_smgr, SMgrRelation	dst_smgr,
+					char relpersistence, copy_relation_storage copy_storage)
 {
-	SMgrRelation dstrel;
-
-	dstrel = smgropen(newrnode, rel->rd_backend);
-
-	/*
-	 * Since we copy the file directly without looking at the shared buffers,
-	 * we'd better first flush out any pages of the source relation that are
-	 * in shared buffers.  We assume no new changes will be made while we are
-	 * holding exclusive lock on the rel.
-	 */
-	FlushRelationBuffers(rel);
-
 	/*
 	 * Create and copy all forks of the relation, and schedule unlinking of
 	 * old physical files.
@@ -14259,32 +14253,51 @@ index_copy_data(Relation rel, RelFileNode newrnode)
 	 * NOTE: any conflict in relfilenode value will be caught in
 	 * RelationCreateStorage().
 	 */
-	RelationCreateStorage(newrnode, rel->rd_rel->relpersistence);
+	RelationCreateStorage(dst_smgr->smgr_rnode.node, relpersistence);
 
 	/* copy main fork */
-	RelationCopyStorage(RelationGetSmgr(rel), dstrel, MAIN_FORKNUM,
-						rel->rd_rel->relpersistence);
+	copy_storage(src_smgr, dst_smgr, MAIN_FORKNUM, relpersistence);
 
 	/* copy those extra forks that exist */
 	for (ForkNumber forkNum = MAIN_FORKNUM + 1;
 		 forkNum <= MAX_FORKNUM; forkNum++)
 	{
-		if (smgrexists(RelationGetSmgr(rel), forkNum))
+		if (smgrexists(src_smgr, forkNum))
 		{
-			smgrcreate(dstrel, forkNum, false);
+			smgrcreate(dst_smgr, forkNum, false);
 
 			/*
 			 * WAL log creation if the relation is persistent, or this is the
 			 * init fork of an unlogged relation.
 			 */
-			if (RelationIsPermanent(rel) ||
-				(rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
+			if (relpersistence == RELPERSISTENCE_PERMANENT ||
+				(relpersistence == RELPERSISTENCE_UNLOGGED &&
 				 forkNum == INIT_FORKNUM))
-				log_smgrcreate(&newrnode, forkNum);
-			RelationCopyStorage(RelationGetSmgr(rel), dstrel, forkNum,
-								rel->rd_rel->relpersistence);
+				log_smgrcreate(&dst_smgr->smgr_rnode.node, forkNum);
+
+			/* Copy a fork's data, block by block. */
+			copy_storage(src_smgr, dst_smgr, forkNum, relpersistence);
 		}
 	}
+}
+
+static void
+index_copy_data(Relation rel, RelFileNode newrnode)
+{
+	SMgrRelation dstrel;
+
+	dstrel = smgropen(newrnode, rel->rd_backend);
+
+	/*
+	 * Since we copy the file directly without looking at the shared buffers,
+	 * we'd better first flush out any pages of the source relation that are
+	 * in shared buffers.  We assume no new changes will be made while we are
+	 * holding exclusive lock on the rel.
+	 */
+	FlushRelationBuffers(rel);
+
+	RelationCopyAllFork(RelationGetSmgr(rel), dstrel,
+						rel->rd_rel->relpersistence, RelationCopyStorage);
 
 	/* drop old relation, and close new one */
 	RelationDropStorage(rel);
diff --git a/src/include/commands/tablecmds.h b/src/include/commands/tablecmds.h
index 336549cc5f..e0e0aa5aa0 100644
--- a/src/include/commands/tablecmds.h
+++ b/src/include/commands/tablecmds.h
@@ -19,10 +19,13 @@
 #include "catalog/objectaddress.h"
 #include "nodes/parsenodes.h"
 #include "storage/lock.h"
+#include "storage/smgr.h"
 #include "utils/relcache.h"
 
 struct AlterTableUtilityContext;	/* avoid including tcop/utility.h here */
 
+typedef void (*copy_relation_storage) (SMgrRelation src, SMgrRelation dst,
+									  ForkNumber forkNum, char relpersistence);
 
 extern ObjectAddress DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 									ObjectAddress *typaddress, const char *queryString);
@@ -42,6 +45,8 @@ extern void AlterTableInternal(Oid relid, List *cmds, bool recurse);
 
 extern Oid	AlterTableMoveAll(AlterTableMoveAllStmt *stmt);
 
+extern void RelationCopyAllFork(SMgrRelation src_smgr, SMgrRelation	dst_smgr,
+								char relpersistence, copy_relation_storage copy_storage);
 extern ObjectAddress AlterTableNamespace(AlterObjectSchemaStmt *stmt,
 										 Oid *oldschema);
 
-- 
2.23.0

v6-0002-Extend-relmap-interfaces.patchapplication/octet-stream; name=v6-0002-Extend-relmap-interfaces.patchDownload

From 241d4a187db666a8f8d4b6755ba738a32bea4830 Mon Sep 17 00:00:00 2001
From: dilipkumar <dilipbalaut@gmail.com>
Date: Mon, 4 Oct 2021 13:50:44 +0530
Subject: [PATCH v6 2/6] Extend relmap interfaces

Support new interfaces in relmapper, 1) Support copying the
relmap file from one database path to the other database path.
2) Like RelationMapOidToFilenode, provide another interface
which do the same but instead of getting it for the database
we are connected to it will get it for the input database
path.

These interfaces are required for next patch for supporting the
wal logged created database.
---
 src/backend/utils/cache/relmapper.c | 122 ++++++++++++++++++++++++----
 src/include/utils/relmapper.h       |   6 +-
 2 files changed, 112 insertions(+), 16 deletions(-)

diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index bb39632080..51f361cf64 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -141,7 +141,7 @@ static void read_relmap_file(char *mapfilename, RelMapFile *map,
 static void write_relmap_file_internal(char *mapfilename, RelMapFile *newmap,
 									   bool write_wal, bool send_sinval,
 									   bool preserve_files, Oid dbid, Oid tsid,
-									   const char *dbpath);
+									   const char *dbpath, bool create);
 static void load_relmap_file(bool shared, bool lock_held);
 static void write_relmap_file(bool shared, RelMapFile *newmap,
 							  bool write_wal, bool send_sinval, bool preserve_files,
@@ -255,6 +255,36 @@ RelationMapFilenodeToOid(Oid filenode, bool shared)
 	return InvalidOid;
 }
 
+/*
+ * RelationMapOidToFilenodeForDatabase
+ *
+ * Find relfilenode for the given relation id in the dbpath.  Returns
+ * InvalidOid if the relationId is not found in the relmap.
+ */
+Oid
+RelationMapOidToFilenodeForDatabase(char *dbpath, Oid relationId)
+{
+	RelMapFile	map;
+	int			i;
+	char		mapfilename[MAXPGPATH];
+
+	/* Relmap file path for the given dbpath. */
+	snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+			 dbpath, RELMAPPER_FILENAME);
+
+	/* Read the relmap file from the source database. */
+	read_relmap_file(mapfilename, &map, false);
+
+	/* Iterate over the relmap entries to find the input relation oid. */
+	for (i = 0; i < map.num_mappings; i++)
+	{
+		if (relationId == map.mappings[i].mapoid)
+			return map.mappings[i].mapfilenode;
+	}
+
+	return InvalidOid;
+}
+
 /*
  * RelationMapUpdateMap
  *
@@ -693,7 +723,43 @@ RestoreRelationMap(char *startAddress)
 }
 
 /*
- * read_relmap_file -- read data from given mapfilename file.
+ * CopyRelationMap
+ *
+ * Copy relmapfile from source db path to the destination db path and WAL log
+ * the operation.  This function is only called during the create database, so
+ * the destination database is not yet visible to anyone else, thus we don't
+ * need to acquire the relmap lock while updating the destination relmap.
+ */
+void
+CopyRelationMap(Oid dbid, Oid tsid, char *srcdbpath, char *dstdbpath)
+{
+	RelMapFile map;
+	char mapfilename[MAXPGPATH];
+
+	/* Relmap file path of the source database. */
+	snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+			 srcdbpath, RELMAPPER_FILENAME);
+
+	/* Read the relmap file from the source database. */
+	read_relmap_file(mapfilename, &map, false);
+
+	/* Relmap file path of the destination database. */
+	snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+			 dstdbpath, RELMAPPER_FILENAME);
+
+	/*
+	 * Write map contents into the destination database's relmap file.
+	 * write_relmap_file_internal, expects that the CRC should have been
+	 * computed and stored in the input map.  But, since we have read this map
+	 * from the source database and directly writing to the destination file
+	 * without updating it so we don't need to recompute it.
+	 */
+	write_relmap_file_internal(mapfilename, &map, true, false, true, dbid,
+							   tsid, dstdbpath, true);
+}
+
+/*
+ * read_relmap_file - read data from given mapfilename file.
  *
  * Because the map file is essential for access to core system catalogs,
  * failure to read it is a fatal error.
@@ -796,15 +862,18 @@ load_relmap_file(bool shared, bool lock_held)
 }
 
 /*
- * Helper function for write_relmap_file, Read comments atop write_relmap_file
- * for more details.  The CRC should be computed by the caller and stored in
- * the newmap.
+ * Helper function for write_relmap_file and CopyRelationMap, Read comments
+ * atop write_relmap_file for more details.  The CRC should be computed by the
+ * caller and stored in the newmap.
+ *
+ * Pass the create = true, if we are copying the relmap file during CREATE
+ * DATABASE command.
  */
 static void
 write_relmap_file_internal(char *mapfilename, RelMapFile *newmap,
 						   bool write_wal, bool send_sinval,
 						   bool preserve_files, Oid dbid, Oid tsid,
-						   const char *dbpath)
+						   const char *dbpath, bool create)
 {
 	int			fd;
 
@@ -830,6 +899,7 @@ write_relmap_file_internal(char *mapfilename, RelMapFile *newmap,
 		xlrec.dbid = dbid;
 		xlrec.tsid = tsid;
 		xlrec.nbytes = sizeof(RelMapFile);
+		xlrec.create = create;
 
 		XLogBeginInsert();
 		XLogRegisterData((char *) (&xlrec), MinSizeOfRelmapUpdate);
@@ -971,7 +1041,7 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 	/* Write the map to the relmap file. */
 	write_relmap_file_internal(mapfilename, newmap, write_wal,
 							   send_sinval, preserve_files, dbid, tsid,
-							   dbpath);
+							   dbpath, false);
 
 	/*
 	 * Success, update permanent copy.  During bootstrap, we might be working
@@ -1063,15 +1133,37 @@ relmap_redo(XLogReaderState *record)
 		 * Write out the new map and send sinval, but of course don't write a
 		 * new WAL entry.  There's no surrounding transaction to tell to
 		 * preserve files, either.
-		 *
-		 * There shouldn't be anyone else updating relmaps during WAL replay,
-		 * but grab the lock to interlock against load_relmap_file().
 		 */
-		LWLockAcquire(RelationMappingLock, LW_EXCLUSIVE);
-		write_relmap_file((xlrec->dbid == InvalidOid), &newmap,
-						  false, true, false,
-						  xlrec->dbid, xlrec->tsid, dbpath);
-		LWLockRelease(RelationMappingLock);
+		if (!xlrec->create)
+		{
+			/*
+			 * There shouldn't be anyone else updating relmaps during WAL
+			 * replay, but grab the lock to interlock against
+			 * load_relmap_file().
+			 */
+			LWLockAcquire(RelationMappingLock, LW_EXCLUSIVE);
+			write_relmap_file((xlrec->dbid == InvalidOid), &newmap,
+							false, true, false,
+							xlrec->dbid, xlrec->tsid, dbpath);
+			LWLockRelease(RelationMappingLock);
+		}
+		else
+		{
+			char		mapfilename[MAXPGPATH];
+
+			/* Construct the mapfilename. */
+			snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+					 dbpath, RELMAPPER_FILENAME);
+
+			/*
+			 * We don't need to take relmap lock because this wal is logged
+			 * while creating a new database, so there could be no one else
+			 * reading/writing the relmap file.
+			 */
+			write_relmap_file_internal(mapfilename, &newmap, false, false,
+									   false, xlrec->dbid, xlrec->tsid, dbpath,
+									   true);
+		}
 
 		pfree(dbpath);
 	}
diff --git a/src/include/utils/relmapper.h b/src/include/utils/relmapper.h
index c0d14daad9..4165f0990b 100644
--- a/src/include/utils/relmapper.h
+++ b/src/include/utils/relmapper.h
@@ -29,6 +29,7 @@ typedef struct xl_relmap_update
 	Oid			dbid;			/* database ID, or 0 for shared map */
 	Oid			tsid;			/* database's tablespace, or pg_global */
 	int32		nbytes;			/* size of relmap data */
+	bool		create;			/* true if creating new relmap */
 	char		data[FLEXIBLE_ARRAY_MEMBER];
 } xl_relmap_update;
 
@@ -39,6 +40,8 @@ extern Oid	RelationMapOidToFilenode(Oid relationId, bool shared);
 
 extern Oid	RelationMapFilenodeToOid(Oid relationId, bool shared);
 
+extern Oid RelationMapOidToFilenodeForDatabase(char *dbpath, Oid relationId);
+
 extern void RelationMapUpdateMap(Oid relationId, Oid fileNode, bool shared,
 								 bool immediate);
 
@@ -62,7 +65,8 @@ extern void RelationMapInitializePhase3(void);
 extern Size EstimateRelationMapSpace(void);
 extern void SerializeRelationMap(Size maxSize, char *startAddress);
 extern void RestoreRelationMap(char *startAddress);
-
+extern void CopyRelationMap(Oid dbid, Oid tsid, char *srcdbpath,
+							char *dstdbpath);
 extern void relmap_redo(XLogReaderState *record);
 extern void relmap_desc(StringInfo buf, XLogReaderState *record);
 extern const char *relmap_identify(uint8 info);
-- 
2.23.0

v6-0004-Extend-bufmgr-interfaces.patchapplication/octet-stream; name=v6-0004-Extend-bufmgr-interfaces.patchDownload

From ec040f140f6dcab3e89c2f20b094d539ae1ec54c Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 24 Sep 2021 18:23:39 +0530
Subject: [PATCH v6 4/6] Extend bufmgr interfaces

Extend ReadBufferWithoutRelcache interface to take relpersistence as
and input and extend DropDatabaseBuffers to take tablespace oid as
input.
---
 src/backend/access/transam/xlogutils.c |  9 ++++++---
 src/backend/commands/dbcommands.c      |  9 +++------
 src/backend/storage/buffer/bufmgr.c    | 24 +++++++++++-------------
 src/include/storage/bufmgr.h           |  5 +++--
 4 files changed, 23 insertions(+), 24 deletions(-)

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index b33e0531ed..81c192f223 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -484,7 +484,8 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 	{
 		/* page exists in file */
 		buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-										   mode, NULL);
+										   mode, NULL,
+										   RELPERSISTENCE_PERMANENT);
 	}
 	else
 	{
@@ -509,7 +510,8 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 				ReleaseBuffer(buffer);
 			}
 			buffer = ReadBufferWithoutRelcache(rnode, forknum,
-											   P_NEW, mode, NULL);
+											   P_NEW, mode, NULL,
+											   RELPERSISTENCE_PERMANENT);
 		}
 		while (BufferGetBlockNumber(buffer) < blkno);
 		/* Handle the corner case that P_NEW returns non-consecutive pages */
@@ -519,7 +521,8 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 				LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 			ReleaseBuffer(buffer);
 			buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-											   mode, NULL);
+											   mode, NULL,
+											   RELPERSISTENCE_PERMANENT);
 		}
 	}
 
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 029fab48df..1d963d8428 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -938,7 +938,7 @@ dropdb(const char *dbname, bool missing_ok, bool force)
 	 * is important to ensure that no remaining backend tries to write out a
 	 * dirty buffer to the dead database later...
 	 */
-	DropDatabaseBuffers(db_id);
+	DropDatabaseBuffers(db_id, InvalidOid);
 
 	/*
 	 * Tell the stats collector to forget it immediately, too.
@@ -1220,11 +1220,8 @@ movedb(const char *dbname, const char *tblspcname)
 	 * contain valid data again --- but they'd be missing any changes made in
 	 * the database while it was in the new tablespace.  In any case, freeing
 	 * buffers that should never be used again seems worth the cycles.
-	 *
-	 * Note: it'd be sufficient to get rid of buffers matching db_id and
-	 * src_tblspcoid, but bufmgr.c presently provides no API for that.
 	 */
-	DropDatabaseBuffers(db_id);
+	DropDatabaseBuffers(db_id, src_tblspcoid);
 
 	/*
 	 * Check for existence of files in the target directory, i.e., objects of
@@ -2201,7 +2198,7 @@ dbase_redo(XLogReaderState *record)
 		ReplicationSlotsDropDBSlots(xlrec->db_id);
 
 		/* Drop pages for this database that are in the shared buffer cache */
-		DropDatabaseBuffers(xlrec->db_id);
+		DropDatabaseBuffers(xlrec->db_id, InvalidOid);
 
 		/* Also, clean out any fsync requests that might be pending in md.c */
 		ForgetDatabaseSyncRequests(xlrec->db_id);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 08ebabfe96..ea3ebcc13c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -770,24 +770,17 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 /*
  * ReadBufferWithoutRelcache -- like ReadBufferExtended, but doesn't require
  *		a relcache entry for the relation.
- *
- * NB: At present, this function may only be used on permanent relations, which
- * is OK, because we only use it during XLOG replay.  If in the future we
- * want to use it on temporary or unlogged relations, we could pass additional
- * parameters.
  */
 Buffer
 ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
 						  BlockNumber blockNum, ReadBufferMode mode,
-						  BufferAccessStrategy strategy)
+						  BufferAccessStrategy strategy, char relpersistence)
 {
 	bool		hit;
 
 	SMgrRelation smgr = smgropen(rnode, InvalidBackendId);
 
-	Assert(InRecovery);
-
-	return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
+	return ReadBuffer_common(smgr, relpersistence, forkNum, blockNum,
 							 mode, strategy, &hit);
 }
 
@@ -797,7 +790,7 @@ ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
  *
  * *hit is set to true if the request was satisfied from shared buffer cache.
  */
-static Buffer
+Buffer
 ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 				  BlockNumber blockNum, ReadBufferMode mode,
 				  BufferAccessStrategy strategy, bool *hit)
@@ -3402,10 +3395,13 @@ FindAndDropRelFileNodeBuffers(RelFileNode rnode, ForkNumber forkNum,
  *		database, to avoid trying to flush data to disk when the directory
  *		tree no longer exists.  Implementation is pretty similar to
  *		DropRelFileNodeBuffers() which is for destroying just one relation.
+ *
+ *		If a valid tablespace oid is passed then it will compare the tablespace
+ *		oid as well otherwise just the db oid.
  * --------------------------------------------------------------------
  */
 void
-DropDatabaseBuffers(Oid dbid)
+DropDatabaseBuffers(Oid dbid, Oid tbsid)
 {
 	int			i;
 
@@ -3423,11 +3419,13 @@ DropDatabaseBuffers(Oid dbid)
 		 * As in DropRelFileNodeBuffers, an unlocked precheck should be safe
 		 * and saves some cycles.
 		 */
-		if (bufHdr->tag.rnode.dbNode != dbid)
+		if (bufHdr->tag.rnode.dbNode != dbid ||
+			(OidIsValid(tbsid) && bufHdr->tag.rnode.spcNode != tbsid))
 			continue;
 
 		buf_state = LockBufHdr(bufHdr);
-		if (bufHdr->tag.rnode.dbNode == dbid)
+		if (bufHdr->tag.rnode.dbNode == dbid &&
+			(!OidIsValid(tbsid) || bufHdr->tag.rnode.spcNode == tbsid))
 			InvalidateBuffer(bufHdr);	/* releases spinlock */
 		else
 			UnlockBufHdr(bufHdr, buf_state);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index cfce23ecbc..237c6a9078 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -184,7 +184,8 @@ extern Buffer ReadBufferExtended(Relation reln, ForkNumber forkNum,
 								 BufferAccessStrategy strategy);
 extern Buffer ReadBufferWithoutRelcache(RelFileNode rnode,
 										ForkNumber forkNum, BlockNumber blockNum,
-										ReadBufferMode mode, BufferAccessStrategy strategy);
+										ReadBufferMode mode, BufferAccessStrategy strategy,
+										char relpersistence);
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern void MarkBufferDirty(Buffer buffer);
@@ -207,7 +208,7 @@ extern void FlushDatabaseBuffers(Oid dbid);
 extern void DropRelFileNodeBuffers(struct SMgrRelationData *smgr_reln, ForkNumber *forkNum,
 								   int nforks, BlockNumber *firstDelBlock);
 extern void DropRelFileNodesAllBuffers(struct SMgrRelationData **smgr_reln, int nnodes);
-extern void DropDatabaseBuffers(Oid dbid);
+extern void DropDatabaseBuffers(Oid dbid, Oid tbsid);
 
 #define RelationGetNumberOfBlocks(reln) \
 	RelationGetNumberOfBlocksInFork(reln, MAIN_FORKNUM)
-- 
2.23.0

v6-0001-Refactor-relmap-load-and-relmap-write-functions.patchapplication/octet-stream; name=v6-0001-Refactor-relmap-load-and-relmap-write-functions.patchDownload

From eb27d159bc2aa41011b6352a514c7981a19a64dc Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Wed, 1 Sep 2021 14:06:29 +0530
Subject: [PATCH v6 1/6] Refactor relmap load and relmap write functions

Currently, write_relmap_file and load_relmap_file are tightly
coupled with shared_map and local_map.  As part of the higher
level patch set we need remap read/write interfaces that are
not dependent upon shared_map and local_map, and we should be
able to pass map memory as an external parameter instead.
---
 src/backend/utils/cache/relmapper.c | 163 +++++++++++++++++-----------
 1 file changed, 99 insertions(+), 64 deletions(-)

diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index a6e38adce3..bb39632080 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -136,6 +136,12 @@ static void apply_map_update(RelMapFile *map, Oid relationId, Oid fileNode,
 							 bool add_okay);
 static void merge_map_updates(RelMapFile *map, const RelMapFile *updates,
 							  bool add_okay);
+static void read_relmap_file(char *mapfilename, RelMapFile *map,
+							 bool lock_held);
+static void write_relmap_file_internal(char *mapfilename, RelMapFile *newmap,
+									   bool write_wal, bool send_sinval,
+									   bool preserve_files, Oid dbid, Oid tsid,
+									   const char *dbpath);
 static void load_relmap_file(bool shared, bool lock_held);
 static void write_relmap_file(bool shared, RelMapFile *newmap,
 							  bool write_wal, bool send_sinval, bool preserve_files,
@@ -687,36 +693,19 @@ RestoreRelationMap(char *startAddress)
 }
 
 /*
- * load_relmap_file -- load data from the shared or local map file
+ * read_relmap_file -- read data from given mapfilename file.
  *
  * Because the map file is essential for access to core system catalogs,
  * failure to read it is a fatal error.
- *
- * Note that the local case requires DatabasePath to be set up.
  */
 static void
-load_relmap_file(bool shared, bool lock_held)
+read_relmap_file(char *mapfilename, RelMapFile *map, bool lock_held)
 {
-	RelMapFile *map;
-	char		mapfilename[MAXPGPATH];
 	pg_crc32c	crc;
 	int			fd;
 	int			r;
 
-	if (shared)
-	{
-		snprintf(mapfilename, sizeof(mapfilename), "global/%s",
-				 RELMAPPER_FILENAME);
-		map = &shared_map;
-	}
-	else
-	{
-		snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
-				 DatabasePath, RELMAPPER_FILENAME);
-		map = &local_map;
-	}
-
-	/* Read data ... */
+	/* Open the relmap file for reading. */
 	fd = OpenTransientFile(mapfilename, O_RDONLY | PG_BINARY);
 	if (fd < 0)
 		ereport(FATAL,
@@ -779,62 +768,50 @@ load_relmap_file(bool shared, bool lock_held)
 }
 
 /*
- * Write out a new shared or local map file with the given contents.
- *
- * The magic number and CRC are automatically updated in *newmap.  On
- * success, we copy the data to the appropriate permanent static variable.
- *
- * If write_wal is true then an appropriate WAL message is emitted.
- * (It will be false for bootstrap and WAL replay cases.)
- *
- * If send_sinval is true then a SI invalidation message is sent.
- * (This should be true except in bootstrap case.)
- *
- * If preserve_files is true then the storage manager is warned not to
- * delete the files listed in the map.
+ * load_relmap_file -- load data from the shared or local map file
  *
- * Because this may be called during WAL replay when MyDatabaseId,
- * DatabasePath, etc aren't valid, we require the caller to pass in suitable
- * values.  The caller is also responsible for being sure no concurrent
- * map update could be happening.
+ * Note that the local case requires DatabasePath to be set up.
  */
 static void
-write_relmap_file(bool shared, RelMapFile *newmap,
-				  bool write_wal, bool send_sinval, bool preserve_files,
-				  Oid dbid, Oid tsid, const char *dbpath)
+load_relmap_file(bool shared, bool lock_held)
 {
-	int			fd;
-	RelMapFile *realmap;
+	RelMapFile *map;
 	char		mapfilename[MAXPGPATH];
 
-	/*
-	 * Fill in the overhead fields and update CRC.
-	 */
-	newmap->magic = RELMAPPER_FILEMAGIC;
-	if (newmap->num_mappings < 0 || newmap->num_mappings > MAX_MAPPINGS)
-		elog(ERROR, "attempt to write bogus relation mapping");
-
-	INIT_CRC32C(newmap->crc);
-	COMP_CRC32C(newmap->crc, (char *) newmap, offsetof(RelMapFile, crc));
-	FIN_CRC32C(newmap->crc);
-
-	/*
-	 * Open the target file.  We prefer to do this before entering the
-	 * critical section, so that an open() failure need not force PANIC.
-	 */
 	if (shared)
 	{
 		snprintf(mapfilename, sizeof(mapfilename), "global/%s",
 				 RELMAPPER_FILENAME);
-		realmap = &shared_map;
+		map = &shared_map;
 	}
 	else
 	{
 		snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
-				 dbpath, RELMAPPER_FILENAME);
-		realmap = &local_map;
+				 DatabasePath, RELMAPPER_FILENAME);
+		map = &local_map;
 	}
 
+	/* Read data ... */
+	read_relmap_file(mapfilename, map, lock_held);
+}
+
+/*
+ * Helper function for write_relmap_file, Read comments atop write_relmap_file
+ * for more details.  The CRC should be computed by the caller and stored in
+ * the newmap.
+ */
+static void
+write_relmap_file_internal(char *mapfilename, RelMapFile *newmap,
+						   bool write_wal, bool send_sinval,
+						   bool preserve_files, Oid dbid, Oid tsid,
+						   const char *dbpath)
+{
+	int			fd;
+
+	/*
+	 * Open the target file.  We prefer to do this before entering the
+	 * critical section, so that an open() failure need not force PANIC.
+	 */
 	fd = OpenTransientFile(mapfilename, O_WRONLY | O_CREAT | PG_BINARY);
 	if (fd < 0)
 		ereport(ERROR,
@@ -934,6 +911,68 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		}
 	}
 
+	/* Critical section done */
+	if (write_wal)
+		END_CRIT_SECTION();
+}
+
+/*
+ * Write out a new shared or local map file with the given contents.
+ *
+ * The magic number and CRC are automatically updated in *newmap.  On
+ * success, we copy the data to the appropriate permanent static variable.
+ *
+ * If write_wal is true then an appropriate WAL message is emitted.
+ * (It will be false for bootstrap and WAL replay cases.)
+ *
+ * If send_sinval is true then a SI invalidation message is sent.
+ * (This should be true except in bootstrap case.)
+ *
+ * If preserve_files is true then the storage manager is warned not to
+ * delete the files listed in the map.
+ *
+ * Because this may be called during WAL replay when MyDatabaseId,
+ * DatabasePath, etc aren't valid, we require the caller to pass in suitable
+ * values.  The caller is also responsible for being sure no concurrent
+ * map update could be happening.
+ */
+static void
+write_relmap_file(bool shared, RelMapFile *newmap,
+				  bool write_wal, bool send_sinval, bool preserve_files,
+				  Oid dbid, Oid tsid, const char *dbpath)
+{
+	RelMapFile *realmap;
+	char		mapfilename[MAXPGPATH];
+
+	/*
+	 * Fill in the overhead fields and update CRC.
+	 */
+	newmap->magic = RELMAPPER_FILEMAGIC;
+	if (newmap->num_mappings < 0 || newmap->num_mappings > MAX_MAPPINGS)
+		elog(ERROR, "attempt to write bogus relation mapping");
+
+	INIT_CRC32C(newmap->crc);
+	COMP_CRC32C(newmap->crc, (char *) newmap, offsetof(RelMapFile, crc));
+	FIN_CRC32C(newmap->crc);
+
+	if (shared)
+	{
+		snprintf(mapfilename, sizeof(mapfilename), "global/%s",
+				 RELMAPPER_FILENAME);
+		realmap = &shared_map;
+	}
+	else
+	{
+		snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+				 dbpath, RELMAPPER_FILENAME);
+		realmap = &local_map;
+	}
+
+	/* Write the map to the relmap file. */
+	write_relmap_file_internal(mapfilename, newmap, write_wal,
+							   send_sinval, preserve_files, dbid, tsid,
+							   dbpath);
+
 	/*
 	 * Success, update permanent copy.  During bootstrap, we might be working
 	 * on the permanent copy itself, in which case skip the memcpy() to avoid
@@ -943,10 +982,6 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		memcpy(realmap, newmap, sizeof(RelMapFile));
 	else
 		Assert(!send_sinval);	/* must be bootstrapping */
-
-	/* Critical section done */
-	if (write_wal)
-		END_CRIT_SECTION();
 }
 
 /*
-- 
2.23.0

v6-0006-WAL-logged-CREATE-DATABASE.patchapplication/octet-stream; name=v6-0006-WAL-logged-CREATE-DATABASE.patchDownload

From 4a726359d9a8e870bb181bf6181fbf507f9616f2 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Tue, 5 Oct 2021 11:45:02 +0530
Subject: [PATCH v6 6/6] WAL logged CREATE DATABASE

Currently, CREATE DATABASE forces a checkpoint, then copies all the files,
then forces another checkpoint. The comments in the createdb() function
explain the reasons for this. The attached patch fixes this problem by making
create database completely WAL logged so that we can avoid the checkpoints.

This can also be useful for supporting the TDE. For example, if we need different
encryption for the source and the target database then we can not re-encrypt the
page data if we copy the whole directory.  But with this patch, we are copying
page by page so we have an opportunity to re-encrypt the page before copying that
to the target database.
---
 src/backend/access/rmgrdesc/dbasedesc.c |   3 +-
 src/backend/commands/dbcommands.c       | 679 ++++++++++++++++--------
 src/bin/pg_rewind/parsexlog.c           |   1 +
 src/include/commands/dbcommands_xlog.h  |   3 -
 4 files changed, 462 insertions(+), 224 deletions(-)

diff --git a/src/backend/access/rmgrdesc/dbasedesc.c b/src/backend/access/rmgrdesc/dbasedesc.c
index 26609845aa..5010f72b2c 100644
--- a/src/backend/access/rmgrdesc/dbasedesc.c
+++ b/src/backend/access/rmgrdesc/dbasedesc.c
@@ -28,8 +28,7 @@ dbase_desc(StringInfo buf, XLogReaderState *record)
 	{
 		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) rec;
 
-		appendStringInfo(buf, "copy dir %u/%u to %u/%u",
-						 xlrec->src_tablespace_id, xlrec->src_db_id,
+		appendStringInfo(buf, "create dir %u/%u",
 						 xlrec->tablespace_id, xlrec->db_id);
 	}
 	else if (info == XLOG_DBASE_DROP)
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 1d963d8428..12f5abcd4d 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -45,13 +45,13 @@
 #include "commands/dbcommands_xlog.h"
 #include "commands/defrem.h"
 #include "commands/seclabel.h"
+#include "commands/tablecmds.h"
 #include "commands/tablespace.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
 #include "replication/slot.h"
-#include "storage/copydir.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
@@ -62,6 +62,7 @@
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
 #include "utils/pg_locale.h"
+#include "utils/relmapper.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 
@@ -77,6 +78,19 @@ typedef struct
 	Oid			dest_tsoid;		/* tablespace we are trying to move to */
 } movedb_failure_params;
 
+/*
+ * When creating a database, we scan the pg_class of the source database to
+ * identify all the relations to be copied.  The structure is used for storing
+ * information about each relation of the source database.
+ */
+typedef struct CreateDBRelInfo
+{
+	RelFileNode		rnode;				/* physical relation identifier */
+	Oid				reloid;				/* relation oid */
+	char			relpersistence;		/* relation's persistence level */
+} CreateDBRelInfo;
+
+
 /* non-export function prototypes */
 static void createdb_failure_callback(int code, Datum arg);
 static void movedb(const char *dbname, const char *tblspcname);
@@ -91,6 +105,425 @@ static bool have_createdb_privilege(void);
 static void remove_dbtablespaces(Oid db_id);
 static bool check_db_file_conflict(Oid db_id);
 static int	errdetail_busy_db(int notherbackends, int npreparedxacts);
+static void CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid,
+									bool isRedo);
+static List *GetDatabaseRelationList(Oid srctbid, Oid srcdbid, char *srcpath);
+static void RelationCopyStorageUsingBuffer(SMgrRelation src, SMgrRelation dst,
+									ForkNumber forkNum, char relpersistence);
+static void CopyDatabase(Oid src_dboid, Oid dboid, Oid src_tsid, Oid dst_tsid);
+
+/*
+ * CreateDirAndVersionFile - Create database directory and write out the
+ *							 PG_VERSION file in the database path.
+ *
+ * If isRedo is true, it's okay for the database directory to exist already.
+ *
+ * We can directly write PG_MAJORVERSION in the version file instead of copying
+ * from the source database file because these two must be the same.
+ */
+static void
+CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid, bool isRedo)
+{
+	int		fd;
+	char	versionfile[MAXPGPATH];
+	StringInfoData	buf;
+
+	/* Prepare version data before starting a critical section. */
+	initStringInfo(&buf);
+	appendStringInfo(&buf, "%s\n", PG_MAJORVERSION);
+
+	/* If we are not in WAL replay then write the WAL. */
+	if (!isRedo)
+	{
+		xl_dbase_create_rec xlrec;
+		XLogRecPtr	lsn;
+
+		/* Now errors are fatal ... */
+		START_CRIT_SECTION();
+
+		xlrec.db_id = dbid;
+		xlrec.tablespace_id = tsid;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec), sizeof(xl_dbase_create_rec));
+
+		lsn = XLogInsert(RM_DBASE_ID, XLOG_DBASE_CREATE);
+
+		/* As always, WAL must hit the disk before the data update does. */
+		XLogFlush(lsn);
+	}
+
+	/* Create database directory. */
+	if (MakePGDirectory(dbpath) < 0)
+	{
+		/* Failure other than already exists or not in WAL replay? */
+		if (errno != EEXIST || !isRedo)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not create directory \"%s\": %m", dbpath)));
+	}
+
+	/*
+	 * Create PG_VERSION file in the database path.  If the file already exists
+	 * and we are in WAL replay then try again to open it in write mode.
+	 */
+	snprintf(versionfile, sizeof(versionfile), "%s/%s", dbpath, "PG_VERSION");
+
+	fd = OpenTransientFile(versionfile, O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
+	if (fd < 0 && errno == EEXIST && isRedo)
+		fd = OpenTransientFile(versionfile, O_WRONLY | PG_BINARY);
+
+	if (fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m", versionfile)));
+
+	/* Write PG_MAJORVERSION in the PG_VERSION file. */
+	pgstat_report_wait_start(WAIT_EVENT_COPY_FILE_WRITE);
+	errno = 0;
+	if ((int) write(fd, buf.data, buf.len) != buf.len)
+	{
+		/* If write didn't set errno, assume problem is no disk space. */
+		if (errno == 0)
+			errno = ENOSPC;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m", versionfile)));
+	}
+	pgstat_report_wait_end();
+
+	/* Close the version file. */
+	CloseTransientFile(fd);
+
+	/* Critical section done. */
+	if (!isRedo)
+		END_CRIT_SECTION();
+}
+
+/*
+ * GetDatabaseRelationList - Get relfilenode list to be copied.
+ *
+ * Iterate over each block of the pg_class relation.  From there, we will check
+ * all the visible tuples in order to get a list of all the valid relfilenodes
+ * in the source database that should be copied to the target database.
+ */
+static List *
+GetDatabaseRelationList(Oid tbid, Oid dbid, char *srcpath)
+{
+	SMgrRelation	rd_smgr;
+	RelFileNode		rnode;
+	BlockNumber		nblocks;
+	BlockNumber		blkno;
+	OffsetNumber	offnum;
+	OffsetNumber	maxoff;
+	Buffer			buf;
+	Oid				relfilenode;
+	Page			page;
+	List		   *rnodelist = NIL;
+	HeapTupleData	tuple;
+	Form_pg_class	classForm;
+	LockRelId		relid;
+	BufferAccessStrategy bstrategy;
+
+	/* Get pg_class relfilenode. */
+	relfilenode = RelationMapOidToFilenodeForDatabase(srcpath,
+													  RelationRelationId);
+	/*
+	 * We are going to read the buffers associated with the pg_class relation.
+	 * Thus, acquire the relation level lock before start scanning.  As we are
+	 * not connected to the database, we cannot use relation_open directly, so
+	 * we have to lock using relation id.
+	 */
+	relid.dbId = dbid;
+	relid.relId = RelationRelationId;
+	LockRelationId(&relid, AccessShareLock);
+
+	/* Prepare a relnode for pg_class relation. */
+	rnode.spcNode = tbid;
+	rnode.dbNode = dbid;
+	rnode.relNode = relfilenode;
+
+	/*
+	 * We are not connected to the source database so open the pg_class
+	 * relation at the smgr level and get the block count.
+	 */
+	rd_smgr = smgropen(rnode, InvalidBackendId);
+	nblocks = smgrnblocks(rd_smgr, MAIN_FORKNUM);
+
+	/*
+	 * We're going to read the whole pg_class so better to use bulk-read buffer
+	 * access strategy.
+	 */
+	bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	/* Iterate over each block on the pg_class relation. */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		/*
+		 * We are not connected to the source database so directly use the lower
+		 * level bufmgr interface which operates on the rnode.
+		 */
+		buf = ReadBufferWithoutRelcache(rnode, MAIN_FORKNUM, blkno,
+										RBM_NORMAL, bstrategy,
+										RELPERSISTENCE_PERMANENT);
+
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buf);
+		if (PageIsNew(page) || PageIsEmpty(page))
+		{
+			UnlockReleaseBuffer(buf);
+			continue;
+		}
+
+		maxoff = PageGetMaxOffsetNumber(page);
+
+		/* Iterate over each tuple on the page. */
+		for (offnum = FirstOffsetNumber;
+			 offnum <= maxoff;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemid;
+
+			itemid = PageGetItemId(page, offnum);
+
+			/* Nothing to do if slot is empty or already dead. */
+			if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid) ||
+				ItemIdIsRedirected(itemid))
+				continue;
+
+			Assert(ItemIdIsNormal(itemid));
+			ItemPointerSet(&(tuple.t_self), blkno, offnum);
+
+			tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
+			tuple.t_len = ItemIdGetLength(itemid);
+			tuple.t_tableOid = RelationRelationId;
+
+			/*
+			 * If the tuple is visible then add its relfilenode info to the
+			 * list.
+			 */
+			if (HeapTupleSatisfiesVisibility(&tuple, GetActiveSnapshot(), buf))
+			{
+				Oid				relfilenode = InvalidOid;
+				CreateDBRelInfo   *relinfo;
+
+				classForm = (Form_pg_class) GETSTRUCT(&tuple);
+
+				/* We don't need to copy the shared objects to the target. */
+				if (classForm->reltablespace == GLOBALTABLESPACE_OID)
+					continue;
+
+				/*
+				 * If the object doesn't have the storage then nothing to be
+				 * done for that object so just ignore it.
+				 */
+				if (!RELKIND_HAS_STORAGE(classForm->relkind))
+					continue;
+
+				/*
+				 * If relfilenode is valid then directly use it.  Otherwise,
+				 * consult the relmapper for the mapped relation.
+				 */
+				if (OidIsValid(classForm->relfilenode))
+					relfilenode = classForm->relfilenode;
+				else
+					relfilenode = RelationMapOidToFilenodeForDatabase(srcpath,
+													classForm->oid);
+
+				/* We must have a valid relfilenode oid. */
+				Assert(OidIsValid(relfilenode));
+
+				/* Prepare a rel info element and add it to the list. */
+				relinfo = (CreateDBRelInfo *) palloc(sizeof(CreateDBRelInfo));
+				if (OidIsValid(classForm->reltablespace))
+					relinfo->rnode.spcNode = classForm->reltablespace;
+				else
+					relinfo->rnode.spcNode = tbid;
+
+				relinfo->rnode.dbNode = dbid;
+				relinfo->rnode.relNode = relfilenode;
+				relinfo->reloid = classForm->oid;
+				relinfo->relpersistence = classForm->relpersistence;
+
+				/* Add it to the list. */
+				rnodelist = lappend(rnodelist, relinfo);
+			}
+		}
+
+		/* Release the buffer lock. */
+		UnlockReleaseBuffer(buf);
+	}
+
+	/* Release the lock. */
+	UnlockRelationId(&relid, AccessShareLock);
+
+	return rnodelist;
+}
+
+/*
+ * RelationCopyStorageUsingBuffer - Copy fork's data using bufmgr.
+ *
+ * Same as RelationCopyStorage but instead of using smgrread and smgrextend
+ * this will copy using bufmgr APIs.
+ */
+static void
+RelationCopyStorageUsingBuffer(SMgrRelation src, SMgrRelation dst,
+							   ForkNumber forkNum, char relpersistence)
+{
+	Buffer		srcBuf;
+	Buffer		dstBuf;
+	Page		srcPage;
+	Page		dstPage;
+	bool		use_wal;
+	bool		copying_initfork;
+	BlockNumber nblocks;
+	BlockNumber blkno;
+	BufferAccessStrategy bstrategy_src;
+	BufferAccessStrategy bstrategy_dst;
+
+	/* Refer comments in RelationCopyStorage. */
+	copying_initfork = relpersistence == RELPERSISTENCE_UNLOGGED &&
+		forkNum == INIT_FORKNUM;
+	use_wal = XLogIsNeeded() &&
+		(relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork);
+
+	/* Get number of blocks in the source relation. */
+	nblocks = smgrnblocks(src, forkNum);
+
+	/*
+	 * We are going to copy whole relation from the source to the destination
+	 * so use BAS_BULKREAD strategy for the source relation and BAS_BULKWRITE
+	 * strategy for the destination relation.
+	 */
+	bstrategy_src = GetAccessStrategy(BAS_BULKREAD);
+	bstrategy_dst = GetAccessStrategy(BAS_BULKWRITE);
+
+	/* Iterate over each block of the source relation file. */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		/* If we got a cancel signal during the copy of the data, quit */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Read block from source relation. */
+		srcBuf = ReadBufferWithoutRelcache(src->smgr_rnode.node, forkNum,
+										   blkno, RBM_NORMAL, bstrategy_src,
+										   relpersistence);
+		srcPage = BufferGetPage(srcBuf);
+		if (PageIsNew(srcPage) || PageIsEmpty(srcPage))
+		{
+			ReleaseBuffer(srcBuf);
+			continue;
+		}
+
+		/* Use P_NEW to extend the relation. */
+		dstBuf = ReadBufferWithoutRelcache(dst->smgr_rnode.node, forkNum,
+										   P_NEW, RBM_NORMAL, bstrategy_dst,
+										   relpersistence);
+		LockBuffer(dstBuf, BUFFER_LOCK_EXCLUSIVE);
+
+		START_CRIT_SECTION();
+
+		/* Initialize the page and write the data. */
+		dstPage = BufferGetPage(dstBuf);
+		PageInit(dstPage, BufferGetPageSize(dstBuf), 0);
+		memcpy(dstPage, srcPage, BLCKSZ);
+		MarkBufferDirty(dstBuf);
+
+		/* WAL-log the copied page. */
+		if (use_wal)
+			log_newpage_buffer(dstBuf, true);
+
+		END_CRIT_SECTION();
+
+		UnlockReleaseBuffer(dstBuf);
+		ReleaseBuffer(srcBuf);
+	}
+}
+
+/*
+ * CopyDatabase - Copy source database to the target database.
+ *
+ * Create target database directory and copy data files from the source database
+ * to the target database, block by block and WAL log all the operations.
+ */
+static void
+CopyDatabase(Oid src_dboid, Oid dst_dboid, Oid src_tsid, Oid dst_tsid)
+{
+	char	   *srcpath;
+	char	   *dstpath;
+	List	   *rnodelist = NULL;
+	ListCell   *cell;
+	LockRelId	relid;
+	RelFileNode	srcrnode;
+	RelFileNode	dstrnode;
+	CreateDBRelInfo	*relinfo;
+
+	/* Get the source database path. */
+	srcpath = GetDatabasePath(src_dboid, src_tsid);
+
+	/* Get the destination database path. */
+	dstpath = GetDatabasePath(dst_dboid, dst_tsid);
+
+	/* Create database directory and write PG_VERSION file. */
+	CreateDirAndVersionFile(dstpath, dst_dboid, dst_tsid, false);
+
+	/* Copy relmap file from source database to the destination database. */
+	CopyRelationMap(dst_dboid, dst_tsid, srcpath, dstpath);
+
+	/* Get list of all valid relnode from the source database. */
+	rnodelist = GetDatabaseRelationList(src_tsid, src_dboid, srcpath);
+	Assert(rnodelist != NIL);
+
+	/*
+	 * Database id is common for all the relation so set it before entering to
+	 * the loop.
+	 */
+	relid.dbId = src_dboid;
+
+	/*
+	 * Iterate over each relfilenode and copy the relation data block by block
+	 * from source database to the destination database.
+	 */
+	foreach(cell, rnodelist)
+	{
+		SMgrRelation	src_smgr;
+		SMgrRelation	dst_smgr;
+
+		relinfo = lfirst(cell);
+		srcrnode = relinfo->rnode;
+
+		/*
+		 * If the relation is from the default tablespace then we need to
+		 * create it in the destinations db's default tablespace.  Otherwise,
+		 * we need to create in the same tablespace as it is in the source
+		 * database.
+		 */
+		if (srcrnode.spcNode != src_tsid)
+			dstrnode.spcNode = srcrnode.spcNode;
+		else
+			dstrnode.spcNode = dst_tsid;
+
+		dstrnode.dbNode = dst_dboid;
+		dstrnode.relNode = srcrnode.relNode;
+
+		/* Acquire the lock on relation before start copying. */
+		relid.relId = relinfo->reloid;
+		LockRelationId(&relid, AccessShareLock);
+
+		/* Open the source and the destination relation at smgr level. */
+		src_smgr = smgropen(srcrnode, InvalidBackendId);
+		dst_smgr = smgropen(dstrnode, InvalidBackendId);
+
+		/* Copy relation storage from source to the destination. */
+		RelationCopyAllFork(src_smgr, dst_smgr, relinfo->relpersistence,
+							RelationCopyStorageUsingBuffer);
+
+		/* Release the lock. */
+		UnlockRelationId(&relid, AccessShareLock);
+	}
+
+	list_free_deep(rnodelist);
+}
 
 
 /*
@@ -99,8 +532,6 @@ static int	errdetail_busy_db(int notherbackends, int npreparedxacts);
 Oid
 createdb(ParseState *pstate, const CreatedbStmt *stmt)
 {
-	TableScanDesc scan;
-	Relation	rel;
 	Oid			src_dboid;
 	Oid			src_owner;
 	int			src_encoding = -1;
@@ -562,19 +993,6 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	/* Post creation hook for new database */
 	InvokeObjectPostCreateHook(DatabaseRelationId, dboid, 0);
 
-	/*
-	 * Force a checkpoint before starting the copy. This will force all dirty
-	 * buffers, including those of unlogged tables, out to disk, to ensure
-	 * source database is up-to-date on disk for the copy.
-	 * FlushDatabaseBuffers() would suffice for that, but we also want to
-	 * process any pending unlink requests. Otherwise, if a checkpoint
-	 * happened while we're copying files, a file might be deleted just when
-	 * we're about to copy it, causing the lstat() call in copydir() to fail
-	 * with ENOENT.
-	 */
-	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT
-					  | CHECKPOINT_FLUSH_ALL);
-
 	/*
 	 * Once we start copying subdirectories, we need to be able to clean 'em
 	 * up if we fail.  Use an ENSURE block to make sure this happens.  (This
@@ -587,115 +1005,16 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	PG_ENSURE_ERROR_CLEANUP(createdb_failure_callback,
 							PointerGetDatum(&fparms));
 	{
-		/*
-		 * Iterate through all tablespaces of the template database, and copy
-		 * each one to the new database.
-		 */
-		rel = table_open(TableSpaceRelationId, AccessShareLock);
-		scan = table_beginscan_catalog(rel, 0, NULL);
-		while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
-		{
-			Form_pg_tablespace spaceform = (Form_pg_tablespace) GETSTRUCT(tuple);
-			Oid			srctablespace = spaceform->oid;
-			Oid			dsttablespace;
-			char	   *srcpath;
-			char	   *dstpath;
-			struct stat st;
-
-			/* No need to copy global tablespace */
-			if (srctablespace == GLOBALTABLESPACE_OID)
-				continue;
-
-			srcpath = GetDatabasePath(src_dboid, srctablespace);
-
-			if (stat(srcpath, &st) < 0 || !S_ISDIR(st.st_mode) ||
-				directory_is_empty(srcpath))
-			{
-				/* Assume we can ignore it */
-				pfree(srcpath);
-				continue;
-			}
-
-			if (srctablespace == src_deftablespace)
-				dsttablespace = dst_deftablespace;
-			else
-				dsttablespace = srctablespace;
-
-			dstpath = GetDatabasePath(dboid, dsttablespace);
-
-			/*
-			 * Copy this subdirectory to the new location
-			 *
-			 * We don't need to copy subdirectories
-			 */
-			copydir(srcpath, dstpath, false);
-
-			/* Record the filesystem change in XLOG */
-			{
-				xl_dbase_create_rec xlrec;
-
-				xlrec.db_id = dboid;
-				xlrec.tablespace_id = dsttablespace;
-				xlrec.src_db_id = src_dboid;
-				xlrec.src_tablespace_id = srctablespace;
-
-				XLogBeginInsert();
-				XLogRegisterData((char *) &xlrec, sizeof(xl_dbase_create_rec));
-
-				(void) XLogInsert(RM_DBASE_ID,
-								  XLOG_DBASE_CREATE | XLR_SPECIAL_REL_UPDATE);
-			}
-		}
-		table_endscan(scan);
-		table_close(rel, AccessShareLock);
-
-		/*
-		 * We force a checkpoint before committing.  This effectively means
-		 * that committed XLOG_DBASE_CREATE operations will never need to be
-		 * replayed (at least not in ordinary crash recovery; we still have to
-		 * make the XLOG entry for the benefit of PITR operations). This
-		 * avoids two nasty scenarios:
-		 *
-		 * #1: When PITR is off, we don't XLOG the contents of newly created
-		 * indexes; therefore the drop-and-recreate-whole-directory behavior
-		 * of DBASE_CREATE replay would lose such indexes.
-		 *
-		 * #2: Since we have to recopy the source database during DBASE_CREATE
-		 * replay, we run the risk of copying changes in it that were
-		 * committed after the original CREATE DATABASE command but before the
-		 * system crash that led to the replay.  This is at least unexpected
-		 * and at worst could lead to inconsistencies, eg duplicate table
-		 * names.
-		 *
-		 * (Both of these were real bugs in releases 8.0 through 8.0.3.)
-		 *
-		 * In PITR replay, the first of these isn't an issue, and the second
-		 * is only a risk if the CREATE DATABASE and subsequent template
-		 * database change both occur while a base backup is being taken.
-		 * There doesn't seem to be much we can do about that except document
-		 * it as a limitation.
-		 *
-		 * Perhaps if we ever implement CREATE DATABASE in a less cheesy way,
-		 * we can avoid this.
-		 */
-		RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
-
-		/*
-		 * Close pg_database, but keep lock till commit.
-		 */
-		table_close(pg_database_rel, NoLock);
-
-		/*
-		 * Force synchronous commit, thus minimizing the window between
-		 * creation of the database files and committal of the transaction. If
-		 * we crash before committing, we'll have a DB that's taking up disk
-		 * space but is not in pg_database, which is not good.
-		 */
-		ForceSyncCommit();
+		CopyDatabase(src_dboid, dboid, src_deftablespace, dst_deftablespace);
 	}
 	PG_END_ENSURE_ERROR_CLEANUP(createdb_failure_callback,
 								PointerGetDatum(&fparms));
 
+	/*
+	 * Close pg_database, but keep lock till commit.
+	 */
+	table_close(pg_database_rel, NoLock);
+
 	return dboid;
 }
 
@@ -1195,34 +1514,6 @@ movedb(const char *dbname, const char *tblspcname)
 	src_dbpath = GetDatabasePath(db_id, src_tblspcoid);
 	dst_dbpath = GetDatabasePath(db_id, dst_tblspcoid);
 
-	/*
-	 * Force a checkpoint before proceeding. This will force all dirty
-	 * buffers, including those of unlogged tables, out to disk, to ensure
-	 * source database is up-to-date on disk for the copy.
-	 * FlushDatabaseBuffers() would suffice for that, but we also want to
-	 * process any pending unlink requests. Otherwise, the check for existing
-	 * files in the target directory might fail unnecessarily, not to mention
-	 * that the copy might fail due to source files getting deleted under it.
-	 * On Windows, this also ensures that background procs don't hold any open
-	 * files, which would cause rmdir() to fail.
-	 */
-	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT
-					  | CHECKPOINT_FLUSH_ALL);
-
-	/*
-	 * Now drop all buffers holding data of the target database; they should
-	 * no longer be dirty so DropDatabaseBuffers is safe.
-	 *
-	 * It might seem that we could just let these buffers age out of shared
-	 * buffers naturally, since they should not get referenced anymore.  The
-	 * problem with that is that if the user later moves the database back to
-	 * its original tablespace, any still-surviving buffers would appear to
-	 * contain valid data again --- but they'd be missing any changes made in
-	 * the database while it was in the new tablespace.  In any case, freeing
-	 * buffers that should never be used again seems worth the cycles.
-	 */
-	DropDatabaseBuffers(db_id, src_tblspcoid);
-
 	/*
 	 * Check for existence of files in the target directory, i.e., objects of
 	 * this database that are already in the target tablespace.  We can't
@@ -1268,28 +1559,7 @@ movedb(const char *dbname, const char *tblspcname)
 	PG_ENSURE_ERROR_CLEANUP(movedb_failure_callback,
 							PointerGetDatum(&fparms));
 	{
-		/*
-		 * Copy files from the old tablespace to the new one
-		 */
-		copydir(src_dbpath, dst_dbpath, false);
-
-		/*
-		 * Record the filesystem change in XLOG
-		 */
-		{
-			xl_dbase_create_rec xlrec;
-
-			xlrec.db_id = db_id;
-			xlrec.tablespace_id = dst_tblspcoid;
-			xlrec.src_db_id = db_id;
-			xlrec.src_tablespace_id = src_tblspcoid;
-
-			XLogBeginInsert();
-			XLogRegisterData((char *) &xlrec, sizeof(xl_dbase_create_rec));
-
-			(void) XLogInsert(RM_DBASE_ID,
-							  XLOG_DBASE_CREATE | XLR_SPECIAL_REL_UPDATE);
-		}
+		CopyDatabase(db_id, db_id, src_tblspcoid, dst_tblspcoid);
 
 		/*
 		 * Update the database's pg_database tuple
@@ -1322,22 +1592,6 @@ movedb(const char *dbname, const char *tblspcname)
 
 		systable_endscan(sysscan);
 
-		/*
-		 * Force another checkpoint here.  As in CREATE DATABASE, this is to
-		 * ensure that we don't have to replay a committed XLOG_DBASE_CREATE
-		 * operation, which would cause us to lose any unlogged operations
-		 * done in the new DB tablespace before the next checkpoint.
-		 */
-		RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
-
-		/*
-		 * Force synchronous commit, thus minimizing the window between
-		 * copying the database files and committal of the transaction. If we
-		 * crash before committing, we'll leave an orphaned set of files on
-		 * disk, which is not fatal but not good either.
-		 */
-		ForceSyncCommit();
-
 		/*
 		 * Close pg_database, but keep lock till commit.
 		 */
@@ -1346,6 +1600,21 @@ movedb(const char *dbname, const char *tblspcname)
 	PG_END_ENSURE_ERROR_CLEANUP(movedb_failure_callback,
 								PointerGetDatum(&fparms));
 
+	/*
+	 * Now drop all buffers holding data of the target database for the old
+	 * tablespace oid; We have already copied all the data to the new
+	 * tablespace so we no longer required the old buffers.
+	 *
+	 * It might seem that we could just let these buffers age out of shared
+	 * buffers naturally, since they should not get referenced anymore.  The
+	 * problem with that is that if the user later moves the database back to
+	 * its original tablespace, any still-surviving buffers would appear to
+	 * contain valid data again --- but they'd be missing any changes made in
+	 * the database while it was in the new tablespace.  In any case, freeing
+	 * buffers that should never be used again seems worth the cycles.
+	 */
+	DropDatabaseBuffers(db_id, src_tblspcoid);
+
 	/*
 	 * Commit the transaction so that the pg_database update is committed. If
 	 * we crash while removing files, the database won't be corrupt, we'll
@@ -2138,39 +2407,11 @@ dbase_redo(XLogReaderState *record)
 	if (info == XLOG_DBASE_CREATE)
 	{
 		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) XLogRecGetData(record);
-		char	   *src_path;
-		char	   *dst_path;
-		struct stat st;
-
-		src_path = GetDatabasePath(xlrec->src_db_id, xlrec->src_tablespace_id);
-		dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
+		char	   *dbpath;
 
-		/*
-		 * Our theory for replaying a CREATE is to forcibly drop the target
-		 * subdirectory if present, then re-copy the source data. This may be
-		 * more work than needed, but it is simple to implement.
-		 */
-		if (stat(dst_path, &st) == 0 && S_ISDIR(st.st_mode))
-		{
-			if (!rmtree(dst_path, true))
-				/* If this failed, copydir() below is going to error. */
-				ereport(WARNING,
-						(errmsg("some useless files may be left behind in old database directory \"%s\"",
-								dst_path)));
-		}
-
-		/*
-		 * Force dirty buffers out to disk, to ensure source database is
-		 * up-to-date for the copy.
-		 */
-		FlushDatabaseBuffers(xlrec->src_db_id);
-
-		/*
-		 * Copy this subdirectory to the new location
-		 *
-		 * We don't need to copy subdirectories
-		 */
-		copydir(src_path, dst_path, false);
+		dbpath = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
+		CreateDirAndVersionFile(dbpath, xlrec->db_id, xlrec->tablespace_id,
+								true);
 	}
 	else if (info == XLOG_DBASE_DROP)
 	{
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 436df54120..a68f6c732b 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -23,6 +23,7 @@
 #include "fe_utils/archive.h"
 #include "filemap.h"
 #include "pg_rewind.h"
+#include "utils/relmapper.h"
 
 /*
  * RmgrNames is an array of resource manager names, to make error messages
diff --git a/src/include/commands/dbcommands_xlog.h b/src/include/commands/dbcommands_xlog.h
index f5ed762677..21dc58ea5d 100644
--- a/src/include/commands/dbcommands_xlog.h
+++ b/src/include/commands/dbcommands_xlog.h
@@ -23,11 +23,8 @@
 
 typedef struct xl_dbase_create_rec
 {
-	/* Records copying of a single subdirectory incl. contents */
 	Oid			db_id;
 	Oid			tablespace_id;
-	Oid			src_db_id;
-	Oid			src_tablespace_id;
 } xl_dbase_create_rec;
 
 typedef struct xl_dbase_drop_rec
-- 
2.23.0

#45

Greg Nancarrow

gregn4422@gmail.com

about 4 years ago

In reply to: Dilip Kumar (#44)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Thu, Nov 25, 2021 at 10:17 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Thanks for the review and many valuable comments, I have fixed all of
them except this comment (/* If we got a cancel signal during the copy
of the data, quit */) because this looks fine to me. 0007, I have
dropped from the patchset for now. I have also included fixes for
comments given by John.

Any progress/results yet on testing against a large database (as
suggested by John Naylor) and multiple tablespaces?

Thanks for the patch updates.
I have some additional minor comments:

0002

(1) Tidy patch comment

I suggest minor tidying of the patch comment, as follows:

Support new interfaces in relmapper, 1) Support copying the
relmap file from one database path to another database path.
2) Like RelationMapOidToFilenode, provide another interface
which does the same but, instead of getting it for the database
we are connected to, it will get it for the input database
path.

These interfaces are required for the next patch, for supporting
the WAL-logged created database.

0003

src/include/commands/tablecmds.h
(1) typedef void (*copy_relation_storage) ...

The new typename "copy_relation_storage" needs to be added to
src/tools/pgindent/typedefs.list

0006

src/backend/commands/dbcommands.c
(1) CreateDirAndVersionFile

After writing to the file, you should probably pfree(buf.data), right?
Actually, I don't think StringInfo (dynamic string allocation) is
needed here, since the version string is so short, so why not just use
a local "char buf[16]" buffer and snprintf() the
PG_MAJORVERSION+newline into that?

Also (as mentioned in my first review) shouldn't the "O_TRUNC" flag be
additionally specified in the case when OpenTransientFile() is tried
for a 2nd time because of errno==EEXIST on the 1st attempt? (otherwise
if the existing file did contain something you'd end up writing after
the existing data in the file).

src/backend/commands/dbcommands.c
(2) typedef struct CreateDBRelInfo ... CreateDBRelInfo

The new typename "CreateDBRelInfo" needs to be added to
src/tools/pgindent/typedefs.list

src/bin/pg_rewind/parsexlog.c
(3) Include additional header file

It seems that the following additional header file is not needed to
compile the source file:

+#include "utils/relmapper.h"

Regards,
Greg Nancarrow
Fujitsu Australia

#46

Greg Nancarrow

gregn4422@gmail.com

about 4 years ago

In reply to: Dilip Kumar (#44)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Thu, Nov 25, 2021 at 10:17 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Thanks for the review and many valuable comments, I have fixed all of
them except this comment (/* If we got a cancel signal during the copy
of the data, quit */) because this looks fine to me. 0007, I have
dropped from the patchset for now. I have also included fixes for
comments given by John.

I found the following issue with the patches applied:

A server crash occurs after the following sequence of commands:

create tablespace tbsp1 location '<directory>/tbsp1';
create tablespace tbsp2 location '<directory>/tbsp2';
create database test1 tablespace tbsp1;
create database test2 template test1 tablespace tbsp2;
alter database test2 set tablespace tbsp1;
checkpoint;

The following type of message is seen in the server log:

2021-12-01 16:48:26.623 AEDT [67423] PANIC: could not fsync file
"pg_tblspc/16385/PG_15_202111301/16387/3394": No such file or
directory
2021-12-01 16:48:27.228 AEDT [67422] LOG: checkpointer process (PID
67423) was terminated by signal 6: Aborted
2021-12-01 16:48:27.228 AEDT [67422] LOG: terminating any other
active server processes
2021-12-01 16:48:27.233 AEDT [67422] LOG: all server processes
terminated; reinitializing

Also (prior to running the checkpoint command above) I've seen errors
like the following when running pg_dumpall:

pg_dump: error: connection to server on socket "/tmp/.s.PGSQL.5432"
failed: PANIC: could not open critical system index 2662
pg_dumpall: error: pg_dump failed on database "test2", exiting

Hopefully the above example will help in tracking down the cause.

Regards,
Greg Nancarrow
Fujitsu Australia

#47

Dilip Kumar

dilipbalaut@gmail.com

about 4 years ago

In reply to: Greg Nancarrow (#46)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Wed, Dec 1, 2021 at 12:07 PM Greg Nancarrow <gregn4422@gmail.com> wrote:

On Thu, Nov 25, 2021 at 10:17 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Thanks for the review and many valuable comments, I have fixed all of
them except this comment (/* If we got a cancel signal during the copy
of the data, quit */) because this looks fine to me. 0007, I have
dropped from the patchset for now. I have also included fixes for
comments given by John.

I found the following issue with the patches applied:

A server crash occurs after the following sequence of commands:

create tablespace tbsp1 location '<directory>/tbsp1';
create tablespace tbsp2 location '<directory>/tbsp2';
create database test1 tablespace tbsp1;
create database test2 template test1 tablespace tbsp2;
alter database test2 set tablespace tbsp1;
checkpoint;

The following type of message is seen in the server log:

2021-12-01 16:48:26.623 AEDT [67423] PANIC: could not fsync file
"pg_tblspc/16385/PG_15_202111301/16387/3394": No such file or
directory

Thanks a lot for testing this. From the error, it seems like some of
the old buffer w.r.t. the previous tablespace is not dropped after the
movedb. Actually, we are calling DropDatabaseBuffers() after copying
to a new tablespace and dropping all the buffers of this database
w.r.t the old tablespace. But seems something is missing, I will
reproduce this and try to fix it by tomorrow. I will also fix the
other review comments raised by you in the previous mail.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#48

Dilip Kumar

dilipbalaut@gmail.com

about 4 years ago

In reply to: Dilip Kumar (#47)

7 attachment(s)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Wed, Dec 1, 2021 at 6:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Thanks a lot for testing this. From the error, it seems like some of
the old buffer w.r.t. the previous tablespace is not dropped after the
movedb. Actually, we are calling DropDatabaseBuffers() after copying
to a new tablespace and dropping all the buffers of this database
w.r.t the old tablespace. But seems something is missing, I will
reproduce this and try to fix it by tomorrow. I will also fix the
other review comments raised by you in the previous mail.

Okay, I got the issue, basically we are dropping the database buffers
but not unregistering the existing sync request for database buffers
w.r.t old tablespace. Attached patch fixes that. I also had to extend
ForgetDatabaseSyncRequests so that we can delete the sync request of
the database for the particular tablespace so added another patch for
the same (0006).

I will test the performance scenario next week, which is suggested by John.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v7-0004-Extend-bufmgr-interfaces.patchtext/x-patch; charset=US-ASCII; name=v7-0004-Extend-bufmgr-interfaces.patchDownload

From 58a93d2b37a99d00d4879a6ecec6be9a9900cb00 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 24 Sep 2021 18:23:39 +0530
Subject: [PATCH v7 4/7] Extend bufmgr interfaces

Extend ReadBufferWithoutRelcache interface to take relpersistence as
and input and extend DropDatabaseBuffers to take tablespace oid as
input.
---
 src/backend/access/transam/xlogutils.c |  9 ++++++---
 src/backend/commands/dbcommands.c      |  9 +++------
 src/backend/storage/buffer/bufmgr.c    | 24 +++++++++++-------------
 src/include/storage/bufmgr.h           |  5 +++--
 4 files changed, 23 insertions(+), 24 deletions(-)

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index b33e053..81c192f 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -484,7 +484,8 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 	{
 		/* page exists in file */
 		buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-										   mode, NULL);
+										   mode, NULL,
+										   RELPERSISTENCE_PERMANENT);
 	}
 	else
 	{
@@ -509,7 +510,8 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 				ReleaseBuffer(buffer);
 			}
 			buffer = ReadBufferWithoutRelcache(rnode, forknum,
-											   P_NEW, mode, NULL);
+											   P_NEW, mode, NULL,
+											   RELPERSISTENCE_PERMANENT);
 		}
 		while (BufferGetBlockNumber(buffer) < blkno);
 		/* Handle the corner case that P_NEW returns non-consecutive pages */
@@ -519,7 +521,8 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 				LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 			ReleaseBuffer(buffer);
 			buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-											   mode, NULL);
+											   mode, NULL,
+											   RELPERSISTENCE_PERMANENT);
 		}
 	}
 
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 029fab4..1d963d8 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -938,7 +938,7 @@ dropdb(const char *dbname, bool missing_ok, bool force)
 	 * is important to ensure that no remaining backend tries to write out a
 	 * dirty buffer to the dead database later...
 	 */
-	DropDatabaseBuffers(db_id);
+	DropDatabaseBuffers(db_id, InvalidOid);
 
 	/*
 	 * Tell the stats collector to forget it immediately, too.
@@ -1220,11 +1220,8 @@ movedb(const char *dbname, const char *tblspcname)
 	 * contain valid data again --- but they'd be missing any changes made in
 	 * the database while it was in the new tablespace.  In any case, freeing
 	 * buffers that should never be used again seems worth the cycles.
-	 *
-	 * Note: it'd be sufficient to get rid of buffers matching db_id and
-	 * src_tblspcoid, but bufmgr.c presently provides no API for that.
 	 */
-	DropDatabaseBuffers(db_id);
+	DropDatabaseBuffers(db_id, src_tblspcoid);
 
 	/*
 	 * Check for existence of files in the target directory, i.e., objects of
@@ -2201,7 +2198,7 @@ dbase_redo(XLogReaderState *record)
 		ReplicationSlotsDropDBSlots(xlrec->db_id);
 
 		/* Drop pages for this database that are in the shared buffer cache */
-		DropDatabaseBuffers(xlrec->db_id);
+		DropDatabaseBuffers(xlrec->db_id, InvalidOid);
 
 		/* Also, clean out any fsync requests that might be pending in md.c */
 		ForgetDatabaseSyncRequests(xlrec->db_id);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 08ebabf..ea3ebcc 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -770,24 +770,17 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 /*
  * ReadBufferWithoutRelcache -- like ReadBufferExtended, but doesn't require
  *		a relcache entry for the relation.
- *
- * NB: At present, this function may only be used on permanent relations, which
- * is OK, because we only use it during XLOG replay.  If in the future we
- * want to use it on temporary or unlogged relations, we could pass additional
- * parameters.
  */
 Buffer
 ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
 						  BlockNumber blockNum, ReadBufferMode mode,
-						  BufferAccessStrategy strategy)
+						  BufferAccessStrategy strategy, char relpersistence)
 {
 	bool		hit;
 
 	SMgrRelation smgr = smgropen(rnode, InvalidBackendId);
 
-	Assert(InRecovery);
-
-	return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
+	return ReadBuffer_common(smgr, relpersistence, forkNum, blockNum,
 							 mode, strategy, &hit);
 }
 
@@ -797,7 +790,7 @@ ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
  *
  * *hit is set to true if the request was satisfied from shared buffer cache.
  */
-static Buffer
+Buffer
 ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 				  BlockNumber blockNum, ReadBufferMode mode,
 				  BufferAccessStrategy strategy, bool *hit)
@@ -3402,10 +3395,13 @@ FindAndDropRelFileNodeBuffers(RelFileNode rnode, ForkNumber forkNum,
  *		database, to avoid trying to flush data to disk when the directory
  *		tree no longer exists.  Implementation is pretty similar to
  *		DropRelFileNodeBuffers() which is for destroying just one relation.
+ *
+ *		If a valid tablespace oid is passed then it will compare the tablespace
+ *		oid as well otherwise just the db oid.
  * --------------------------------------------------------------------
  */
 void
-DropDatabaseBuffers(Oid dbid)
+DropDatabaseBuffers(Oid dbid, Oid tbsid)
 {
 	int			i;
 
@@ -3423,11 +3419,13 @@ DropDatabaseBuffers(Oid dbid)
 		 * As in DropRelFileNodeBuffers, an unlocked precheck should be safe
 		 * and saves some cycles.
 		 */
-		if (bufHdr->tag.rnode.dbNode != dbid)
+		if (bufHdr->tag.rnode.dbNode != dbid ||
+			(OidIsValid(tbsid) && bufHdr->tag.rnode.spcNode != tbsid))
 			continue;
 
 		buf_state = LockBufHdr(bufHdr);
-		if (bufHdr->tag.rnode.dbNode == dbid)
+		if (bufHdr->tag.rnode.dbNode == dbid &&
+			(!OidIsValid(tbsid) || bufHdr->tag.rnode.spcNode == tbsid))
 			InvalidateBuffer(bufHdr);	/* releases spinlock */
 		else
 			UnlockBufHdr(bufHdr, buf_state);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index cfce23e..237c6a9 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -184,7 +184,8 @@ extern Buffer ReadBufferExtended(Relation reln, ForkNumber forkNum,
 								 BufferAccessStrategy strategy);
 extern Buffer ReadBufferWithoutRelcache(RelFileNode rnode,
 										ForkNumber forkNum, BlockNumber blockNum,
-										ReadBufferMode mode, BufferAccessStrategy strategy);
+										ReadBufferMode mode, BufferAccessStrategy strategy,
+										char relpersistence);
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern void MarkBufferDirty(Buffer buffer);
@@ -207,7 +208,7 @@ extern void FlushDatabaseBuffers(Oid dbid);
 extern void DropRelFileNodeBuffers(struct SMgrRelationData *smgr_reln, ForkNumber *forkNum,
 								   int nforks, BlockNumber *firstDelBlock);
 extern void DropRelFileNodesAllBuffers(struct SMgrRelationData **smgr_reln, int nnodes);
-extern void DropDatabaseBuffers(Oid dbid);
+extern void DropDatabaseBuffers(Oid dbid, Oid tbsid);
 
 #define RelationGetNumberOfBlocks(reln) \
 	RelationGetNumberOfBlocksInFork(reln, MAIN_FORKNUM)
-- 
1.8.3.1

v7-0003-Refactor-index_copy_data.patchtext/x-patch; charset=US-ASCII; name=v7-0003-Refactor-index_copy_data.patchDownload

From 91acf75aef203d2201ab462e21f26d36d12dad67 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 24 Sep 2021 18:13:25 +0530
Subject: [PATCH v7 3/7] Refactor index_copy_data

Make separate interface for copying relation storage, this will
be used by later patch for copying the database relations.
---
 src/backend/commands/tablecmds.c | 61 ++++++++++++++++++++++++----------------
 src/include/commands/tablecmds.h |  5 ++++
 src/tools/pgindent/typedefs.list |  1 +
 3 files changed, 43 insertions(+), 24 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 5e9cae2..25f897f 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -14237,21 +14237,15 @@ AlterTableMoveAll(AlterTableMoveAllStmt *stmt)
 	return new_tablespaceoid;
 }
 
-static void
-index_copy_data(Relation rel, RelFileNode newrnode)
+/*
+ * Copy source smgr relation's all fork's data to the destination.
+ *
+ * copy_storage - storage copy function, which is passed by the caller.
+ */
+void
+RelationCopyAllFork(SMgrRelation src_smgr, SMgrRelation	dst_smgr,
+					char relpersistence, copy_relation_storage copy_storage)
 {
-	SMgrRelation dstrel;
-
-	dstrel = smgropen(newrnode, rel->rd_backend);
-
-	/*
-	 * Since we copy the file directly without looking at the shared buffers,
-	 * we'd better first flush out any pages of the source relation that are
-	 * in shared buffers.  We assume no new changes will be made while we are
-	 * holding exclusive lock on the rel.
-	 */
-	FlushRelationBuffers(rel);
-
 	/*
 	 * Create and copy all forks of the relation, and schedule unlinking of
 	 * old physical files.
@@ -14259,32 +14253,51 @@ index_copy_data(Relation rel, RelFileNode newrnode)
 	 * NOTE: any conflict in relfilenode value will be caught in
 	 * RelationCreateStorage().
 	 */
-	RelationCreateStorage(newrnode, rel->rd_rel->relpersistence);
+	RelationCreateStorage(dst_smgr->smgr_rnode.node, relpersistence);
 
 	/* copy main fork */
-	RelationCopyStorage(RelationGetSmgr(rel), dstrel, MAIN_FORKNUM,
-						rel->rd_rel->relpersistence);
+	copy_storage(src_smgr, dst_smgr, MAIN_FORKNUM, relpersistence);
 
 	/* copy those extra forks that exist */
 	for (ForkNumber forkNum = MAIN_FORKNUM + 1;
 		 forkNum <= MAX_FORKNUM; forkNum++)
 	{
-		if (smgrexists(RelationGetSmgr(rel), forkNum))
+		if (smgrexists(src_smgr, forkNum))
 		{
-			smgrcreate(dstrel, forkNum, false);
+			smgrcreate(dst_smgr, forkNum, false);
 
 			/*
 			 * WAL log creation if the relation is persistent, or this is the
 			 * init fork of an unlogged relation.
 			 */
-			if (RelationIsPermanent(rel) ||
-				(rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
+			if (relpersistence == RELPERSISTENCE_PERMANENT ||
+				(relpersistence == RELPERSISTENCE_UNLOGGED &&
 				 forkNum == INIT_FORKNUM))
-				log_smgrcreate(&newrnode, forkNum);
-			RelationCopyStorage(RelationGetSmgr(rel), dstrel, forkNum,
-								rel->rd_rel->relpersistence);
+				log_smgrcreate(&dst_smgr->smgr_rnode.node, forkNum);
+
+			/* Copy a fork's data, block by block. */
+			copy_storage(src_smgr, dst_smgr, forkNum, relpersistence);
 		}
 	}
+}
+
+static void
+index_copy_data(Relation rel, RelFileNode newrnode)
+{
+	SMgrRelation dstrel;
+
+	dstrel = smgropen(newrnode, rel->rd_backend);
+
+	/*
+	 * Since we copy the file directly without looking at the shared buffers,
+	 * we'd better first flush out any pages of the source relation that are
+	 * in shared buffers.  We assume no new changes will be made while we are
+	 * holding exclusive lock on the rel.
+	 */
+	FlushRelationBuffers(rel);
+
+	RelationCopyAllFork(RelationGetSmgr(rel), dstrel,
+						rel->rd_rel->relpersistence, RelationCopyStorage);
 
 	/* drop old relation, and close new one */
 	RelationDropStorage(rel);
diff --git a/src/include/commands/tablecmds.h b/src/include/commands/tablecmds.h
index 336549c..e0e0aa5 100644
--- a/src/include/commands/tablecmds.h
+++ b/src/include/commands/tablecmds.h
@@ -19,10 +19,13 @@
 #include "catalog/objectaddress.h"
 #include "nodes/parsenodes.h"
 #include "storage/lock.h"
+#include "storage/smgr.h"
 #include "utils/relcache.h"
 
 struct AlterTableUtilityContext;	/* avoid including tcop/utility.h here */
 
+typedef void (*copy_relation_storage) (SMgrRelation src, SMgrRelation dst,
+									  ForkNumber forkNum, char relpersistence);
 
 extern ObjectAddress DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 									ObjectAddress *typaddress, const char *queryString);
@@ -42,6 +45,8 @@ extern void AlterTableInternal(Oid relid, List *cmds, bool recurse);
 
 extern Oid	AlterTableMoveAll(AlterTableMoveAllStmt *stmt);
 
+extern void RelationCopyAllFork(SMgrRelation src_smgr, SMgrRelation	dst_smgr,
+								char relpersistence, copy_relation_storage copy_storage);
 extern ObjectAddress AlterTableNamespace(AlterObjectSchemaStmt *stmt,
 										 Oid *oldschema);
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index da6ac8e..bb3097f 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3050,6 +3050,7 @@ config_var_value
 contain_aggs_of_level_context
 convert_testexpr_context
 copy_data_source_cb
+copy_relation_storage
 core_YYSTYPE
 core_yy_extra_type
 core_yyscan_t
-- 
1.8.3.1

v7-0001-Refactor-relmap-load-and-relmap-write-functions.patchtext/x-patch; charset=US-ASCII; name=v7-0001-Refactor-relmap-load-and-relmap-write-functions.patchDownload

From eb27d159bc2aa41011b6352a514c7981a19a64dc Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Wed, 1 Sep 2021 14:06:29 +0530
Subject: [PATCH v7 1/7] Refactor relmap load and relmap write functions

Currently, write_relmap_file and load_relmap_file are tightly
coupled with shared_map and local_map.  As part of the higher
level patch set we need remap read/write interfaces that are
not dependent upon shared_map and local_map, and we should be
able to pass map memory as an external parameter instead.
---
 src/backend/utils/cache/relmapper.c | 163 ++++++++++++++++++++++--------------
 1 file changed, 99 insertions(+), 64 deletions(-)

diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index a6e38ad..bb39632 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -136,6 +136,12 @@ static void apply_map_update(RelMapFile *map, Oid relationId, Oid fileNode,
 							 bool add_okay);
 static void merge_map_updates(RelMapFile *map, const RelMapFile *updates,
 							  bool add_okay);
+static void read_relmap_file(char *mapfilename, RelMapFile *map,
+							 bool lock_held);
+static void write_relmap_file_internal(char *mapfilename, RelMapFile *newmap,
+									   bool write_wal, bool send_sinval,
+									   bool preserve_files, Oid dbid, Oid tsid,
+									   const char *dbpath);
 static void load_relmap_file(bool shared, bool lock_held);
 static void write_relmap_file(bool shared, RelMapFile *newmap,
 							  bool write_wal, bool send_sinval, bool preserve_files,
@@ -687,36 +693,19 @@ RestoreRelationMap(char *startAddress)
 }
 
 /*
- * load_relmap_file -- load data from the shared or local map file
+ * read_relmap_file -- read data from given mapfilename file.
  *
  * Because the map file is essential for access to core system catalogs,
  * failure to read it is a fatal error.
- *
- * Note that the local case requires DatabasePath to be set up.
  */
 static void
-load_relmap_file(bool shared, bool lock_held)
+read_relmap_file(char *mapfilename, RelMapFile *map, bool lock_held)
 {
-	RelMapFile *map;
-	char		mapfilename[MAXPGPATH];
 	pg_crc32c	crc;
 	int			fd;
 	int			r;
 
-	if (shared)
-	{
-		snprintf(mapfilename, sizeof(mapfilename), "global/%s",
-				 RELMAPPER_FILENAME);
-		map = &shared_map;
-	}
-	else
-	{
-		snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
-				 DatabasePath, RELMAPPER_FILENAME);
-		map = &local_map;
-	}
-
-	/* Read data ... */
+	/* Open the relmap file for reading. */
 	fd = OpenTransientFile(mapfilename, O_RDONLY | PG_BINARY);
 	if (fd < 0)
 		ereport(FATAL,
@@ -779,62 +768,50 @@ load_relmap_file(bool shared, bool lock_held)
 }
 
 /*
- * Write out a new shared or local map file with the given contents.
- *
- * The magic number and CRC are automatically updated in *newmap.  On
- * success, we copy the data to the appropriate permanent static variable.
- *
- * If write_wal is true then an appropriate WAL message is emitted.
- * (It will be false for bootstrap and WAL replay cases.)
- *
- * If send_sinval is true then a SI invalidation message is sent.
- * (This should be true except in bootstrap case.)
- *
- * If preserve_files is true then the storage manager is warned not to
- * delete the files listed in the map.
+ * load_relmap_file -- load data from the shared or local map file
  *
- * Because this may be called during WAL replay when MyDatabaseId,
- * DatabasePath, etc aren't valid, we require the caller to pass in suitable
- * values.  The caller is also responsible for being sure no concurrent
- * map update could be happening.
+ * Note that the local case requires DatabasePath to be set up.
  */
 static void
-write_relmap_file(bool shared, RelMapFile *newmap,
-				  bool write_wal, bool send_sinval, bool preserve_files,
-				  Oid dbid, Oid tsid, const char *dbpath)
+load_relmap_file(bool shared, bool lock_held)
 {
-	int			fd;
-	RelMapFile *realmap;
+	RelMapFile *map;
 	char		mapfilename[MAXPGPATH];
 
-	/*
-	 * Fill in the overhead fields and update CRC.
-	 */
-	newmap->magic = RELMAPPER_FILEMAGIC;
-	if (newmap->num_mappings < 0 || newmap->num_mappings > MAX_MAPPINGS)
-		elog(ERROR, "attempt to write bogus relation mapping");
-
-	INIT_CRC32C(newmap->crc);
-	COMP_CRC32C(newmap->crc, (char *) newmap, offsetof(RelMapFile, crc));
-	FIN_CRC32C(newmap->crc);
-
-	/*
-	 * Open the target file.  We prefer to do this before entering the
-	 * critical section, so that an open() failure need not force PANIC.
-	 */
 	if (shared)
 	{
 		snprintf(mapfilename, sizeof(mapfilename), "global/%s",
 				 RELMAPPER_FILENAME);
-		realmap = &shared_map;
+		map = &shared_map;
 	}
 	else
 	{
 		snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
-				 dbpath, RELMAPPER_FILENAME);
-		realmap = &local_map;
+				 DatabasePath, RELMAPPER_FILENAME);
+		map = &local_map;
 	}
 
+	/* Read data ... */
+	read_relmap_file(mapfilename, map, lock_held);
+}
+
+/*
+ * Helper function for write_relmap_file, Read comments atop write_relmap_file
+ * for more details.  The CRC should be computed by the caller and stored in
+ * the newmap.
+ */
+static void
+write_relmap_file_internal(char *mapfilename, RelMapFile *newmap,
+						   bool write_wal, bool send_sinval,
+						   bool preserve_files, Oid dbid, Oid tsid,
+						   const char *dbpath)
+{
+	int			fd;
+
+	/*
+	 * Open the target file.  We prefer to do this before entering the
+	 * critical section, so that an open() failure need not force PANIC.
+	 */
 	fd = OpenTransientFile(mapfilename, O_WRONLY | O_CREAT | PG_BINARY);
 	if (fd < 0)
 		ereport(ERROR,
@@ -934,6 +911,68 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		}
 	}
 
+	/* Critical section done */
+	if (write_wal)
+		END_CRIT_SECTION();
+}
+
+/*
+ * Write out a new shared or local map file with the given contents.
+ *
+ * The magic number and CRC are automatically updated in *newmap.  On
+ * success, we copy the data to the appropriate permanent static variable.
+ *
+ * If write_wal is true then an appropriate WAL message is emitted.
+ * (It will be false for bootstrap and WAL replay cases.)
+ *
+ * If send_sinval is true then a SI invalidation message is sent.
+ * (This should be true except in bootstrap case.)
+ *
+ * If preserve_files is true then the storage manager is warned not to
+ * delete the files listed in the map.
+ *
+ * Because this may be called during WAL replay when MyDatabaseId,
+ * DatabasePath, etc aren't valid, we require the caller to pass in suitable
+ * values.  The caller is also responsible for being sure no concurrent
+ * map update could be happening.
+ */
+static void
+write_relmap_file(bool shared, RelMapFile *newmap,
+				  bool write_wal, bool send_sinval, bool preserve_files,
+				  Oid dbid, Oid tsid, const char *dbpath)
+{
+	RelMapFile *realmap;
+	char		mapfilename[MAXPGPATH];
+
+	/*
+	 * Fill in the overhead fields and update CRC.
+	 */
+	newmap->magic = RELMAPPER_FILEMAGIC;
+	if (newmap->num_mappings < 0 || newmap->num_mappings > MAX_MAPPINGS)
+		elog(ERROR, "attempt to write bogus relation mapping");
+
+	INIT_CRC32C(newmap->crc);
+	COMP_CRC32C(newmap->crc, (char *) newmap, offsetof(RelMapFile, crc));
+	FIN_CRC32C(newmap->crc);
+
+	if (shared)
+	{
+		snprintf(mapfilename, sizeof(mapfilename), "global/%s",
+				 RELMAPPER_FILENAME);
+		realmap = &shared_map;
+	}
+	else
+	{
+		snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+				 dbpath, RELMAPPER_FILENAME);
+		realmap = &local_map;
+	}
+
+	/* Write the map to the relmap file. */
+	write_relmap_file_internal(mapfilename, newmap, write_wal,
+							   send_sinval, preserve_files, dbid, tsid,
+							   dbpath);
+
 	/*
 	 * Success, update permanent copy.  During bootstrap, we might be working
 	 * on the permanent copy itself, in which case skip the memcpy() to avoid
@@ -943,10 +982,6 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		memcpy(realmap, newmap, sizeof(RelMapFile));
 	else
 		Assert(!send_sinval);	/* must be bootstrapping */
-
-	/* Critical section done */
-	if (write_wal)
-		END_CRIT_SECTION();
 }
 
 /*
-- 
1.8.3.1

v7-0002-Extend-relmap-interfaces.patchtext/x-patch; charset=US-ASCII; name=v7-0002-Extend-relmap-interfaces.patchDownload

From 88dc7c9e96d3f74844c3fd32bdbf1b8d58ce911c Mon Sep 17 00:00:00 2001
From: dilipkumar <dilipbalaut@gmail.com>
Date: Mon, 4 Oct 2021 13:50:44 +0530
Subject: [PATCH v7 2/7] Extend relmap interfaces

Support new interfaces in relmapper, 1) Support copying the
relmap file from one database path to the other database path.
2) Like RelationMapOidToFilenode, provide another interface
which does the same but, instead of getting it for the database
we are connected to, it will get it for the input database
path.

These interfaces are required for next patch, for supporting the
wal logged created database.
---
 src/backend/utils/cache/relmapper.c | 122 +++++++++++++++++++++++++++++++-----
 src/include/utils/relmapper.h       |   6 +-
 2 files changed, 112 insertions(+), 16 deletions(-)

diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index bb39632..51f361c 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -141,7 +141,7 @@ static void read_relmap_file(char *mapfilename, RelMapFile *map,
 static void write_relmap_file_internal(char *mapfilename, RelMapFile *newmap,
 									   bool write_wal, bool send_sinval,
 									   bool preserve_files, Oid dbid, Oid tsid,
-									   const char *dbpath);
+									   const char *dbpath, bool create);
 static void load_relmap_file(bool shared, bool lock_held);
 static void write_relmap_file(bool shared, RelMapFile *newmap,
 							  bool write_wal, bool send_sinval, bool preserve_files,
@@ -256,6 +256,36 @@ RelationMapFilenodeToOid(Oid filenode, bool shared)
 }
 
 /*
+ * RelationMapOidToFilenodeForDatabase
+ *
+ * Find relfilenode for the given relation id in the dbpath.  Returns
+ * InvalidOid if the relationId is not found in the relmap.
+ */
+Oid
+RelationMapOidToFilenodeForDatabase(char *dbpath, Oid relationId)
+{
+	RelMapFile	map;
+	int			i;
+	char		mapfilename[MAXPGPATH];
+
+	/* Relmap file path for the given dbpath. */
+	snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+			 dbpath, RELMAPPER_FILENAME);
+
+	/* Read the relmap file from the source database. */
+	read_relmap_file(mapfilename, &map, false);
+
+	/* Iterate over the relmap entries to find the input relation oid. */
+	for (i = 0; i < map.num_mappings; i++)
+	{
+		if (relationId == map.mappings[i].mapoid)
+			return map.mappings[i].mapfilenode;
+	}
+
+	return InvalidOid;
+}
+
+/*
  * RelationMapUpdateMap
  *
  * Install a new relfilenode mapping for the specified relation.
@@ -693,7 +723,43 @@ RestoreRelationMap(char *startAddress)
 }
 
 /*
- * read_relmap_file -- read data from given mapfilename file.
+ * CopyRelationMap
+ *
+ * Copy relmapfile from source db path to the destination db path and WAL log
+ * the operation.  This function is only called during the create database, so
+ * the destination database is not yet visible to anyone else, thus we don't
+ * need to acquire the relmap lock while updating the destination relmap.
+ */
+void
+CopyRelationMap(Oid dbid, Oid tsid, char *srcdbpath, char *dstdbpath)
+{
+	RelMapFile map;
+	char mapfilename[MAXPGPATH];
+
+	/* Relmap file path of the source database. */
+	snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+			 srcdbpath, RELMAPPER_FILENAME);
+
+	/* Read the relmap file from the source database. */
+	read_relmap_file(mapfilename, &map, false);
+
+	/* Relmap file path of the destination database. */
+	snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+			 dstdbpath, RELMAPPER_FILENAME);
+
+	/*
+	 * Write map contents into the destination database's relmap file.
+	 * write_relmap_file_internal, expects that the CRC should have been
+	 * computed and stored in the input map.  But, since we have read this map
+	 * from the source database and directly writing to the destination file
+	 * without updating it so we don't need to recompute it.
+	 */
+	write_relmap_file_internal(mapfilename, &map, true, false, true, dbid,
+							   tsid, dstdbpath, true);
+}
+
+/*
+ * read_relmap_file - read data from given mapfilename file.
  *
  * Because the map file is essential for access to core system catalogs,
  * failure to read it is a fatal error.
@@ -796,15 +862,18 @@ load_relmap_file(bool shared, bool lock_held)
 }
 
 /*
- * Helper function for write_relmap_file, Read comments atop write_relmap_file
- * for more details.  The CRC should be computed by the caller and stored in
- * the newmap.
+ * Helper function for write_relmap_file and CopyRelationMap, Read comments
+ * atop write_relmap_file for more details.  The CRC should be computed by the
+ * caller and stored in the newmap.
+ *
+ * Pass the create = true, if we are copying the relmap file during CREATE
+ * DATABASE command.
  */
 static void
 write_relmap_file_internal(char *mapfilename, RelMapFile *newmap,
 						   bool write_wal, bool send_sinval,
 						   bool preserve_files, Oid dbid, Oid tsid,
-						   const char *dbpath)
+						   const char *dbpath, bool create)
 {
 	int			fd;
 
@@ -830,6 +899,7 @@ write_relmap_file_internal(char *mapfilename, RelMapFile *newmap,
 		xlrec.dbid = dbid;
 		xlrec.tsid = tsid;
 		xlrec.nbytes = sizeof(RelMapFile);
+		xlrec.create = create;
 
 		XLogBeginInsert();
 		XLogRegisterData((char *) (&xlrec), MinSizeOfRelmapUpdate);
@@ -971,7 +1041,7 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 	/* Write the map to the relmap file. */
 	write_relmap_file_internal(mapfilename, newmap, write_wal,
 							   send_sinval, preserve_files, dbid, tsid,
-							   dbpath);
+							   dbpath, false);
 
 	/*
 	 * Success, update permanent copy.  During bootstrap, we might be working
@@ -1063,15 +1133,37 @@ relmap_redo(XLogReaderState *record)
 		 * Write out the new map and send sinval, but of course don't write a
 		 * new WAL entry.  There's no surrounding transaction to tell to
 		 * preserve files, either.
-		 *
-		 * There shouldn't be anyone else updating relmaps during WAL replay,
-		 * but grab the lock to interlock against load_relmap_file().
 		 */
-		LWLockAcquire(RelationMappingLock, LW_EXCLUSIVE);
-		write_relmap_file((xlrec->dbid == InvalidOid), &newmap,
-						  false, true, false,
-						  xlrec->dbid, xlrec->tsid, dbpath);
-		LWLockRelease(RelationMappingLock);
+		if (!xlrec->create)
+		{
+			/*
+			 * There shouldn't be anyone else updating relmaps during WAL
+			 * replay, but grab the lock to interlock against
+			 * load_relmap_file().
+			 */
+			LWLockAcquire(RelationMappingLock, LW_EXCLUSIVE);
+			write_relmap_file((xlrec->dbid == InvalidOid), &newmap,
+							false, true, false,
+							xlrec->dbid, xlrec->tsid, dbpath);
+			LWLockRelease(RelationMappingLock);
+		}
+		else
+		{
+			char		mapfilename[MAXPGPATH];
+
+			/* Construct the mapfilename. */
+			snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+					 dbpath, RELMAPPER_FILENAME);
+
+			/*
+			 * We don't need to take relmap lock because this wal is logged
+			 * while creating a new database, so there could be no one else
+			 * reading/writing the relmap file.
+			 */
+			write_relmap_file_internal(mapfilename, &newmap, false, false,
+									   false, xlrec->dbid, xlrec->tsid, dbpath,
+									   true);
+		}
 
 		pfree(dbpath);
 	}
diff --git a/src/include/utils/relmapper.h b/src/include/utils/relmapper.h
index c0d14da..4165f09 100644
--- a/src/include/utils/relmapper.h
+++ b/src/include/utils/relmapper.h
@@ -29,6 +29,7 @@ typedef struct xl_relmap_update
 	Oid			dbid;			/* database ID, or 0 for shared map */
 	Oid			tsid;			/* database's tablespace, or pg_global */
 	int32		nbytes;			/* size of relmap data */
+	bool		create;			/* true if creating new relmap */
 	char		data[FLEXIBLE_ARRAY_MEMBER];
 } xl_relmap_update;
 
@@ -39,6 +40,8 @@ extern Oid	RelationMapOidToFilenode(Oid relationId, bool shared);
 
 extern Oid	RelationMapFilenodeToOid(Oid relationId, bool shared);
 
+extern Oid RelationMapOidToFilenodeForDatabase(char *dbpath, Oid relationId);
+
 extern void RelationMapUpdateMap(Oid relationId, Oid fileNode, bool shared,
 								 bool immediate);
 
@@ -62,7 +65,8 @@ extern void RelationMapInitializePhase3(void);
 extern Size EstimateRelationMapSpace(void);
 extern void SerializeRelationMap(Size maxSize, char *startAddress);
 extern void RestoreRelationMap(char *startAddress);
-
+extern void CopyRelationMap(Oid dbid, Oid tsid, char *srcdbpath,
+							char *dstdbpath);
 extern void relmap_redo(XLogReaderState *record);
 extern void relmap_desc(StringInfo buf, XLogReaderState *record);
 extern const char *relmap_identify(uint8 info);
-- 
1.8.3.1

v7-0005-New-interface-to-lock-relation-id.patchtext/x-patch; charset=US-ASCII; name=v7-0005-New-interface-to-lock-relation-id.patchDownload

From 54dd5f7d89907daf35e06239e2066fc23b598eb2 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 24 Sep 2021 18:29:17 +0530
Subject: [PATCH v7 5/7] New interface to lock relation id

Same as LockRelationOid, but instead of rel oid it will take
LockRelId object as an input.  So instead of using MyDatabaseId it
will use the dboid passed in the LockRelId object. So this will
provide an option to lock the relation even if we are not connected
to the database.
---
 src/backend/storage/lmgr/lmgr.c | 28 ++++++++++++++++++++++++++++
 src/include/storage/lmgr.h      |  1 +
 2 files changed, 29 insertions(+)

diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 2db0424..89d3ecb 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -176,6 +176,34 @@ ConditionalLockRelationOid(Oid relid, LOCKMODE lockmode)
 }
 
 /*
+ *		LockRelationId
+ *
+ * Lock, given a LockRelId.  Same as LockRelationOid but take LockRelId as an
+ * input.
+ */
+void
+LockRelationId(LockRelId *relid, LOCKMODE lockmode)
+{
+	LOCKTAG		tag;
+	LOCALLOCK  *locallock;
+	LockAcquireResult res;
+
+	SET_LOCKTAG_RELATION(tag, relid->dbId, relid->relId);
+
+	res = LockAcquireExtended(&tag, lockmode, false, false, true, &locallock);
+
+	/*
+	 * Now that we have the lock, check for invalidation messages; see notes
+	 * in LockRelationOid.
+	 */
+	if (res != LOCKACQUIRE_ALREADY_CLEAR)
+	{
+		AcceptInvalidationMessages();
+		MarkLockClear(locallock);
+	}
+}
+
+/*
  *		UnlockRelationId
  *
  * Unlock, given a LockRelId.  This is preferred over UnlockRelationOid
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index b009559..092ee93 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -38,6 +38,7 @@ extern void RelationInitLockInfo(Relation relation);
 
 /* Lock a relation */
 extern void LockRelationOid(Oid relid, LOCKMODE lockmode);
+extern void LockRelationId(LockRelId *relid, LOCKMODE lockmode);
 extern bool ConditionalLockRelationOid(Oid relid, LOCKMODE lockmode);
 extern void UnlockRelationId(LockRelId *relid, LOCKMODE lockmode);
 extern void UnlockRelationOid(Oid relid, LOCKMODE lockmode);
-- 
1.8.3.1

v7-0006-Extend-ForgetDatabaseSyncRequests-interface.patchtext/x-patch; charset=US-ASCII; name=v7-0006-Extend-ForgetDatabaseSyncRequests-interface.patchDownload

From ed61f7ac3fad60313010345935dba76a1df40f8d Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Thu, 2 Dec 2021 17:34:10 +0530
Subject: [PATCH v7 6/7] Extend ForgetDatabaseSyncRequests interface

Extend the interface such that it can forget the database sync request
only for the specific tablespace.
---
 src/backend/commands/dbcommands.c | 4 ++--
 src/backend/storage/smgr/md.c     | 4 ++--
 src/include/storage/md.h          | 2 +-
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 1d963d8..85fe598 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -951,7 +951,7 @@ dropdb(const char *dbname, bool missing_ok, bool force)
 	 * worse, it will delete files that belong to a newly created database
 	 * with the same OID.
 	 */
-	ForgetDatabaseSyncRequests(db_id);
+	ForgetDatabaseSyncRequests(db_id, InvalidOid);
 
 	/*
 	 * Force a checkpoint to make sure the checkpointer has received the
@@ -2201,7 +2201,7 @@ dbase_redo(XLogReaderState *record)
 		DropDatabaseBuffers(xlrec->db_id, InvalidOid);
 
 		/* Also, clean out any fsync requests that might be pending in md.c */
-		ForgetDatabaseSyncRequests(xlrec->db_id);
+		ForgetDatabaseSyncRequests(xlrec->db_id, InvalidOid);
 
 		/* Clean out the xlog relcache too */
 		XLogDropDatabase(xlrec->db_id);
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index b4bca7e..3b5ae1c 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -1029,13 +1029,13 @@ register_forget_request(RelFileNodeBackend rnode, ForkNumber forknum,
  * ForgetDatabaseSyncRequests -- forget any fsyncs and unlinks for a DB
  */
 void
-ForgetDatabaseSyncRequests(Oid dbid)
+ForgetDatabaseSyncRequests(Oid dbid, Oid tbsid)
 {
 	FileTag		tag;
 	RelFileNode rnode;
 
 	rnode.dbNode = dbid;
-	rnode.spcNode = 0;
+	rnode.spcNode = tbsid;
 	rnode.relNode = 0;
 
 	INIT_MD_FILETAG(tag, rnode, InvalidForkNumber, InvalidBlockNumber);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 752b440..9502330 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -41,7 +41,7 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber nblocks);
 extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
 
-extern void ForgetDatabaseSyncRequests(Oid dbid);
+extern void ForgetDatabaseSyncRequests(Oid dbid, Oid tbsid);
 extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
 
 /* md sync callbacks */
-- 
1.8.3.1

v7-0007-WAL-logged-CREATE-DATABASE.patchtext/x-patch; charset=US-ASCII; name=v7-0007-WAL-logged-CREATE-DATABASE.patchDownload

From fe0e122033bd0d7604a07d921308cf4cd700980b Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Tue, 5 Oct 2021 11:45:02 +0530
Subject: [PATCH v7 7/7] WAL logged CREATE DATABASE

Currently, CREATE DATABASE forces a checkpoint, then copies all the files,
then forces another checkpoint. The comments in the createdb() function
explain the reasons for this. The attached patch fixes this problem by making
create database completely WAL logged so that we can avoid the checkpoints.

This can also be useful for supporting the TDE. For example, if we need different
encryption for the source and the target database then we can not re-encrypt the
page data if we copy the whole directory.  But with this patch, we are copying
page by page so we have an opportunity to re-encrypt the page before copying that
to the target database.
---
 src/backend/access/rmgrdesc/dbasedesc.c |   3 +-
 src/backend/commands/dbcommands.c       | 686 ++++++++++++++++++++++----------
 src/include/commands/dbcommands_xlog.h  |   3 -
 src/tools/pgindent/typedefs.list        |   1 +
 4 files changed, 469 insertions(+), 224 deletions(-)

diff --git a/src/backend/access/rmgrdesc/dbasedesc.c b/src/backend/access/rmgrdesc/dbasedesc.c
index 2660984..5010f72 100644
--- a/src/backend/access/rmgrdesc/dbasedesc.c
+++ b/src/backend/access/rmgrdesc/dbasedesc.c
@@ -28,8 +28,7 @@ dbase_desc(StringInfo buf, XLogReaderState *record)
 	{
 		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) rec;
 
-		appendStringInfo(buf, "copy dir %u/%u to %u/%u",
-						 xlrec->src_tablespace_id, xlrec->src_db_id,
+		appendStringInfo(buf, "create dir %u/%u",
 						 xlrec->tablespace_id, xlrec->db_id);
 	}
 	else if (info == XLOG_DBASE_DROP)
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 85fe598..d3c3c7a 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -45,13 +45,13 @@
 #include "commands/dbcommands_xlog.h"
 #include "commands/defrem.h"
 #include "commands/seclabel.h"
+#include "commands/tablecmds.h"
 #include "commands/tablespace.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
 #include "replication/slot.h"
-#include "storage/copydir.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
@@ -62,6 +62,7 @@
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
 #include "utils/pg_locale.h"
+#include "utils/relmapper.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 
@@ -77,6 +78,19 @@ typedef struct
 	Oid			dest_tsoid;		/* tablespace we are trying to move to */
 } movedb_failure_params;
 
+/*
+ * When creating a database, we scan the pg_class of the source database to
+ * identify all the relations to be copied.  The structure is used for storing
+ * information about each relation of the source database.
+ */
+typedef struct CreateDBRelInfo
+{
+	RelFileNode		rnode;				/* physical relation identifier */
+	Oid				reloid;				/* relation oid */
+	char			relpersistence;		/* relation's persistence level */
+} CreateDBRelInfo;
+
+
 /* non-export function prototypes */
 static void createdb_failure_callback(int code, Datum arg);
 static void movedb(const char *dbname, const char *tblspcname);
@@ -91,6 +105,426 @@ static bool have_createdb_privilege(void);
 static void remove_dbtablespaces(Oid db_id);
 static bool check_db_file_conflict(Oid db_id);
 static int	errdetail_busy_db(int notherbackends, int npreparedxacts);
+static void CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid,
+									bool isRedo);
+static List *GetDatabaseRelationList(Oid srctbid, Oid srcdbid, char *srcpath);
+static void RelationCopyStorageUsingBuffer(SMgrRelation src, SMgrRelation dst,
+									ForkNumber forkNum, char relpersistence);
+static void CopyDatabase(Oid src_dboid, Oid dboid, Oid src_tsid, Oid dst_tsid);
+
+/*
+ * CreateDirAndVersionFile - Create database directory and write out the
+ *							 PG_VERSION file in the database path.
+ *
+ * If isRedo is true, it's okay for the database directory to exist already.
+ *
+ * We can directly write PG_MAJORVERSION in the version file instead of copying
+ * from the source database file because these two must be the same.
+ */
+static void
+CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid, bool isRedo)
+{
+	int		fd;
+	int		nbytes;
+	char	versionfile[MAXPGPATH];
+	char	buf[16];
+
+	/* Prepare version data before starting a critical section. */
+	sprintf(buf, "%s\n", PG_MAJORVERSION);
+	nbytes = strlen(PG_MAJORVERSION) + 1;
+
+	/* If we are not in WAL replay then write the WAL. */
+	if (!isRedo)
+	{
+		xl_dbase_create_rec xlrec;
+		XLogRecPtr	lsn;
+
+		/* Now errors are fatal ... */
+		START_CRIT_SECTION();
+
+		xlrec.db_id = dbid;
+		xlrec.tablespace_id = tsid;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec), sizeof(xl_dbase_create_rec));
+
+		lsn = XLogInsert(RM_DBASE_ID, XLOG_DBASE_CREATE);
+
+		/* As always, WAL must hit the disk before the data update does. */
+		XLogFlush(lsn);
+	}
+
+	/* Create database directory. */
+	if (MakePGDirectory(dbpath) < 0)
+	{
+		/* Failure other than already exists or not in WAL replay? */
+		if (errno != EEXIST || !isRedo)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not create directory \"%s\": %m", dbpath)));
+	}
+
+	/*
+	 * Create PG_VERSION file in the database path.  If the file already exists
+	 * and we are in WAL replay then try again to open it in write mode.
+	 */
+	snprintf(versionfile, sizeof(versionfile), "%s/%s", dbpath, "PG_VERSION");
+
+	fd = OpenTransientFile(versionfile, O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
+	if (fd < 0 && errno == EEXIST && isRedo)
+		fd = OpenTransientFile(versionfile, O_WRONLY | O_TRUNC | PG_BINARY);
+
+	if (fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m", versionfile)));
+
+	/* Write PG_MAJORVERSION in the PG_VERSION file. */
+	pgstat_report_wait_start(WAIT_EVENT_COPY_FILE_WRITE);
+	errno = 0;
+	if ((int) write(fd, buf, nbytes) != nbytes)
+	{
+		/* If write didn't set errno, assume problem is no disk space. */
+		if (errno == 0)
+			errno = ENOSPC;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m", versionfile)));
+	}
+	pgstat_report_wait_end();
+
+	/* Close the version file. */
+	CloseTransientFile(fd);
+
+	/* Critical section done. */
+	if (!isRedo)
+		END_CRIT_SECTION();
+}
+
+/*
+ * GetDatabaseRelationList - Get relfilenode list to be copied.
+ *
+ * Iterate over each block of the pg_class relation.  From there, we will check
+ * all the visible tuples in order to get a list of all the valid relfilenodes
+ * in the source database that should be copied to the target database.
+ */
+static List *
+GetDatabaseRelationList(Oid tbid, Oid dbid, char *srcpath)
+{
+	SMgrRelation	rd_smgr;
+	RelFileNode		rnode;
+	BlockNumber		nblocks;
+	BlockNumber		blkno;
+	OffsetNumber	offnum;
+	OffsetNumber	maxoff;
+	Buffer			buf;
+	Oid				relfilenode;
+	Page			page;
+	List		   *rnodelist = NIL;
+	HeapTupleData	tuple;
+	Form_pg_class	classForm;
+	LockRelId		relid;
+	BufferAccessStrategy bstrategy;
+
+	/* Get pg_class relfilenode. */
+	relfilenode = RelationMapOidToFilenodeForDatabase(srcpath,
+													  RelationRelationId);
+	/*
+	 * We are going to read the buffers associated with the pg_class relation.
+	 * Thus, acquire the relation level lock before start scanning.  As we are
+	 * not connected to the database, we cannot use relation_open directly, so
+	 * we have to lock using relation id.
+	 */
+	relid.dbId = dbid;
+	relid.relId = RelationRelationId;
+	LockRelationId(&relid, AccessShareLock);
+
+	/* Prepare a relnode for pg_class relation. */
+	rnode.spcNode = tbid;
+	rnode.dbNode = dbid;
+	rnode.relNode = relfilenode;
+
+	/*
+	 * We are not connected to the source database so open the pg_class
+	 * relation at the smgr level and get the block count.
+	 */
+	rd_smgr = smgropen(rnode, InvalidBackendId);
+	nblocks = smgrnblocks(rd_smgr, MAIN_FORKNUM);
+
+	/*
+	 * We're going to read the whole pg_class so better to use bulk-read buffer
+	 * access strategy.
+	 */
+	bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	/* Iterate over each block on the pg_class relation. */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		/*
+		 * We are not connected to the source database so directly use the lower
+		 * level bufmgr interface which operates on the rnode.
+		 */
+		buf = ReadBufferWithoutRelcache(rnode, MAIN_FORKNUM, blkno,
+										RBM_NORMAL, bstrategy,
+										RELPERSISTENCE_PERMANENT);
+
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buf);
+		if (PageIsNew(page) || PageIsEmpty(page))
+		{
+			UnlockReleaseBuffer(buf);
+			continue;
+		}
+
+		maxoff = PageGetMaxOffsetNumber(page);
+
+		/* Iterate over each tuple on the page. */
+		for (offnum = FirstOffsetNumber;
+			 offnum <= maxoff;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemid;
+
+			itemid = PageGetItemId(page, offnum);
+
+			/* Nothing to do if slot is empty or already dead. */
+			if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid) ||
+				ItemIdIsRedirected(itemid))
+				continue;
+
+			Assert(ItemIdIsNormal(itemid));
+			ItemPointerSet(&(tuple.t_self), blkno, offnum);
+
+			tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
+			tuple.t_len = ItemIdGetLength(itemid);
+			tuple.t_tableOid = RelationRelationId;
+
+			/*
+			 * If the tuple is visible then add its relfilenode info to the
+			 * list.
+			 */
+			if (HeapTupleSatisfiesVisibility(&tuple, GetActiveSnapshot(), buf))
+			{
+				Oid				relfilenode = InvalidOid;
+				CreateDBRelInfo   *relinfo;
+
+				classForm = (Form_pg_class) GETSTRUCT(&tuple);
+
+				/* We don't need to copy the shared objects to the target. */
+				if (classForm->reltablespace == GLOBALTABLESPACE_OID)
+					continue;
+
+				/*
+				 * If the object doesn't have the storage then nothing to be
+				 * done for that object so just ignore it.
+				 */
+				if (!RELKIND_HAS_STORAGE(classForm->relkind))
+					continue;
+
+				/*
+				 * If relfilenode is valid then directly use it.  Otherwise,
+				 * consult the relmapper for the mapped relation.
+				 */
+				if (OidIsValid(classForm->relfilenode))
+					relfilenode = classForm->relfilenode;
+				else
+					relfilenode = RelationMapOidToFilenodeForDatabase(srcpath,
+													classForm->oid);
+
+				/* We must have a valid relfilenode oid. */
+				Assert(OidIsValid(relfilenode));
+
+				/* Prepare a rel info element and add it to the list. */
+				relinfo = (CreateDBRelInfo *) palloc(sizeof(CreateDBRelInfo));
+				if (OidIsValid(classForm->reltablespace))
+					relinfo->rnode.spcNode = classForm->reltablespace;
+				else
+					relinfo->rnode.spcNode = tbid;
+
+				relinfo->rnode.dbNode = dbid;
+				relinfo->rnode.relNode = relfilenode;
+				relinfo->reloid = classForm->oid;
+				relinfo->relpersistence = classForm->relpersistence;
+
+				/* Add it to the list. */
+				rnodelist = lappend(rnodelist, relinfo);
+			}
+		}
+
+		/* Release the buffer lock. */
+		UnlockReleaseBuffer(buf);
+	}
+
+	/* Release the lock. */
+	UnlockRelationId(&relid, AccessShareLock);
+
+	return rnodelist;
+}
+
+/*
+ * RelationCopyStorageUsingBuffer - Copy fork's data using bufmgr.
+ *
+ * Same as RelationCopyStorage but instead of using smgrread and smgrextend
+ * this will copy using bufmgr APIs.
+ */
+static void
+RelationCopyStorageUsingBuffer(SMgrRelation src, SMgrRelation dst,
+							   ForkNumber forkNum, char relpersistence)
+{
+	Buffer		srcBuf;
+	Buffer		dstBuf;
+	Page		srcPage;
+	Page		dstPage;
+	bool		use_wal;
+	bool		copying_initfork;
+	BlockNumber nblocks;
+	BlockNumber blkno;
+	BufferAccessStrategy bstrategy_src;
+	BufferAccessStrategy bstrategy_dst;
+
+	/* Refer comments in RelationCopyStorage. */
+	copying_initfork = relpersistence == RELPERSISTENCE_UNLOGGED &&
+		forkNum == INIT_FORKNUM;
+	use_wal = XLogIsNeeded() &&
+		(relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork);
+
+	/* Get number of blocks in the source relation. */
+	nblocks = smgrnblocks(src, forkNum);
+
+	/*
+	 * We are going to copy whole relation from the source to the destination
+	 * so use BAS_BULKREAD strategy for the source relation and BAS_BULKWRITE
+	 * strategy for the destination relation.
+	 */
+	bstrategy_src = GetAccessStrategy(BAS_BULKREAD);
+	bstrategy_dst = GetAccessStrategy(BAS_BULKWRITE);
+
+	/* Iterate over each block of the source relation file. */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		/* If we got a cancel signal during the copy of the data, quit */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Read block from source relation. */
+		srcBuf = ReadBufferWithoutRelcache(src->smgr_rnode.node, forkNum,
+										   blkno, RBM_NORMAL, bstrategy_src,
+										   relpersistence);
+		srcPage = BufferGetPage(srcBuf);
+		if (PageIsNew(srcPage) || PageIsEmpty(srcPage))
+		{
+			ReleaseBuffer(srcBuf);
+			continue;
+		}
+
+		/* Use P_NEW to extend the relation. */
+		dstBuf = ReadBufferWithoutRelcache(dst->smgr_rnode.node, forkNum,
+										   P_NEW, RBM_NORMAL, bstrategy_dst,
+										   relpersistence);
+		LockBuffer(dstBuf, BUFFER_LOCK_EXCLUSIVE);
+
+		START_CRIT_SECTION();
+
+		/* Initialize the page and write the data. */
+		dstPage = BufferGetPage(dstBuf);
+		PageInit(dstPage, BufferGetPageSize(dstBuf), 0);
+		memcpy(dstPage, srcPage, BLCKSZ);
+		MarkBufferDirty(dstBuf);
+
+		/* WAL-log the copied page. */
+		if (use_wal)
+			log_newpage_buffer(dstBuf, true);
+
+		END_CRIT_SECTION();
+
+		UnlockReleaseBuffer(dstBuf);
+		ReleaseBuffer(srcBuf);
+	}
+}
+
+/*
+ * CopyDatabase - Copy source database to the target database.
+ *
+ * Create target database directory and copy data files from the source database
+ * to the target database, block by block and WAL log all the operations.
+ */
+static void
+CopyDatabase(Oid src_dboid, Oid dst_dboid, Oid src_tsid, Oid dst_tsid)
+{
+	char	   *srcpath;
+	char	   *dstpath;
+	List	   *rnodelist = NULL;
+	ListCell   *cell;
+	LockRelId	relid;
+	RelFileNode	srcrnode;
+	RelFileNode	dstrnode;
+	CreateDBRelInfo	*relinfo;
+
+	/* Get the source database path. */
+	srcpath = GetDatabasePath(src_dboid, src_tsid);
+
+	/* Get the destination database path. */
+	dstpath = GetDatabasePath(dst_dboid, dst_tsid);
+
+	/* Create database directory and write PG_VERSION file. */
+	CreateDirAndVersionFile(dstpath, dst_dboid, dst_tsid, false);
+
+	/* Copy relmap file from source database to the destination database. */
+	CopyRelationMap(dst_dboid, dst_tsid, srcpath, dstpath);
+
+	/* Get list of all valid relnode from the source database. */
+	rnodelist = GetDatabaseRelationList(src_tsid, src_dboid, srcpath);
+	Assert(rnodelist != NIL);
+
+	/*
+	 * Database id is common for all the relation so set it before entering to
+	 * the loop.
+	 */
+	relid.dbId = src_dboid;
+
+	/*
+	 * Iterate over each relfilenode and copy the relation data block by block
+	 * from source database to the destination database.
+	 */
+	foreach(cell, rnodelist)
+	{
+		SMgrRelation	src_smgr;
+		SMgrRelation	dst_smgr;
+
+		relinfo = lfirst(cell);
+		srcrnode = relinfo->rnode;
+
+		/*
+		 * If the relation is from the default tablespace then we need to
+		 * create it in the destinations db's default tablespace.  Otherwise,
+		 * we need to create in the same tablespace as it is in the source
+		 * database.
+		 */
+		if (srcrnode.spcNode != src_tsid)
+			dstrnode.spcNode = srcrnode.spcNode;
+		else
+			dstrnode.spcNode = dst_tsid;
+
+		dstrnode.dbNode = dst_dboid;
+		dstrnode.relNode = srcrnode.relNode;
+
+		/* Acquire the lock on relation before start copying. */
+		relid.relId = relinfo->reloid;
+		LockRelationId(&relid, AccessShareLock);
+
+		/* Open the source and the destination relation at smgr level. */
+		src_smgr = smgropen(srcrnode, InvalidBackendId);
+		dst_smgr = smgropen(dstrnode, InvalidBackendId);
+
+		/* Copy relation storage from source to the destination. */
+		RelationCopyAllFork(src_smgr, dst_smgr, relinfo->relpersistence,
+							RelationCopyStorageUsingBuffer);
+
+		/* Release the lock. */
+		UnlockRelationId(&relid, AccessShareLock);
+	}
+
+	list_free_deep(rnodelist);
+}
 
 
 /*
@@ -99,8 +533,6 @@ static int	errdetail_busy_db(int notherbackends, int npreparedxacts);
 Oid
 createdb(ParseState *pstate, const CreatedbStmt *stmt)
 {
-	TableScanDesc scan;
-	Relation	rel;
 	Oid			src_dboid;
 	Oid			src_owner;
 	int			src_encoding = -1;
@@ -563,19 +995,6 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	InvokeObjectPostCreateHook(DatabaseRelationId, dboid, 0);
 
 	/*
-	 * Force a checkpoint before starting the copy. This will force all dirty
-	 * buffers, including those of unlogged tables, out to disk, to ensure
-	 * source database is up-to-date on disk for the copy.
-	 * FlushDatabaseBuffers() would suffice for that, but we also want to
-	 * process any pending unlink requests. Otherwise, if a checkpoint
-	 * happened while we're copying files, a file might be deleted just when
-	 * we're about to copy it, causing the lstat() call in copydir() to fail
-	 * with ENOENT.
-	 */
-	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT
-					  | CHECKPOINT_FLUSH_ALL);
-
-	/*
 	 * Once we start copying subdirectories, we need to be able to clean 'em
 	 * up if we fail.  Use an ENSURE block to make sure this happens.  (This
 	 * is not a 100% solution, because of the possibility of failure during
@@ -587,115 +1006,16 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	PG_ENSURE_ERROR_CLEANUP(createdb_failure_callback,
 							PointerGetDatum(&fparms));
 	{
-		/*
-		 * Iterate through all tablespaces of the template database, and copy
-		 * each one to the new database.
-		 */
-		rel = table_open(TableSpaceRelationId, AccessShareLock);
-		scan = table_beginscan_catalog(rel, 0, NULL);
-		while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
-		{
-			Form_pg_tablespace spaceform = (Form_pg_tablespace) GETSTRUCT(tuple);
-			Oid			srctablespace = spaceform->oid;
-			Oid			dsttablespace;
-			char	   *srcpath;
-			char	   *dstpath;
-			struct stat st;
-
-			/* No need to copy global tablespace */
-			if (srctablespace == GLOBALTABLESPACE_OID)
-				continue;
-
-			srcpath = GetDatabasePath(src_dboid, srctablespace);
-
-			if (stat(srcpath, &st) < 0 || !S_ISDIR(st.st_mode) ||
-				directory_is_empty(srcpath))
-			{
-				/* Assume we can ignore it */
-				pfree(srcpath);
-				continue;
-			}
-
-			if (srctablespace == src_deftablespace)
-				dsttablespace = dst_deftablespace;
-			else
-				dsttablespace = srctablespace;
-
-			dstpath = GetDatabasePath(dboid, dsttablespace);
-
-			/*
-			 * Copy this subdirectory to the new location
-			 *
-			 * We don't need to copy subdirectories
-			 */
-			copydir(srcpath, dstpath, false);
-
-			/* Record the filesystem change in XLOG */
-			{
-				xl_dbase_create_rec xlrec;
-
-				xlrec.db_id = dboid;
-				xlrec.tablespace_id = dsttablespace;
-				xlrec.src_db_id = src_dboid;
-				xlrec.src_tablespace_id = srctablespace;
-
-				XLogBeginInsert();
-				XLogRegisterData((char *) &xlrec, sizeof(xl_dbase_create_rec));
-
-				(void) XLogInsert(RM_DBASE_ID,
-								  XLOG_DBASE_CREATE | XLR_SPECIAL_REL_UPDATE);
-			}
-		}
-		table_endscan(scan);
-		table_close(rel, AccessShareLock);
-
-		/*
-		 * We force a checkpoint before committing.  This effectively means
-		 * that committed XLOG_DBASE_CREATE operations will never need to be
-		 * replayed (at least not in ordinary crash recovery; we still have to
-		 * make the XLOG entry for the benefit of PITR operations). This
-		 * avoids two nasty scenarios:
-		 *
-		 * #1: When PITR is off, we don't XLOG the contents of newly created
-		 * indexes; therefore the drop-and-recreate-whole-directory behavior
-		 * of DBASE_CREATE replay would lose such indexes.
-		 *
-		 * #2: Since we have to recopy the source database during DBASE_CREATE
-		 * replay, we run the risk of copying changes in it that were
-		 * committed after the original CREATE DATABASE command but before the
-		 * system crash that led to the replay.  This is at least unexpected
-		 * and at worst could lead to inconsistencies, eg duplicate table
-		 * names.
-		 *
-		 * (Both of these were real bugs in releases 8.0 through 8.0.3.)
-		 *
-		 * In PITR replay, the first of these isn't an issue, and the second
-		 * is only a risk if the CREATE DATABASE and subsequent template
-		 * database change both occur while a base backup is being taken.
-		 * There doesn't seem to be much we can do about that except document
-		 * it as a limitation.
-		 *
-		 * Perhaps if we ever implement CREATE DATABASE in a less cheesy way,
-		 * we can avoid this.
-		 */
-		RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
-
-		/*
-		 * Close pg_database, but keep lock till commit.
-		 */
-		table_close(pg_database_rel, NoLock);
-
-		/*
-		 * Force synchronous commit, thus minimizing the window between
-		 * creation of the database files and committal of the transaction. If
-		 * we crash before committing, we'll have a DB that's taking up disk
-		 * space but is not in pg_database, which is not good.
-		 */
-		ForceSyncCommit();
+		CopyDatabase(src_dboid, dboid, src_deftablespace, dst_deftablespace);
 	}
 	PG_END_ENSURE_ERROR_CLEANUP(createdb_failure_callback,
 								PointerGetDatum(&fparms));
 
+	/*
+	 * Close pg_database, but keep lock till commit.
+	 */
+	table_close(pg_database_rel, NoLock);
+
 	return dboid;
 }
 
@@ -1196,34 +1516,6 @@ movedb(const char *dbname, const char *tblspcname)
 	dst_dbpath = GetDatabasePath(db_id, dst_tblspcoid);
 
 	/*
-	 * Force a checkpoint before proceeding. This will force all dirty
-	 * buffers, including those of unlogged tables, out to disk, to ensure
-	 * source database is up-to-date on disk for the copy.
-	 * FlushDatabaseBuffers() would suffice for that, but we also want to
-	 * process any pending unlink requests. Otherwise, the check for existing
-	 * files in the target directory might fail unnecessarily, not to mention
-	 * that the copy might fail due to source files getting deleted under it.
-	 * On Windows, this also ensures that background procs don't hold any open
-	 * files, which would cause rmdir() to fail.
-	 */
-	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT
-					  | CHECKPOINT_FLUSH_ALL);
-
-	/*
-	 * Now drop all buffers holding data of the target database; they should
-	 * no longer be dirty so DropDatabaseBuffers is safe.
-	 *
-	 * It might seem that we could just let these buffers age out of shared
-	 * buffers naturally, since they should not get referenced anymore.  The
-	 * problem with that is that if the user later moves the database back to
-	 * its original tablespace, any still-surviving buffers would appear to
-	 * contain valid data again --- but they'd be missing any changes made in
-	 * the database while it was in the new tablespace.  In any case, freeing
-	 * buffers that should never be used again seems worth the cycles.
-	 */
-	DropDatabaseBuffers(db_id, src_tblspcoid);
-
-	/*
 	 * Check for existence of files in the target directory, i.e., objects of
 	 * this database that are already in the target tablespace.  We can't
 	 * allow the move in such a case, because we would need to change those
@@ -1268,28 +1560,7 @@ movedb(const char *dbname, const char *tblspcname)
 	PG_ENSURE_ERROR_CLEANUP(movedb_failure_callback,
 							PointerGetDatum(&fparms));
 	{
-		/*
-		 * Copy files from the old tablespace to the new one
-		 */
-		copydir(src_dbpath, dst_dbpath, false);
-
-		/*
-		 * Record the filesystem change in XLOG
-		 */
-		{
-			xl_dbase_create_rec xlrec;
-
-			xlrec.db_id = db_id;
-			xlrec.tablespace_id = dst_tblspcoid;
-			xlrec.src_db_id = db_id;
-			xlrec.src_tablespace_id = src_tblspcoid;
-
-			XLogBeginInsert();
-			XLogRegisterData((char *) &xlrec, sizeof(xl_dbase_create_rec));
-
-			(void) XLogInsert(RM_DBASE_ID,
-							  XLOG_DBASE_CREATE | XLR_SPECIAL_REL_UPDATE);
-		}
+		CopyDatabase(db_id, db_id, src_tblspcoid, dst_tblspcoid);
 
 		/*
 		 * Update the database's pg_database tuple
@@ -1323,22 +1594,6 @@ movedb(const char *dbname, const char *tblspcname)
 		systable_endscan(sysscan);
 
 		/*
-		 * Force another checkpoint here.  As in CREATE DATABASE, this is to
-		 * ensure that we don't have to replay a committed XLOG_DBASE_CREATE
-		 * operation, which would cause us to lose any unlogged operations
-		 * done in the new DB tablespace before the next checkpoint.
-		 */
-		RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
-
-		/*
-		 * Force synchronous commit, thus minimizing the window between
-		 * copying the database files and committal of the transaction. If we
-		 * crash before committing, we'll leave an orphaned set of files on
-		 * disk, which is not fatal but not good either.
-		 */
-		ForceSyncCommit();
-
-		/*
 		 * Close pg_database, but keep lock till commit.
 		 */
 		table_close(pgdbrel, NoLock);
@@ -1347,6 +1602,27 @@ movedb(const char *dbname, const char *tblspcname)
 								PointerGetDatum(&fparms));
 
 	/*
+	 * Now drop all buffers holding data of the target database for the old
+	 * tablespace oid; We have already copied all the data to the new
+	 * tablespace so we no longer required the old buffers.
+	 *
+	 * It might seem that we could just let these buffers age out of shared
+	 * buffers naturally, since they should not get referenced anymore.  The
+	 * problem with that is that if the user later moves the database back to
+	 * its original tablespace, any still-surviving buffers would appear to
+	 * contain valid data again --- but they'd be missing any changes made in
+	 * the database while it was in the new tablespace.  In any case, freeing
+	 * buffers that should never be used again seems worth the cycles.
+	 */
+	DropDatabaseBuffers(db_id, src_tblspcoid);
+
+	/*
+	 * Also, clean out any fsync requests w.r.t. the old tablespace that might
+	 * be pending in md.c.
+	 */
+	ForgetDatabaseSyncRequests(db_id, src_tblspcoid);
+
+	/*
 	 * Commit the transaction so that the pg_database update is committed. If
 	 * we crash while removing files, the database won't be corrupt, we'll
 	 * just leave some orphaned files in the old directory.
@@ -2138,39 +2414,11 @@ dbase_redo(XLogReaderState *record)
 	if (info == XLOG_DBASE_CREATE)
 	{
 		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) XLogRecGetData(record);
-		char	   *src_path;
-		char	   *dst_path;
-		struct stat st;
-
-		src_path = GetDatabasePath(xlrec->src_db_id, xlrec->src_tablespace_id);
-		dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
+		char	   *dbpath;
 
-		/*
-		 * Our theory for replaying a CREATE is to forcibly drop the target
-		 * subdirectory if present, then re-copy the source data. This may be
-		 * more work than needed, but it is simple to implement.
-		 */
-		if (stat(dst_path, &st) == 0 && S_ISDIR(st.st_mode))
-		{
-			if (!rmtree(dst_path, true))
-				/* If this failed, copydir() below is going to error. */
-				ereport(WARNING,
-						(errmsg("some useless files may be left behind in old database directory \"%s\"",
-								dst_path)));
-		}
-
-		/*
-		 * Force dirty buffers out to disk, to ensure source database is
-		 * up-to-date for the copy.
-		 */
-		FlushDatabaseBuffers(xlrec->src_db_id);
-
-		/*
-		 * Copy this subdirectory to the new location
-		 *
-		 * We don't need to copy subdirectories
-		 */
-		copydir(src_path, dst_path, false);
+		dbpath = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
+		CreateDirAndVersionFile(dbpath, xlrec->db_id, xlrec->tablespace_id,
+								true);
 	}
 	else if (info == XLOG_DBASE_DROP)
 	{
diff --git a/src/include/commands/dbcommands_xlog.h b/src/include/commands/dbcommands_xlog.h
index f5ed762..21dc58e 100644
--- a/src/include/commands/dbcommands_xlog.h
+++ b/src/include/commands/dbcommands_xlog.h
@@ -23,11 +23,8 @@
 
 typedef struct xl_dbase_create_rec
 {
-	/* Records copying of a single subdirectory incl. contents */
 	Oid			db_id;
 	Oid			tablespace_id;
-	Oid			src_db_id;
-	Oid			src_tablespace_id;
 } xl_dbase_create_rec;
 
 typedef struct xl_dbase_drop_rec
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index bb3097f..7a5f6b5 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -460,6 +460,7 @@ CoverPos
 CreateAmStmt
 CreateCastStmt
 CreateConversionStmt
+CreateDBRelInfo
 CreateDomainStmt
 CreateEnumStmt
 CreateEventTrigStmt
-- 
1.8.3.1

#49

Ashutosh Sharma

ashu.coek88@gmail.com

about 4 years ago

In reply to: Dilip Kumar (#48)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

I see that this patch is reducing the database creation time by almost 3-4
times provided that the template database has some user data in it.
However, there are couple of points to be noted:

1) It makes the crash recovery a bit slower than before if the crash has
occurred after the execution of a create database statement. Moreover, if
the template database size is big, it might even generate a lot of WAL
files which the user needs to be aware of.

2) This will put a lot of load on the first checkpoint that will occur
after creating the database statement. I will experiment around this to see
if this has any side effects.

Further, the code changes in the patch looks good. I just have few comments:

+void
+LockRelationId(LockRelId *relid, LOCKMODE lockmode)
+{
+   LOCKTAG     tag;
+   LOCALLOCK  *locallock;
+   LockAcquireResult res;
+
+   SET_LOCKTAG_RELATION(tag, relid->dbId, relid->relId);

Should there be an assertion statement here to ensure that relid->dbid
and relid->relid is valid?

    if (info == XLOG_DBASE_CREATE)
    {
        xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *)
XLogRecGetData(record);
-       char       *src_path;
-       char       *dst_path;
-       struct stat st;
-
-       src_path = GetDatabasePath(xlrec->src_db_id,
xlrec->src_tablespace_id);
-       dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
+       char       *dbpath;

- /*
- * Our theory for replaying a CREATE is to forcibly drop the target
- * subdirectory if present, then re-copy the source data. This may
be
- * more work than needed, but it is simple to implement.
- */
- if (stat(dst_path, &st) == 0 && S_ISDIR(st.st_mode))
- {
- if (!rmtree(dst_path, true))
- /* If this failed, copydir() below is going to error. */
- ereport(WARNING,
- (errmsg("some useless files may be left behind in
old database directory \"%s\"",
- dst_path)));
- }

I think this is a significant change and probably needs some kind of
explanation/comments as-in why we are just creating a dir and copying the
version file when replaying create database operation. Earlier, this meant
replaying the complete create database operation, that doesn't seem to be
the case now.

Have you intentionally skipped pg_internal.init file from being copied to
the target database?

--
With Regards,
Ashutosh Sharma.

On Thu, Dec 2, 2021 at 7:20 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Show quoted text

On Wed, Dec 1, 2021 at 6:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Thanks a lot for testing this. From the error, it seems like some of
the old buffer w.r.t. the previous tablespace is not dropped after the
movedb. Actually, we are calling DropDatabaseBuffers() after copying
to a new tablespace and dropping all the buffers of this database
w.r.t the old tablespace. But seems something is missing, I will
reproduce this and try to fix it by tomorrow. I will also fix the
other review comments raised by you in the previous mail.

Okay, I got the issue, basically we are dropping the database buffers
but not unregistering the existing sync request for database buffers
w.r.t old tablespace. Attached patch fixes that. I also had to extend
ForgetDatabaseSyncRequests so that we can delete the sync request of
the database for the particular tablespace so added another patch for
the same (0006).

I will test the performance scenario next week, which is suggested by John.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#50

Dilip Kumar

dilipbalaut@gmail.com

about 4 years ago

In reply to: Ashutosh Sharma (#49)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Fri, Dec 3, 2021 at 7:38 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

I see that this patch is reducing the database creation time by almost 3-4 times provided that the template database has some user data in it. However, there are couple of points to be noted:

Thanks a lot for looking into the patches.

1) It makes the crash recovery a bit slower than before if the crash has occurred after the execution of a create database statement. Moreover, if the template database size is big, it might even generate a lot of WAL files which the user needs to be aware of.

Yes it will but actually that is the only correct way to do it, in
current we are just logging the WAL as copying the source directory to
destination directory without really noting down exactly what we
wanted to copy, so we are force to do the checkpoint right after
create database because in crash recovery we can not actually replay
that WAL. Because WAL just say copy the source to destination so it
is very much possible that at the DO time source directory had some
different content than the REDO time so this would have created the
inconsistencies in the crash recovery so to avoid this bug they force
the checkpoint so now also if you do force checkpoint then again crash
recovery will be equally fast. So I would not say that we have made
crash recovery slow but we have removed some bugs and with that now we
don't need to force the checkpoint. Also note that in current code
even with force checkpoint the bug is not completely avoided in all
the cases, check below comments from the code[1]* In PITR replay, the first of these isn't an issue, and the second * is only a risk if the CREATE DATABASE and subsequent template * database change both occur while a base backup is being taken. * There doesn't seem to be much we can do about that except document * it as a limitation. * * Perhaps if we ever implement CREATE DATABASE in a less cheesy way, * we can avoid this. */ RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);.

2) This will put a lot of load on the first checkpoint that will occur after creating the database statement. I will experiment around this to see if this has any side effects.

But now a checkpoint can happen at its own need and there is no need
to force a checkpoint like it was before patch.

So the major goal of this patch is 1) Correctly WAL log the create
database which is hack in the current system, 2) Avoid force
checkpoints, 3) We copy page by page so it will support TDE because if
the source and destination database has different encryption then we
can reencrypt the page before copying to destination database, which
is not possible in current system as we are copying directory 4) Now
the new database pages will get the latest LSN which is the correct
things earlier new database pages were getting copied directly with
old LSN only.

Further, the code changes in the patch looks good. I just have few comments:

I will look into the other comments and get back to you, thanks.

[1]: * In PITR replay, the first of these isn't an issue, and the second * is only a risk if the CREATE DATABASE and subsequent template * database change both occur while a base backup is being taken. * There doesn't seem to be much we can do about that except document * it as a limitation. * * Perhaps if we ever implement CREATE DATABASE in a less cheesy way, * we can avoid this. */ RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
* In PITR replay, the first of these isn't an issue, and the second
* is only a risk if the CREATE DATABASE and subsequent template
* database change both occur while a base backup is being taken.
* There doesn't seem to be much we can do about that except document
* it as a limitation.
*
* Perhaps if we ever implement CREATE DATABASE in a less cheesy way,
* we can avoid this.
*/
RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#51

Ashutosh Sharma

ashu.coek88@gmail.com

about 4 years ago

In reply to: Dilip Kumar (#50)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Fri, Dec 3, 2021 at 8:28 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Dec 3, 2021 at 7:38 PM Ashutosh Sharma <ashu.coek88@gmail.com>
wrote:

I see that this patch is reducing the database creation time by almost

3-4 times provided that the template database has some user data in it.
However, there are couple of points to be noted:

Thanks a lot for looking into the patches.

1) It makes the crash recovery a bit slower than before if the crash has

occurred after the execution of a create database statement. Moreover, if
the template database size is big, it might even generate a lot of WAL
files which the user needs to be aware of.

Yes it will but actually that is the only correct way to do it, in
current we are just logging the WAL as copying the source directory to
destination directory without really noting down exactly what we
wanted to copy, so we are force to do the checkpoint right after
create database because in crash recovery we can not actually replay
that WAL. Because WAL just say copy the source to destination so it
is very much possible that at the DO time source directory had some
different content than the REDO time so this would have created the
inconsistencies in the crash recovery so to avoid this bug they force
the checkpoint so now also if you do force checkpoint then again crash
recovery will be equally fast. So I would not say that we have made
crash recovery slow but we have removed some bugs and with that now we
don't need to force the checkpoint. Also note that in current code
even with force checkpoint the bug is not completely avoided in all
the cases, check below comments from the code[1].

2) This will put a lot of load on the first checkpoint that will occur

after creating the database statement. I will experiment around this to see
if this has any side effects.

But now a checkpoint can happen at its own need and there is no need
to force a checkpoint like it was before patch.

So the major goal of this patch is 1) Correctly WAL log the create
database which is hack in the current system, 2) Avoid force
checkpoints, 3) We copy page by page so it will support TDE because if
the source and destination database has different encryption then we
can reencrypt the page before copying to destination database, which
is not possible in current system as we are copying directory 4) Now
the new database pages will get the latest LSN which is the correct
things earlier new database pages were getting copied directly with
old LSN only.

OK. Understood, thanks.!

--
With Regards,
Ashutosh Sharma.

#52

Ashutosh Sharma

ashu.coek88@gmail.com

about 4 years ago

In reply to: Ashutosh Sharma (#51)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

Here are few more review comments:

1) It seems that we are not freeing the memory allocated for buf.data in
CreateDirAndVersionFile().

+ */
+static void
+CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid, bool isRedo)
+{

2) Do we need to pass dbpath here? I mean why not reconstruct it from dbid
and tsid.

3) Not sure if this point has already been discussed, Will we be able to
recover the data when wal_level is set to minimal because the following
condition would be false with this wal level.

+   use_wal = XLogIsNeeded() &&
+       (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork);

--
With Regards,
Ashutosh Sharma.

On Mon, Dec 6, 2021 at 9:12 AM Ashutosh Sharma <ashu.coek88@gmail.com>
wrote:

Show quoted text

On Fri, Dec 3, 2021 at 8:28 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Dec 3, 2021 at 7:38 PM Ashutosh Sharma <ashu.coek88@gmail.com>
wrote:

I see that this patch is reducing the database creation time by almost

3-4 times provided that the template database has some user data in it.
However, there are couple of points to be noted:

Thanks a lot for looking into the patches.

1) It makes the crash recovery a bit slower than before if the crash

has occurred after the execution of a create database statement. Moreover,
if the template database size is big, it might even generate a lot of WAL
files which the user needs to be aware of.

Yes it will but actually that is the only correct way to do it, in
current we are just logging the WAL as copying the source directory to
destination directory without really noting down exactly what we
wanted to copy, so we are force to do the checkpoint right after
create database because in crash recovery we can not actually replay
that WAL. Because WAL just say copy the source to destination so it
is very much possible that at the DO time source directory had some
different content than the REDO time so this would have created the
inconsistencies in the crash recovery so to avoid this bug they force
the checkpoint so now also if you do force checkpoint then again crash
recovery will be equally fast. So I would not say that we have made
crash recovery slow but we have removed some bugs and with that now we
don't need to force the checkpoint. Also note that in current code
even with force checkpoint the bug is not completely avoided in all
the cases, check below comments from the code[1].

2) This will put a lot of load on the first checkpoint that will occur

after creating the database statement. I will experiment around this to see
if this has any side effects.

But now a checkpoint can happen at its own need and there is no need
to force a checkpoint like it was before patch.

So the major goal of this patch is 1) Correctly WAL log the create
database which is hack in the current system, 2) Avoid force
checkpoints, 3) We copy page by page so it will support TDE because if
the source and destination database has different encryption then we
can reencrypt the page before copying to destination database, which
is not possible in current system as we are copying directory 4) Now
the new database pages will get the latest LSN which is the correct
things earlier new database pages were getting copied directly with
old LSN only.

OK. Understood, thanks.!

--
With Regards,
Ashutosh Sharma.

#53

Dilip Kumar

dilipbalaut@gmail.com

about 4 years ago

In reply to: Ashutosh Sharma (#52)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Mon, Dec 6, 2021 at 9:17 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Here are few more review comments:

Thanks for reviewing it.

1) It seems that we are not freeing the memory allocated for buf.data in CreateDirAndVersionFile().

Yeah this was a problem in v6 but I have fixed in v7, can you check that.

+ */
+static void
+CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid, bool isRedo)
+{
2) Do we need to pass dbpath here? I mean why not reconstruct it from dbid and tsid.

Yeah we can do that but I thought computing dbpath has some cost and
since the caller already has it why not to pass it.

3) Not sure if this point has already been discussed, Will we be able to recover the data when wal_level is set to minimal because the following condition would be false with this wal level.
+   use_wal = XLogIsNeeded() &&
+       (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork);

Since we are creating new relfilenode this is fine, refer "Skipping
WAL for New RelFileNode" in src/backend/access/transam/README

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#54

Ashutosh Sharma

ashu.coek88@gmail.com

about 4 years ago

In reply to: Dilip Kumar (#53)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

Thank you, Dilip for the quick response. I am okay with the changes done in
the v7 patch.

One last point - If we try to clone a huge database, as expected CREATE
DATABASE emits a lot of WALs, causing a lot of intermediate checkpoints
which seems to be affecting the performance slightly.

--
With Regards,
Ashutosh Sharma.

On Mon, Dec 6, 2021 at 9:59 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Show quoted text

On Mon, Dec 6, 2021 at 9:17 AM Ashutosh Sharma <ashu.coek88@gmail.com>
wrote:

Here are few more review comments:

Thanks for reviewing it.

1) It seems that we are not freeing the memory allocated for buf.data in

CreateDirAndVersionFile().

Yeah this was a problem in v6 but I have fixed in v7, can you check that.
+ */
+static void
+CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid, bool isRedo)
+{
2) Do we need to pass dbpath here? I mean why not reconstruct it from
dbid and tsid.

Yeah we can do that but I thought computing dbpath has some cost and
since the caller already has it why not to pass it.

3) Not sure if this point has already been discussed, Will we be able to

recover the data when wal_level is set to minimal because the following
condition would be false with this wal level.
+   use_wal = XLogIsNeeded() &&
+       (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork);
Since we are creating new relfilenode this is fine, refer "Skipping
WAL for New RelFileNode" in src/backend/access/transam/README

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#55

Robert Haas

robertmhaas@gmail.com

about 4 years ago

In reply to: Ashutosh Sharma (#54)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Mon, Dec 6, 2021 at 9:23 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

One last point - If we try to clone a huge database, as expected CREATE DATABASE emits a lot of WALs, causing a lot of intermediate checkpoints which seems to be affecting the performance slightly.

Yes, I think this needs to be characterized better. If you have a big
shared buffers setting and a lot of those buffers are dirty and the
template database is small, all of which is fairly normal, then this
new approach should be much quicker. On the other hand, what if the
situation is reversed? Perhaps you have a small shared buffers and not
much of it is dirty and the template database is gigantic. Then maybe
this new approach will be slower. But right now I think we don't know
where the crossover point is, and I think we should try to figure that
out.

So for example, imagine tests with 1GB of shard_buffers, 8GB, and
64GB. And template databases with sizes of whatever the default is,
1GB, 10GB, 100GB. Repeatedly make 75% of the pages dirty and then
create a new database from one of the templates. And then just measure
the performance. Maybe for large databases this approach is just
really the pits -- and if your max_wal_size is too small, it
definitely will be. But, I don't know, maybe with reasonable settings
it's not that bad. Writing everything to disk twice - once to WAL and
once to the target directory - has to be more expensive than doing it
once. But on the other hand, it's all sequential I/O and the data
pages don't need to be fsync'd, so perhaps the overhead is relatively
mild. I don't know.

--
Robert Haas
EDB: http://www.enterprisedb.com

#56

Ashutosh Sharma

ashu.coek88@gmail.com

about 4 years ago

In reply to: Robert Haas (#55)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

Thanks Robert for sharing your thoughts.

On Mon, Dec 6, 2021 at 11:16 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Dec 6, 2021 at 9:23 AM Ashutosh Sharma <ashu.coek88@gmail.com>
wrote:

One last point - If we try to clone a huge database, as expected CREATE

DATABASE emits a lot of WALs, causing a lot of intermediate checkpoints
which seems to be affecting the performance slightly.

Yes, I think this needs to be characterized better. If you have a big
shared buffers setting and a lot of those buffers are dirty and the
template database is small, all of which is fairly normal, then this
new approach should be much quicker. On the other hand, what if the
situation is reversed? Perhaps you have a small shared buffers and not
much of it is dirty and the template database is gigantic. Then maybe
this new approach will be slower. But right now I think we don't know
where the crossover point is, and I think we should try to figure that
out.

Yes I think so too.

So for example, imagine tests with 1GB of shard_buffers, 8GB, and
64GB. And template databases with sizes of whatever the default is,
1GB, 10GB, 100GB. Repeatedly make 75% of the pages dirty and then
create a new database from one of the templates. And then just measure
the performance. Maybe for large databases this approach is just
really the pits -- and if your max_wal_size is too small, it
definitely will be. But, I don't know, maybe with reasonable settings
it's not that bad. Writing everything to disk twice - once to WAL and
once to the target directory - has to be more expensive than doing it
once. But on the other hand, it's all sequential I/O and the data
pages don't need to be fsync'd, so perhaps the overhead is relatively
mild. I don't know.

So far, I haven't found much performance overhead with a few gb of data in
the template database. It's just a bit with the default settings, perhaps
setting a higher value of max_wal_size would reduce this overhead.

--
With Regards,
Ashutosh Sharma.

#57

Dilip Kumar

dilipbalaut@gmail.com

about 4 years ago

In reply to: Ashutosh Sharma (#54)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Mon, Dec 6, 2021 at 7:53 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Thank you, Dilip for the quick response. I am okay with the changes done in the v7 patch.

One last point - If we try to clone a huge database, as expected CREATE DATABASE emits a lot of WALs, causing a lot of intermediate checkpoints which seems to be affecting the performance slightly.

Yeah, that is a valid point because instead of just one WAL for
createdb we will generate WAL for each page in the database, so I
agree that if the max_wal_size is not enough for those WALs then we
might have to pay the cost of multiple checkpoints. However, if we
compare it with the current mechanism then now it is a forced
checkpoint and there is no way to avoid it whereas with the new
approach user can set enough max_wal_size and they can avoid it. So
in other words now the checkpoint is driven by the amount of resource
which is true for any other operation e.g. ALTER TABLE SET TABLESPACE
so now it is in more sync with the rest of the system, but without the
patch, it was a special purpose forced checkpoint only for the
createdb.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#58

Neha Sharma

neha.sharma@enterprisedb.com

about 4 years ago

In reply to: Dilip Kumar (#57)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

Hello Dilip,

While testing the v7 patches, I am observing a crash with the below test
case.

Test case:
create tablespace tab location '<dir_path>/test_dir';
create tablespace tab1 location '<dir_path>/test_dir1';
create database test tablespace tab;
\c test
create table t( a int PRIMARY KEY,b text);
CREATE OR REPLACE FUNCTION large_val() RETURNS TEXT LANGUAGE SQL AS 'select
array_agg(md5(g::text))::text from generate_series(1, 256) g';
insert into t values (generate_series(1,2000000), large_val());
alter table t set tablespace tab1 ;
\c postgres
create database test1 template test;
alter database test set tablespace pg_default;
alter database test set tablespace tab;
\c test1
alter table t set tablespace tab;

Logfile says:
2021-12-08 23:31:58.855 +04 [134252] PANIC: could not fsync file
"base/16386/4152": No such file or directory
2021-12-08 23:31:59.398 +04 [134251] LOG: checkpointer process (PID
134252) was terminated by signal 6: Aborted

Thanks.
--
Regards,
Neha Sharma

On Tue, Dec 7, 2021 at 12:24 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Show quoted text

On Mon, Dec 6, 2021 at 7:53 PM Ashutosh Sharma <ashu.coek88@gmail.com>
wrote:

Thank you, Dilip for the quick response. I am okay with the changes done

in the v7 patch.

One last point - If we try to clone a huge database, as expected CREATE

DATABASE emits a lot of WALs, causing a lot of intermediate checkpoints
which seems to be affecting the performance slightly.

Yeah, that is a valid point because instead of just one WAL for
createdb we will generate WAL for each page in the database, so I
agree that if the max_wal_size is not enough for those WALs then we
might have to pay the cost of multiple checkpoints. However, if we
compare it with the current mechanism then now it is a forced
checkpoint and there is no way to avoid it whereas with the new
approach user can set enough max_wal_size and they can avoid it. So
in other words now the checkpoint is driven by the amount of resource
which is true for any other operation e.g. ALTER TABLE SET TABLESPACE
so now it is in more sync with the rest of the system, but without the
patch, it was a special purpose forced checkpoint only for the
createdb.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#59

Greg Nancarrow

gregn4422@gmail.com

about 4 years ago

In reply to: Neha Sharma (#58)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Thu, Dec 9, 2021 at 6:57 AM Neha Sharma <neha.sharma@enterprisedb.com> wrote:

While testing the v7 patches, I am observing a crash with the below test case.

Test case:
create tablespace tab location '<dir_path>/test_dir';
create tablespace tab1 location '<dir_path>/test_dir1';
create database test tablespace tab;
\c test
create table t( a int PRIMARY KEY,b text);
CREATE OR REPLACE FUNCTION large_val() RETURNS TEXT LANGUAGE SQL AS 'select array_agg(md5(g::text))::text from generate_series(1, 256) g';
insert into t values (generate_series(1,2000000), large_val());
alter table t set tablespace tab1 ;
\c postgres
create database test1 template test;
alter database test set tablespace pg_default;
alter database test set tablespace tab;
\c test1
alter table t set tablespace tab;

Logfile says:
2021-12-08 23:31:58.855 +04 [134252] PANIC: could not fsync file "base/16386/4152": No such file or directory
2021-12-08 23:31:59.398 +04 [134251] LOG: checkpointer process (PID 134252) was terminated by signal 6: Aborted

I tried to reproduce the issue using your test scenario, but I needed
to reduce the amount of inserted data (so reduced 2000000 to 20000)
due to disk space.
I then consistently get an error like the following:

postgres=# alter database test set tablespace pg_default;
ERROR: could not create file
"pg_tblspc/16385/PG_15_202111301/16386/36395": File exists

(this only happens when the patch is used)

Regards,
Greg Nancarrow
Fujitsu Australia

#60

Neha Sharma

neha.sharma@enterprisedb.com

about 4 years ago

In reply to: Greg Nancarrow (#59)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Thu, Dec 9, 2021 at 4:26 AM Greg Nancarrow <gregn4422@gmail.com> wrote:

On Thu, Dec 9, 2021 at 6:57 AM Neha Sharma <neha.sharma@enterprisedb.com>
wrote:

While testing the v7 patches, I am observing a crash with the below test

case.

Test case:
create tablespace tab location '<dir_path>/test_dir';
create tablespace tab1 location '<dir_path>/test_dir1';
create database test tablespace tab;
\c test
create table t( a int PRIMARY KEY,b text);
CREATE OR REPLACE FUNCTION large_val() RETURNS TEXT LANGUAGE SQL AS

'select array_agg(md5(g::text))::text from generate_series(1, 256) g';

insert into t values (generate_series(1,2000000), large_val());
alter table t set tablespace tab1 ;
\c postgres
create database test1 template test;
alter database test set tablespace pg_default;
alter database test set tablespace tab;
\c test1
alter table t set tablespace tab;

Logfile says:
2021-12-08 23:31:58.855 +04 [134252] PANIC: could not fsync file

"base/16386/4152": No such file or directory

2021-12-08 23:31:59.398 +04 [134251] LOG: checkpointer process (PID

134252) was terminated by signal 6: Aborted

I tried to reproduce the issue using your test scenario, but I needed
to reduce the amount of inserted data (so reduced 2000000 to 20000)
due to disk space.
I then consistently get an error like the following:

postgres=# alter database test set tablespace pg_default;
ERROR: could not create file
"pg_tblspc/16385/PG_15_202111301/16386/36395": File exists

(this only happens when the patch is used)

Yes, I was also getting this, and moving further we get a crash when we
alter the table of database test1.
Below is the output of the test at my end.

postgres=# create tablespace tab1 location
'/home/edb/PGsources/postgresql/inst/bin/rep_test1';
CREATE TABLESPACE
postgres=# create tablespace tab location
'/home/edb/PGsources/postgresql/inst/bin/rep_test';
CREATE TABLESPACE
postgres=# create database test tablespace tab;
CREATE DATABASE
postgres=# \c test
You are now connected to database "test" as user "edb".
test=# create table t( a int PRIMARY KEY,b text);
CREATE TABLE
test=# CREATE OR REPLACE FUNCTION large_val() RETURNS TEXT LANGUAGE SQL AS
'select array_agg(md5(g::text))::text from generate_series(1, 256) g';
CREATE FUNCTION
test=# insert into t values (generate_series(1,2000000), large_val());
INSERT 0 2000000
test=# alter table t set tablespace tab1 ;
ALTER TABLE
test=# \c postgres
You are now connected to database "postgres" as user "edb".
postgres=# create database test1 template test;
CREATE DATABASE
postgres=# alter database test set tablespace pg_default;
ERROR: could not create file
"pg_tblspc/16384/PG_15_202111301/16386/2016395": File exists
postgres=# alter database test set tablespace tab;
ALTER DATABASE
postgres=# \c test1
You are now connected to database "test1" as user "edb".
test1=# alter table t set tablespace tab;
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and
repeat your command.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!?>

Show quoted text

Regards,
Greg Nancarrow
Fujitsu Australia

#61

Ashutosh Sharma

ashu.coek88@gmail.com

about 4 years ago

In reply to: Neha Sharma (#60)

1 attachment(s)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

Hi,

The issue here is that we are trying to create a table that exists inside a
non-default tablespace when doing ALTER DATABASE. I think this should be
skipped otherwise we will come across the error like shown below:

ashu@postgres=# alter database test set tablespace pg_default;
ERROR: 58P02: could not create file
"pg_tblspc/16385/PG_15_202111301/16386/16390": File exists

I have taken the above from Neha's test-case.

Attached patch fixes this. I am passing a new boolean flag named *movedb*
to CopyDatabase() so that it could skip the creation of tables existing in
non-default tablespace when doing alter database. Alternatively, we can
also rename the boolean flag movedb to createdb and pass its value
accordingly from movedb() or createdb(). Either way looks fine to me.
Kindly check the attached patch for the changes.

Dilip, Could you please check the attached patch and let me know if it
looks fine or not?

Neha, can you please re-run the test-cases with the attached patch.

Thanks,

--
With Regards,
Ashutosh Sharma.

On Thu, Dec 9, 2021 at 8:43 AM Neha Sharma <neha.sharma@enterprisedb.com>
wrote:

Show quoted text

On Thu, Dec 9, 2021 at 4:26 AM Greg Nancarrow <gregn4422@gmail.com> wrote:

On Thu, Dec 9, 2021 at 6:57 AM Neha Sharma <neha.sharma@enterprisedb.com>
wrote:

While testing the v7 patches, I am observing a crash with the below

test case.

Test case:
create tablespace tab location '<dir_path>/test_dir';
create tablespace tab1 location '<dir_path>/test_dir1';
create database test tablespace tab;
\c test
create table t( a int PRIMARY KEY,b text);
CREATE OR REPLACE FUNCTION large_val() RETURNS TEXT LANGUAGE SQL AS

'select array_agg(md5(g::text))::text from generate_series(1, 256) g';

insert into t values (generate_series(1,2000000), large_val());
alter table t set tablespace tab1 ;
\c postgres
create database test1 template test;
alter database test set tablespace pg_default;
alter database test set tablespace tab;
\c test1
alter table t set tablespace tab;

Logfile says:
2021-12-08 23:31:58.855 +04 [134252] PANIC: could not fsync file

"base/16386/4152": No such file or directory

2021-12-08 23:31:59.398 +04 [134251] LOG: checkpointer process (PID

134252) was terminated by signal 6: Aborted

I tried to reproduce the issue using your test scenario, but I needed
to reduce the amount of inserted data (so reduced 2000000 to 20000)
due to disk space.
I then consistently get an error like the following:

postgres=# alter database test set tablespace pg_default;
ERROR: could not create file
"pg_tblspc/16385/PG_15_202111301/16386/36395": File exists

(this only happens when the patch is used)

Yes, I was also getting this, and moving further we get a crash when we
alter the table of database test1.
Below is the output of the test at my end.

postgres=# create tablespace tab1 location
'/home/edb/PGsources/postgresql/inst/bin/rep_test1';
CREATE TABLESPACE
postgres=# create tablespace tab location
'/home/edb/PGsources/postgresql/inst/bin/rep_test';
CREATE TABLESPACE
postgres=# create database test tablespace tab;
CREATE DATABASE
postgres=# \c test
You are now connected to database "test" as user "edb".
test=# create table t( a int PRIMARY KEY,b text);
CREATE TABLE
test=# CREATE OR REPLACE FUNCTION large_val() RETURNS TEXT LANGUAGE SQL AS
'select array_agg(md5(g::text))::text from generate_series(1, 256) g';
CREATE FUNCTION
test=# insert into t values (generate_series(1,2000000), large_val());
INSERT 0 2000000
test=# alter table t set tablespace tab1 ;
ALTER TABLE
test=# \c postgres
You are now connected to database "postgres" as user "edb".
postgres=# create database test1 template test;
CREATE DATABASE
postgres=# alter database test set tablespace pg_default;
ERROR: could not create file
"pg_tblspc/16384/PG_15_202111301/16386/2016395": File exists
postgres=# alter database test set tablespace tab;
ALTER DATABASE
postgres=# \c test1
You are now connected to database "test1" as user "edb".
test1=# alter table t set tablespace tab;
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and
repeat your command.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!?>

Regards,
Greg Nancarrow
Fujitsu Australia

Attachments:

skip-table-creation-for-alter-database.patchapplication/octet-stream; name=skip-table-creation-for-alter-database.patchDownload

diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index d3c3c7aba0..9e96db06a3 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -110,7 +110,7 @@ static void CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid,
 static List *GetDatabaseRelationList(Oid srctbid, Oid srcdbid, char *srcpath);
 static void RelationCopyStorageUsingBuffer(SMgrRelation src, SMgrRelation dst,
 									ForkNumber forkNum, char relpersistence);
-static void CopyDatabase(Oid src_dboid, Oid dboid, Oid src_tsid, Oid dst_tsid);
+static void CopyDatabase(Oid src_dboid, Oid dboid, Oid src_tsid, Oid dst_tsid, bool movedb);
 
 /*
  * CreateDirAndVersionFile - Create database directory and write out the
@@ -448,7 +448,7 @@ RelationCopyStorageUsingBuffer(SMgrRelation src, SMgrRelation dst,
  * to the target database, block by block and WAL log all the operations.
  */
 static void
-CopyDatabase(Oid src_dboid, Oid dst_dboid, Oid src_tsid, Oid dst_tsid)
+CopyDatabase(Oid src_dboid, Oid dst_dboid, Oid src_tsid, Oid dst_tsid, bool movedb)
 {
 	char	   *srcpath;
 	char	   *dstpath;
@@ -500,7 +500,12 @@ CopyDatabase(Oid src_dboid, Oid dst_dboid, Oid src_tsid, Oid dst_tsid)
 		 * database.
 		 */
 		if (srcrnode.spcNode != src_tsid)
-			dstrnode.spcNode = srcrnode.spcNode;
+		{
+			if (!movedb)
+				dstrnode.spcNode = srcrnode.spcNode;
+			else
+				continue;
+		}
 		else
 			dstrnode.spcNode = dst_tsid;
 
@@ -1006,7 +1011,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	PG_ENSURE_ERROR_CLEANUP(createdb_failure_callback,
 							PointerGetDatum(&fparms));
 	{
-		CopyDatabase(src_dboid, dboid, src_deftablespace, dst_deftablespace);
+		CopyDatabase(src_dboid, dboid, src_deftablespace, dst_deftablespace, false);
 	}
 	PG_END_ENSURE_ERROR_CLEANUP(createdb_failure_callback,
 								PointerGetDatum(&fparms));
@@ -1560,7 +1565,7 @@ movedb(const char *dbname, const char *tblspcname)
 	PG_ENSURE_ERROR_CLEANUP(movedb_failure_callback,
 							PointerGetDatum(&fparms));
 	{
-		CopyDatabase(db_id, db_id, src_tblspcoid, dst_tblspcoid);
+		CopyDatabase(db_id, db_id, src_tblspcoid, dst_tblspcoid, true);
 
 		/*
 		 * Update the database's pg_database tuple

#62

Dilip Kumar

dilipbalaut@gmail.com

about 4 years ago

In reply to: Ashutosh Sharma (#61)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Thu, Dec 9, 2021 at 12:42 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Hi,

The issue here is that we are trying to create a table that exists inside a non-default tablespace when doing ALTER DATABASE. I think this should be skipped otherwise we will come across the error like shown below:

ashu@postgres=# alter database test set tablespace pg_default;
ERROR: 58P02: could not create file "pg_tblspc/16385/PG_15_202111301/16386/16390": File exists

I have taken the above from Neha's test-case.

--

Attached patch fixes this. I am passing a new boolean flag named *movedb* to CopyDatabase() so that it could skip the creation of tables existing in non-default tablespace when doing alter database. Alternatively, we can also rename the boolean flag movedb to createdb and pass its value accordingly from movedb() or createdb(). Either way looks fine to me. Kindly check the attached patch for the changes.

Dilip, Could you please check the attached patch and let me know if it looks fine or not?

Neha, can you please re-run the test-cases with the attached patch.

Thanks Ahustosh, yeah I have observed the same, earlier we were
directly copying the whole directory so this was not an issue, now if
some tables of the database are already in the destination tablespace
then we should skip them while copying. I will review your patch and
merge into the main patch.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#63

Neha Sharma

neha.sharma@enterprisedb.com

about 4 years ago

In reply to: Ashutosh Sharma (#61)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Thu, Dec 9, 2021 at 11:12 AM Ashutosh Sharma <ashu.coek88@gmail.com>
wrote:

Hi,

The issue here is that we are trying to create a table that exists inside
a non-default tablespace when doing ALTER DATABASE. I think this should be
skipped otherwise we will come across the error like shown below:

ashu@postgres=# alter database test set tablespace pg_default;
ERROR: 58P02: could not create file
"pg_tblspc/16385/PG_15_202111301/16386/16390": File exists

Thanks Ashutosh for the patch, the mentioned issue has been resolved with
the patch.

But I am still able to reproduce the crash consistently on top of this
patch + v7 patches,just the test case has been modified.

create tablespace tab1 location '<dir_path>/test1';
create tablespace tab location '<dir_path>/test';
create database test tablespace tab;
\c test
create table t( a int PRIMARY KEY,b text);
CREATE OR REPLACE FUNCTION large_val() RETURNS TEXT LANGUAGE SQL AS 'select
array_agg(md5(g::text))::text from generate_series(1, 256) g';
insert into t values (generate_series(1,100000), large_val());
alter table t set tablespace tab1 ;
\c postgres
create database test1 template test;
\c test1
alter table t set tablespace tab;
\c postgres
alter database test1 set tablespace tab1;

--Cancel the below command after few seconds
alter database test1 set tablespace pg_default;

\c test1
alter table t set tablespace tab1;

Logfile Snippet:
2021-12-09 17:49:18.110 +04 [18151] PANIC: could not fsync file
"base/116398/116400": No such file or directory
2021-12-09 17:49:19.105 +04 [18150] LOG: checkpointer process (PID 18151)
was terminated by signal 6: Aborted
2021-12-09 17:49:19.105 +04 [18150] LOG: terminating any other active
server processes

#64

Dilip Kumar

dilipbalaut@gmail.com

about 4 years ago

In reply to: Neha Sharma (#63)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Thu, Dec 9, 2021 at 7:23 PM Neha Sharma <neha.sharma@enterprisedb.com> wrote:

On Thu, Dec 9, 2021 at 11:12 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

\c postgres
alter database test1 set tablespace tab1;

--Cancel the below command after few seconds
alter database test1 set tablespace pg_default;

\c test1
alter table t set tablespace tab1;

Logfile Snippet:
2021-12-09 17:49:18.110 +04 [18151] PANIC: could not fsync file "base/116398/116400": No such file or directory
2021-12-09 17:49:19.105 +04 [18150] LOG: checkpointer process (PID 18151) was terminated by signal 6: Aborted
2021-12-09 17:49:19.105 +04 [18150] LOG: terminating any other active server processes

Yeah, it seems like the fsync requests produced while copying database
objects to the new tablespace are not unregistered. This seems like a
different issue than previously raised. I will work on this next
week, thanks for testing.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#65

Ashutosh Sharma

ashu.coek88@gmail.com

about 4 years ago

In reply to: Neha Sharma (#63)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Thu, Dec 9, 2021 at 7:23 PM Neha Sharma <neha.sharma@enterprisedb.com>
wrote:

On Thu, Dec 9, 2021 at 11:12 AM Ashutosh Sharma <ashu.coek88@gmail.com>
wrote:

Hi,

The issue here is that we are trying to create a table that exists inside
a non-default tablespace when doing ALTER DATABASE. I think this should be
skipped otherwise we will come across the error like shown below:

ashu@postgres=# alter database test set tablespace pg_default;
ERROR: 58P02: could not create file
"pg_tblspc/16385/PG_15_202111301/16386/16390": File exists

Thanks Ashutosh for the patch, the mentioned issue has been resolved with
the patch.

But I am still able to reproduce the crash consistently on top of this
patch + v7 patches,just the test case has been modified.

create tablespace tab1 location '<dir_path>/test1';
create tablespace tab location '<dir_path>/test';
create database test tablespace tab;
\c test
create table t( a int PRIMARY KEY,b text);
CREATE OR REPLACE FUNCTION large_val() RETURNS TEXT LANGUAGE SQL AS
'select array_agg(md5(g::text))::text from generate_series(1, 256) g';
insert into t values (generate_series(1,100000), large_val());
alter table t set tablespace tab1 ;
\c postgres
create database test1 template test;
\c test1
alter table t set tablespace tab;
\c postgres
alter database test1 set tablespace tab1;

--Cancel the below command after few seconds
alter database test1 set tablespace pg_default;

\c test1
alter table t set tablespace tab1;

Logfile Snippet:
2021-12-09 17:49:18.110 +04 [18151] PANIC: could not fsync file
"base/116398/116400": No such file or directory
2021-12-09 17:49:19.105 +04 [18150] LOG: checkpointer process (PID 18151)
was terminated by signal 6: Aborted
2021-12-09 17:49:19.105 +04 [18150] LOG: terminating any other active
server processes

This is different from the issue you raised earlier. As Dilip said, we need
to unregister sync requests for files that got successfully copied to the
target database, but the overall alter database statement failed. We are
doing this when the database is created successfully, but not when it fails.
Probably doing the same inside the cleanup function
movedb_failure_callback() should fix the problem.

--
With Regards,
Ashutosh Sharma.

#66

Dilip Kumar

dilipbalaut@gmail.com

about 4 years ago

In reply to: Ashutosh Sharma (#65)

7 attachment(s)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Fri, Dec 10, 2021 at 7:39 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Logfile Snippet:
2021-12-09 17:49:18.110 +04 [18151] PANIC: could not fsync file "base/116398/116400": No such file or directory
2021-12-09 17:49:19.105 +04 [18150] LOG: checkpointer process (PID 18151) was terminated by signal 6: Aborted
2021-12-09 17:49:19.105 +04 [18150] LOG: terminating any other active server processes

This is different from the issue you raised earlier. As Dilip said, we need to unregister sync requests for files that got successfully copied to the target database, but the overall alter database statement failed. We are doing this when the database is created successfully, but not when it fails.
Probably doing the same inside the cleanup function movedb_failure_callback() should fix the problem.

Correct, I have done this cleanup, apart from this we have dropped the
fsyc request in create database failure case as well and also need to
drop buffer in error case of creatdb as well as movedb. I have also
fixed the other issue for which you gave the patch (a bit differently)
basically, in case of movedb the source and destination dboid are same
so we don't need an additional parameter and also readjusted the
conditions to avoid nested if.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v8-0001-Refactor-relmap-load-and-relmap-write-functions.patchtext/x-patch; charset=US-ASCII; name=v8-0001-Refactor-relmap-load-and-relmap-write-functions.patchDownload

From eb27d159bc2aa41011b6352a514c7981a19a64dc Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Wed, 1 Sep 2021 14:06:29 +0530
Subject: [PATCH v8 1/7] Refactor relmap load and relmap write functions

Currently, write_relmap_file and load_relmap_file are tightly
coupled with shared_map and local_map.  As part of the higher
level patch set we need remap read/write interfaces that are
not dependent upon shared_map and local_map, and we should be
able to pass map memory as an external parameter instead.
---
 src/backend/utils/cache/relmapper.c | 163 ++++++++++++++++++++++--------------
 1 file changed, 99 insertions(+), 64 deletions(-)

diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index a6e38ad..bb39632 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -136,6 +136,12 @@ static void apply_map_update(RelMapFile *map, Oid relationId, Oid fileNode,
 							 bool add_okay);
 static void merge_map_updates(RelMapFile *map, const RelMapFile *updates,
 							  bool add_okay);
+static void read_relmap_file(char *mapfilename, RelMapFile *map,
+							 bool lock_held);
+static void write_relmap_file_internal(char *mapfilename, RelMapFile *newmap,
+									   bool write_wal, bool send_sinval,
+									   bool preserve_files, Oid dbid, Oid tsid,
+									   const char *dbpath);
 static void load_relmap_file(bool shared, bool lock_held);
 static void write_relmap_file(bool shared, RelMapFile *newmap,
 							  bool write_wal, bool send_sinval, bool preserve_files,
@@ -687,36 +693,19 @@ RestoreRelationMap(char *startAddress)
 }
 
 /*
- * load_relmap_file -- load data from the shared or local map file
+ * read_relmap_file -- read data from given mapfilename file.
  *
  * Because the map file is essential for access to core system catalogs,
  * failure to read it is a fatal error.
- *
- * Note that the local case requires DatabasePath to be set up.
  */
 static void
-load_relmap_file(bool shared, bool lock_held)
+read_relmap_file(char *mapfilename, RelMapFile *map, bool lock_held)
 {
-	RelMapFile *map;
-	char		mapfilename[MAXPGPATH];
 	pg_crc32c	crc;
 	int			fd;
 	int			r;
 
-	if (shared)
-	{
-		snprintf(mapfilename, sizeof(mapfilename), "global/%s",
-				 RELMAPPER_FILENAME);
-		map = &shared_map;
-	}
-	else
-	{
-		snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
-				 DatabasePath, RELMAPPER_FILENAME);
-		map = &local_map;
-	}
-
-	/* Read data ... */
+	/* Open the relmap file for reading. */
 	fd = OpenTransientFile(mapfilename, O_RDONLY | PG_BINARY);
 	if (fd < 0)
 		ereport(FATAL,
@@ -779,62 +768,50 @@ load_relmap_file(bool shared, bool lock_held)
 }
 
 /*
- * Write out a new shared or local map file with the given contents.
- *
- * The magic number and CRC are automatically updated in *newmap.  On
- * success, we copy the data to the appropriate permanent static variable.
- *
- * If write_wal is true then an appropriate WAL message is emitted.
- * (It will be false for bootstrap and WAL replay cases.)
- *
- * If send_sinval is true then a SI invalidation message is sent.
- * (This should be true except in bootstrap case.)
- *
- * If preserve_files is true then the storage manager is warned not to
- * delete the files listed in the map.
+ * load_relmap_file -- load data from the shared or local map file
  *
- * Because this may be called during WAL replay when MyDatabaseId,
- * DatabasePath, etc aren't valid, we require the caller to pass in suitable
- * values.  The caller is also responsible for being sure no concurrent
- * map update could be happening.
+ * Note that the local case requires DatabasePath to be set up.
  */
 static void
-write_relmap_file(bool shared, RelMapFile *newmap,
-				  bool write_wal, bool send_sinval, bool preserve_files,
-				  Oid dbid, Oid tsid, const char *dbpath)
+load_relmap_file(bool shared, bool lock_held)
 {
-	int			fd;
-	RelMapFile *realmap;
+	RelMapFile *map;
 	char		mapfilename[MAXPGPATH];
 
-	/*
-	 * Fill in the overhead fields and update CRC.
-	 */
-	newmap->magic = RELMAPPER_FILEMAGIC;
-	if (newmap->num_mappings < 0 || newmap->num_mappings > MAX_MAPPINGS)
-		elog(ERROR, "attempt to write bogus relation mapping");
-
-	INIT_CRC32C(newmap->crc);
-	COMP_CRC32C(newmap->crc, (char *) newmap, offsetof(RelMapFile, crc));
-	FIN_CRC32C(newmap->crc);
-
-	/*
-	 * Open the target file.  We prefer to do this before entering the
-	 * critical section, so that an open() failure need not force PANIC.
-	 */
 	if (shared)
 	{
 		snprintf(mapfilename, sizeof(mapfilename), "global/%s",
 				 RELMAPPER_FILENAME);
-		realmap = &shared_map;
+		map = &shared_map;
 	}
 	else
 	{
 		snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
-				 dbpath, RELMAPPER_FILENAME);
-		realmap = &local_map;
+				 DatabasePath, RELMAPPER_FILENAME);
+		map = &local_map;
 	}
 
+	/* Read data ... */
+	read_relmap_file(mapfilename, map, lock_held);
+}
+
+/*
+ * Helper function for write_relmap_file, Read comments atop write_relmap_file
+ * for more details.  The CRC should be computed by the caller and stored in
+ * the newmap.
+ */
+static void
+write_relmap_file_internal(char *mapfilename, RelMapFile *newmap,
+						   bool write_wal, bool send_sinval,
+						   bool preserve_files, Oid dbid, Oid tsid,
+						   const char *dbpath)
+{
+	int			fd;
+
+	/*
+	 * Open the target file.  We prefer to do this before entering the
+	 * critical section, so that an open() failure need not force PANIC.
+	 */
 	fd = OpenTransientFile(mapfilename, O_WRONLY | O_CREAT | PG_BINARY);
 	if (fd < 0)
 		ereport(ERROR,
@@ -934,6 +911,68 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		}
 	}
 
+	/* Critical section done */
+	if (write_wal)
+		END_CRIT_SECTION();
+}
+
+/*
+ * Write out a new shared or local map file with the given contents.
+ *
+ * The magic number and CRC are automatically updated in *newmap.  On
+ * success, we copy the data to the appropriate permanent static variable.
+ *
+ * If write_wal is true then an appropriate WAL message is emitted.
+ * (It will be false for bootstrap and WAL replay cases.)
+ *
+ * If send_sinval is true then a SI invalidation message is sent.
+ * (This should be true except in bootstrap case.)
+ *
+ * If preserve_files is true then the storage manager is warned not to
+ * delete the files listed in the map.
+ *
+ * Because this may be called during WAL replay when MyDatabaseId,
+ * DatabasePath, etc aren't valid, we require the caller to pass in suitable
+ * values.  The caller is also responsible for being sure no concurrent
+ * map update could be happening.
+ */
+static void
+write_relmap_file(bool shared, RelMapFile *newmap,
+				  bool write_wal, bool send_sinval, bool preserve_files,
+				  Oid dbid, Oid tsid, const char *dbpath)
+{
+	RelMapFile *realmap;
+	char		mapfilename[MAXPGPATH];
+
+	/*
+	 * Fill in the overhead fields and update CRC.
+	 */
+	newmap->magic = RELMAPPER_FILEMAGIC;
+	if (newmap->num_mappings < 0 || newmap->num_mappings > MAX_MAPPINGS)
+		elog(ERROR, "attempt to write bogus relation mapping");
+
+	INIT_CRC32C(newmap->crc);
+	COMP_CRC32C(newmap->crc, (char *) newmap, offsetof(RelMapFile, crc));
+	FIN_CRC32C(newmap->crc);
+
+	if (shared)
+	{
+		snprintf(mapfilename, sizeof(mapfilename), "global/%s",
+				 RELMAPPER_FILENAME);
+		realmap = &shared_map;
+	}
+	else
+	{
+		snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+				 dbpath, RELMAPPER_FILENAME);
+		realmap = &local_map;
+	}
+
+	/* Write the map to the relmap file. */
+	write_relmap_file_internal(mapfilename, newmap, write_wal,
+							   send_sinval, preserve_files, dbid, tsid,
+							   dbpath);
+
 	/*
 	 * Success, update permanent copy.  During bootstrap, we might be working
 	 * on the permanent copy itself, in which case skip the memcpy() to avoid
@@ -943,10 +982,6 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		memcpy(realmap, newmap, sizeof(RelMapFile));
 	else
 		Assert(!send_sinval);	/* must be bootstrapping */
-
-	/* Critical section done */
-	if (write_wal)
-		END_CRIT_SECTION();
 }
 
 /*
-- 
1.8.3.1

v8-0002-Extend-relmap-interfaces.patchtext/x-patch; charset=US-ASCII; name=v8-0002-Extend-relmap-interfaces.patchDownload

From 88dc7c9e96d3f74844c3fd32bdbf1b8d58ce911c Mon Sep 17 00:00:00 2001
From: dilipkumar <dilipbalaut@gmail.com>
Date: Mon, 4 Oct 2021 13:50:44 +0530
Subject: [PATCH v8 2/7] Extend relmap interfaces

Support new interfaces in relmapper, 1) Support copying the
relmap file from one database path to the other database path.
2) Like RelationMapOidToFilenode, provide another interface
which does the same but, instead of getting it for the database
we are connected to, it will get it for the input database
path.

These interfaces are required for next patch, for supporting the
wal logged created database.
---
 src/backend/utils/cache/relmapper.c | 122 +++++++++++++++++++++++++++++++-----
 src/include/utils/relmapper.h       |   6 +-
 2 files changed, 112 insertions(+), 16 deletions(-)

diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index bb39632..51f361c 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -141,7 +141,7 @@ static void read_relmap_file(char *mapfilename, RelMapFile *map,
 static void write_relmap_file_internal(char *mapfilename, RelMapFile *newmap,
 									   bool write_wal, bool send_sinval,
 									   bool preserve_files, Oid dbid, Oid tsid,
-									   const char *dbpath);
+									   const char *dbpath, bool create);
 static void load_relmap_file(bool shared, bool lock_held);
 static void write_relmap_file(bool shared, RelMapFile *newmap,
 							  bool write_wal, bool send_sinval, bool preserve_files,
@@ -256,6 +256,36 @@ RelationMapFilenodeToOid(Oid filenode, bool shared)
 }
 
 /*
+ * RelationMapOidToFilenodeForDatabase
+ *
+ * Find relfilenode for the given relation id in the dbpath.  Returns
+ * InvalidOid if the relationId is not found in the relmap.
+ */
+Oid
+RelationMapOidToFilenodeForDatabase(char *dbpath, Oid relationId)
+{
+	RelMapFile	map;
+	int			i;
+	char		mapfilename[MAXPGPATH];
+
+	/* Relmap file path for the given dbpath. */
+	snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+			 dbpath, RELMAPPER_FILENAME);
+
+	/* Read the relmap file from the source database. */
+	read_relmap_file(mapfilename, &map, false);
+
+	/* Iterate over the relmap entries to find the input relation oid. */
+	for (i = 0; i < map.num_mappings; i++)
+	{
+		if (relationId == map.mappings[i].mapoid)
+			return map.mappings[i].mapfilenode;
+	}
+
+	return InvalidOid;
+}
+
+/*
  * RelationMapUpdateMap
  *
  * Install a new relfilenode mapping for the specified relation.
@@ -693,7 +723,43 @@ RestoreRelationMap(char *startAddress)
 }
 
 /*
- * read_relmap_file -- read data from given mapfilename file.
+ * CopyRelationMap
+ *
+ * Copy relmapfile from source db path to the destination db path and WAL log
+ * the operation.  This function is only called during the create database, so
+ * the destination database is not yet visible to anyone else, thus we don't
+ * need to acquire the relmap lock while updating the destination relmap.
+ */
+void
+CopyRelationMap(Oid dbid, Oid tsid, char *srcdbpath, char *dstdbpath)
+{
+	RelMapFile map;
+	char mapfilename[MAXPGPATH];
+
+	/* Relmap file path of the source database. */
+	snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+			 srcdbpath, RELMAPPER_FILENAME);
+
+	/* Read the relmap file from the source database. */
+	read_relmap_file(mapfilename, &map, false);
+
+	/* Relmap file path of the destination database. */
+	snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+			 dstdbpath, RELMAPPER_FILENAME);
+
+	/*
+	 * Write map contents into the destination database's relmap file.
+	 * write_relmap_file_internal, expects that the CRC should have been
+	 * computed and stored in the input map.  But, since we have read this map
+	 * from the source database and directly writing to the destination file
+	 * without updating it so we don't need to recompute it.
+	 */
+	write_relmap_file_internal(mapfilename, &map, true, false, true, dbid,
+							   tsid, dstdbpath, true);
+}
+
+/*
+ * read_relmap_file - read data from given mapfilename file.
  *
  * Because the map file is essential for access to core system catalogs,
  * failure to read it is a fatal error.
@@ -796,15 +862,18 @@ load_relmap_file(bool shared, bool lock_held)
 }
 
 /*
- * Helper function for write_relmap_file, Read comments atop write_relmap_file
- * for more details.  The CRC should be computed by the caller and stored in
- * the newmap.
+ * Helper function for write_relmap_file and CopyRelationMap, Read comments
+ * atop write_relmap_file for more details.  The CRC should be computed by the
+ * caller and stored in the newmap.
+ *
+ * Pass the create = true, if we are copying the relmap file during CREATE
+ * DATABASE command.
  */
 static void
 write_relmap_file_internal(char *mapfilename, RelMapFile *newmap,
 						   bool write_wal, bool send_sinval,
 						   bool preserve_files, Oid dbid, Oid tsid,
-						   const char *dbpath)
+						   const char *dbpath, bool create)
 {
 	int			fd;
 
@@ -830,6 +899,7 @@ write_relmap_file_internal(char *mapfilename, RelMapFile *newmap,
 		xlrec.dbid = dbid;
 		xlrec.tsid = tsid;
 		xlrec.nbytes = sizeof(RelMapFile);
+		xlrec.create = create;
 
 		XLogBeginInsert();
 		XLogRegisterData((char *) (&xlrec), MinSizeOfRelmapUpdate);
@@ -971,7 +1041,7 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 	/* Write the map to the relmap file. */
 	write_relmap_file_internal(mapfilename, newmap, write_wal,
 							   send_sinval, preserve_files, dbid, tsid,
-							   dbpath);
+							   dbpath, false);
 
 	/*
 	 * Success, update permanent copy.  During bootstrap, we might be working
@@ -1063,15 +1133,37 @@ relmap_redo(XLogReaderState *record)
 		 * Write out the new map and send sinval, but of course don't write a
 		 * new WAL entry.  There's no surrounding transaction to tell to
 		 * preserve files, either.
-		 *
-		 * There shouldn't be anyone else updating relmaps during WAL replay,
-		 * but grab the lock to interlock against load_relmap_file().
 		 */
-		LWLockAcquire(RelationMappingLock, LW_EXCLUSIVE);
-		write_relmap_file((xlrec->dbid == InvalidOid), &newmap,
-						  false, true, false,
-						  xlrec->dbid, xlrec->tsid, dbpath);
-		LWLockRelease(RelationMappingLock);
+		if (!xlrec->create)
+		{
+			/*
+			 * There shouldn't be anyone else updating relmaps during WAL
+			 * replay, but grab the lock to interlock against
+			 * load_relmap_file().
+			 */
+			LWLockAcquire(RelationMappingLock, LW_EXCLUSIVE);
+			write_relmap_file((xlrec->dbid == InvalidOid), &newmap,
+							false, true, false,
+							xlrec->dbid, xlrec->tsid, dbpath);
+			LWLockRelease(RelationMappingLock);
+		}
+		else
+		{
+			char		mapfilename[MAXPGPATH];
+
+			/* Construct the mapfilename. */
+			snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+					 dbpath, RELMAPPER_FILENAME);
+
+			/*
+			 * We don't need to take relmap lock because this wal is logged
+			 * while creating a new database, so there could be no one else
+			 * reading/writing the relmap file.
+			 */
+			write_relmap_file_internal(mapfilename, &newmap, false, false,
+									   false, xlrec->dbid, xlrec->tsid, dbpath,
+									   true);
+		}
 
 		pfree(dbpath);
 	}
diff --git a/src/include/utils/relmapper.h b/src/include/utils/relmapper.h
index c0d14da..4165f09 100644
--- a/src/include/utils/relmapper.h
+++ b/src/include/utils/relmapper.h
@@ -29,6 +29,7 @@ typedef struct xl_relmap_update
 	Oid			dbid;			/* database ID, or 0 for shared map */
 	Oid			tsid;			/* database's tablespace, or pg_global */
 	int32		nbytes;			/* size of relmap data */
+	bool		create;			/* true if creating new relmap */
 	char		data[FLEXIBLE_ARRAY_MEMBER];
 } xl_relmap_update;
 
@@ -39,6 +40,8 @@ extern Oid	RelationMapOidToFilenode(Oid relationId, bool shared);
 
 extern Oid	RelationMapFilenodeToOid(Oid relationId, bool shared);
 
+extern Oid RelationMapOidToFilenodeForDatabase(char *dbpath, Oid relationId);
+
 extern void RelationMapUpdateMap(Oid relationId, Oid fileNode, bool shared,
 								 bool immediate);
 
@@ -62,7 +65,8 @@ extern void RelationMapInitializePhase3(void);
 extern Size EstimateRelationMapSpace(void);
 extern void SerializeRelationMap(Size maxSize, char *startAddress);
 extern void RestoreRelationMap(char *startAddress);
-
+extern void CopyRelationMap(Oid dbid, Oid tsid, char *srcdbpath,
+							char *dstdbpath);
 extern void relmap_redo(XLogReaderState *record);
 extern void relmap_desc(StringInfo buf, XLogReaderState *record);
 extern const char *relmap_identify(uint8 info);
-- 
1.8.3.1

v8-0005-New-interface-to-lock-relation-id.patchtext/x-patch; charset=US-ASCII; name=v8-0005-New-interface-to-lock-relation-id.patchDownload

From 61f4b13524ea33200b28c87a8eb3242065067f83 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 24 Sep 2021 18:29:17 +0530
Subject: [PATCH v8 5/7] New interface to lock relation id

Same as LockRelationOid, but instead of rel oid it will take
LockRelId object as an input.  So instead of using MyDatabaseId it
will use the dboid passed in the LockRelId object. So this will
provide an option to lock the relation even if we are not connected
to the database.
---
 src/backend/storage/lmgr/lmgr.c | 28 ++++++++++++++++++++++++++++
 src/include/storage/lmgr.h      |  1 +
 2 files changed, 29 insertions(+)

diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 2db0424..89d3ecb 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -176,6 +176,34 @@ ConditionalLockRelationOid(Oid relid, LOCKMODE lockmode)
 }
 
 /*
+ *		LockRelationId
+ *
+ * Lock, given a LockRelId.  Same as LockRelationOid but take LockRelId as an
+ * input.
+ */
+void
+LockRelationId(LockRelId *relid, LOCKMODE lockmode)
+{
+	LOCKTAG		tag;
+	LOCALLOCK  *locallock;
+	LockAcquireResult res;
+
+	SET_LOCKTAG_RELATION(tag, relid->dbId, relid->relId);
+
+	res = LockAcquireExtended(&tag, lockmode, false, false, true, &locallock);
+
+	/*
+	 * Now that we have the lock, check for invalidation messages; see notes
+	 * in LockRelationOid.
+	 */
+	if (res != LOCKACQUIRE_ALREADY_CLEAR)
+	{
+		AcceptInvalidationMessages();
+		MarkLockClear(locallock);
+	}
+}
+
+/*
  *		UnlockRelationId
  *
  * Unlock, given a LockRelId.  This is preferred over UnlockRelationOid
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index b009559..092ee93 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -38,6 +38,7 @@ extern void RelationInitLockInfo(Relation relation);
 
 /* Lock a relation */
 extern void LockRelationOid(Oid relid, LOCKMODE lockmode);
+extern void LockRelationId(LockRelId *relid, LOCKMODE lockmode);
 extern bool ConditionalLockRelationOid(Oid relid, LOCKMODE lockmode);
 extern void UnlockRelationId(LockRelId *relid, LOCKMODE lockmode);
 extern void UnlockRelationOid(Oid relid, LOCKMODE lockmode);
-- 
1.8.3.1

v8-0004-Extend-bufmgr-interfaces.patchtext/x-patch; charset=US-ASCII; name=v8-0004-Extend-bufmgr-interfaces.patchDownload

From f3f1adea6febf987d906366482d2133bf1583f76 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 24 Sep 2021 18:23:39 +0530
Subject: [PATCH v8 4/7] Extend bufmgr interfaces

Extend ReadBufferWithoutRelcache interface to take relpersistence as
and input and extend DropDatabaseBuffers to take tablespace oid as
input.
---
 src/backend/access/transam/xlogutils.c |  9 ++++++---
 src/backend/commands/dbcommands.c      |  9 +++------
 src/backend/storage/buffer/bufmgr.c    | 24 +++++++++++-------------
 src/include/storage/bufmgr.h           |  5 +++--
 4 files changed, 23 insertions(+), 24 deletions(-)

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index b33e053..81c192f 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -484,7 +484,8 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 	{
 		/* page exists in file */
 		buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-										   mode, NULL);
+										   mode, NULL,
+										   RELPERSISTENCE_PERMANENT);
 	}
 	else
 	{
@@ -509,7 +510,8 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 				ReleaseBuffer(buffer);
 			}
 			buffer = ReadBufferWithoutRelcache(rnode, forknum,
-											   P_NEW, mode, NULL);
+											   P_NEW, mode, NULL,
+											   RELPERSISTENCE_PERMANENT);
 		}
 		while (BufferGetBlockNumber(buffer) < blkno);
 		/* Handle the corner case that P_NEW returns non-consecutive pages */
@@ -519,7 +521,8 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 				LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 			ReleaseBuffer(buffer);
 			buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-											   mode, NULL);
+											   mode, NULL,
+											   RELPERSISTENCE_PERMANENT);
 		}
 	}
 
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 029fab4..1d963d8 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -938,7 +938,7 @@ dropdb(const char *dbname, bool missing_ok, bool force)
 	 * is important to ensure that no remaining backend tries to write out a
 	 * dirty buffer to the dead database later...
 	 */
-	DropDatabaseBuffers(db_id);
+	DropDatabaseBuffers(db_id, InvalidOid);
 
 	/*
 	 * Tell the stats collector to forget it immediately, too.
@@ -1220,11 +1220,8 @@ movedb(const char *dbname, const char *tblspcname)
 	 * contain valid data again --- but they'd be missing any changes made in
 	 * the database while it was in the new tablespace.  In any case, freeing
 	 * buffers that should never be used again seems worth the cycles.
-	 *
-	 * Note: it'd be sufficient to get rid of buffers matching db_id and
-	 * src_tblspcoid, but bufmgr.c presently provides no API for that.
 	 */
-	DropDatabaseBuffers(db_id);
+	DropDatabaseBuffers(db_id, src_tblspcoid);
 
 	/*
 	 * Check for existence of files in the target directory, i.e., objects of
@@ -2201,7 +2198,7 @@ dbase_redo(XLogReaderState *record)
 		ReplicationSlotsDropDBSlots(xlrec->db_id);
 
 		/* Drop pages for this database that are in the shared buffer cache */
-		DropDatabaseBuffers(xlrec->db_id);
+		DropDatabaseBuffers(xlrec->db_id, InvalidOid);
 
 		/* Also, clean out any fsync requests that might be pending in md.c */
 		ForgetDatabaseSyncRequests(xlrec->db_id);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 08ebabf..ea3ebcc 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -770,24 +770,17 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 /*
  * ReadBufferWithoutRelcache -- like ReadBufferExtended, but doesn't require
  *		a relcache entry for the relation.
- *
- * NB: At present, this function may only be used on permanent relations, which
- * is OK, because we only use it during XLOG replay.  If in the future we
- * want to use it on temporary or unlogged relations, we could pass additional
- * parameters.
  */
 Buffer
 ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
 						  BlockNumber blockNum, ReadBufferMode mode,
-						  BufferAccessStrategy strategy)
+						  BufferAccessStrategy strategy, char relpersistence)
 {
 	bool		hit;
 
 	SMgrRelation smgr = smgropen(rnode, InvalidBackendId);
 
-	Assert(InRecovery);
-
-	return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
+	return ReadBuffer_common(smgr, relpersistence, forkNum, blockNum,
 							 mode, strategy, &hit);
 }
 
@@ -797,7 +790,7 @@ ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
  *
  * *hit is set to true if the request was satisfied from shared buffer cache.
  */
-static Buffer
+Buffer
 ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 				  BlockNumber blockNum, ReadBufferMode mode,
 				  BufferAccessStrategy strategy, bool *hit)
@@ -3402,10 +3395,13 @@ FindAndDropRelFileNodeBuffers(RelFileNode rnode, ForkNumber forkNum,
  *		database, to avoid trying to flush data to disk when the directory
  *		tree no longer exists.  Implementation is pretty similar to
  *		DropRelFileNodeBuffers() which is for destroying just one relation.
+ *
+ *		If a valid tablespace oid is passed then it will compare the tablespace
+ *		oid as well otherwise just the db oid.
  * --------------------------------------------------------------------
  */
 void
-DropDatabaseBuffers(Oid dbid)
+DropDatabaseBuffers(Oid dbid, Oid tbsid)
 {
 	int			i;
 
@@ -3423,11 +3419,13 @@ DropDatabaseBuffers(Oid dbid)
 		 * As in DropRelFileNodeBuffers, an unlocked precheck should be safe
 		 * and saves some cycles.
 		 */
-		if (bufHdr->tag.rnode.dbNode != dbid)
+		if (bufHdr->tag.rnode.dbNode != dbid ||
+			(OidIsValid(tbsid) && bufHdr->tag.rnode.spcNode != tbsid))
 			continue;
 
 		buf_state = LockBufHdr(bufHdr);
-		if (bufHdr->tag.rnode.dbNode == dbid)
+		if (bufHdr->tag.rnode.dbNode == dbid &&
+			(!OidIsValid(tbsid) || bufHdr->tag.rnode.spcNode == tbsid))
 			InvalidateBuffer(bufHdr);	/* releases spinlock */
 		else
 			UnlockBufHdr(bufHdr, buf_state);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index cfce23e..237c6a9 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -184,7 +184,8 @@ extern Buffer ReadBufferExtended(Relation reln, ForkNumber forkNum,
 								 BufferAccessStrategy strategy);
 extern Buffer ReadBufferWithoutRelcache(RelFileNode rnode,
 										ForkNumber forkNum, BlockNumber blockNum,
-										ReadBufferMode mode, BufferAccessStrategy strategy);
+										ReadBufferMode mode, BufferAccessStrategy strategy,
+										char relpersistence);
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern void MarkBufferDirty(Buffer buffer);
@@ -207,7 +208,7 @@ extern void FlushDatabaseBuffers(Oid dbid);
 extern void DropRelFileNodeBuffers(struct SMgrRelationData *smgr_reln, ForkNumber *forkNum,
 								   int nforks, BlockNumber *firstDelBlock);
 extern void DropRelFileNodesAllBuffers(struct SMgrRelationData **smgr_reln, int nnodes);
-extern void DropDatabaseBuffers(Oid dbid);
+extern void DropDatabaseBuffers(Oid dbid, Oid tbsid);
 
 #define RelationGetNumberOfBlocks(reln) \
 	RelationGetNumberOfBlocksInFork(reln, MAIN_FORKNUM)
-- 
1.8.3.1

v8-0003-Refactor-index_copy_data.patchtext/x-patch; charset=US-ASCII; name=v8-0003-Refactor-index_copy_data.patchDownload

From 9357c150f564a7e78a026567d78d239d8bc506fe Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 24 Sep 2021 18:13:25 +0530
Subject: [PATCH v8 3/7] Refactor index_copy_data

Make separate interface for copying relation storage, this will
be used by later patch for copying the database relations.
---
 src/backend/commands/tablecmds.c | 68 +++++++++++++++++++++++++---------------
 src/include/commands/tablecmds.h |  5 +++
 src/tools/pgindent/typedefs.list |  1 +
 3 files changed, 48 insertions(+), 26 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 5e9cae2..160a2a1 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -14237,54 +14237,70 @@ AlterTableMoveAll(AlterTableMoveAllStmt *stmt)
 	return new_tablespaceoid;
 }
 
-static void
-index_copy_data(Relation rel, RelFileNode newrnode)
+/*
+ * Copy source smgr relation's all fork's data to the destination.
+ *
+ * copy_storage - storage copy function, which is passed by the caller.
+ */
+void
+RelationCopyAllFork(SMgrRelation src_smgr, SMgrRelation	dst_smgr,
+					char relpersistence, copy_relation_storage copy_storage)
 {
-	SMgrRelation dstrel;
-
-	dstrel = smgropen(newrnode, rel->rd_backend);
-
 	/*
-	 * Since we copy the file directly without looking at the shared buffers,
-	 * we'd better first flush out any pages of the source relation that are
-	 * in shared buffers.  We assume no new changes will be made while we are
-	 * holding exclusive lock on the rel.
-	 */
-	FlushRelationBuffers(rel);
-
-	/*
-	 * Create and copy all forks of the relation, and schedule unlinking of
-	 * old physical files.
+	 * Create and copy all forks of the relation.
 	 *
 	 * NOTE: any conflict in relfilenode value will be caught in
 	 * RelationCreateStorage().
 	 */
-	RelationCreateStorage(newrnode, rel->rd_rel->relpersistence);
+	RelationCreateStorage(dst_smgr->smgr_rnode.node, relpersistence);
 
 	/* copy main fork */
-	RelationCopyStorage(RelationGetSmgr(rel), dstrel, MAIN_FORKNUM,
-						rel->rd_rel->relpersistence);
+	copy_storage(src_smgr, dst_smgr, MAIN_FORKNUM, relpersistence);
 
 	/* copy those extra forks that exist */
 	for (ForkNumber forkNum = MAIN_FORKNUM + 1;
 		 forkNum <= MAX_FORKNUM; forkNum++)
 	{
-		if (smgrexists(RelationGetSmgr(rel), forkNum))
+		if (smgrexists(src_smgr, forkNum))
 		{
-			smgrcreate(dstrel, forkNum, false);
+			smgrcreate(dst_smgr, forkNum, false);
 
 			/*
 			 * WAL log creation if the relation is persistent, or this is the
 			 * init fork of an unlogged relation.
 			 */
-			if (RelationIsPermanent(rel) ||
-				(rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
+			if (relpersistence == RELPERSISTENCE_PERMANENT ||
+				(relpersistence == RELPERSISTENCE_UNLOGGED &&
 				 forkNum == INIT_FORKNUM))
-				log_smgrcreate(&newrnode, forkNum);
-			RelationCopyStorage(RelationGetSmgr(rel), dstrel, forkNum,
-								rel->rd_rel->relpersistence);
+				log_smgrcreate(&dst_smgr->smgr_rnode.node, forkNum);
+
+			/* Copy a fork's data, block by block. */
+			copy_storage(src_smgr, dst_smgr, forkNum, relpersistence);
 		}
 	}
+}
+
+static void
+index_copy_data(Relation rel, RelFileNode newrnode)
+{
+	SMgrRelation dstrel;
+
+	dstrel = smgropen(newrnode, rel->rd_backend);
+
+	/*
+	 * Since we copy the file directly without looking at the shared buffers,
+	 * we'd better first flush out any pages of the source relation that are
+	 * in shared buffers.  We assume no new changes will be made while we are
+	 * holding exclusive lock on the rel.
+	 */
+	FlushRelationBuffers(rel);
+
+	/*
+	 * Create and copy all forks of the relation, and schedule unlinking of
+	 * old physical files.
+	 */
+	RelationCopyAllFork(RelationGetSmgr(rel), dstrel,
+						rel->rd_rel->relpersistence, RelationCopyStorage);
 
 	/* drop old relation, and close new one */
 	RelationDropStorage(rel);
diff --git a/src/include/commands/tablecmds.h b/src/include/commands/tablecmds.h
index 336549c..e0e0aa5 100644
--- a/src/include/commands/tablecmds.h
+++ b/src/include/commands/tablecmds.h
@@ -19,10 +19,13 @@
 #include "catalog/objectaddress.h"
 #include "nodes/parsenodes.h"
 #include "storage/lock.h"
+#include "storage/smgr.h"
 #include "utils/relcache.h"
 
 struct AlterTableUtilityContext;	/* avoid including tcop/utility.h here */
 
+typedef void (*copy_relation_storage) (SMgrRelation src, SMgrRelation dst,
+									  ForkNumber forkNum, char relpersistence);
 
 extern ObjectAddress DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 									ObjectAddress *typaddress, const char *queryString);
@@ -42,6 +45,8 @@ extern void AlterTableInternal(Oid relid, List *cmds, bool recurse);
 
 extern Oid	AlterTableMoveAll(AlterTableMoveAllStmt *stmt);
 
+extern void RelationCopyAllFork(SMgrRelation src_smgr, SMgrRelation	dst_smgr,
+								char relpersistence, copy_relation_storage copy_storage);
 extern ObjectAddress AlterTableNamespace(AlterObjectSchemaStmt *stmt,
 										 Oid *oldschema);
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index da6ac8e..bb3097f 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3050,6 +3050,7 @@ config_var_value
 contain_aggs_of_level_context
 convert_testexpr_context
 copy_data_source_cb
+copy_relation_storage
 core_YYSTYPE
 core_yy_extra_type
 core_yyscan_t
-- 
1.8.3.1

v8-0007-WAL-logged-CREATE-DATABASE.patchtext/x-patch; charset=US-ASCII; name=v8-0007-WAL-logged-CREATE-DATABASE.patchDownload

From bb9088b08b29aac441ff61e6edf0b79541247593 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Tue, 5 Oct 2021 11:45:02 +0530
Subject: [PATCH v8 7/7] WAL logged CREATE DATABASE

Currently, CREATE DATABASE forces a checkpoint, then copies all the files,
then forces another checkpoint. The comments in the createdb() function
explain the reasons for this. The attached patch fixes this problem by making
create database completely WAL logged so that we can avoid the checkpoints.

This can also be useful for supporting the TDE. For example, if we need different
encryption for the source and the target database then we can not re-encrypt the
page data if we copy the whole directory.  But with this patch, we are copying
page by page so we have an opportunity to re-encrypt the page before copying that
to the target database.
---
 src/backend/access/rmgrdesc/dbasedesc.c |   3 +-
 src/backend/commands/dbcommands.c       | 715 ++++++++++++++++++++++----------
 src/include/commands/dbcommands_xlog.h  |   3 -
 src/tools/pgindent/typedefs.list        |   1 +
 4 files changed, 498 insertions(+), 224 deletions(-)

diff --git a/src/backend/access/rmgrdesc/dbasedesc.c b/src/backend/access/rmgrdesc/dbasedesc.c
index 2660984..5010f72 100644
--- a/src/backend/access/rmgrdesc/dbasedesc.c
+++ b/src/backend/access/rmgrdesc/dbasedesc.c
@@ -28,8 +28,7 @@ dbase_desc(StringInfo buf, XLogReaderState *record)
 	{
 		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) rec;
 
-		appendStringInfo(buf, "copy dir %u/%u to %u/%u",
-						 xlrec->src_tablespace_id, xlrec->src_db_id,
+		appendStringInfo(buf, "create dir %u/%u",
 						 xlrec->tablespace_id, xlrec->db_id);
 	}
 	else if (info == XLOG_DBASE_DROP)
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 85fe598..c254b62 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -45,13 +45,13 @@
 #include "commands/dbcommands_xlog.h"
 #include "commands/defrem.h"
 #include "commands/seclabel.h"
+#include "commands/tablecmds.h"
 #include "commands/tablespace.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
 #include "replication/slot.h"
-#include "storage/copydir.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
@@ -62,6 +62,7 @@
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
 #include "utils/pg_locale.h"
+#include "utils/relmapper.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 
@@ -77,6 +78,19 @@ typedef struct
 	Oid			dest_tsoid;		/* tablespace we are trying to move to */
 } movedb_failure_params;
 
+/*
+ * When creating a database, we scan the pg_class of the source database to
+ * identify all the relations to be copied.  The structure is used for storing
+ * information about each relation of the source database.
+ */
+typedef struct CreateDBRelInfo
+{
+	RelFileNode		rnode;				/* physical relation identifier */
+	Oid				reloid;				/* relation oid */
+	char			relpersistence;		/* relation's persistence level */
+} CreateDBRelInfo;
+
+
 /* non-export function prototypes */
 static void createdb_failure_callback(int code, Datum arg);
 static void movedb(const char *dbname, const char *tblspcname);
@@ -91,6 +105,434 @@ static bool have_createdb_privilege(void);
 static void remove_dbtablespaces(Oid db_id);
 static bool check_db_file_conflict(Oid db_id);
 static int	errdetail_busy_db(int notherbackends, int npreparedxacts);
+static void CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid,
+									bool isRedo);
+static List *GetDatabaseRelationList(Oid srctbid, Oid srcdbid, char *srcpath);
+static void RelationCopyStorageUsingBuffer(SMgrRelation src, SMgrRelation dst,
+									ForkNumber forkNum, char relpersistence);
+static void CopyDatabase(Oid src_dboid, Oid dboid, Oid src_tsid, Oid dst_tsid);
+
+/*
+ * CreateDirAndVersionFile - Create database directory and write out the
+ *							 PG_VERSION file in the database path.
+ *
+ * If isRedo is true, it's okay for the database directory to exist already.
+ *
+ * We can directly write PG_MAJORVERSION in the version file instead of copying
+ * from the source database file because these two must be the same.
+ */
+static void
+CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid, bool isRedo)
+{
+	int		fd;
+	int		nbytes;
+	char	versionfile[MAXPGPATH];
+	char	buf[16];
+
+	/* Prepare version data before starting a critical section. */
+	sprintf(buf, "%s\n", PG_MAJORVERSION);
+	nbytes = strlen(PG_MAJORVERSION) + 1;
+
+	/* If we are not in WAL replay then write the WAL. */
+	if (!isRedo)
+	{
+		xl_dbase_create_rec xlrec;
+		XLogRecPtr	lsn;
+
+		/* Now errors are fatal ... */
+		START_CRIT_SECTION();
+
+		xlrec.db_id = dbid;
+		xlrec.tablespace_id = tsid;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec), sizeof(xl_dbase_create_rec));
+
+		lsn = XLogInsert(RM_DBASE_ID, XLOG_DBASE_CREATE);
+
+		/* As always, WAL must hit the disk before the data update does. */
+		XLogFlush(lsn);
+	}
+
+	/* Create database directory. */
+	if (MakePGDirectory(dbpath) < 0)
+	{
+		/* Failure other than already exists or not in WAL replay? */
+		if (errno != EEXIST || !isRedo)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not create directory \"%s\": %m", dbpath)));
+	}
+
+	/*
+	 * Create PG_VERSION file in the database path.  If the file already exists
+	 * and we are in WAL replay then try again to open it in write mode.
+	 */
+	snprintf(versionfile, sizeof(versionfile), "%s/%s", dbpath, "PG_VERSION");
+
+	fd = OpenTransientFile(versionfile, O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
+	if (fd < 0 && errno == EEXIST && isRedo)
+		fd = OpenTransientFile(versionfile, O_WRONLY | O_TRUNC | PG_BINARY);
+
+	if (fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m", versionfile)));
+
+	/* Write PG_MAJORVERSION in the PG_VERSION file. */
+	pgstat_report_wait_start(WAIT_EVENT_COPY_FILE_WRITE);
+	errno = 0;
+	if ((int) write(fd, buf, nbytes) != nbytes)
+	{
+		/* If write didn't set errno, assume problem is no disk space. */
+		if (errno == 0)
+			errno = ENOSPC;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m", versionfile)));
+	}
+	pgstat_report_wait_end();
+
+	/* Close the version file. */
+	CloseTransientFile(fd);
+
+	/* Critical section done. */
+	if (!isRedo)
+		END_CRIT_SECTION();
+}
+
+/*
+ * GetDatabaseRelationList - Get relfilenode list to be copied.
+ *
+ * Iterate over each block of the pg_class relation.  From there, we will check
+ * all the visible tuples in order to get a list of all the valid relfilenodes
+ * in the source database that should be copied to the target database.
+ */
+static List *
+GetDatabaseRelationList(Oid tbid, Oid dbid, char *srcpath)
+{
+	SMgrRelation	rd_smgr;
+	RelFileNode		rnode;
+	BlockNumber		nblocks;
+	BlockNumber		blkno;
+	OffsetNumber	offnum;
+	OffsetNumber	maxoff;
+	Buffer			buf;
+	Oid				relfilenode;
+	Page			page;
+	List		   *rnodelist = NIL;
+	HeapTupleData	tuple;
+	Form_pg_class	classForm;
+	LockRelId		relid;
+	BufferAccessStrategy bstrategy;
+
+	/* Get pg_class relfilenode. */
+	relfilenode = RelationMapOidToFilenodeForDatabase(srcpath,
+													  RelationRelationId);
+	/*
+	 * We are going to read the buffers associated with the pg_class relation.
+	 * Thus, acquire the relation level lock before start scanning.  As we are
+	 * not connected to the database, we cannot use relation_open directly, so
+	 * we have to lock using relation id.
+	 */
+	relid.dbId = dbid;
+	relid.relId = RelationRelationId;
+	LockRelationId(&relid, AccessShareLock);
+
+	/* Prepare a relnode for pg_class relation. */
+	rnode.spcNode = tbid;
+	rnode.dbNode = dbid;
+	rnode.relNode = relfilenode;
+
+	/*
+	 * We are not connected to the source database so open the pg_class
+	 * relation at the smgr level and get the block count.
+	 */
+	rd_smgr = smgropen(rnode, InvalidBackendId);
+	nblocks = smgrnblocks(rd_smgr, MAIN_FORKNUM);
+
+	/*
+	 * We're going to read the whole pg_class so better to use bulk-read buffer
+	 * access strategy.
+	 */
+	bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	/* Iterate over each block on the pg_class relation. */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		/*
+		 * We are not connected to the source database so directly use the lower
+		 * level bufmgr interface which operates on the rnode.
+		 */
+		buf = ReadBufferWithoutRelcache(rnode, MAIN_FORKNUM, blkno,
+										RBM_NORMAL, bstrategy,
+										RELPERSISTENCE_PERMANENT);
+
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buf);
+		if (PageIsNew(page) || PageIsEmpty(page))
+		{
+			UnlockReleaseBuffer(buf);
+			continue;
+		}
+
+		maxoff = PageGetMaxOffsetNumber(page);
+
+		/* Iterate over each tuple on the page. */
+		for (offnum = FirstOffsetNumber;
+			 offnum <= maxoff;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemid;
+
+			itemid = PageGetItemId(page, offnum);
+
+			/* Nothing to do if slot is empty or already dead. */
+			if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid) ||
+				ItemIdIsRedirected(itemid))
+				continue;
+
+			Assert(ItemIdIsNormal(itemid));
+			ItemPointerSet(&(tuple.t_self), blkno, offnum);
+
+			tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
+			tuple.t_len = ItemIdGetLength(itemid);
+			tuple.t_tableOid = RelationRelationId;
+
+			/*
+			 * If the tuple is visible then add its relfilenode info to the
+			 * list.
+			 */
+			if (HeapTupleSatisfiesVisibility(&tuple, GetActiveSnapshot(), buf))
+			{
+				Oid				relfilenode = InvalidOid;
+				CreateDBRelInfo   *relinfo;
+
+				classForm = (Form_pg_class) GETSTRUCT(&tuple);
+
+				/* We don't need to copy the shared objects to the target. */
+				if (classForm->reltablespace == GLOBALTABLESPACE_OID)
+					continue;
+
+				/*
+				 * If the object doesn't have the storage then nothing to be
+				 * done for that object so just ignore it.
+				 */
+				if (!RELKIND_HAS_STORAGE(classForm->relkind))
+					continue;
+
+				/*
+				 * If relfilenode is valid then directly use it.  Otherwise,
+				 * consult the relmapper for the mapped relation.
+				 */
+				if (OidIsValid(classForm->relfilenode))
+					relfilenode = classForm->relfilenode;
+				else
+					relfilenode = RelationMapOidToFilenodeForDatabase(srcpath,
+													classForm->oid);
+
+				/* We must have a valid relfilenode oid. */
+				Assert(OidIsValid(relfilenode));
+
+				/* Prepare a rel info element and add it to the list. */
+				relinfo = (CreateDBRelInfo *) palloc(sizeof(CreateDBRelInfo));
+				if (OidIsValid(classForm->reltablespace))
+					relinfo->rnode.spcNode = classForm->reltablespace;
+				else
+					relinfo->rnode.spcNode = tbid;
+
+				relinfo->rnode.dbNode = dbid;
+				relinfo->rnode.relNode = relfilenode;
+				relinfo->reloid = classForm->oid;
+				relinfo->relpersistence = classForm->relpersistence;
+
+				/* Add it to the list. */
+				rnodelist = lappend(rnodelist, relinfo);
+			}
+		}
+
+		/* Release the buffer lock. */
+		UnlockReleaseBuffer(buf);
+	}
+
+	/* Release the lock. */
+	UnlockRelationId(&relid, AccessShareLock);
+
+	return rnodelist;
+}
+
+/*
+ * RelationCopyStorageUsingBuffer - Copy fork's data using bufmgr.
+ *
+ * Same as RelationCopyStorage but instead of using smgrread and smgrextend
+ * this will copy using bufmgr APIs.
+ */
+static void
+RelationCopyStorageUsingBuffer(SMgrRelation src, SMgrRelation dst,
+							   ForkNumber forkNum, char relpersistence)
+{
+	Buffer		srcBuf;
+	Buffer		dstBuf;
+	Page		srcPage;
+	Page		dstPage;
+	bool		use_wal;
+	bool		copying_initfork;
+	BlockNumber nblocks;
+	BlockNumber blkno;
+	BufferAccessStrategy bstrategy_src;
+	BufferAccessStrategy bstrategy_dst;
+
+	/* Refer comments in RelationCopyStorage. */
+	copying_initfork = relpersistence == RELPERSISTENCE_UNLOGGED &&
+		forkNum == INIT_FORKNUM;
+	use_wal = XLogIsNeeded() &&
+		(relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork);
+
+	/* Get number of blocks in the source relation. */
+	nblocks = smgrnblocks(src, forkNum);
+
+	/*
+	 * We are going to copy whole relation from the source to the destination
+	 * so use BAS_BULKREAD strategy for the source relation and BAS_BULKWRITE
+	 * strategy for the destination relation.
+	 */
+	bstrategy_src = GetAccessStrategy(BAS_BULKREAD);
+	bstrategy_dst = GetAccessStrategy(BAS_BULKWRITE);
+
+	/* Iterate over each block of the source relation file. */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		/* If we got a cancel signal during the copy of the data, quit */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Read block from source relation. */
+		srcBuf = ReadBufferWithoutRelcache(src->smgr_rnode.node, forkNum,
+										   blkno, RBM_NORMAL, bstrategy_src,
+										   relpersistence);
+		srcPage = BufferGetPage(srcBuf);
+		if (PageIsNew(srcPage) || PageIsEmpty(srcPage))
+		{
+			ReleaseBuffer(srcBuf);
+			continue;
+		}
+
+		/* Use P_NEW to extend the relation. */
+		dstBuf = ReadBufferWithoutRelcache(dst->smgr_rnode.node, forkNum,
+										   P_NEW, RBM_NORMAL, bstrategy_dst,
+										   relpersistence);
+		LockBuffer(dstBuf, BUFFER_LOCK_EXCLUSIVE);
+
+		START_CRIT_SECTION();
+
+		/* Initialize the page and write the data. */
+		dstPage = BufferGetPage(dstBuf);
+		PageInit(dstPage, BufferGetPageSize(dstBuf), 0);
+		memcpy(dstPage, srcPage, BLCKSZ);
+		MarkBufferDirty(dstBuf);
+
+		/* WAL-log the copied page. */
+		if (use_wal)
+			log_newpage_buffer(dstBuf, true);
+
+		END_CRIT_SECTION();
+
+		UnlockReleaseBuffer(dstBuf);
+		ReleaseBuffer(srcBuf);
+	}
+}
+
+/*
+ * CopyDatabase - Copy source database to the target database.
+ *
+ * Create target database directory and copy data files from the source database
+ * to the target database, block by block and WAL log all the operations.
+ */
+static void
+CopyDatabase(Oid src_dboid, Oid dst_dboid, Oid src_tsid, Oid dst_tsid)
+{
+	char	   *srcpath;
+	char	   *dstpath;
+	List	   *rnodelist = NULL;
+	ListCell   *cell;
+	LockRelId	relid;
+	RelFileNode	srcrnode;
+	RelFileNode	dstrnode;
+	CreateDBRelInfo	*relinfo;
+
+	/* Get the source database path. */
+	srcpath = GetDatabasePath(src_dboid, src_tsid);
+
+	/* Get the destination database path. */
+	dstpath = GetDatabasePath(dst_dboid, dst_tsid);
+
+	/* Create database directory and write PG_VERSION file. */
+	CreateDirAndVersionFile(dstpath, dst_dboid, dst_tsid, false);
+
+	/* Copy relmap file from source database to the destination database. */
+	CopyRelationMap(dst_dboid, dst_tsid, srcpath, dstpath);
+
+	/* Get list of all valid relnode from the source database. */
+	rnodelist = GetDatabaseRelationList(src_tsid, src_dboid, srcpath);
+	Assert(rnodelist != NIL);
+
+	/*
+	 * Database id is common for all the relation so set it before entering to
+	 * the loop.
+	 */
+	relid.dbId = src_dboid;
+
+	/*
+	 * Iterate over each relfilenode and copy the relation data block by block
+	 * from source database to the destination database.
+	 */
+	foreach(cell, rnodelist)
+	{
+		SMgrRelation	src_smgr;
+		SMgrRelation	dst_smgr;
+
+		relinfo = lfirst(cell);
+		srcrnode = relinfo->rnode;
+
+		/*
+		 * If the relation is from the default tablespace then we need to
+		 * create it in the destinations db's default tablespace.  Otherwise,
+		 * we need to create in the same tablespace as it is in the source
+		 * database.
+		 */
+		if (srcrnode.spcNode == src_tsid)
+			dstrnode.spcNode = dst_tsid;
+		/*
+		 * In case of ALTER DATABASE SET TABLESPACE we don't need to do
+		 * anything for the object which are not in the source db's default
+		 * tablespace.  The source and destination dboid will be same in
+		 * case of ALTER DATABASE SET TABLESPACE.
+		 */
+		else if (src_dboid == dst_dboid)
+			continue;
+		else
+			dstrnode.spcNode = srcrnode.spcNode;;
+
+		dstrnode.dbNode = dst_dboid;
+		dstrnode.relNode = srcrnode.relNode;
+
+		/* Acquire the lock on relation before start copying. */
+		relid.relId = relinfo->reloid;
+		LockRelationId(&relid, AccessShareLock);
+
+		/* Open the source and the destination relation at smgr level. */
+		src_smgr = smgropen(srcrnode, InvalidBackendId);
+		dst_smgr = smgropen(dstrnode, InvalidBackendId);
+
+		/* Copy relation storage from source to the destination. */
+		RelationCopyAllFork(src_smgr, dst_smgr, relinfo->relpersistence,
+							RelationCopyStorageUsingBuffer);
+
+		/* Release the lock. */
+		UnlockRelationId(&relid, AccessShareLock);
+	}
+
+	list_free_deep(rnodelist);
+}
 
 
 /*
@@ -99,8 +541,6 @@ static int	errdetail_busy_db(int notherbackends, int npreparedxacts);
 Oid
 createdb(ParseState *pstate, const CreatedbStmt *stmt)
 {
-	TableScanDesc scan;
-	Relation	rel;
 	Oid			src_dboid;
 	Oid			src_owner;
 	int			src_encoding = -1;
@@ -563,19 +1003,6 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	InvokeObjectPostCreateHook(DatabaseRelationId, dboid, 0);
 
 	/*
-	 * Force a checkpoint before starting the copy. This will force all dirty
-	 * buffers, including those of unlogged tables, out to disk, to ensure
-	 * source database is up-to-date on disk for the copy.
-	 * FlushDatabaseBuffers() would suffice for that, but we also want to
-	 * process any pending unlink requests. Otherwise, if a checkpoint
-	 * happened while we're copying files, a file might be deleted just when
-	 * we're about to copy it, causing the lstat() call in copydir() to fail
-	 * with ENOENT.
-	 */
-	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT
-					  | CHECKPOINT_FLUSH_ALL);
-
-	/*
 	 * Once we start copying subdirectories, we need to be able to clean 'em
 	 * up if we fail.  Use an ENSURE block to make sure this happens.  (This
 	 * is not a 100% solution, because of the possibility of failure during
@@ -587,115 +1014,16 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	PG_ENSURE_ERROR_CLEANUP(createdb_failure_callback,
 							PointerGetDatum(&fparms));
 	{
-		/*
-		 * Iterate through all tablespaces of the template database, and copy
-		 * each one to the new database.
-		 */
-		rel = table_open(TableSpaceRelationId, AccessShareLock);
-		scan = table_beginscan_catalog(rel, 0, NULL);
-		while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
-		{
-			Form_pg_tablespace spaceform = (Form_pg_tablespace) GETSTRUCT(tuple);
-			Oid			srctablespace = spaceform->oid;
-			Oid			dsttablespace;
-			char	   *srcpath;
-			char	   *dstpath;
-			struct stat st;
-
-			/* No need to copy global tablespace */
-			if (srctablespace == GLOBALTABLESPACE_OID)
-				continue;
-
-			srcpath = GetDatabasePath(src_dboid, srctablespace);
-
-			if (stat(srcpath, &st) < 0 || !S_ISDIR(st.st_mode) ||
-				directory_is_empty(srcpath))
-			{
-				/* Assume we can ignore it */
-				pfree(srcpath);
-				continue;
-			}
-
-			if (srctablespace == src_deftablespace)
-				dsttablespace = dst_deftablespace;
-			else
-				dsttablespace = srctablespace;
-
-			dstpath = GetDatabasePath(dboid, dsttablespace);
-
-			/*
-			 * Copy this subdirectory to the new location
-			 *
-			 * We don't need to copy subdirectories
-			 */
-			copydir(srcpath, dstpath, false);
-
-			/* Record the filesystem change in XLOG */
-			{
-				xl_dbase_create_rec xlrec;
-
-				xlrec.db_id = dboid;
-				xlrec.tablespace_id = dsttablespace;
-				xlrec.src_db_id = src_dboid;
-				xlrec.src_tablespace_id = srctablespace;
-
-				XLogBeginInsert();
-				XLogRegisterData((char *) &xlrec, sizeof(xl_dbase_create_rec));
-
-				(void) XLogInsert(RM_DBASE_ID,
-								  XLOG_DBASE_CREATE | XLR_SPECIAL_REL_UPDATE);
-			}
-		}
-		table_endscan(scan);
-		table_close(rel, AccessShareLock);
-
-		/*
-		 * We force a checkpoint before committing.  This effectively means
-		 * that committed XLOG_DBASE_CREATE operations will never need to be
-		 * replayed (at least not in ordinary crash recovery; we still have to
-		 * make the XLOG entry for the benefit of PITR operations). This
-		 * avoids two nasty scenarios:
-		 *
-		 * #1: When PITR is off, we don't XLOG the contents of newly created
-		 * indexes; therefore the drop-and-recreate-whole-directory behavior
-		 * of DBASE_CREATE replay would lose such indexes.
-		 *
-		 * #2: Since we have to recopy the source database during DBASE_CREATE
-		 * replay, we run the risk of copying changes in it that were
-		 * committed after the original CREATE DATABASE command but before the
-		 * system crash that led to the replay.  This is at least unexpected
-		 * and at worst could lead to inconsistencies, eg duplicate table
-		 * names.
-		 *
-		 * (Both of these were real bugs in releases 8.0 through 8.0.3.)
-		 *
-		 * In PITR replay, the first of these isn't an issue, and the second
-		 * is only a risk if the CREATE DATABASE and subsequent template
-		 * database change both occur while a base backup is being taken.
-		 * There doesn't seem to be much we can do about that except document
-		 * it as a limitation.
-		 *
-		 * Perhaps if we ever implement CREATE DATABASE in a less cheesy way,
-		 * we can avoid this.
-		 */
-		RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
-
-		/*
-		 * Close pg_database, but keep lock till commit.
-		 */
-		table_close(pg_database_rel, NoLock);
-
-		/*
-		 * Force synchronous commit, thus minimizing the window between
-		 * creation of the database files and committal of the transaction. If
-		 * we crash before committing, we'll have a DB that's taking up disk
-		 * space but is not in pg_database, which is not good.
-		 */
-		ForceSyncCommit();
+		CopyDatabase(src_dboid, dboid, src_deftablespace, dst_deftablespace);
 	}
 	PG_END_ENSURE_ERROR_CLEANUP(createdb_failure_callback,
 								PointerGetDatum(&fparms));
 
+	/*
+	 * Close pg_database, but keep lock till commit.
+	 */
+	table_close(pg_database_rel, NoLock);
+
 	return dboid;
 }
 
@@ -764,6 +1092,15 @@ createdb_failure_callback(int code, Datum arg)
 {
 	createdb_failure_params *fparms = (createdb_failure_params *) DatumGetPointer(arg);
 
+	/* Drop pages for this database that are in the shared buffer cache. */
+	DropDatabaseBuffers(fparms->dest_dboid, InvalidOid);
+
+	/*
+	 * Clean out any fsync requests w.r.t. the new database that might be
+	 * pending in md.c.
+	 */
+	ForgetDatabaseSyncRequests(fparms->dest_dboid, InvalidOid);
+
 	/*
 	 * Release lock on source database before doing recursive remove. This is
 	 * not essential but it seems desirable to release the lock as soon as
@@ -1196,34 +1533,6 @@ movedb(const char *dbname, const char *tblspcname)
 	dst_dbpath = GetDatabasePath(db_id, dst_tblspcoid);
 
 	/*
-	 * Force a checkpoint before proceeding. This will force all dirty
-	 * buffers, including those of unlogged tables, out to disk, to ensure
-	 * source database is up-to-date on disk for the copy.
-	 * FlushDatabaseBuffers() would suffice for that, but we also want to
-	 * process any pending unlink requests. Otherwise, the check for existing
-	 * files in the target directory might fail unnecessarily, not to mention
-	 * that the copy might fail due to source files getting deleted under it.
-	 * On Windows, this also ensures that background procs don't hold any open
-	 * files, which would cause rmdir() to fail.
-	 */
-	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT
-					  | CHECKPOINT_FLUSH_ALL);
-
-	/*
-	 * Now drop all buffers holding data of the target database; they should
-	 * no longer be dirty so DropDatabaseBuffers is safe.
-	 *
-	 * It might seem that we could just let these buffers age out of shared
-	 * buffers naturally, since they should not get referenced anymore.  The
-	 * problem with that is that if the user later moves the database back to
-	 * its original tablespace, any still-surviving buffers would appear to
-	 * contain valid data again --- but they'd be missing any changes made in
-	 * the database while it was in the new tablespace.  In any case, freeing
-	 * buffers that should never be used again seems worth the cycles.
-	 */
-	DropDatabaseBuffers(db_id, src_tblspcoid);
-
-	/*
 	 * Check for existence of files in the target directory, i.e., objects of
 	 * this database that are already in the target tablespace.  We can't
 	 * allow the move in such a case, because we would need to change those
@@ -1268,28 +1577,7 @@ movedb(const char *dbname, const char *tblspcname)
 	PG_ENSURE_ERROR_CLEANUP(movedb_failure_callback,
 							PointerGetDatum(&fparms));
 	{
-		/*
-		 * Copy files from the old tablespace to the new one
-		 */
-		copydir(src_dbpath, dst_dbpath, false);
-
-		/*
-		 * Record the filesystem change in XLOG
-		 */
-		{
-			xl_dbase_create_rec xlrec;
-
-			xlrec.db_id = db_id;
-			xlrec.tablespace_id = dst_tblspcoid;
-			xlrec.src_db_id = db_id;
-			xlrec.src_tablespace_id = src_tblspcoid;
-
-			XLogBeginInsert();
-			XLogRegisterData((char *) &xlrec, sizeof(xl_dbase_create_rec));
-
-			(void) XLogInsert(RM_DBASE_ID,
-							  XLOG_DBASE_CREATE | XLR_SPECIAL_REL_UPDATE);
-		}
+		CopyDatabase(db_id, db_id, src_tblspcoid, dst_tblspcoid);
 
 		/*
 		 * Update the database's pg_database tuple
@@ -1323,22 +1611,6 @@ movedb(const char *dbname, const char *tblspcname)
 		systable_endscan(sysscan);
 
 		/*
-		 * Force another checkpoint here.  As in CREATE DATABASE, this is to
-		 * ensure that we don't have to replay a committed XLOG_DBASE_CREATE
-		 * operation, which would cause us to lose any unlogged operations
-		 * done in the new DB tablespace before the next checkpoint.
-		 */
-		RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
-
-		/*
-		 * Force synchronous commit, thus minimizing the window between
-		 * copying the database files and committal of the transaction. If we
-		 * crash before committing, we'll leave an orphaned set of files on
-		 * disk, which is not fatal but not good either.
-		 */
-		ForceSyncCommit();
-
-		/*
 		 * Close pg_database, but keep lock till commit.
 		 */
 		table_close(pgdbrel, NoLock);
@@ -1347,6 +1619,27 @@ movedb(const char *dbname, const char *tblspcname)
 								PointerGetDatum(&fparms));
 
 	/*
+	 * Now drop all buffers holding data of the target database for the old
+	 * tablespace oid; We have already copied all the data to the new
+	 * tablespace so we no longer required the old buffers.
+	 *
+	 * It might seem that we could just let these buffers age out of shared
+	 * buffers naturally, since they should not get referenced anymore.  The
+	 * problem with that is that if the user later moves the database back to
+	 * its original tablespace, any still-surviving buffers would appear to
+	 * contain valid data again --- but they'd be missing any changes made in
+	 * the database while it was in the new tablespace.  In any case, freeing
+	 * buffers that should never be used again seems worth the cycles.
+	 */
+	DropDatabaseBuffers(db_id, src_tblspcoid);
+
+	/*
+	 * Also, clean out any fsync requests w.r.t. the old tablespace that might
+	 * be pending in md.c.
+	 */
+	ForgetDatabaseSyncRequests(db_id, src_tblspcoid);
+
+	/*
 	 * Commit the transaction so that the pg_database update is committed. If
 	 * we crash while removing files, the database won't be corrupt, we'll
 	 * just leave some orphaned files in the old directory.
@@ -1400,6 +1693,18 @@ movedb_failure_callback(int code, Datum arg)
 	movedb_failure_params *fparms = (movedb_failure_params *) DatumGetPointer(arg);
 	char	   *dstpath;
 
+	/*
+	 * Drop pages for database in destination tablespace that are in the shared
+	 * buffer cache.
+	 */
+	DropDatabaseBuffers(fparms->dest_dboid, fparms->dest_tsoid);
+
+	/*
+	 * Clean out any fsync requests w.r.t. the new tablespace that might
+	 * be pending in md.c.
+	 */
+	ForgetDatabaseSyncRequests(fparms->dest_dboid, fparms->dest_tsoid);
+
 	/* Get rid of anything we managed to copy to the target directory */
 	dstpath = GetDatabasePath(fparms->dest_dboid, fparms->dest_tsoid);
 
@@ -2138,39 +2443,11 @@ dbase_redo(XLogReaderState *record)
 	if (info == XLOG_DBASE_CREATE)
 	{
 		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) XLogRecGetData(record);
-		char	   *src_path;
-		char	   *dst_path;
-		struct stat st;
-
-		src_path = GetDatabasePath(xlrec->src_db_id, xlrec->src_tablespace_id);
-		dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
+		char	   *dbpath;
 
-		/*
-		 * Our theory for replaying a CREATE is to forcibly drop the target
-		 * subdirectory if present, then re-copy the source data. This may be
-		 * more work than needed, but it is simple to implement.
-		 */
-		if (stat(dst_path, &st) == 0 && S_ISDIR(st.st_mode))
-		{
-			if (!rmtree(dst_path, true))
-				/* If this failed, copydir() below is going to error. */
-				ereport(WARNING,
-						(errmsg("some useless files may be left behind in old database directory \"%s\"",
-								dst_path)));
-		}
-
-		/*
-		 * Force dirty buffers out to disk, to ensure source database is
-		 * up-to-date for the copy.
-		 */
-		FlushDatabaseBuffers(xlrec->src_db_id);
-
-		/*
-		 * Copy this subdirectory to the new location
-		 *
-		 * We don't need to copy subdirectories
-		 */
-		copydir(src_path, dst_path, false);
+		dbpath = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
+		CreateDirAndVersionFile(dbpath, xlrec->db_id, xlrec->tablespace_id,
+								true);
 	}
 	else if (info == XLOG_DBASE_DROP)
 	{
diff --git a/src/include/commands/dbcommands_xlog.h b/src/include/commands/dbcommands_xlog.h
index f5ed762..21dc58e 100644
--- a/src/include/commands/dbcommands_xlog.h
+++ b/src/include/commands/dbcommands_xlog.h
@@ -23,11 +23,8 @@
 
 typedef struct xl_dbase_create_rec
 {
-	/* Records copying of a single subdirectory incl. contents */
 	Oid			db_id;
 	Oid			tablespace_id;
-	Oid			src_db_id;
-	Oid			src_tablespace_id;
 } xl_dbase_create_rec;
 
 typedef struct xl_dbase_drop_rec
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index bb3097f..7a5f6b5 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -460,6 +460,7 @@ CoverPos
 CreateAmStmt
 CreateCastStmt
 CreateConversionStmt
+CreateDBRelInfo
 CreateDomainStmt
 CreateEnumStmt
 CreateEventTrigStmt
-- 
1.8.3.1

v8-0006-Extend-ForgetDatabaseSyncRequests-interface.patchtext/x-patch; charset=US-ASCII; name=v8-0006-Extend-ForgetDatabaseSyncRequests-interface.patchDownload

From 32e67fcbe22f1407dc9b7c8860202dfe24f79d04 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Thu, 2 Dec 2021 17:34:10 +0530
Subject: [PATCH v8 6/7] Extend ForgetDatabaseSyncRequests interface

Extend the interface such that it can forget the database sync request
only for the specific tablespace.
---
 src/backend/commands/dbcommands.c | 4 ++--
 src/backend/storage/smgr/md.c     | 4 ++--
 src/include/storage/md.h          | 2 +-
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 1d963d8..85fe598 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -951,7 +951,7 @@ dropdb(const char *dbname, bool missing_ok, bool force)
 	 * worse, it will delete files that belong to a newly created database
 	 * with the same OID.
 	 */
-	ForgetDatabaseSyncRequests(db_id);
+	ForgetDatabaseSyncRequests(db_id, InvalidOid);
 
 	/*
 	 * Force a checkpoint to make sure the checkpointer has received the
@@ -2201,7 +2201,7 @@ dbase_redo(XLogReaderState *record)
 		DropDatabaseBuffers(xlrec->db_id, InvalidOid);
 
 		/* Also, clean out any fsync requests that might be pending in md.c */
-		ForgetDatabaseSyncRequests(xlrec->db_id);
+		ForgetDatabaseSyncRequests(xlrec->db_id, InvalidOid);
 
 		/* Clean out the xlog relcache too */
 		XLogDropDatabase(xlrec->db_id);
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index b4bca7e..3b5ae1c 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -1029,13 +1029,13 @@ register_forget_request(RelFileNodeBackend rnode, ForkNumber forknum,
  * ForgetDatabaseSyncRequests -- forget any fsyncs and unlinks for a DB
  */
 void
-ForgetDatabaseSyncRequests(Oid dbid)
+ForgetDatabaseSyncRequests(Oid dbid, Oid tbsid)
 {
 	FileTag		tag;
 	RelFileNode rnode;
 
 	rnode.dbNode = dbid;
-	rnode.spcNode = 0;
+	rnode.spcNode = tbsid;
 	rnode.relNode = 0;
 
 	INIT_MD_FILETAG(tag, rnode, InvalidForkNumber, InvalidBlockNumber);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 752b440..9502330 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -41,7 +41,7 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber nblocks);
 extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
 
-extern void ForgetDatabaseSyncRequests(Oid dbid);
+extern void ForgetDatabaseSyncRequests(Oid dbid, Oid tbsid);
 extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
 
 /* md sync callbacks */
-- 
1.8.3.1

#67

Ashutosh Sharma

ashu.coek88@gmail.com

about 4 years ago

In reply to: Dilip Kumar (#66)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

+       /*
+        * If the relation is from the default tablespace then we need to
+        * create it in the destinations db's default tablespace.
Otherwise,
+        * we need to create in the same tablespace as it is in the source
+        * database.
+        */

This comment looks a bit confusing to me especially because when we say
destination db's default tablespace people may think of pg_default
tablespace (at least I think so). Basically what you are trying to say here
- "If the relation exists in the same tablespace as the src database, then
in the destination db also it should be the same or something like that.. "
So, why not put it that way instead of referring to it as the default
tablespace. It's just my view. If you disagree you can ignore it.

+       else if (src_dboid == dst_dboid)
+           continue;
+       else
+           dstrnode.spcNode = srcrnode.spcNode;;

There is an extra semicolon here.

--
With Regards,
Ashutosh Sharma.

On Sun, Dec 12, 2021 at 1:39 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Show quoted text

On Fri, Dec 10, 2021 at 7:39 AM Ashutosh Sharma <ashu.coek88@gmail.com>
wrote:

Logfile Snippet:
2021-12-09 17:49:18.110 +04 [18151] PANIC: could not fsync file

"base/116398/116400": No such file or directory

2021-12-09 17:49:19.105 +04 [18150] LOG: checkpointer process (PID

18151) was terminated by signal 6: Aborted

2021-12-09 17:49:19.105 +04 [18150] LOG: terminating any other active

server processes

This is different from the issue you raised earlier. As Dilip said, we

need to unregister sync requests for files that got successfully copied to
the target database, but the overall alter database statement failed. We
are doing this when the database is created successfully, but not when it
fails.

Probably doing the same inside the cleanup function

movedb_failure_callback() should fix the problem.

Correct, I have done this cleanup, apart from this we have dropped the
fsyc request in create database failure case as well and also need to
drop buffer in error case of creatdb as well as movedb. I have also
fixed the other issue for which you gave the patch (a bit differently)
basically, in case of movedb the source and destination dboid are same
so we don't need an additional parameter and also readjusted the
conditions to avoid nested if.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#68

Dilip Kumar

dilipbalaut@gmail.com

about 4 years ago

In reply to: Ashutosh Sharma (#67)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Mon, Dec 13, 2021 at 8:34 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

+       /*
+        * If the relation is from the default tablespace then we need to
+        * create it in the destinations db's default tablespace.  Otherwise,
+        * we need to create in the same tablespace as it is in the source
+        * database.
+        */
This comment looks a bit confusing to me especially because when we say destination db's default tablespace people may think of pg_default tablespace (at least I think so). Basically what you are trying to say here - "If the relation exists in the same tablespace as the src database, then in the destination db also it should be the same or something like that.. " So, why not put it that way instead of referring to it as the default tablespace. It's just my view. If you disagree you can ignore it.

--
+       else if (src_dboid == dst_dboid)
+           continue;
+       else
+           dstrnode.spcNode = srcrnode.spcNode;;
There is an extra semicolon here.

Noted. I will fix them in the next version.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#69

Bruce Momjian

bruce@momjian.us

about 4 years ago

In reply to: Dilip Kumar (#48)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Thu, Dec 2, 2021 at 07:19:50PM +0530, Dilip Kumar wrote:
From the patch:

Currently, CREATE DATABASE forces a checkpoint, then copies all the files,
then forces another checkpoint. The comments in the createdb() function
explain the reasons for this. The attached patch fixes this problem by making
create database completely WAL logged so that we can avoid the checkpoints.

This can also be useful for supporting the TDE. For example, if we need different
encryption for the source and the target database then we can not re-encrypt the
page data if we copy the whole directory. But with this patch, we are copying
page by page so we have an opportunity to re-encrypt the page before copying that
to the target database.

Uh, why is this true? Why can't we just copy the heap/index files 8k at
a time and reencrypt them during the file copy, rather than using shared
buffers?

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

If only the physical world exists, free will is an illusion.

#70

Dilip Kumar

dilipbalaut@gmail.com

about 4 years ago

In reply to: Bruce Momjian (#69)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Thu, Dec 16, 2021 at 12:15 AM Bruce Momjian <bruce@momjian.us> wrote:

On Thu, Dec 2, 2021 at 07:19:50PM +0530, Dilip Kumar wrote:
From the patch:

Currently, CREATE DATABASE forces a checkpoint, then copies all the files,
then forces another checkpoint. The comments in the createdb() function
explain the reasons for this. The attached patch fixes this problem by making
create database completely WAL logged so that we can avoid the checkpoints.

This can also be useful for supporting the TDE. For example, if we need different
encryption for the source and the target database then we can not re-encrypt the
page data if we copy the whole directory. But with this patch, we are copying
page by page so we have an opportunity to re-encrypt the page before copying that
to the target database.

Uh, why is this true? Why can't we just copy the heap/index files 8k at
a time and reencrypt them during the file copy, rather than using shared
buffers?

Hi Bruce,

Yeah, you are right that if we copy in 8k block then we can re-encrypt
the page, but in the current system, we are not copying block by
block. So the main effort for this patch is not only for TDE but to
get rid of the checkpoint we are forced to do before and after create
database. So my point is that in this patch since we are copying page
by page we get an opportunity to re-encrypt the page. I agree that if
the re-encryption would have been the main goal of this patch then
true we can copy files in 8k blocks and re-encrypt those blocks, that
time even if we have to access some page data for re-encryption (like
nonce) then also we can do it, but that is not the main objective.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#71

Neha Sharma

neha.sharma@enterprisedb.com

about 4 years ago

In reply to: Dilip Kumar (#70)

1 attachment(s)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

Hi,

While testing the v8 patches in a hot-standby setup, it was observed the
master is crashing with the below error;

2021-12-16 19:32:47.757 +04 [101483] PANIC: could not fsync file
"pg_tblspc/16385/PG_15_202112111/16386/16391": No such file or directory
2021-12-16 19:32:48.917 +04 [101482] LOG: checkpointer process (PID
101483) was terminated by signal 6: Aborted

Parameters configured at master:
wal_level = hot_standby
max_wal_senders = 3
hot_standby = on
max_standby_streaming_delay= -1
wal_consistency_checking='all'
max_wal_size= 10GB
checkpoint_timeout= 1d
log_min_messages=debug1

Test Case:
create tablespace tab1 location
'/home/edb/PGsources/postgresql/inst/bin/test1';
create tablespace tab location
'/home/edb/PGsources/postgresql/inst/bin/test';
create database test tablespace tab;
\c test
create table t( a int PRIMARY KEY,b text);
CREATE OR REPLACE FUNCTION large_val() RETURNS TEXT LANGUAGE SQL AS 'select
array_agg(md5(g::text))::text from generate_series(1, 256) g';
insert into t values (generate_series(1,100000), large_val());
alter table t set tablespace tab1 ;
\c postgres
create database test1 template test;
\c test1
alter table t set tablespace tab;
\c postgres
alter database test1 set tablespace tab1;

--cancel the below command
alter database test1 set tablespace pg_default; --press ctrl+c
\c test1
alter table t set tablespace tab1;

Log file attached for reference.

Thanks.
--
Regards,
Neha Sharma

On Thu, Dec 16, 2021 at 4:17 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Show quoted text

On Thu, Dec 16, 2021 at 12:15 AM Bruce Momjian <bruce@momjian.us> wrote:

On Thu, Dec 2, 2021 at 07:19:50PM +0530, Dilip Kumar wrote:
From the patch:

Currently, CREATE DATABASE forces a checkpoint, then copies all the

files,

then forces another checkpoint. The comments in the createdb() function
explain the reasons for this. The attached patch fixes this problem by

making

create database completely WAL logged so that we can avoid the

checkpoints.

This can also be useful for supporting the TDE. For example, if we

need different

encryption for the source and the target database then we can not

re-encrypt the

page data if we copy the whole directory. But with this patch, we are

copying

page by page so we have an opportunity to re-encrypt the page before

copying that

to the target database.

Uh, why is this true? Why can't we just copy the heap/index files 8k at
a time and reencrypt them during the file copy, rather than using shared
buffers?

Hi Bruce,

Yeah, you are right that if we copy in 8k block then we can re-encrypt
the page, but in the current system, we are not copying block by
block. So the main effort for this patch is not only for TDE but to
get rid of the checkpoint we are forced to do before and after create
database. So my point is that in this patch since we are copying page
by page we get an opportunity to re-encrypt the page. I agree that if
the re-encryption would have been the main goal of this patch then
true we can copy files in 8k blocks and re-encrypt those blocks, that
time even if we have to access some page data for re-encryption (like
nonce) then also we can do it, but that is not the main objective.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#72

Ashutosh Sharma

ashu.coek88@gmail.com

about 4 years ago

In reply to: Neha Sharma (#71)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

I am getting the below error when running the same test-case that Neha
shared in her previous email.

ERROR: 55000: some relations of database "test1" are already in tablespace
"tab1"
HINT: You must move them back to the database's default tablespace before
using this command.
LOCATION: movedb, dbcommands.c:1555

test-case:
========
create tablespace tab1 location '/home/ashu/test1';
create tablespace tab location '/home/ashu/test';

create database test tablespace tab;
\c test

create table t(a int primary key, b text);

create or replace function large_val() returns text language sql as 'select
array_agg(md5(g::text))::text from generate_series(1, 256) g';

insert into t values (generate_series(1,100000), large_val());

alter table t set tablespace tab1 ;

\c postgres
create database test1 template test;

\c test1
alter table t set tablespace tab;

\c postgres
alter database test1 set tablespace tab1; -- this fails with the given
error.

Observations:
===========
Please note that before running above alter database statement, the table
't' is moved to tablespace 'tab' from 'tab1' so not sure why ReadDir() is
returning true when searching for table 't' in tablespace 'tab1'. It should
have returned NULL here:

while ((xlde = ReadDir(dstdir, dst_dbpath)) != NULL)
{
if (strcmp(xlde->d_name, ".") == 0 ||
strcmp(xlde->d_name, "..") == 0)
continue;

ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("some relations of database \"%s\" are already
in tablespace \"%s\"",
dbname, tblspcname),
errhint("You must move them back to the database's
default tablespace before using this command.")));
}

Also, if I run the checkpoint explicitly before executing the above alter
database statement, this error doesn't appear which means it only happens
with the new changes because earlier we were doing the force checkpoint at
the end of createdb statement.

--
With Regards,
Ashutosh Sharma.

On Thu, Dec 16, 2021 at 9:26 PM Neha Sharma <neha.sharma@enterprisedb.com>
wrote:

Show quoted text

Hi,

While testing the v8 patches in a hot-standby setup, it was observed the
master is crashing with the below error;

2021-12-16 19:32:47.757 +04 [101483] PANIC: could not fsync file
"pg_tblspc/16385/PG_15_202112111/16386/16391": No such file or directory
2021-12-16 19:32:48.917 +04 [101482] LOG: checkpointer process (PID
101483) was terminated by signal 6: Aborted

Parameters configured at master:
wal_level = hot_standby
max_wal_senders = 3
hot_standby = on
max_standby_streaming_delay= -1
wal_consistency_checking='all'
max_wal_size= 10GB
checkpoint_timeout= 1d
log_min_messages=debug1

Test Case:
create tablespace tab1 location
'/home/edb/PGsources/postgresql/inst/bin/test1';
create tablespace tab location
'/home/edb/PGsources/postgresql/inst/bin/test';
create database test tablespace tab;
\c test
create table t( a int PRIMARY KEY,b text);
CREATE OR REPLACE FUNCTION large_val() RETURNS TEXT LANGUAGE SQL AS
'select array_agg(md5(g::text))::text from generate_series(1, 256) g';
insert into t values (generate_series(1,100000), large_val());
alter table t set tablespace tab1 ;
\c postgres
create database test1 template test;
\c test1
alter table t set tablespace tab;
\c postgres
alter database test1 set tablespace tab1;

--cancel the below command
alter database test1 set tablespace pg_default; --press ctrl+c
\c test1
alter table t set tablespace tab1;

Log file attached for reference.

Thanks.
--
Regards,
Neha Sharma

On Thu, Dec 16, 2021 at 4:17 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Dec 16, 2021 at 12:15 AM Bruce Momjian <bruce@momjian.us> wrote:

On Thu, Dec 2, 2021 at 07:19:50PM +0530, Dilip Kumar wrote:
From the patch:

Currently, CREATE DATABASE forces a checkpoint, then copies all the

files,

then forces another checkpoint. The comments in the createdb()

function

explain the reasons for this. The attached patch fixes this problem

by making

create database completely WAL logged so that we can avoid the

checkpoints.

This can also be useful for supporting the TDE. For example, if we

need different

encryption for the source and the target database then we can not

re-encrypt the

page data if we copy the whole directory. But with this patch, we

are copying

page by page so we have an opportunity to re-encrypt the page before

copying that

to the target database.

Uh, why is this true? Why can't we just copy the heap/index files 8k at
a time and reencrypt them during the file copy, rather than using shared
buffers?

Hi Bruce,

Yeah, you are right that if we copy in 8k block then we can re-encrypt
the page, but in the current system, we are not copying block by
block. So the main effort for this patch is not only for TDE but to
get rid of the checkpoint we are forced to do before and after create
database. So my point is that in this patch since we are copying page
by page we get an opportunity to re-encrypt the page. I agree that if
the re-encryption would have been the main goal of this patch then
true we can copy files in 8k blocks and re-encrypt those blocks, that
time even if we have to access some page data for re-encryption (like
nonce) then also we can do it, but that is not the main objective.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#73

Dilip Kumar

dilipbalaut@gmail.com

about 4 years ago

In reply to: Neha Sharma (#71)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Thu, Dec 16, 2021 at 9:26 PM Neha Sharma
<neha.sharma@enterprisedb.com> wrote:

Hi,

While testing the v8 patches in a hot-standby setup, it was observed the master is crashing with the below error;

2021-12-16 19:32:47.757 +04 [101483] PANIC: could not fsync file "pg_tblspc/16385/PG_15_202112111/16386/16391": No such file or directory
2021-12-16 19:32:48.917 +04 [101482] LOG: checkpointer process (PID 101483) was terminated by signal 6: Aborted

Parameters configured at master:
wal_level = hot_standby
max_wal_senders = 3
hot_standby = on
max_standby_streaming_delay= -1
wal_consistency_checking='all'
max_wal_size= 10GB
checkpoint_timeout= 1d
log_min_messages=debug1

Test Case:
create tablespace tab1 location '/home/edb/PGsources/postgresql/inst/bin/test1';
create tablespace tab location '/home/edb/PGsources/postgresql/inst/bin/test';
create database test tablespace tab;
\c test
create table t( a int PRIMARY KEY,b text);
CREATE OR REPLACE FUNCTION large_val() RETURNS TEXT LANGUAGE SQL AS 'select array_agg(md5(g::text))::text from generate_series(1, 256) g';
insert into t values (generate_series(1,100000), large_val());
alter table t set tablespace tab1 ;
\c postgres
create database test1 template test;
\c test1
alter table t set tablespace tab;
\c postgres
alter database test1 set tablespace tab1;

--cancel the below command
alter database test1 set tablespace pg_default; --press ctrl+c
\c test1
alter table t set tablespace tab1;

Log file attached for reference.

Seems like this is an existing issue and I am able to reproduce on the
PostgreSQL head as well [1]/messages/by-id/CAFiTN-szX=ayO80EnSWonBu1YMZrpOr9V0R3BzHBSjMrMPAeMg@mail.gmail.com

[1]: /messages/by-id/CAFiTN-szX=ayO80EnSWonBu1YMZrpOr9V0R3BzHBSjMrMPAeMg@mail.gmail.com

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#74

Ashutosh Sharma

ashu.coek88@gmail.com

about 4 years ago

In reply to: Ashutosh Sharma (#72)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

Hi Dilip,

On Tue, Dec 21, 2021 at 11:10 AM Ashutosh Sharma <ashu.coek88@gmail.com>
wrote:

I am getting the below error when running the same test-case that Neha
shared in her previous email.

ERROR: 55000: some relations of database "test1" are already in
tablespace "tab1"
HINT: You must move them back to the database's default tablespace before
using this command.
LOCATION: movedb, dbcommands.c:1555

test-case:
========
create tablespace tab1 location '/home/ashu/test1';
create tablespace tab location '/home/ashu/test';

create database test tablespace tab;
\c test

create table t(a int primary key, b text);

create or replace function large_val() returns text language sql as
'select array_agg(md5(g::text))::text from generate_series(1, 256) g';

insert into t values (generate_series(1,100000), large_val());

alter table t set tablespace tab1 ;

\c postgres
create database test1 template test;

\c test1
alter table t set tablespace tab;

\c postgres
alter database test1 set tablespace tab1; -- this fails with the given
error.

Observations:
===========
Please note that before running above alter database statement, the table
't' is moved to tablespace 'tab' from 'tab1' so not sure why ReadDir() is
returning true when searching for table 't' in tablespace 'tab1'. It should
have returned NULL here:

while ((xlde = ReadDir(dstdir, dst_dbpath)) != NULL)
{
if (strcmp(xlde->d_name, ".") == 0 ||
strcmp(xlde->d_name, "..") == 0)
continue;

ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("some relations of database \"%s\" are already
in tablespace \"%s\"",
dbname, tblspcname),
errhint("You must move them back to the database's
default tablespace before using this command.")));
}

Also, if I run the checkpoint explicitly before executing the above alter
database statement, this error doesn't appear which means it only happens
with the new changes because earlier we were doing the force checkpoint at
the end of createdb statement.

Is this expected? I think it is not.

--
With Regards,
Ashutosh Sharma.

#75

Dilip Kumar

dilipbalaut@gmail.com

about 4 years ago

In reply to: Ashutosh Sharma (#72)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Tue, Dec 21, 2021 at 11:10 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

I am getting the below error when running the same test-case that Neha shared in her previous email.

ERROR: 55000: some relations of database "test1" are already in tablespace "tab1"
HINT: You must move them back to the database's default tablespace before using this command.
LOCATION: movedb, dbcommands.c:1555

test-case:
========
create tablespace tab1 location '/home/ashu/test1';
create tablespace tab location '/home/ashu/test';

create database test tablespace tab;
\c test

create table t(a int primary key, b text);

create or replace function large_val() returns text language sql as 'select array_agg(md5(g::text))::text from generate_series(1, 256) g';

insert into t values (generate_series(1,100000), large_val());

alter table t set tablespace tab1 ;

\c postgres
create database test1 template test;

\c test1
alter table t set tablespace tab;

\c postgres
alter database test1 set tablespace tab1; -- this fails with the given error.

Observations:
===========
Please note that before running above alter database statement, the table 't' is moved to tablespace 'tab' from 'tab1' so not sure why ReadDir() is returning true when searching for table 't' in tablespace 'tab1'. It should have returned NULL here:

while ((xlde = ReadDir(dstdir, dst_dbpath)) != NULL)
{
if (strcmp(xlde->d_name, ".") == 0 ||
strcmp(xlde->d_name, "..") == 0)
continue;

ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("some relations of database \"%s\" are already in tablespace \"%s\"",
dbname, tblspcname),
errhint("You must move them back to the database's default tablespace before using this command.")));
}

Also, if I run the checkpoint explicitly before executing the above alter database statement, this error doesn't appear which means it only happens with the new changes because earlier we were doing the force checkpoint at the end of createdb statement.

Basically, ALTER TABLE SET TABLESPACE, will register the
SYNC_UNLINK_REQUEST for the table files w.r.t the old tablespace, but
those will get unlinked during the next checkpoint. Although the
files must be truncated during commit itself but unlink might not have
been processed until the next checkpoint. This is the explanation for
the behavior you found during your investigation, but I haven't looked
into the issue so I will do it latest by tomorrow and send my
analysis.

Thanks for working on this.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#76

Ashutosh Sharma

ashu.coek88@gmail.com

about 4 years ago

In reply to: Dilip Kumar (#75)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Wed, Dec 22, 2021 at 2:44 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Dec 21, 2021 at 11:10 AM Ashutosh Sharma <ashu.coek88@gmail.com>
wrote:

I am getting the below error when running the same test-case that Neha

shared in her previous email.

ERROR: 55000: some relations of database "test1" are already in

tablespace "tab1"

HINT: You must move them back to the database's default tablespace

before using this command.

LOCATION: movedb, dbcommands.c:1555

test-case:
========
create tablespace tab1 location '/home/ashu/test1';
create tablespace tab location '/home/ashu/test';

create database test tablespace tab;
\c test

create table t(a int primary key, b text);

create or replace function large_val() returns text language sql as

'select array_agg(md5(g::text))::text from generate_series(1, 256) g';

insert into t values (generate_series(1,100000), large_val());

alter table t set tablespace tab1 ;

\c postgres
create database test1 template test;

\c test1
alter table t set tablespace tab;

\c postgres
alter database test1 set tablespace tab1; -- this fails with the given

error.

Observations:
===========
Please note that before running above alter database statement, the

table 't' is moved to tablespace 'tab' from 'tab1' so not sure why
ReadDir() is returning true when searching for table 't' in tablespace
'tab1'. It should have returned NULL here:

while ((xlde = ReadDir(dstdir, dst_dbpath)) != NULL)
{
if (strcmp(xlde->d_name, ".") == 0 ||
strcmp(xlde->d_name, "..") == 0)
continue;

ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("some relations of database \"%s\" are

already in tablespace \"%s\"",

dbname, tblspcname),
errhint("You must move them back to the database's

default tablespace before using this command.")));

}

Also, if I run the checkpoint explicitly before executing the above

alter database statement, this error doesn't appear which means it only
happens with the new changes because earlier we were doing the force
checkpoint at the end of createdb statement.

Basically, ALTER TABLE SET TABLESPACE, will register the
SYNC_UNLINK_REQUEST for the table files w.r.t the old tablespace, but
those will get unlinked during the next checkpoint. Although the
files must be truncated during commit itself but unlink might not have
been processed until the next checkpoint. This is the explanation for
the behavior you found during your investigation, but I haven't looked
into the issue so I will do it latest by tomorrow and send my
analysis.

Thanks for working on this.

Yeah the problem here is that the old rel file that needs to be unlinked
still exists in the old tablespace. Earlier, without your changes we were
doing force checkpoint before starting with the actual work for the alter
database which unlinked/deleted the rel file from the old tablespace, but
that is not the case here. Now we have removed the force checkpoint from
movedb() which means until the auto checkpoint happens the old rel file
will remain in the old tablespace thereby creating this problem.

--
With Regards,
Ashutosh Sharma.

#77

Dilip Kumar

dilipbalaut@gmail.com

about 4 years ago

In reply to: Ashutosh Sharma (#76)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Wed, Dec 22, 2021 at 4:26 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Basically, ALTER TABLE SET TABLESPACE, will register the
SYNC_UNLINK_REQUEST for the table files w.r.t the old tablespace, but
those will get unlinked during the next checkpoint. Although the
files must be truncated during commit itself but unlink might not have
been processed until the next checkpoint. This is the explanation for
the behavior you found during your investigation, but I haven't looked
into the issue so I will do it latest by tomorrow and send my
analysis.

Thanks for working on this.

Yeah the problem here is that the old rel file that needs to be unlinked still exists in the old tablespace. Earlier, without your changes we were doing force checkpoint before starting with the actual work for the alter database which unlinked/deleted the rel file from the old tablespace, but that is not the case here. Now we have removed the force checkpoint from movedb() which means until the auto checkpoint happens the old rel file will remain in the old tablespace thereby creating this problem.

One solution to this problem could be that, similar to mdpostckpt(),
we invent one more function which takes dboid and dsttblspc oid as
input and it will unlink all the requests which are w.r.t. the dboid
and tablespaceoid, and before doing it we should also do
ForgetDatabaseSyncRequests(), so that next checkpoint does not flush
some old request.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#78

Ashutosh Sharma

ashu.coek88@gmail.com

about 4 years ago

In reply to: Dilip Kumar (#77)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Wed, Dec 22, 2021 at 5:06 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Dec 22, 2021 at 4:26 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Basically, ALTER TABLE SET TABLESPACE, will register the
SYNC_UNLINK_REQUEST for the table files w.r.t the old tablespace, but
those will get unlinked during the next checkpoint. Although the
files must be truncated during commit itself but unlink might not have
been processed until the next checkpoint. This is the explanation for
the behavior you found during your investigation, but I haven't looked
into the issue so I will do it latest by tomorrow and send my
analysis.

Thanks for working on this.

Yeah the problem here is that the old rel file that needs to be unlinked still exists in the old tablespace. Earlier, without your changes we were doing force checkpoint before starting with the actual work for the alter database which unlinked/deleted the rel file from the old tablespace, but that is not the case here. Now we have removed the force checkpoint from movedb() which means until the auto checkpoint happens the old rel file will remain in the old tablespace thereby creating this problem.

One solution to this problem could be that, similar to mdpostckpt(),
we invent one more function which takes dboid and dsttblspc oid as
input and it will unlink all the requests which are w.r.t. the dboid
and tablespaceoid, and before doing it we should also do
ForgetDatabaseSyncRequests(), so that next checkpoint does not flush
some old request.

I couldn't find the mdpostchkpt() function. Are you talking about
SyncPostCheckpoint() ? Anyway, as you have rightly said, we need to
unlink all the files available inside the dst_tablespaceoid/dst_dboid/
directory by scanning the pendingUnlinks list. And finally we don't
want the next checkpoint to unlink this file again and PANIC so for
that we have to update the entry for this unlinked rel file in the
hash table i.e. cancel the sync request for it.

--
With Regards,
Ashutosh Sharma.

#79

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Robert Haas (#55)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Mon, Dec 6, 2021 at 12:45 PM Robert Haas <robertmhaas@gmail.com> wrote:

So for example, imagine tests with 1GB of shard_buffers, 8GB, and
64GB. And template databases with sizes of whatever the default is,
1GB, 10GB, 100GB. Repeatedly make 75% of the pages dirty and then
create a new database from one of the templates. And then just measure
the performance. Maybe for large databases this approach is just
really the pits -- and if your max_wal_size is too small, it
definitely will be. But, I don't know, maybe with reasonable settings
it's not that bad. Writing everything to disk twice - once to WAL and
once to the target directory - has to be more expensive than doing it
once. But on the other hand, it's all sequential I/O and the data
pages don't need to be fsync'd, so perhaps the overhead is relatively
mild. I don't know.

I have been tied up with other things for a bit now and have not had
time to look at this thread; sorry about that. I have a little more
time available now so I thought I would take a look at this again and
see where things stand.

Sadly, it doesn't appear to me that anyone has done any performance
testing of this patch, along the lines suggested above or otherwise,
and I think it's a crucial question for the patch. My reading of this
thread is that nobody really likes the idea of maintaining two methods
for performing CREATE DATABASE, but nobody wants to hose people who
are using it to clone large databases, either. To some extent those
things are inexorably in conflict. If we postulate that the 10TB
template database is on a local RAID array with 40 spindles, while
pg_wal is on an iSCSI volume that we access via a 128kB ISDN link,
then the new system is going to be infinitely worse. But real
situations aren't likely to be that bad, and it would be useful in my
opinion to have an idea how bad they actually are.

I'm somewhat inclined to propose that we keep the existing method
around along with the new method. Even though nobody really likes
that, we don't necessarily have to maintain both methods forever. If,
say, we use the new method by default in all cases, but add an option
to get the old method back if you need it, we could leave it that way
for a few years and then propose removing the old method (and the
switch to activate it) and see if anyone complains. That way, if the
new method turns out to suck in certain cases, users have a way out.
However, I still think doing some performance testing would be a
really good idea. It's not a great plan to make decisions about this
kind of thing in an information vacuum.

--
Robert Haas
EDB: http://www.enterprisedb.com

#80

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Ashutosh Sharma (#78)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Wed, Dec 22, 2021 at 9:32 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

I couldn't find the mdpostchkpt() function. Are you talking about
SyncPostCheckpoint() ? Anyway, as you have rightly said, we need to
unlink all the files available inside the dst_tablespaceoid/dst_dboid/
directory by scanning the pendingUnlinks list. And finally we don't
want the next checkpoint to unlink this file again and PANIC so for
that we have to update the entry for this unlinked rel file in the
hash table i.e. cancel the sync request for it.

Until commit 3eb77eba5a51780d5cf52cd66a9844cd4d26feb0 in April 2019,
there was an mdpostckpt function, which is probably what was meant
here.

--
Robert Haas
EDB: http://www.enterprisedb.com

#81

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Dilip Kumar (#66)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Sun, Dec 12, 2021 at 3:09 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Correct, I have done this cleanup, apart from this we have dropped the
fsyc request in create database failure case as well and also need to
drop buffer in error case of creatdb as well as movedb. I have also
fixed the other issue for which you gave the patch (a bit differently)
basically, in case of movedb the source and destination dboid are same
so we don't need an additional parameter and also readjusted the
conditions to avoid nested if.

Amazingly to me given how much time has passed, these patches still
apply, although I think there are a few outstanding issues that you
promised to fix in the next version and haven't yet addressed.

In 0007, I think you will need to work a bit harder. I don't think
that you can just add a second argument to
ForgetDatabaseSyncRequests() that makes it do something other than
what the name of the function suggests but without renaming the
function or updating any comments. Elsewhere we have things like
TablespaceCreateDbspace and ResetUnloggedRelationsInDbspaceDir so
perhaps we ought to just add a new function with a name inspired by
those precedents alongside the existing one, rather than doing it this
way.

In 0008, this is a bit confusing:

+               PageInit(dstPage, BufferGetPageSize(dstBuf), 0);
+               memcpy(dstPage, srcPage, BLCKSZ);

After a minute, I figured out that the point here was to force
log_newpage() to actually set the LSN, but how about a comment?

I kind of wonder whether GetDatabaseRelationList should be broken into
two functions so that don't have quite such deep nesting. And I wonder
if maybe the return value of GetActiveSnapshot() should be cached in a
local variable.

On the whole I think there aren't huge code-level issues here, even if
things need to be tweaked here and there and bugs fixed. The real key
is arriving at a set of design trade-offs that doesn't make anyone too
upset.

--
Robert Haas
EDB: http://www.enterprisedb.com

#82

Dilip Kumar

dilipbalaut@gmail.com

almost 4 years ago

In reply to: Robert Haas (#79)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Tue, Feb 8, 2022 at 10:39 PM Robert Haas <robertmhaas@gmail.com> wrote:

I have been tied up with other things for a bit now and have not had
time to look at this thread; sorry about that. I have a little more
time available now so I thought I would take a look at this again and
see where things stand.

Thanks for looking into this.

Sadly, it doesn't appear to me that anyone has done any performance
testing of this patch, along the lines suggested above or otherwise,
and I think it's a crucial question for the patch.

Yeah, actually some performance testing started as shared by Ahustosh
[1]: /messages/by-id/CAE9k0Pkg20tHq8oiJ+xXa9=af3QZCSYTw99aBaPthA1UMKhnTg@mail.gmail.com
we thought had to be fixed before we proceed with this feature.

My reading of this

thread is that nobody really likes the idea of maintaining two methods
for performing CREATE DATABASE, but nobody wants to hose people who
are using it to clone large databases, either. To some extent those
things are inexorably in conflict. If we postulate that the 10TB
template database is on a local RAID array with 40 spindles, while
pg_wal is on an iSCSI volume that we access via a 128kB ISDN link,
then the new system is going to be infinitely worse. But real
situations aren't likely to be that bad, and it would be useful in my
opinion to have an idea how bad they actually are.

Yeah that makes sense, I will work on performance testing in this line
and also on previous ideas you suggested.

I'm somewhat inclined to propose that we keep the existing method
around along with the new method. Even though nobody really likes
that, we don't necessarily have to maintain both methods forever. If,
say, we use the new method by default in all cases, but add an option
to get the old method back if you need it, we could leave it that way
for a few years and then propose removing the old method (and the
switch to activate it) and see if anyone complains. That way, if the
new method turns out to suck in certain cases, users have a way out.
However, I still think doing some performance testing would be a
really good idea. It's not a great plan to make decisions about this
kind of thing in an information vacuum.

Yeah that makes sense to me.

Now, one bigger question is can we proceed with this patch without
fixing [2]/messages/by-id/CA+TgmobM5FN5x0u3tSpoNvk_TZPFCdbcHxsXCoY1ytn1dXROvg@mail.gmail.com, IMHO, if we are deciding to keep the old method also
intact then one option could be that for now only change CREATE
DATABASE to support both old and new way of creating database and for
time being leave the ALTER DATABASE SET TABLESPACE alone and let it
work only with the old method? And another option is that we first
fix the issue related to the tombstone file and then come back to
this?

IMHO, the first option could be better in a way that we have already
made better progress in this patch and this is in better shape than
the other patch we are trying to make for removing the tombstone
files.

[1]: /messages/by-id/CAE9k0Pkg20tHq8oiJ+xXa9=af3QZCSYTw99aBaPthA1UMKhnTg@mail.gmail.com
[2]: /messages/by-id/CA+TgmobM5FN5x0u3tSpoNvk_TZPFCdbcHxsXCoY1ytn1dXROvg@mail.gmail.com

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#83

Bruce Momjian

bruce@momjian.us

almost 4 years ago

In reply to: Robert Haas (#79)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Tue, Feb 8, 2022 at 12:09:08PM -0500, Robert Haas wrote:

Sadly, it doesn't appear to me that anyone has done any performance
testing of this patch, along the lines suggested above or otherwise,
and I think it's a crucial question for the patch. My reading of this
thread is that nobody really likes the idea of maintaining two methods
for performing CREATE DATABASE, but nobody wants to hose people who
are using it to clone large databases, either. To some extent those
things are inexorably in conflict. If we postulate that the 10TB
template database is on a local RAID array with 40 spindles, while
pg_wal is on an iSCSI volume that we access via a 128kB ISDN link,
then the new system is going to be infinitely worse. But real
situations aren't likely to be that bad, and it would be useful in my
opinion to have an idea how bad they actually are.

Honestly, I never understood why the checkpoint during CREATE DATABASE
was as problem --- we checkpoint by default every five minutes anyway,
so why is an additional two a problem --- it just means the next
checkpoint will do less work. It is hard to see how avoiding
checkpoints to add WAL writes, fscyncs, and replication traffic could be
a win.

I see the patch justification outlined here:

/messages/by-id/CAFiTN-sP6yLVTfjR42mEfvFwJ-SZ2iEtG1t0j=QX09X=BM+KWQ@mail.gmail.com

TDE is mentioned as a value for this patch, but I don't see why it is
needed --- TDE can easily decrypt/encrypt the pages while they are
copied.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

If only the physical world exists, free will is an illusion.

#84

Andrew Dunstan

andrew@dunslane.net

almost 4 years ago

In reply to: Dilip Kumar (#10)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On 6/16/21 03:52, Dilip Kumar wrote:

On Tue, Jun 15, 2021 at 7:01 PM Andrew Dunstan <andrew@dunslane.net> wrote:

Rather than use size, I'd be inclined to say use this if the source
database is marked as a template, and use the copydir approach for
anything that isn't.

Yeah, that is possible, on the other thought wouldn't it be good to
provide control to the user by providing two different commands, e.g.
COPY DATABASE for the existing method (copydir) and CREATE DATABASE
for the new method (fully wal logged)?

This proposal seems to have gotten lost.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#85

Dilip Kumar

dilipbalaut@gmail.com

almost 4 years ago

In reply to: Bruce Momjian (#83)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Wed, Feb 9, 2022 at 7:49 PM Bruce Momjian <bruce@momjian.us> wrote:

Honestly, I never understood why the checkpoint during CREATE DATABASE
was as problem --- we checkpoint by default every five minutes anyway,
so why is an additional two a problem --- it just means the next
checkpoint will do less work. It is hard to see how avoiding
checkpoints to add WAL writes, fscyncs, and replication traffic could be
a win.

But don't you think that the current way of WAL logging the CREATE
DATABASE is a bit hacky? I mean we are just logically WAL logging the
source and destination directory paths without actually WAL logging
what content we want to copy. IMHO this is against the basic
principle of WAL and that's the reason we are forcefully checkpointing
to avoid replaying that WAL during crash recovery. Even after this
some of the code comments say that we have limitations during PITR[1]* In PITR replay, the first of these isn't an issue, and the second * is only a risk if the CREATE DATABASE and subsequent template * database change both occur while a base backup is being taken. * There doesn't seem to be much we can do about that except document * it as a limitation. * * Perhaps if we ever implement CREATE DATABASE in a less cheesy way, * we can avoid this. */ RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
and we want to avoid it sometime in the future.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#86

Dilip Kumar

dilipbalaut@gmail.com

almost 4 years ago

In reply to: Andrew Dunstan (#84)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Wed, Feb 9, 2022 at 9:25 PM Andrew Dunstan <andrew@dunslane.net> wrote:

On 6/16/21 03:52, Dilip Kumar wrote:

On Tue, Jun 15, 2021 at 7:01 PM Andrew Dunstan <andrew@dunslane.net> wrote:

Rather than use size, I'd be inclined to say use this if the source
database is marked as a template, and use the copydir approach for
anything that isn't.

Yeah, that is possible, on the other thought wouldn't it be good to
provide control to the user by providing two different commands, e.g.
COPY DATABASE for the existing method (copydir) and CREATE DATABASE
for the new method (fully wal logged)?

This proposal seems to have gotten lost.

Yeah, I am planning to work on this part so that we can support both methods.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#87

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Bruce Momjian (#83)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Wed, Feb 9, 2022 at 9:19 AM Bruce Momjian <bruce@momjian.us> wrote:

Honestly, I never understood why the checkpoint during CREATE DATABASE
was as problem --- we checkpoint by default every five minutes anyway,
so why is an additional two a problem --- it just means the next
checkpoint will do less work. It is hard to see how avoiding
checkpoints to add WAL writes, fscyncs, and replication traffic could be
a win.

Try running pgbench with the --progress option and enough concurrent
jobs to keep a moderately large system busy and watching what happens
to the tps each time a checkpoint occurs. It's extremely dramatic, or
at least it was the last time I ran such tests. I think that
performance will sometimes drop by a factor of five or more when the
checkpoint hits, and take multiple minutes to recover.

I think your statement that doing an extra checkpoint "just means the
next checkpoint will do less work" is kind of misleading. That's
certainly true in some situations. But when the same pages are being
dirtied over and over again, an extra checkpoint often means that the
system will do MUCH MORE work, because every checkpoint triggers a new
set of full-page writes over the actively-updated portion of the
database.

I think that very few people run systems with heavy write workloads
with checkpoint_timeout=5m, precisely because of this issue. Almost
every system I see has had that raised to at least 10m and sometimes
30m or more. It can make a massive difference.

I see the patch justification outlined here:

/messages/by-id/CAFiTN-sP6yLVTfjR42mEfvFwJ-SZ2iEtG1t0j=QX09X=BM+KWQ@mail.gmail.com

TDE is mentioned as a value for this patch, but I don't see why it is
needed --- TDE can easily decrypt/encrypt the pages while they are
copied.

That's true, but depending on what other design decisions we make,
WAL-logging it might be a problem.

Right now, when someone creates a new database, we log a single record
that basically says "go copy the directory'". That's very different
than what we normally do, which is to log changes to individual pages,
or where required, small groups of pages (e.g. a single WAL record is
written for an UPDATE even though it may touch two pages). The fact
that in this case we only log a single WAL record for an operation
that could touch an unbounded amount of data is why this needs special
handling around checkpoints. It also introduces a certain amount of
fragility into the system, because if for some reason the source
directory on the standby doesn't exactly match the source directory on
the primary, the new databases won't match either. Any errors that
creep into the process can be propagated around to other places by a
system like this. However, ordinarily that doesn't happen, which is
why we've been able to use this system successfully for so many years.

The other reason we've been able to use this successfully is that
we're confident that we can perform exactly the same operation on the
standby as we do on the primary knowing only the relevant directory
names. If we say "copy this directory to there" we believe we'll be
able to do that exactly the same way on the standby. Is that still
true with TDE? Well, it depends. If the encryption can be performed
knowing only the key and the identity of the block (database OID,
tablespace OID, relfilenode, fork, block number) then it's true. But
if the encryption needs to, for example, generate a random nonce for
each block, then it's false. If you want the standby to be an exact
copy of the master in a system where new blocks get random nonces,
then you need to replicate the copy block-by-block, not as one
gigantic operation, so that you can log the nonce you picked for each
block. On the other hand, maybe you DON'T want the standby to be an
exact copy of the master. If, for example, you imagine a system where
the master and standby aren't even using the same key, then this is a
lot less relevant.

I can't predict whether PostgreSQL will get TDE in the future, and if
it does, I can't predict what form it will take. Therefore any strong
statement about whether this will benefit TDE or not seems to me to be
pretty questionable - we don't know that it will be useful, and we
don't know that it won't. But, like Dilip, I think the way we're
WAL-logging CREATE DATABASE right now is a hack, and I *know* it can
cause massive performance drops on busy systems.

--
Robert Haas
EDB: http://www.enterprisedb.com

#88

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Dilip Kumar (#86)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Wed, Feb 9, 2022 at 10:59 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Feb 9, 2022 at 9:25 PM Andrew Dunstan <andrew@dunslane.net> wrote:

On 6/16/21 03:52, Dilip Kumar wrote:

On Tue, Jun 15, 2021 at 7:01 PM Andrew Dunstan <andrew@dunslane.net> wrote:

Rather than use size, I'd be inclined to say use this if the source
database is marked as a template, and use the copydir approach for
anything that isn't.

Yeah, that is possible, on the other thought wouldn't it be good to
provide control to the user by providing two different commands, e.g.
COPY DATABASE for the existing method (copydir) and CREATE DATABASE
for the new method (fully wal logged)?

This proposal seems to have gotten lost.

Yeah, I am planning to work on this part so that we can support both methods.

But can we pick a different syntax? In my view this should be an
option to CREATE DATABASE rather than a whole new command.

--
Robert Haas
EDB: http://www.enterprisedb.com

#89

Andrew Dunstan

andrew@dunslane.net

almost 4 years ago

In reply to: Dilip Kumar (#86)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On 2/9/22 10:58, Dilip Kumar wrote:

On Wed, Feb 9, 2022 at 9:25 PM Andrew Dunstan <andrew@dunslane.net> wrote:

On 6/16/21 03:52, Dilip Kumar wrote:

On Tue, Jun 15, 2021 at 7:01 PM Andrew Dunstan <andrew@dunslane.net> wrote:

Rather than use size, I'd be inclined to say use this if the source
database is marked as a template, and use the copydir approach for
anything that isn't.

Yeah, that is possible, on the other thought wouldn't it be good to
provide control to the user by providing two different commands, e.g.
COPY DATABASE for the existing method (copydir) and CREATE DATABASE
for the new method (fully wal logged)?

This proposal seems to have gotten lost.

Yeah, I am planning to work on this part so that we can support both methods.

OK, many thanks.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#90

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Dilip Kumar (#82)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Tue, Feb 8, 2022 at 11:47 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Now, one bigger question is can we proceed with this patch without
fixing [2], IMHO, if we are deciding to keep the old method also
intact then one option could be that for now only change CREATE
DATABASE to support both old and new way of creating database and for
time being leave the ALTER DATABASE SET TABLESPACE alone and let it
work only with the old method? And another option is that we first
fix the issue related to the tombstone file and then come back to
this?

IMHO, the first option could be better in a way that we have already
made better progress in this patch and this is in better shape than
the other patch we are trying to make for removing the tombstone
files.

Yeah, it's getting quite close to the end of this release cycle. I'm
not sure whether we can get anything committed here at all in the time
we have remaining, but I agree with you that this patch seems like a
better prospect than that one.

--
Robert Haas
EDB: http://www.enterprisedb.com

#91

Bruce Momjian

bruce@momjian.us

almost 4 years ago

In reply to: Robert Haas (#87)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Wed, Feb 9, 2022 at 11:00:06AM -0500, Robert Haas wrote:

Try running pgbench with the --progress option and enough concurrent
jobs to keep a moderately large system busy and watching what happens
to the tps each time a checkpoint occurs. It's extremely dramatic, or
at least it was the last time I ran such tests. I think that
performance will sometimes drop by a factor of five or more when the
checkpoint hits, and take multiple minutes to recover.

I think your statement that doing an extra checkpoint "just means the
next checkpoint will do less work" is kind of misleading. That's
certainly true in some situations. But when the same pages are being
dirtied over and over again, an extra checkpoint often means that the
system will do MUCH MORE work, because every checkpoint triggers a new
set of full-page writes over the actively-updated portion of the
database.

I think that very few people run systems with heavy write workloads
with checkpoint_timeout=5m, precisely because of this issue. Almost
every system I see has had that raised to at least 10m and sometimes
30m or more. It can make a massive difference.

Well, I think the worst case is that the checkpoint happens exactly
between two checkpoints, so you are checkpointing twice as often, but if
it happens just before or after a checkpoint, I assume the effect would
be minimal.

So, it seems we are weighing having a checkpoint happen in the middle of
a checkpoint interval vs writing more WAL. If the WAL traffic, without
CREATE DATABASE, is high, and the template database is small, writing
more WAL and skipping the checkpoint will be win, but if the WAL traffic
is small and the template database is big, the extra WAL will be a loss.
Is this accurate?

I can't predict whether PostgreSQL will get TDE in the future, and if
it does, I can't predict what form it will take. Therefore any strong
statement about whether this will benefit TDE or not seems to me to be
pretty questionable - we don't know that it will be useful, and we

Agreed. We would want to have a different heap/index key on the standby
so we can rotate the heap/index key.

don't know that it won't. But, like Dilip, I think the way we're
WAL-logging CREATE DATABASE right now is a hack, and I *know* it can

Yes, it is a hack, but it seems to be a clever one that we might have
chosen if it had not been part of the original system.

cause massive performance drops on busy systems.

See above.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

If only the physical world exists, free will is an illusion.

#92

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Bruce Momjian (#91)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Wed, Feb 9, 2022 at 1:34 PM Bruce Momjian <bruce@momjian.us> wrote:

Well, I think the worst case is that the checkpoint happens exactly
between two checkpoints, so you are checkpointing twice as often, but if
it happens just before or after a checkpoint, I assume the effect would
be minimal.

I agree for the most part. I think that if checkpoints happen every 8
minutes normally and the extra checkpoint happens 2 minutes after the
previous checkpoint, the impact may be almost as bad as if it had
happened right in the middle. If it happens 5 seconds after the
previous checkpoint, it should be low impact.

So, it seems we are weighing having a checkpoint happen in the middle of
a checkpoint interval vs writing more WAL. If the WAL traffic, without
CREATE DATABASE, is high, and the template database is small, writing
more WAL and skipping the checkpoint will be win, but if the WAL traffic
is small and the template database is big, the extra WAL will be a loss.
Is this accurate?

I think that's basically correct. I would expect that the worry about
big template database is mostly about template databases that are
REALLY big. I think if your template database is 10GB you probably
shouldn't be worried about this feature. 10GB of extra WAL isn't
nothing, but if you've got reasonably capable hardware, it's not
overloaded, and max_wal_size is big enough, it's probably not going to
have a huge impact. Also, most of the impact will probably be on the
CREATE DATABASE command itself, and other things running on the system
at the same time will be impacted to a lesser degree. I think it's
even possible that you will be happier with this feature than without,
because you may like the idea that CREATE DATABASE itself is slow more
than you like the idea of it making everything else on the system
slow. On the other hand, if your template database is 1TB, the extra
WAL is probably going to be a fairly big problem.

Basically I think for most people this should be neutral or a win. For
people with really large template databases, it's a loss. Hence the
discussion about having a way for people who prefer the current
behavior to keep it.

Agreed. We would want to have a different heap/index key on the standby
so we can rotate the heap/index key.

I don't like that design, and I don't think that's what we should do,
but I understand that you feel differently. IMHO, this thread is not
the place to hash that out.

--
Robert Haas
EDB: http://www.enterprisedb.com

#93

Julien Rouhaud

rjuju123@gmail.com

almost 4 years ago

In reply to: Robert Haas (#92)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Wed, Feb 09, 2022 at 02:30:08PM -0500, Robert Haas wrote:

On Wed, Feb 9, 2022 at 1:34 PM Bruce Momjian <bruce@momjian.us> wrote:

Well, I think the worst case is that the checkpoint happens exactly
between two checkpoints, so you are checkpointing twice as often, but if
it happens just before or after a checkpoint, I assume the effect would
be minimal.

I agree for the most part. I think that if checkpoints happen every 8
minutes normally and the extra checkpoint happens 2 minutes after the
previous checkpoint, the impact may be almost as bad as if it had
happened right in the middle. If it happens 5 seconds after the
previous checkpoint, it should be low impact.

But the extra checkpoints will be immediate, while on a properly configured
system it should be spread checkpoint. That will add some more overhead.

So, it seems we are weighing having a checkpoint happen in the middle of
a checkpoint interval vs writing more WAL. If the WAL traffic, without
CREATE DATABASE, is high, and the template database is small, writing
more WAL and skipping the checkpoint will be win, but if the WAL traffic
is small and the template database is big, the extra WAL will be a loss.
Is this accurate?

I think that's basically correct. I would expect that the worry about
big template database is mostly about template databases that are
REALLY big. I think if your template database is 10GB you probably
shouldn't be worried about this feature. 10GB of extra WAL isn't
nothing, but if you've got reasonably capable hardware, it's not
overloaded, and max_wal_size is big enough, it's probably not going to
have a huge impact. Also, most of the impact will probably be on the
CREATE DATABASE command itself, and other things running on the system
at the same time will be impacted to a lesser degree. I think it's
even possible that you will be happier with this feature than without,
because you may like the idea that CREATE DATABASE itself is slow more
than you like the idea of it making everything else on the system
slow. On the other hand, if your template database is 1TB, the extra
WAL is probably going to be a fairly big problem.

Basically I think for most people this should be neutral or a win. For
people with really large template databases, it's a loss. Hence the
discussion about having a way for people who prefer the current
behavior to keep it.

Those extra WALs will also impact backups and replication. You could have
fancy hardware, a read-mostly workload and the need to replicate over a slow
WAN, and in that case the 10GB could be much more problematic.

#94

Dilip Kumar

dilipbalaut@gmail.com

almost 4 years ago

In reply to: Robert Haas (#88)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Wed, Feb 9, 2022 at 9:31 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Feb 9, 2022 at 10:59 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Feb 9, 2022 at 9:25 PM Andrew Dunstan <andrew@dunslane.net> wrote:

On 6/16/21 03:52, Dilip Kumar wrote:

On Tue, Jun 15, 2021 at 7:01 PM Andrew Dunstan <andrew@dunslane.net> wrote:

Rather than use size, I'd be inclined to say use this if the source
database is marked as a template, and use the copydir approach for
anything that isn't.

Yeah, that is possible, on the other thought wouldn't it be good to
provide control to the user by providing two different commands, e.g.
COPY DATABASE for the existing method (copydir) and CREATE DATABASE
for the new method (fully wal logged)?

This proposal seems to have gotten lost.

Yeah, I am planning to work on this part so that we can support both methods.

But can we pick a different syntax? In my view this should be an
option to CREATE DATABASE rather than a whole new command.

Maybe we can provide something like

CREATE DATABASE..WITH WAL_LOG=true/false ? OR
CREATE DATABASE..WITH WAL_LOG_DATA_PAGE=true/false ? OR
CREATE DATABASE..WITH CHECKPOINT=true/false ? OR

And then we can explain in documentation about these options? I think
default should be new method?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#95

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Julien Rouhaud (#93)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Thu, Feb 10, 2022 at 2:52 AM Julien Rouhaud <rjuju123@gmail.com> wrote:

Those extra WALs will also impact backups and replication. You could have
fancy hardware, a read-mostly workload and the need to replicate over a slow
WAN, and in that case the 10GB could be much more problematic.

True, I guess, but how bad does your WAN have to be for that to be an
issue? On a 1 gigabit/second link, that's a little over 2 minutes of
transfer time. That's not nothing, but it's not extreme, either,
especially because there's no sense in querying an empty database.
You're going to have to put some stuff in that database before you can
do anything meaningful with it, and that's going to have to be
replicated with or without this feature.

I am not saying it couldn't be a problem, and that's why I'm endorsing
making the behavior optional. But I think that it's a niche scenario.
You need a bigger-than-normal template database, a slow WAN link, AND
you need the amount of data loaded into the databases you create from
the template to be small enough to make the cost of logging the
template pages material. If you create a 10GB database from a template
and then load 200GB of data into it, the WAL-logging overhead of
creating the template is only 5%.

I won't really be surprised if we hear that someone has a 10GB
template database and likes to make a ton of copies and only change
500 rows in each one while replicating the whole thing over a slow
WAN. That can definitely happen, and I'm sure whoever is doing that
has reasons for it which they consider good and sufficient. However, I
don't think there are likely to be a ton of people doing stuff like
that - just a few.

--
Robert Haas
EDB: http://www.enterprisedb.com

#96

Andres Freund

andres@anarazel.de

almost 4 years ago

In reply to: Robert Haas (#95)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

Hi,

On 2022-02-10 10:32:42 -0500, Robert Haas wrote:

I won't really be surprised if we hear that someone has a 10GB
template database and likes to make a ton of copies and only change
500 rows in each one while replicating the whole thing over a slow
WAN. That can definitely happen, and I'm sure whoever is doing that
has reasons for it which they consider good and sufficient. However, I
don't think there are likely to be a ton of people doing stuff like
that - just a few.

Yea. I would be a bit more concerned if we made creating template databases
very cheap, e.g. by using file copy-on-write functionality like we have for
pg_upgrade. But right now it's a fairly hefty operation anyway.

Greetings,

Andres Freund

#97

Andrew Dunstan

andrew@dunslane.net

almost 4 years ago

In reply to: Dilip Kumar (#94)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On 2/10/22 07:32, Dilip Kumar wrote:

On Wed, Feb 9, 2022 at 9:31 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Feb 9, 2022 at 10:59 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Feb 9, 2022 at 9:25 PM Andrew Dunstan <andrew@dunslane.net> wrote:

On 6/16/21 03:52, Dilip Kumar wrote:

On Tue, Jun 15, 2021 at 7:01 PM Andrew Dunstan <andrew@dunslane.net> wrote:

Rather than use size, I'd be inclined to say use this if the source
database is marked as a template, and use the copydir approach for
anything that isn't.

Yeah, that is possible, on the other thought wouldn't it be good to
provide control to the user by providing two different commands, e.g.
COPY DATABASE for the existing method (copydir) and CREATE DATABASE
for the new method (fully wal logged)?

This proposal seems to have gotten lost.

Yeah, I am planning to work on this part so that we can support both methods.

But can we pick a different syntax? In my view this should be an
option to CREATE DATABASE rather than a whole new command.

Maybe we can provide something like

CREATE DATABASE..WITH WAL_LOG=true/false ? OR
CREATE DATABASE..WITH WAL_LOG_DATA_PAGE=true/false ? OR
CREATE DATABASE..WITH CHECKPOINT=true/false ? OR

And then we can explain in documentation about these options? I think
default should be new method?

The last one at least has the advantage that it doesn't invent yet
another keyword.

I can live with the new method being the default. I'm sure it would be
highlighted in the release notes too.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#98

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Andrew Dunstan (#97)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Fri, Feb 11, 2022 at 12:11 PM Andrew Dunstan <andrew@dunslane.net> wrote:

The last one at least has the advantage that it doesn't invent yet
another keyword.

We don't need a new keyword for this as long as it lexes as one token,
because createdb_opt_name accepts IDENT. So I think we should focus on
trying to come up with something that is as clear as we know how to
make it.

What I find difficult about doing that is that this is all a bunch of
technical details that users may have difficulty understanding. If we
say WAL_LOG or WAL_LOG_DATA, a reasonably but not incredibly
well-informed user will assume that skipping WAL is not really an
option. If we say CHECKPOINT, a reasonably but not incredibly
well-informed user will presume they don't want one (I think).
CHECKPOINT also seems like it's naming the switch by the unwanted side
effect, which doesn't seem too flattering to the existing method.

How about something like LOG_AS_CLONE? That makes it clear, I hope,
that we're logging it a different way, but that method of logging it
is different in each case. You'd still have to read the documentation
to find out what it really means, but at least it seems like it points
you more in the right direction. To me, anyway.

I can live with the new method being the default. I'm sure it would be
highlighted in the release notes too.

That would make sense.

--
Robert Haas
EDB: http://www.enterprisedb.com

#99

Bruce Momjian

bruce@momjian.us

almost 4 years ago

In reply to: Robert Haas (#98)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Fri, Feb 11, 2022 at 12:35:50PM -0500, Robert Haas wrote:

How about something like LOG_AS_CLONE? That makes it clear, I hope,
that we're logging it a different way, but that method of logging it
is different in each case. You'd still have to read the documentation
to find out what it really means, but at least it seems like it points
you more in the right direction. To me, anyway.

I think CLONE would be confusing since we don't use that term often,
maybe LOG_DB_COPY or LOG_FILE_COPY?

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

If only the physical world exists, free will is an illusion.

#100

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Bruce Momjian (#99)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Fri, Feb 11, 2022 at 12:50 PM Bruce Momjian <bruce@momjian.us> wrote:

On Fri, Feb 11, 2022 at 12:35:50PM -0500, Robert Haas wrote:

How about something like LOG_AS_CLONE? That makes it clear, I hope,
that we're logging it a different way, but that method of logging it
is different in each case. You'd still have to read the documentation
to find out what it really means, but at least it seems like it points
you more in the right direction. To me, anyway.

I think CLONE would be confusing since we don't use that term often,
maybe LOG_DB_COPY or LOG_FILE_COPY?

Yeah, maybe. But it's not clear to me with that kind of naming whether
TRUE or FALSE would be the existing behavior? One version logs a
single record for the whole database, and the other logs a record per
database block. Neither version logs per file. LOG_COPIED_BLOCKS,
maybe?

--
Robert Haas
EDB: http://www.enterprisedb.com

#101

Bruce Momjian

bruce@momjian.us

almost 4 years ago

In reply to: Robert Haas (#100)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Fri, Feb 11, 2022 at 01:18:58PM -0500, Robert Haas wrote:

On Fri, Feb 11, 2022 at 12:50 PM Bruce Momjian <bruce@momjian.us> wrote:

On Fri, Feb 11, 2022 at 12:35:50PM -0500, Robert Haas wrote:

How about something like LOG_AS_CLONE? That makes it clear, I hope,
that we're logging it a different way, but that method of logging it
is different in each case. You'd still have to read the documentation
to find out what it really means, but at least it seems like it points
you more in the right direction. To me, anyway.

I think CLONE would be confusing since we don't use that term often,
maybe LOG_DB_COPY or LOG_FILE_COPY?

Yeah, maybe. But it's not clear to me with that kind of naming whether
TRUE or FALSE would be the existing behavior? One version logs a
single record for the whole database, and the other logs a record per
database block. Neither version logs per file. LOG_COPIED_BLOCKS,
maybe?

Yes, I like BLOCKS more than FILE.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

If only the physical world exists, free will is an illusion.

#102

Andrew Dunstan

andrew@dunslane.net

almost 4 years ago

In reply to: Bruce Momjian (#101)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On 2/11/22 13:32, Bruce Momjian wrote:

On Fri, Feb 11, 2022 at 01:18:58PM -0500, Robert Haas wrote:

On Fri, Feb 11, 2022 at 12:50 PM Bruce Momjian <bruce@momjian.us> wrote:

On Fri, Feb 11, 2022 at 12:35:50PM -0500, Robert Haas wrote:

How about something like LOG_AS_CLONE? That makes it clear, I hope,
that we're logging it a different way, but that method of logging it
is different in each case. You'd still have to read the documentation
to find out what it really means, but at least it seems like it points
you more in the right direction. To me, anyway.

I think CLONE would be confusing since we don't use that term often,
maybe LOG_DB_COPY or LOG_FILE_COPY?

Yeah, maybe. But it's not clear to me with that kind of naming whether
TRUE or FALSE would be the existing behavior? One version logs a
single record for the whole database, and the other logs a record per
database block. Neither version logs per file. LOG_COPIED_BLOCKS,
maybe?

Yes, I like BLOCKS more than FILE.

I'm not really sure any single parameter name is going to capture the
subtlety involved here.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#103

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Andrew Dunstan (#102)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Fri, Feb 11, 2022 at 3:40 PM Andrew Dunstan <andrew@dunslane.net> wrote:

I'm not really sure any single parameter name is going to capture the
subtlety involved here.

I mean to some extent that's inevitable, but it's not a reason not to
do the best we can.

--
Robert Haas
EDB: http://www.enterprisedb.com

#104

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Bruce Momjian (#101)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Fri, Feb 11, 2022 at 1:32 PM Bruce Momjian <bruce@momjian.us> wrote:

Yeah, maybe. But it's not clear to me with that kind of naming whether
TRUE or FALSE would be the existing behavior? One version logs a
single record for the whole database, and the other logs a record per
database block. Neither version logs per file. LOG_COPIED_BLOCKS,
maybe?

Yes, I like BLOCKS more than FILE.

Cool.

--
Robert Haas
EDB: http://www.enterprisedb.com

#105

Alvaro Herrera

alvherre@alvh.no-ip.org

almost 4 years ago

In reply to: Robert Haas (#98)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On 2022-Feb-11, Robert Haas wrote:

What I find difficult about doing that is that this is all a bunch of
technical details that users may have difficulty understanding. If we
say WAL_LOG or WAL_LOG_DATA, a reasonably but not incredibly
well-informed user will assume that skipping WAL is not really an
option. If we say CHECKPOINT, a reasonably but not incredibly
well-informed user will presume they don't want one (I think).
CHECKPOINT also seems like it's naming the switch by the unwanted side
effect, which doesn't seem too flattering to the existing method.

It seems you're thinking deciding what to do based on an option that
gets a boolean argument. But what about making the argument be an enum?
For example

CREATE DATABASE ... WITH (STRATEGY = LOG); -- default if option is omitted
CREATE DATABASE ... WITH (STRATEGY = CHECKPOINT);

So the user has to think about it in terms of some strategy to choose,
rather than enabling or disabling some flag with nontrivial
implications.

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
"[PostgreSQL] is a great group; in my opinion it is THE best open source
development communities in existence anywhere." (Lamar Owen)

#106

Andrew Dunstan

andrew@dunslane.net

almost 4 years ago

In reply to: Robert Haas (#103)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On 2/11/22 15:47, Robert Haas wrote:

On Fri, Feb 11, 2022 at 3:40 PM Andrew Dunstan <andrew@dunslane.net> wrote:

I'm not really sure any single parameter name is going to capture the
subtlety involved here.

I mean to some extent that's inevitable, but it's not a reason not to
do the best we can.

True.

I do think we should be wary of any name starting with "LOG", though.
Long experience tells us that's something that confuses users when it
refers to the WAL.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#107

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Alvaro Herrera (#105)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Fri, Feb 11, 2022 at 4:08 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

It seems you're thinking deciding what to do based on an option that
gets a boolean argument. But what about making the argument be an enum?
For example

CREATE DATABASE ... WITH (STRATEGY = LOG); -- default if option is omitted
CREATE DATABASE ... WITH (STRATEGY = CHECKPOINT);

So the user has to think about it in terms of some strategy to choose,
rather than enabling or disabling some flag with nontrivial
implications.

I don't like those particular strategy names very much, but in general
I think that could be a way to go, too. I somewhat hope we never end
up with THREE strategies for creating a new database, but now that I
think about it, we might. Somebody might want to use a fancy FS
primitive that clones a directory at the FS level, or something.

--
Robert Haas
EDB: http://www.enterprisedb.com

#108

Andres Freund

andres@anarazel.de

almost 4 years ago

In reply to: Robert Haas (#107)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

Hi,

On 2022-02-11 16:19:12 -0500, Robert Haas wrote:

I somewhat hope we never end up with THREE strategies for creating a new
database, but now that I think about it, we might. Somebody might want to
use a fancy FS primitive that clones a directory at the FS level, or
something.

I think that'd be a great, and pretty easy to implement, feature. But it seems
like it'd be mostly orthogonal to the "WAL log data" vs "checkpoint data"
question? On the primary / single node system using "WAL log data" with "COW
file copy" would work well.

I bet using COW file copies would speed up our own regression tests noticeably
- on slower systems we spend a fair bit of time and space creating template0
and postgres, with the bulk of the data never changing.

Template databases are also fairly commonly used by application developers to
avoid the cost of rerunning all the setup DDL & initial data loading for
different tests. Making that measurably cheaper would be a significant win.

Greetings,

Andres Freund

#109

Justin Pryzby

pryzby@telsasoft.com

almost 4 years ago

In reply to: Andres Freund (#108)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Sat, Feb 12, 2022 at 06:00:44PM -0800, Andres Freund wrote:

Hi,

On 2022-02-11 16:19:12 -0500, Robert Haas wrote:

I somewhat hope we never end up with THREE strategies for creating a new
database, but now that I think about it, we might. Somebody might want to
use a fancy FS primitive that clones a directory at the FS level, or
something.

I think that'd be a great, and pretty easy to implement, feature. But it seems
like it'd be mostly orthogonal to the "WAL log data" vs "checkpoint data"
question? On the primary / single node system using "WAL log data" with "COW
file copy" would work well.

I bet using COW file copies would speed up our own regression tests noticeably
- on slower systems we spend a fair bit of time and space creating template0
and postgres, with the bulk of the data never changing.

Template databases are also fairly commonly used by application developers to
avoid the cost of rerunning all the setup DDL & initial data loading for
different tests. Making that measurably cheaper would be a significant win.

I ran into this last week and was still thinking about proposing it.

Would this help CI or any significant fraction of buildfarm ?
Or just tests run locally on supporting filesystems.

Note that pg_upgrade already supports copy/link/clone. (Obviously, link
wouldn't do anything desirable for CREATE DATABASE).

--
Justin

#110

Dilip Kumar

dilipbalaut@gmail.com

almost 4 years ago

In reply to: Alvaro Herrera (#105)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Sat, Feb 12, 2022 at 2:38 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

On 2022-Feb-11, Robert Haas wrote:

What I find difficult about doing that is that this is all a bunch of
technical details that users may have difficulty understanding. If we
say WAL_LOG or WAL_LOG_DATA, a reasonably but not incredibly
well-informed user will assume that skipping WAL is not really an
option. If we say CHECKPOINT, a reasonably but not incredibly
well-informed user will presume they don't want one (I think).
CHECKPOINT also seems like it's naming the switch by the unwanted side
effect, which doesn't seem too flattering to the existing method.

It seems you're thinking deciding what to do based on an option that
gets a boolean argument. But what about making the argument be an enum?
For example

CREATE DATABASE ... WITH (STRATEGY = LOG); -- default if option is omitted
CREATE DATABASE ... WITH (STRATEGY = CHECKPOINT);

So the user has to think about it in terms of some strategy to choose,
rather than enabling or disabling some flag with nontrivial
implications.

Yeah I think being explicit about giving the strategy to the user
looks like a better option. Now they can choose whether they want it
to create using WAL log or using CHECKPOINT. Otherwise, if we give a
flag then we will have to give an explanation that if they choose not
to WAL log then we will have to do a checkpoint internally. So I
think giving LOG vs CHECKPOINT as an explicit option looks better to
me.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#111

Dilip Kumar

dilipbalaut@gmail.com

almost 4 years ago

In reply to: Dilip Kumar (#110)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Sun, Feb 13, 2022 at 10:12 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have done performance testing with different template DB sizes and
different amounts of dirty shared buffers and I think as expected the
bigger the dirty shared buffer the checkpoint approach becomes costly
and OTOH the larger the template DB size the WAL log approach takes
more time.

I think it is very common to have larger shared buffers and of course,
if somebody has configured such a large shared buffer then a good % of
it will be dirty most of the time. So IMHO in the future, the WAL log
approach is going to be more usable in general. However, this is just
my opinion, and others may have completely different thoughts and
anyhow we are keeping options for both the approaches so no worry.

Next, I am planning to do some more tests, where we are having pgbench
running and concurrently we do CREATEDB maybe every 1 minute and see
what is the CREATEDB time as well as what is the impact on pgbench
performance. Because currently I have only measured CREATEDB time but
we must be knowing the impact of createdb on the other system as well.

Test setup:
max_wal_size=64GB
checkpoint_timeout=15min
- CREATE base TABLE of size of Shared Buffers
- CREATE template database and table in it of varying sizes (as per test)
- CHECKPOINT (write out dirty buffers)
- UPDATE 70% of tuple in base table (dirty 70% of shared buffers)
- CREATE database using template db. (Actual test target)

test1:
1 GB shared buffers, template DB size = 6MB, dirty shared buffer=70%
Head: 2341.665 ms
Patch: 85.229 ms

test2:
1 GB shared buffers, template DB size = 1GB, dirty shared buffer=70%
Head: 4044 ms
Patch: 8376 ms

test3:
8 GB shared buffers, template DB size = 1GB, dirty shared buffer=70%
Head: 21398 ms
Patch: 9834 ms

test4:
8 GB shared buffers, template DB size = 10GB, dirty shared buffer=95%
Head: 38574 ms
Patch: 77160 ms

test4:
32 GB shared buffers, template DB size = 10GB, dirty shared buffer=70%
Head: 47656 ms
Patch: 79767 ms

test5:
64 GB shared buffers, template DB size = 1GB, dirty shared buffer=70%
Head: 59151 ms
Patch: 8742 ms

test6:
64 GB shared buffers, template DB size = 50GB, dirty shared buffer=50%
Head: 171614 ms
Patch: 406040 ms

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#112

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Dilip Kumar (#111)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Sun, Feb 13, 2022 at 1:34 AM Dilip Kumar <dilipbalaut@gmail.com> wrot>

test4:
32 GB shared buffers, template DB size = 10GB, dirty shared buffer=70%
Head: 47656 ms
Patch: 79767 ms

This seems like the most surprising result of the bunch. Here, the
template DB is both small enough to fit in shared_buffers and small
enough not to trigger a checkpoint all by itself, and yet the patch
loses.

Did you checkpoint between one test and the next, or might this test
have been done after a bunch of WAL had already been written since the
last checkpoint so that the 10GB pushed it over the edge?

BTW, you have test4 twice in your list of results.

--
Robert Haas
EDB: http://www.enterprisedb.com

#113

Dilip Kumar

dilipbalaut@gmail.com

almost 4 years ago

In reply to: Robert Haas (#112)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Sun, Feb 13, 2022 at 9:56 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Sun, Feb 13, 2022 at 1:34 AM Dilip Kumar <dilipbalaut@gmail.com> wrot>

test4:
32 GB shared buffers, template DB size = 10GB, dirty shared buffer=70%
Head: 47656 ms
Patch: 79767 ms

This seems like the most surprising result of the bunch. Here, the
template DB is both small enough to fit in shared_buffers and small
enough not to trigger a checkpoint all by itself, and yet the patch
loses.

Well this is not really surprising to me because what I have noticed
is that with the new approach the createdb time is completely
dependent upon the template db size. So if the source db size is 10GB
it is taking around 80sec and the shared buffers size does not have a
major impact. Maybe a very small shared buffer can have more impact
so I will test that as well.

Did you checkpoint between one test and the next, or might this test
have been done after a bunch of WAL had already been written since the
last checkpoint so that the 10GB pushed it over the edge?

Not really, I am running each test with a new initdb so that could
not be an issue.

BTW, you have test4 twice in your list of results.

My bad, those are different tests.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#114

Dilip Kumar

dilipbalaut@gmail.com

almost 4 years ago

In reply to: Dilip Kumar (#111)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Sun, Feb 13, 2022 at 12:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Sun, Feb 13, 2022 at 10:12 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Next, I am planning to do some more tests, where we are having pgbench
running and concurrently we do CREATEDB maybe every 1 minute and see
what is the CREATEDB time as well as what is the impact on pgbench
performance. Because currently I have only measured CREATEDB time but
we must be knowing the impact of createdb on the other system as well.

I have done tests with the pgbench as well. So basically I did not
notice any significant difference in the TPS, I was expecting there
should be some difference due to the checkpoint on the head so maybe I
need to test with more backend maybe. And createdb time there is a
huge difference. I think this is because template1 db is very small so
patch is getting completed in no time whereas head is taking huge time
because of high dirty shared buffers (due to concurrent pgbench).

config:
echo "logging_collector=on" >> data/postgresql.conf
echo "port = 5432" >> data/postgresql.conf
echo "max_wal_size=64GB" >> data/postgresql.conf
echo "checkpoint_timeout=15min" >> data/postgresql.conf
echo "shared_buffers=32GB" >> data/postgresql.conf

Test:
./pgbench -i -s 1000 postgres
./pgbench -c 32 -j 32 -T 1200 -M prepared postgres >> result.txt
-- Concurrently run below script every 1 mins
CREATE DATABASE mydb log_copied_blocks=true/false;

Results:
- Pgbench TPS: Did not observe any difference head vs patch
- Create db time(very small template):
head: 21000 ms to 42000 ms (at different time)
patch: 80 ms

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#115

Dilip Kumar

dilipbalaut@gmail.com

almost 4 years ago

In reply to: Dilip Kumar (#113)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Mon, Feb 14, 2022 at 10:31 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Sun, Feb 13, 2022 at 9:56 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Sun, Feb 13, 2022 at 1:34 AM Dilip Kumar <dilipbalaut@gmail.com> wrot>

test4:
32 GB shared buffers, template DB size = 10GB, dirty shared buffer=70%
Head: 47656 ms
Patch: 79767 ms

This seems like the most surprising result of the bunch. Here, the
template DB is both small enough to fit in shared_buffers and small
enough not to trigger a checkpoint all by itself, and yet the patch
loses.

Well this is not really surprising to me because what I have noticed
is that with the new approach the createdb time is completely
dependent upon the template db size. So if the source db size is 10GB
it is taking around 80sec and the shared buffers size does not have a
major impact. Maybe a very small shared buffer can have more impact
so I will test that as well.

I have done some more experiments just to understand where we are
spending most of the time. First I have tried with synchronous commit
and fsync off and the creation time dropped from 80s to 70s then I
just removed the log_newpage then time further dropped to 50s. I have
also tried with different shared buffer sizes and observed that
reducing or increasing the shared buffer size does not have much
impact on the created db with the new approach.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#116

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Dilip Kumar (#113)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Mon, Feb 14, 2022 at 12:01 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Well this is not really surprising to me because what I have noticed
is that with the new approach the createdb time is completely
dependent upon the template db size. So if the source db size is 10GB
it is taking around 80sec and the shared buffers size does not have a
major impact. Maybe a very small shared buffer can have more impact
so I will test that as well.

OK. Well, then this approach is somewhat worse than I expected for
moderately large template databases. But it seems very good for small
template databases, especially when there is other work in progress on
the system.

--
Robert Haas
EDB: http://www.enterprisedb.com

#117

Ashutosh Sharma

ashu.coek88@gmail.com

almost 4 years ago

In reply to: Dilip Kumar (#111)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

Hi Dilip,

On Sun, Feb 13, 2022 at 12:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Sun, Feb 13, 2022 at 10:12 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have done performance testing with different template DB sizes and
different amounts of dirty shared buffers and I think as expected the
bigger the dirty shared buffer the checkpoint approach becomes costly
and OTOH the larger the template DB size the WAL log approach takes
more time.

I think it is very common to have larger shared buffers and of course,
if somebody has configured such a large shared buffer then a good % of
it will be dirty most of the time. So IMHO in the future, the WAL log
approach is going to be more usable in general. However, this is just
my opinion, and others may have completely different thoughts and
anyhow we are keeping options for both the approaches so no worry.

Next, I am planning to do some more tests, where we are having pgbench
running and concurrently we do CREATEDB maybe every 1 minute and see
what is the CREATEDB time as well as what is the impact on pgbench
performance. Because currently I have only measured CREATEDB time but
we must be knowing the impact of createdb on the other system as well.

Test setup:
max_wal_size=64GB
checkpoint_timeout=15min
- CREATE base TABLE of size of Shared Buffers
- CREATE template database and table in it of varying sizes (as per test)
- CHECKPOINT (write out dirty buffers)
- UPDATE 70% of tuple in base table (dirty 70% of shared buffers)
- CREATE database using template db. (Actual test target)

test1:
1 GB shared buffers, template DB size = 6MB, dirty shared buffer=70%
Head: 2341.665 ms
Patch: 85.229 ms

test2:
1 GB shared buffers, template DB size = 1GB, dirty shared buffer=70%
Head: 4044 ms
Patch: 8376 ms

test3:
8 GB shared buffers, template DB size = 1GB, dirty shared buffer=70%
Head: 21398 ms
Patch: 9834 ms

test4:
8 GB shared buffers, template DB size = 10GB, dirty shared buffer=95%
Head: 38574 ms
Patch: 77160 ms

test4:
32 GB shared buffers, template DB size = 10GB, dirty shared buffer=70%
Head: 47656 ms
Patch: 79767 ms

Is it possible to see the WAL size generated by these two statements:
UPDATE 70% of the tuple in the base table (dirty 70% of the shared
buffers) && CREATE database using template DB (Actual test target).
Just wanted to know if it can exceed the max_wal_size of 64GB. Also,
is it possible to try with minimal wal_level? Sorry for asking you
this, I could try it myself but I don't have any high level system to
try it.

--
With Regards,
Ashutosh Sharma.

#118

Dilip Kumar

dilipbalaut@gmail.com

almost 4 years ago

In reply to: Ashutosh Sharma (#117)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Mon, Feb 14, 2022 at 9:17 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Is it possible to see the WAL size generated by these two statements:
UPDATE 70% of the tuple in the base table (dirty 70% of the shared
buffers) && CREATE database using template DB (Actual test target).
Just wanted to know if it can exceed the max_wal_size of 64GB.

I think we already know the wal size generated by creating a db with
an old and new approach. With the old approach it is just one WAL log
and with the new approach it is going to log every single block of the
database. Yeah the updating 70% of the database could have some
impact but for verification purposes I tested without the update and
still the create db with WAL log is taking almost the same time. But
anyway when I test next time I will verify again that no force
checkpoint is triggered.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#119

Dilip Kumar

dilipbalaut@gmail.com

almost 4 years ago

In reply to: Dilip Kumar (#110)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Sun, Feb 13, 2022 at 10:12 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Sat, Feb 12, 2022 at 2:38 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

It seems you're thinking deciding what to do based on an option that
gets a boolean argument. But what about making the argument be an enum?
For example

CREATE DATABASE ... WITH (STRATEGY = LOG); -- default if option is omitted
CREATE DATABASE ... WITH (STRATEGY = CHECKPOINT);

So the user has to think about it in terms of some strategy to choose,
rather than enabling or disabling some flag with nontrivial
implications.

Yeah I think being explicit about giving the strategy to the user
looks like a better option. Now they can choose whether they want it
to create using WAL log or using CHECKPOINT. Otherwise, if we give a
flag then we will have to give an explanation that if they choose not
to WAL log then we will have to do a checkpoint internally. So I
think giving LOG vs CHECKPOINT as an explicit option looks better to
me.

So do we have consensus to use (STRATEGY = LOG/CHECKPOINT or do we
think that keeping it bool i.e. Is LOG_COPIED_BLOCKS a better option?
Once we have consensus on this I will make this change and
documentation as well along with the other changes suggested by
Robert.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#120

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Dilip Kumar (#119)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Mon, Feb 14, 2022 at 11:26 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

So do we have consensus to use (STRATEGY = LOG/CHECKPOINT or do we
think that keeping it bool i.e. Is LOG_COPIED_BLOCKS a better option?
Once we have consensus on this I will make this change and
documentation as well along with the other changes suggested by
Robert.

I think we have consensus on STRATEGY. I'm not sure if we have
consensus on what the option values should be. If we had an option to
use fs-based cloning, that would also need to issue a checkpoint,
which makes me think that CHECKPOINT is not the best name.

--
Robert Haas
EDB: http://www.enterprisedb.com

#121

Bruce Momjian

bruce@momjian.us

almost 4 years ago

In reply to: Robert Haas (#120)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Mon, Feb 14, 2022 at 12:27:10PM -0500, Robert Haas wrote:

On Mon, Feb 14, 2022 at 11:26 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

So do we have consensus to use (STRATEGY = LOG/CHECKPOINT or do we
think that keeping it bool i.e. Is LOG_COPIED_BLOCKS a better option?
Once we have consensus on this I will make this change and
documentation as well along with the other changes suggested by
Robert.

I think we have consensus on STRATEGY. I'm not sure if we have
consensus on what the option values should be. If we had an option to
use fs-based cloning, that would also need to issue a checkpoint,
which makes me think that CHECKPOINT is not the best name.

I think if we want LOG, it has tob e WAL_LOG instead of just LOG. Was
there discussion that the user _has_ to specify and option instead of
using a default? That doesn't seem good.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

If only the physical world exists, free will is an illusion.

#122

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Bruce Momjian (#121)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Mon, Feb 14, 2022 at 1:58 PM Bruce Momjian <bruce@momjian.us> wrote:

I think we have consensus on STRATEGY. I'm not sure if we have
consensus on what the option values should be. If we had an option to
use fs-based cloning, that would also need to issue a checkpoint,
which makes me think that CHECKPOINT is not the best name.

I think if we want LOG, it has tob e WAL_LOG instead of just LOG. Was
there discussion that the user _has_ to specify and option instead of
using a default? That doesn't seem good.

I agree. I think we can set a default, which can be either whatever we
think will be best on average, or maybe it can be conditional based on
the database size or something.

--
Robert Haas
EDB: http://www.enterprisedb.com

#123

Maciek Sakrejda

m.sakrejda@gmail.com

almost 4 years ago

In reply to: Robert Haas (#122)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

Andrew made a good case above for avoiding LOG:

I do think we should be wary of any name starting with "LOG", though.
Long experience tells us that's something that confuses users when it

refers to the WAL.

#124

Dilip Kumar

dilipbalaut@gmail.com

almost 4 years ago

In reply to: Maciek Sakrejda (#123)

6 attachment(s)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Tue, Feb 15, 2022 at 2:01 AM Maciek Sakrejda <m.sakrejda@gmail.com> wrote:

Here is the updated version of the patch, the changes are 1) Fixed
review comments given by Robert and one open comment from Ashutosh.
2) Preserved the old create db method. 3) As agreed upthread for now
we are using the new strategy only for createdb not for movedb so I
have removed the changes in ForgetDatabaseSyncRequests() and
DropDatabaseBuffers(). 3) Provided a database creation strategy
option as of now I have kept it as below.

CREATE DATABASE ... WITH (STRATEGY = WAL_LOG); -- default if
option is omitted
CREATE DATABASE ... WITH (STRATEGY = FILE_COPY);

I have updated the document but I was not sure how much internal
information to be exposed to the user so I will work on that based on
feedback from others.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v9-0001-Refactor-relmap-load-and-relmap-write-functions.patchtext/x-patch; charset=US-ASCII; name=v9-0001-Refactor-relmap-load-and-relmap-write-functions.patchDownload

From 203e77f9d08d7bef1add7aed63ba042cf88697fe Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Wed, 1 Sep 2021 14:06:29 +0530
Subject: [PATCH v9 1/6] Refactor relmap load and relmap write functions

Currently, relmap reading and writing interfaces are tightly
coupled with shared_map and local_map of the database
it is connected to.  But as higher level patch set we need
interfaces where we can read relmap into any input memory
and while writing also we should be able to pass the map.

So as part of this patch, we are doing refactoring of the
existing code such that we can expose the read and write
interfaces that are independent of the shared_map and the
local_map, without changing any logic.

XXX For the code simplicity in write_relmap_file we are
updating the permanent memory copy outside the critical
section but we have already done the disk changes and it
is just a memory change so there is no reason for this
to be in the critical section.
---
 src/backend/utils/cache/relmapper.c | 163 ++++++++++++++++++++++--------------
 1 file changed, 99 insertions(+), 64 deletions(-)

diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 4f6811f..56495f0 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -136,6 +136,12 @@ static void apply_map_update(RelMapFile *map, Oid relationId, Oid fileNode,
 							 bool add_okay);
 static void merge_map_updates(RelMapFile *map, const RelMapFile *updates,
 							  bool add_okay);
+static void read_relmap_file(char *mapfilename, RelMapFile *map,
+							 bool lock_held);
+static void write_relmap_file_internal(char *mapfilename, RelMapFile *newmap,
+									   bool write_wal, bool send_sinval,
+									   bool preserve_files, Oid dbid, Oid tsid,
+									   const char *dbpath);
 static void load_relmap_file(bool shared, bool lock_held);
 static void write_relmap_file(bool shared, RelMapFile *newmap,
 							  bool write_wal, bool send_sinval, bool preserve_files,
@@ -687,36 +693,19 @@ RestoreRelationMap(char *startAddress)
 }
 
 /*
- * load_relmap_file -- load data from the shared or local map file
+ * read_relmap_file -- read data from given mapfilename file.
  *
  * Because the map file is essential for access to core system catalogs,
  * failure to read it is a fatal error.
- *
- * Note that the local case requires DatabasePath to be set up.
  */
 static void
-load_relmap_file(bool shared, bool lock_held)
+read_relmap_file(char *mapfilename, RelMapFile *map, bool lock_held)
 {
-	RelMapFile *map;
-	char		mapfilename[MAXPGPATH];
 	pg_crc32c	crc;
 	int			fd;
 	int			r;
 
-	if (shared)
-	{
-		snprintf(mapfilename, sizeof(mapfilename), "global/%s",
-				 RELMAPPER_FILENAME);
-		map = &shared_map;
-	}
-	else
-	{
-		snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
-				 DatabasePath, RELMAPPER_FILENAME);
-		map = &local_map;
-	}
-
-	/* Read data ... */
+	/* Open the relmap file for reading. */
 	fd = OpenTransientFile(mapfilename, O_RDONLY | PG_BINARY);
 	if (fd < 0)
 		ereport(FATAL,
@@ -779,62 +768,50 @@ load_relmap_file(bool shared, bool lock_held)
 }
 
 /*
- * Write out a new shared or local map file with the given contents.
- *
- * The magic number and CRC are automatically updated in *newmap.  On
- * success, we copy the data to the appropriate permanent static variable.
- *
- * If write_wal is true then an appropriate WAL message is emitted.
- * (It will be false for bootstrap and WAL replay cases.)
- *
- * If send_sinval is true then a SI invalidation message is sent.
- * (This should be true except in bootstrap case.)
- *
- * If preserve_files is true then the storage manager is warned not to
- * delete the files listed in the map.
+ * load_relmap_file -- load data from the shared or local map file
  *
- * Because this may be called during WAL replay when MyDatabaseId,
- * DatabasePath, etc aren't valid, we require the caller to pass in suitable
- * values.  The caller is also responsible for being sure no concurrent
- * map update could be happening.
+ * Note that the local case requires DatabasePath to be set up.
  */
 static void
-write_relmap_file(bool shared, RelMapFile *newmap,
-				  bool write_wal, bool send_sinval, bool preserve_files,
-				  Oid dbid, Oid tsid, const char *dbpath)
+load_relmap_file(bool shared, bool lock_held)
 {
-	int			fd;
-	RelMapFile *realmap;
+	RelMapFile *map;
 	char		mapfilename[MAXPGPATH];
 
-	/*
-	 * Fill in the overhead fields and update CRC.
-	 */
-	newmap->magic = RELMAPPER_FILEMAGIC;
-	if (newmap->num_mappings < 0 || newmap->num_mappings > MAX_MAPPINGS)
-		elog(ERROR, "attempt to write bogus relation mapping");
-
-	INIT_CRC32C(newmap->crc);
-	COMP_CRC32C(newmap->crc, (char *) newmap, offsetof(RelMapFile, crc));
-	FIN_CRC32C(newmap->crc);
-
-	/*
-	 * Open the target file.  We prefer to do this before entering the
-	 * critical section, so that an open() failure need not force PANIC.
-	 */
 	if (shared)
 	{
 		snprintf(mapfilename, sizeof(mapfilename), "global/%s",
 				 RELMAPPER_FILENAME);
-		realmap = &shared_map;
+		map = &shared_map;
 	}
 	else
 	{
 		snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
-				 dbpath, RELMAPPER_FILENAME);
-		realmap = &local_map;
+				 DatabasePath, RELMAPPER_FILENAME);
+		map = &local_map;
 	}
 
+	/* Read data ... */
+	read_relmap_file(mapfilename, map, lock_held);
+}
+
+/*
+ * Helper function for write_relmap_file, Read comments atop write_relmap_file
+ * for more details.  The CRC should be computed by the caller and stored in
+ * the newmap.
+ */
+static void
+write_relmap_file_internal(char *mapfilename, RelMapFile *newmap,
+						   bool write_wal, bool send_sinval,
+						   bool preserve_files, Oid dbid, Oid tsid,
+						   const char *dbpath)
+{
+	int			fd;
+
+	/*
+	 * Open the target file.  We prefer to do this before entering the
+	 * critical section, so that an open() failure need not force PANIC.
+	 */
 	fd = OpenTransientFile(mapfilename, O_WRONLY | O_CREAT | PG_BINARY);
 	if (fd < 0)
 		ereport(ERROR,
@@ -934,6 +911,68 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		}
 	}
 
+	/* Critical section done */
+	if (write_wal)
+		END_CRIT_SECTION();
+}
+
+/*
+ * Write out a new shared or local map file with the given contents.
+ *
+ * The magic number and CRC are automatically updated in *newmap.  On
+ * success, we copy the data to the appropriate permanent static variable.
+ *
+ * If write_wal is true then an appropriate WAL message is emitted.
+ * (It will be false for bootstrap and WAL replay cases.)
+ *
+ * If send_sinval is true then a SI invalidation message is sent.
+ * (This should be true except in bootstrap case.)
+ *
+ * If preserve_files is true then the storage manager is warned not to
+ * delete the files listed in the map.
+ *
+ * Because this may be called during WAL replay when MyDatabaseId,
+ * DatabasePath, etc aren't valid, we require the caller to pass in suitable
+ * values.  The caller is also responsible for being sure no concurrent
+ * map update could be happening.
+ */
+static void
+write_relmap_file(bool shared, RelMapFile *newmap,
+				  bool write_wal, bool send_sinval, bool preserve_files,
+				  Oid dbid, Oid tsid, const char *dbpath)
+{
+	RelMapFile *realmap;
+	char		mapfilename[MAXPGPATH];
+
+	/*
+	 * Fill in the overhead fields and update CRC.
+	 */
+	newmap->magic = RELMAPPER_FILEMAGIC;
+	if (newmap->num_mappings < 0 || newmap->num_mappings > MAX_MAPPINGS)
+		elog(ERROR, "attempt to write bogus relation mapping");
+
+	INIT_CRC32C(newmap->crc);
+	COMP_CRC32C(newmap->crc, (char *) newmap, offsetof(RelMapFile, crc));
+	FIN_CRC32C(newmap->crc);
+
+	if (shared)
+	{
+		snprintf(mapfilename, sizeof(mapfilename), "global/%s",
+				 RELMAPPER_FILENAME);
+		realmap = &shared_map;
+	}
+	else
+	{
+		snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+				 dbpath, RELMAPPER_FILENAME);
+		realmap = &local_map;
+	}
+
+	/* Write the map to the relmap file. */
+	write_relmap_file_internal(mapfilename, newmap, write_wal,
+							   send_sinval, preserve_files, dbid, tsid,
+							   dbpath);
+
 	/*
 	 * Success, update permanent copy.  During bootstrap, we might be working
 	 * on the permanent copy itself, in which case skip the memcpy() to avoid
@@ -943,10 +982,6 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		memcpy(realmap, newmap, sizeof(RelMapFile));
 	else
 		Assert(!send_sinval);	/* must be bootstrapping */
-
-	/* Critical section done */
-	if (write_wal)
-		END_CRIT_SECTION();
 }
 
 /*
-- 
1.8.3.1

v9-0005-New-interface-to-lock-relation-id.patchtext/x-patch; charset=US-ASCII; name=v9-0005-New-interface-to-lock-relation-id.patchDownload

From fccc8fc085997867485ed7f776bbeb432d02914c Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 24 Sep 2021 18:29:17 +0530
Subject: [PATCH v9 5/6] New interface to lock relation id

Currently, we have LockRelationOid which provide a mechanism to
lock the relation oid but we must be connected to the database
from which this relation belong.  As part of this patch we are
providing a new interface which can lock the relation even if we
are not connected to the containing database.
---
 src/backend/storage/lmgr/lmgr.c | 28 ++++++++++++++++++++++++++++
 src/include/storage/lmgr.h      |  1 +
 2 files changed, 29 insertions(+)

diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 5ae52dd..1543da6 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -176,6 +176,34 @@ ConditionalLockRelationOid(Oid relid, LOCKMODE lockmode)
 }
 
 /*
+ *		LockRelationId
+ *
+ * Lock, given a LockRelId.  Same as LockRelationOid but take LockRelId as an
+ * input.
+ */
+void
+LockRelationId(LockRelId *relid, LOCKMODE lockmode)
+{
+	LOCKTAG		tag;
+	LOCALLOCK  *locallock;
+	LockAcquireResult res;
+
+	SET_LOCKTAG_RELATION(tag, relid->dbId, relid->relId);
+
+	res = LockAcquireExtended(&tag, lockmode, false, false, true, &locallock);
+
+	/*
+	 * Now that we have the lock, check for invalidation messages; see notes
+	 * in LockRelationOid.
+	 */
+	if (res != LOCKACQUIRE_ALREADY_CLEAR)
+	{
+		AcceptInvalidationMessages();
+		MarkLockClear(locallock);
+	}
+}
+
+/*
  *		UnlockRelationId
  *
  * Unlock, given a LockRelId.  This is preferred over UnlockRelationOid
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index 49edbcc..be1d2c9 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -38,6 +38,7 @@ extern void RelationInitLockInfo(Relation relation);
 
 /* Lock a relation */
 extern void LockRelationOid(Oid relid, LOCKMODE lockmode);
+extern void LockRelationId(LockRelId *relid, LOCKMODE lockmode);
 extern bool ConditionalLockRelationOid(Oid relid, LOCKMODE lockmode);
 extern void UnlockRelationId(LockRelId *relid, LOCKMODE lockmode);
 extern void UnlockRelationOid(Oid relid, LOCKMODE lockmode);
-- 
1.8.3.1

v9-0003-Refactor-index_copy_data.patchtext/x-patch; charset=US-ASCII; name=v9-0003-Refactor-index_copy_data.patchDownload

From 1218f1b1cae165d441cab1113fc9ea02675d2ec4 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 24 Sep 2021 18:13:25 +0530
Subject: [PATCH v9 3/6] Refactor index_copy_data

Make separate interface for copying relation storage, this will
be used by later patch for copying the database relations.
---
 src/backend/commands/tablecmds.c | 68 +++++++++++++++++++++++++---------------
 src/include/commands/tablecmds.h |  5 +++
 src/tools/pgindent/typedefs.list |  1 +
 3 files changed, 48 insertions(+), 26 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 3e83f37..a57d6b0 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -14580,54 +14580,70 @@ AlterTableMoveAll(AlterTableMoveAllStmt *stmt)
 	return new_tablespaceoid;
 }
 
-static void
-index_copy_data(Relation rel, RelFileNode newrnode)
+/*
+ * Copy source smgr relation's all fork's data to the destination.
+ *
+ * copy_storage - storage copy function, which is passed by the caller.
+ */
+void
+RelationCopyAllFork(SMgrRelation src_smgr, SMgrRelation	dst_smgr,
+					char relpersistence, copy_relation_storage copy_storage)
 {
-	SMgrRelation dstrel;
-
-	dstrel = smgropen(newrnode, rel->rd_backend);
-
 	/*
-	 * Since we copy the file directly without looking at the shared buffers,
-	 * we'd better first flush out any pages of the source relation that are
-	 * in shared buffers.  We assume no new changes will be made while we are
-	 * holding exclusive lock on the rel.
-	 */
-	FlushRelationBuffers(rel);
-
-	/*
-	 * Create and copy all forks of the relation, and schedule unlinking of
-	 * old physical files.
+	 * Create and copy all forks of the relation.
 	 *
 	 * NOTE: any conflict in relfilenode value will be caught in
 	 * RelationCreateStorage().
 	 */
-	RelationCreateStorage(newrnode, rel->rd_rel->relpersistence);
+	RelationCreateStorage(dst_smgr->smgr_rnode.node, relpersistence);
 
 	/* copy main fork */
-	RelationCopyStorage(RelationGetSmgr(rel), dstrel, MAIN_FORKNUM,
-						rel->rd_rel->relpersistence);
+	copy_storage(src_smgr, dst_smgr, MAIN_FORKNUM, relpersistence);
 
 	/* copy those extra forks that exist */
 	for (ForkNumber forkNum = MAIN_FORKNUM + 1;
 		 forkNum <= MAX_FORKNUM; forkNum++)
 	{
-		if (smgrexists(RelationGetSmgr(rel), forkNum))
+		if (smgrexists(src_smgr, forkNum))
 		{
-			smgrcreate(dstrel, forkNum, false);
+			smgrcreate(dst_smgr, forkNum, false);
 
 			/*
 			 * WAL log creation if the relation is persistent, or this is the
 			 * init fork of an unlogged relation.
 			 */
-			if (RelationIsPermanent(rel) ||
-				(rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
+			if (relpersistence == RELPERSISTENCE_PERMANENT ||
+				(relpersistence == RELPERSISTENCE_UNLOGGED &&
 				 forkNum == INIT_FORKNUM))
-				log_smgrcreate(&newrnode, forkNum);
-			RelationCopyStorage(RelationGetSmgr(rel), dstrel, forkNum,
-								rel->rd_rel->relpersistence);
+				log_smgrcreate(&dst_smgr->smgr_rnode.node, forkNum);
+
+			/* Copy a fork's data, block by block. */
+			copy_storage(src_smgr, dst_smgr, forkNum, relpersistence);
 		}
 	}
+}
+
+static void
+index_copy_data(Relation rel, RelFileNode newrnode)
+{
+	SMgrRelation dstrel;
+
+	dstrel = smgropen(newrnode, rel->rd_backend);
+
+	/*
+	 * Since we copy the file directly without looking at the shared buffers,
+	 * we'd better first flush out any pages of the source relation that are
+	 * in shared buffers.  We assume no new changes will be made while we are
+	 * holding exclusive lock on the rel.
+	 */
+	FlushRelationBuffers(rel);
+
+	/*
+	 * Create and copy all forks of the relation, and schedule unlinking of
+	 * old physical files.
+	 */
+	RelationCopyAllFork(RelationGetSmgr(rel), dstrel,
+						rel->rd_rel->relpersistence, RelationCopyStorage);
 
 	/* drop old relation, and close new one */
 	RelationDropStorage(rel);
diff --git a/src/include/commands/tablecmds.h b/src/include/commands/tablecmds.h
index 5d4037f..cd49471 100644
--- a/src/include/commands/tablecmds.h
+++ b/src/include/commands/tablecmds.h
@@ -19,10 +19,13 @@
 #include "catalog/objectaddress.h"
 #include "nodes/parsenodes.h"
 #include "storage/lock.h"
+#include "storage/smgr.h"
 #include "utils/relcache.h"
 
 struct AlterTableUtilityContext;	/* avoid including tcop/utility.h here */
 
+typedef void (*copy_relation_storage) (SMgrRelation src, SMgrRelation dst,
+									  ForkNumber forkNum, char relpersistence);
 
 extern ObjectAddress DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 									ObjectAddress *typaddress, const char *queryString);
@@ -42,6 +45,8 @@ extern void AlterTableInternal(Oid relid, List *cmds, bool recurse);
 
 extern Oid	AlterTableMoveAll(AlterTableMoveAllStmt *stmt);
 
+extern void RelationCopyAllFork(SMgrRelation src_smgr, SMgrRelation	dst_smgr,
+								char relpersistence, copy_relation_storage copy_storage);
 extern ObjectAddress AlterTableNamespace(AlterObjectSchemaStmt *stmt,
 										 Oid *oldschema);
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index bfb7802..c44b5a9 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3055,6 +3055,7 @@ config_var_value
 contain_aggs_of_level_context
 convert_testexpr_context
 copy_data_source_cb
+copy_relation_storage
 core_YYSTYPE
 core_yy_extra_type
 core_yyscan_t
-- 
1.8.3.1

v9-0002-Extend-relmap-interfaces.patchtext/x-patch; charset=US-ASCII; name=v9-0002-Extend-relmap-interfaces.patchDownload

From a62e427ba8d0ba2f4adaf722e728c70ec58a3621 Mon Sep 17 00:00:00 2001
From: dilipkumar <dilipbalaut@gmail.com>
Date: Mon, 4 Oct 2021 13:50:44 +0530
Subject: [PATCH v9 2/6] Extend relmap interfaces

Support new interfaces in relmapper, 1) Support copying the
relmap file from one database path to the other database path.
2) And another interface for getting filenode from oid.  We already
have RelationMapOidToFilenode for the same purpose but that assumes
we are connected to the database for which we want to get the mapping.
So this new interface will do the same but instead, it will get the
mapping for the input database.

These interfaces are required for next patch, for supporting the
wal logged created database.
---
 src/backend/utils/cache/relmapper.c | 123 +++++++++++++++++++++++++++++++-----
 src/include/utils/relmapper.h       |   6 +-
 2 files changed, 113 insertions(+), 16 deletions(-)

diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 56495f0..86a85c8 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -141,7 +141,7 @@ static void read_relmap_file(char *mapfilename, RelMapFile *map,
 static void write_relmap_file_internal(char *mapfilename, RelMapFile *newmap,
 									   bool write_wal, bool send_sinval,
 									   bool preserve_files, Oid dbid, Oid tsid,
-									   const char *dbpath);
+									   const char *dbpath, bool create);
 static void load_relmap_file(bool shared, bool lock_held);
 static void write_relmap_file(bool shared, RelMapFile *newmap,
 							  bool write_wal, bool send_sinval, bool preserve_files,
@@ -256,6 +256,37 @@ RelationMapFilenodeToOid(Oid filenode, bool shared)
 }
 
 /*
+ * RelationMapOidToFilenodeForDatabase
+ *
+ * Same as RelationMapOidToFilenode, but instead of reading the mapping from
+ * the database we are connected to it will read the mapping from the input
+ * database.
+ */
+Oid
+RelationMapOidToFilenodeForDatabase(char *dbpath, Oid relationId)
+{
+	RelMapFile	map;
+	int			i;
+	char		mapfilename[MAXPGPATH];
+
+	/* Relmap file path for the given dbpath. */
+	snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+			 dbpath, RELMAPPER_FILENAME);
+
+	/* Read the relmap file from the source database. */
+	read_relmap_file(mapfilename, &map, false);
+
+	/* Iterate over the relmap entries to find the input relation oid. */
+	for (i = 0; i < map.num_mappings; i++)
+	{
+		if (relationId == map.mappings[i].mapoid)
+			return map.mappings[i].mapfilenode;
+	}
+
+	return InvalidOid;
+}
+
+/*
  * RelationMapUpdateMap
  *
  * Install a new relfilenode mapping for the specified relation.
@@ -693,7 +724,43 @@ RestoreRelationMap(char *startAddress)
 }
 
 /*
- * read_relmap_file -- read data from given mapfilename file.
+ * CopyRelationMap
+ *
+ * Copy relmapfile from source db path to the destination db path and WAL log
+ * the operation.  This function is only called during the create database, so
+ * the destination database is not yet visible to anyone else, thus we don't
+ * need to acquire the relmap lock while updating the destination relmap.
+ */
+void
+CopyRelationMap(Oid dbid, Oid tsid, char *srcdbpath, char *dstdbpath)
+{
+	RelMapFile map;
+	char mapfilename[MAXPGPATH];
+
+	/* Relmap file path of the source database. */
+	snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+			 srcdbpath, RELMAPPER_FILENAME);
+
+	/* Read the relmap file from the source database. */
+	read_relmap_file(mapfilename, &map, false);
+
+	/* Relmap file path of the destination database. */
+	snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+			 dstdbpath, RELMAPPER_FILENAME);
+
+	/*
+	 * Write map contents into the destination database's relmap file.
+	 * write_relmap_file_internal, expects that the CRC should have been
+	 * computed and stored in the input map.  But, since we have read this map
+	 * from the source database and directly writing to the destination file
+	 * without updating it so we don't need to recompute it.
+	 */
+	write_relmap_file_internal(mapfilename, &map, true, false, true, dbid,
+							   tsid, dstdbpath, true);
+}
+
+/*
+ * read_relmap_file - read data from given mapfilename file.
  *
  * Because the map file is essential for access to core system catalogs,
  * failure to read it is a fatal error.
@@ -796,15 +863,18 @@ load_relmap_file(bool shared, bool lock_held)
 }
 
 /*
- * Helper function for write_relmap_file, Read comments atop write_relmap_file
- * for more details.  The CRC should be computed by the caller and stored in
- * the newmap.
+ * Helper function for write_relmap_file and CopyRelationMap, Read comments
+ * atop write_relmap_file for more details.  The CRC should be computed by the
+ * caller and stored in the newmap.
+ *
+ * Pass the create = true, if we are copying the relmap file during CREATE
+ * DATABASE command.
  */
 static void
 write_relmap_file_internal(char *mapfilename, RelMapFile *newmap,
 						   bool write_wal, bool send_sinval,
 						   bool preserve_files, Oid dbid, Oid tsid,
-						   const char *dbpath)
+						   const char *dbpath, bool create)
 {
 	int			fd;
 
@@ -830,6 +900,7 @@ write_relmap_file_internal(char *mapfilename, RelMapFile *newmap,
 		xlrec.dbid = dbid;
 		xlrec.tsid = tsid;
 		xlrec.nbytes = sizeof(RelMapFile);
+		xlrec.create = create;
 
 		XLogBeginInsert();
 		XLogRegisterData((char *) (&xlrec), MinSizeOfRelmapUpdate);
@@ -971,7 +1042,7 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 	/* Write the map to the relmap file. */
 	write_relmap_file_internal(mapfilename, newmap, write_wal,
 							   send_sinval, preserve_files, dbid, tsid,
-							   dbpath);
+							   dbpath, false);
 
 	/*
 	 * Success, update permanent copy.  During bootstrap, we might be working
@@ -1063,15 +1134,37 @@ relmap_redo(XLogReaderState *record)
 		 * Write out the new map and send sinval, but of course don't write a
 		 * new WAL entry.  There's no surrounding transaction to tell to
 		 * preserve files, either.
-		 *
-		 * There shouldn't be anyone else updating relmaps during WAL replay,
-		 * but grab the lock to interlock against load_relmap_file().
 		 */
-		LWLockAcquire(RelationMappingLock, LW_EXCLUSIVE);
-		write_relmap_file((xlrec->dbid == InvalidOid), &newmap,
-						  false, true, false,
-						  xlrec->dbid, xlrec->tsid, dbpath);
-		LWLockRelease(RelationMappingLock);
+		if (!xlrec->create)
+		{
+			/*
+			 * There shouldn't be anyone else updating relmaps during WAL
+			 * replay, but grab the lock to interlock against
+			 * load_relmap_file().
+			 */
+			LWLockAcquire(RelationMappingLock, LW_EXCLUSIVE);
+			write_relmap_file((xlrec->dbid == InvalidOid), &newmap,
+							false, true, false,
+							xlrec->dbid, xlrec->tsid, dbpath);
+			LWLockRelease(RelationMappingLock);
+		}
+		else
+		{
+			char		mapfilename[MAXPGPATH];
+
+			/* Construct the mapfilename. */
+			snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+					 dbpath, RELMAPPER_FILENAME);
+
+			/*
+			 * We don't need to take relmap lock because this wal is logged
+			 * while creating a new database, so there could be no one else
+			 * reading/writing the relmap file.
+			 */
+			write_relmap_file_internal(mapfilename, &newmap, false, false,
+									   false, xlrec->dbid, xlrec->tsid, dbpath,
+									   true);
+		}
 
 		pfree(dbpath);
 	}
diff --git a/src/include/utils/relmapper.h b/src/include/utils/relmapper.h
index 9fbb5a7..e5635bd 100644
--- a/src/include/utils/relmapper.h
+++ b/src/include/utils/relmapper.h
@@ -29,6 +29,7 @@ typedef struct xl_relmap_update
 	Oid			dbid;			/* database ID, or 0 for shared map */
 	Oid			tsid;			/* database's tablespace, or pg_global */
 	int32		nbytes;			/* size of relmap data */
+	bool		create;			/* true if creating new relmap */
 	char		data[FLEXIBLE_ARRAY_MEMBER];
 } xl_relmap_update;
 
@@ -39,6 +40,8 @@ extern Oid	RelationMapOidToFilenode(Oid relationId, bool shared);
 
 extern Oid	RelationMapFilenodeToOid(Oid relationId, bool shared);
 
+extern Oid RelationMapOidToFilenodeForDatabase(char *dbpath, Oid relationId);
+
 extern void RelationMapUpdateMap(Oid relationId, Oid fileNode, bool shared,
 								 bool immediate);
 
@@ -62,7 +65,8 @@ extern void RelationMapInitializePhase3(void);
 extern Size EstimateRelationMapSpace(void);
 extern void SerializeRelationMap(Size maxSize, char *startAddress);
 extern void RestoreRelationMap(char *startAddress);
-
+extern void CopyRelationMap(Oid dbid, Oid tsid, char *srcdbpath,
+							char *dstdbpath);
 extern void relmap_redo(XLogReaderState *record);
 extern void relmap_desc(StringInfo buf, XLogReaderState *record);
 extern const char *relmap_identify(uint8 info);
-- 
1.8.3.1

v9-0004-Extend-bufmgr-interfaces.patchtext/x-patch; charset=US-ASCII; name=v9-0004-Extend-bufmgr-interfaces.patchDownload

From 89681dbe5ce911c196b2b3036d666626d81da54c Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Thu, 10 Feb 2022 15:55:33 +0530
Subject: [PATCH v9 4/6] Extend bufmgr interfaces

Extend ReadBufferWithoutRelcache interface to take relpersistence
as input. At present, this function may only be used on permanent
relations, because we only use it during XLOG replay.  But now as
part of the bigger patch set, we will be using this for reading the
buffer from the database which we are not connected so now we might
have temporary and unlogged relations as well.
---
 src/backend/access/transam/xlogutils.c |  9 ++++++---
 src/backend/storage/buffer/bufmgr.c    | 13 +++----------
 src/include/storage/bufmgr.h           |  3 ++-
 3 files changed, 11 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 90e1c483..f48656e 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -484,7 +484,8 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 	{
 		/* page exists in file */
 		buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-										   mode, NULL);
+										   mode, NULL,
+										   RELPERSISTENCE_PERMANENT);
 	}
 	else
 	{
@@ -509,7 +510,8 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 				ReleaseBuffer(buffer);
 			}
 			buffer = ReadBufferWithoutRelcache(rnode, forknum,
-											   P_NEW, mode, NULL);
+											   P_NEW, mode, NULL,
+											   RELPERSISTENCE_PERMANENT);
 		}
 		while (BufferGetBlockNumber(buffer) < blkno);
 		/* Handle the corner case that P_NEW returns non-consecutive pages */
@@ -519,7 +521,8 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 				LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 			ReleaseBuffer(buffer);
 			buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-											   mode, NULL);
+											   mode, NULL,
+											   RELPERSISTENCE_PERMANENT);
 		}
 	}
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f5459c6..d6d366a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -771,24 +771,17 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 /*
  * ReadBufferWithoutRelcache -- like ReadBufferExtended, but doesn't require
  *		a relcache entry for the relation.
- *
- * NB: At present, this function may only be used on permanent relations, which
- * is OK, because we only use it during XLOG replay.  If in the future we
- * want to use it on temporary or unlogged relations, we could pass additional
- * parameters.
  */
 Buffer
 ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
 						  BlockNumber blockNum, ReadBufferMode mode,
-						  BufferAccessStrategy strategy)
+						  BufferAccessStrategy strategy, char relpersistence)
 {
 	bool		hit;
 
 	SMgrRelation smgr = smgropen(rnode, InvalidBackendId);
 
-	Assert(InRecovery);
-
-	return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
+	return ReadBuffer_common(smgr, relpersistence, forkNum, blockNum,
 							 mode, strategy, &hit);
 }
 
@@ -798,7 +791,7 @@ ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
  *
  * *hit is set to true if the request was satisfied from shared buffer cache.
  */
-static Buffer
+Buffer
 ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 				  BlockNumber blockNum, ReadBufferMode mode,
 				  BufferAccessStrategy strategy, bool *hit)
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index dd01841..7b80f58 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -184,7 +184,8 @@ extern Buffer ReadBufferExtended(Relation reln, ForkNumber forkNum,
 								 BufferAccessStrategy strategy);
 extern Buffer ReadBufferWithoutRelcache(RelFileNode rnode,
 										ForkNumber forkNum, BlockNumber blockNum,
-										ReadBufferMode mode, BufferAccessStrategy strategy);
+										ReadBufferMode mode, BufferAccessStrategy strategy,
+										char relpersistence);
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern void MarkBufferDirty(Buffer buffer);
-- 
1.8.3.1

v9-0006-WAL-logged-CREATE-DATABASE.patchtext/x-patch; charset=US-ASCII; name=v9-0006-WAL-logged-CREATE-DATABASE.patchDownload

From 8b4769d8720eeb15402231fcbc4738e2e651da19 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 14 Feb 2022 17:48:03 +0530
Subject: [PATCH v9 6/6] WAL logged CREATE DATABASE

Currently, CREATE DATABASE forces a checkpoint, then copies all the files,
then forces another checkpoint. The comments in the createdb() function
explain the reasons for this. The attached patch fixes this problem by making
create database completely WAL logged so that we can avoid the checkpoints.

We are also maintaining the old way of creating the database and for that
we are providing an option called log_copied_blocks.  If log_copied_blocks
is passed true then it will create database using new method otherwise old.
The default will be new method.
---
 doc/src/sgml/ref/create_database.sgml  |  22 +
 src/backend/commands/dbcommands.c      | 840 +++++++++++++++++++++++++++------
 src/include/commands/dbcommands_xlog.h |   8 +
 src/tools/pgindent/typedefs.list       |   1 +
 4 files changed, 739 insertions(+), 132 deletions(-)

diff --git a/doc/src/sgml/ref/create_database.sgml b/doc/src/sgml/ref/create_database.sgml
index f70d0c7..7c7bc0e 100644
--- a/doc/src/sgml/ref/create_database.sgml
+++ b/doc/src/sgml/ref/create_database.sgml
@@ -34,6 +34,7 @@ CREATE DATABASE <replaceable class="parameter">name</replaceable>
            [ CONNECTION LIMIT [=] <replaceable class="parameter">connlimit</replaceable> ]
            [ IS_TEMPLATE [=] <replaceable class="parameter">istemplate</replaceable> ]
            [ OID [=] <replaceable class="parameter">oid</replaceable> ] ]
+           [ STRATEGY [=] <replaceable class="parameter">strategy</replaceable> ] ]
 </synopsis>
  </refsynopsisdiv>
 
@@ -240,6 +241,27 @@ CREATE DATABASE <replaceable class="parameter">name</replaceable>
        </listitem>
       </varlistentry>
 
+      <varlistentry>
+       <term><replaceable class="parameter">strategy</replaceable></term>
+       <listitem>
+        <para>
+         This is used for copying the database directory.  Currently, we have
+         two strategies the <literal>WAL_LOG_BLOCK</literal> and the
+         <literal>FILE_COPY</literal>.  If <literal>WAL_LOG_BLOCK</literal>
+         strategy is used then the database will be copied block by block and it
+         will also WAL log each copied block.  Otherwise, if <literal>FILE_COPY
+         </literal> strategy is used then it will do the file system level copy
+         so individual the block is not WAL logged.  If the <literal>FILE_COPY
+         </literal> strategy is used then it has to issue a checkpoint before
+         and after performing the copy and if the shared buffers are large and
+         there are a lot of dirty buffers then issuing checkpoint would be
+         costly and it may impact the performance of the whole system.  On the
+         other hand, if we WAL log each block then if the source database is
+         large then creating the database may take more time.
+        </para>
+       </listitem>
+      </varlistentry>
+
     </variablelist>
 
   <para>
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index c37e3c9..cf5f475 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -46,6 +46,7 @@
 #include "commands/dbcommands_xlog.h"
 #include "commands/defrem.h"
 #include "commands/seclabel.h"
+#include "commands/tablecmds.h"
 #include "commands/tablespace.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
@@ -63,13 +64,27 @@
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
 #include "utils/pg_locale.h"
+#include "utils/relmapper.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 
+/*
+ * Create database strategy.  The CREATEDB_WAL_LOG will copy the database at
+ * the block level and WAL log each copied block.  Whereas the
+ * CREATEDB_FILE_COPY will directly copy the database at the file level and no
+ * individual operations will be WAL logged.
+ */
+typedef enum CreateDBStrategy
+{
+	CREATEDB_WAL_LOG = 0,
+	CREATEDB_FILE_COPY = 1
+} CreateDBStrategy;
+
 typedef struct
 {
 	Oid			src_dboid;		/* source (template) DB */
 	Oid			dest_dboid;		/* DB we are trying to create */
+	CreateDBStrategy	strategy;	/* create db strategy */
 } createdb_failure_params;
 
 typedef struct
@@ -78,6 +93,19 @@ typedef struct
 	Oid			dest_tsoid;		/* tablespace we are trying to move to */
 } movedb_failure_params;
 
+/*
+ * When creating a database, we scan the pg_class of the source database to
+ * identify all the relations to be copied.  The structure is used for storing
+ * information about each relation of the source database.
+ */
+typedef struct CreateDBRelInfo
+{
+	RelFileNode		rnode;				/* physical relation identifier */
+	Oid				reloid;				/* relation oid */
+	char			relpersistence;		/* relation's persistence level */
+} CreateDBRelInfo;
+
+
 /* non-export function prototypes */
 static void createdb_failure_callback(int code, Datum arg);
 static void movedb(const char *dbname, const char *tblspcname);
@@ -92,7 +120,588 @@ static bool have_createdb_privilege(void);
 static void remove_dbtablespaces(Oid db_id);
 static bool check_db_file_conflict(Oid db_id);
 static int	errdetail_busy_db(int notherbackends, int npreparedxacts);
+static void CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid,
+									bool isRedo);
+static List *GetRelListFromPage(Page page, Buffer buf, Oid tbid, Oid dbid,
+								char *srcpath, List *rnodelist, Snapshot
+								snapshot);
+static List *GetDatabaseRelationList(Oid srctbid, Oid srcdbid, char *srcpath);
+static void RelationCopyStorageUsingBuffer(SMgrRelation src, SMgrRelation dst,
+									ForkNumber forkNum, char relpersistence);
+static void CopyDatabaseWithWal(Oid src_dboid, Oid dboid, Oid src_tsid,
+								Oid dst_tsid);
+static void CopyDatabase(Oid src_dboid, Oid dboid, Oid src_tsid, Oid dst_tsid);
+
+/*
+ * CreateDirAndVersionFile - Create database directory and write out the
+ *							 PG_VERSION file in the database path.
+ *
+ * If isRedo is true, it's okay for the database directory to exist already.
+ *
+ * We can directly write PG_MAJORVERSION in the version file instead of copying
+ * from the source database file because these two must be the same.
+ */
+static void
+CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid, bool isRedo)
+{
+	int		fd;
+	int		nbytes;
+	char	versionfile[MAXPGPATH];
+	char	buf[16];
+
+	/* Prepare version data before starting a critical section. */
+	sprintf(buf, "%s\n", PG_MAJORVERSION);
+	nbytes = strlen(PG_MAJORVERSION) + 1;
+
+	/* If we are not in WAL replay then write the WAL. */
+	if (!isRedo)
+	{
+		xl_dbase_create_rec xlrec;
+		XLogRecPtr	lsn;
+
+		/* Now errors are fatal ... */
+		START_CRIT_SECTION();
+
+		xlrec.db_id = dbid;
+		xlrec.tablespace_id = tsid;
+		xlrec.src_db_id = InvalidOid;
+		xlrec.src_tablespace_id = InvalidOid;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec), sizeof(xl_dbase_create_rec));
+
+		lsn = XLogInsert(RM_DBASE_ID, XLOG_DBASE_CREATE);
+
+		/* As always, WAL must hit the disk before the data update does. */
+		XLogFlush(lsn);
+	}
+
+	/* Create database directory. */
+	if (MakePGDirectory(dbpath) < 0)
+	{
+		/* Failure other than already exists or not in WAL replay? */
+		if (errno != EEXIST || !isRedo)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not create directory \"%s\": %m", dbpath)));
+	}
+
+	/*
+	 * Create PG_VERSION file in the database path.  If the file already exists
+	 * and we are in WAL replay then try again to open it in write mode.
+	 */
+	snprintf(versionfile, sizeof(versionfile), "%s/%s", dbpath, "PG_VERSION");
+
+	fd = OpenTransientFile(versionfile, O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
+	if (fd < 0 && errno == EEXIST && isRedo)
+		fd = OpenTransientFile(versionfile, O_WRONLY | O_TRUNC | PG_BINARY);
+
+	if (fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m", versionfile)));
+
+	/* Write PG_MAJORVERSION in the PG_VERSION file. */
+	pgstat_report_wait_start(WAIT_EVENT_COPY_FILE_WRITE);
+	errno = 0;
+	if ((int) write(fd, buf, nbytes) != nbytes)
+	{
+		/* If write didn't set errno, assume problem is no disk space. */
+		if (errno == 0)
+			errno = ENOSPC;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m", versionfile)));
+	}
+	pgstat_report_wait_end();
+
+	/* Close the version file. */
+	CloseTransientFile(fd);
+
+	/* Critical section done. */
+	if (!isRedo)
+		END_CRIT_SECTION();
+}
+
+/*
+ * GetRelListFromPage - Helper function for GetDatabaseRelationList.
+ *
+ * Iterate over each tuple of input pg_class and get a list of all the valid
+ * relfilenodes of the given block and append them to input rnodelist.
+ */
+static List *
+GetRelListFromPage(Page page, Buffer buf, Oid tbid, Oid dbid, char *srcpath,
+				  List *rnodelist, Snapshot snapshot)
+{
+	BlockNumber		blkno = BufferGetBlockNumber(buf);
+	OffsetNumber	offnum;
+	OffsetNumber	maxoff;
+	HeapTupleData	tuple;
+	Form_pg_class	classForm;
+
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/* Iterate over each tuple on the page. */
+	for (offnum = FirstOffsetNumber;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid;
+
+		itemid = PageGetItemId(page, offnum);
+
+		/* Nothing to do if slot is empty or already dead. */
+		if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid) ||
+			ItemIdIsRedirected(itemid))
+			continue;
+
+		Assert(ItemIdIsNormal(itemid));
+		ItemPointerSet(&(tuple.t_self), blkno, offnum);
+
+		tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
+		tuple.t_len = ItemIdGetLength(itemid);
+		tuple.t_tableOid = RelationRelationId;
+
+		/*
+		 * If the tuple is visible then add its relfilenode info to the
+		 * list.
+		 */
+		if (HeapTupleSatisfiesVisibility(&tuple, snapshot, buf))
+		{
+			Oid				relfilenode = InvalidOid;
+			CreateDBRelInfo   *relinfo;
+
+			classForm = (Form_pg_class) GETSTRUCT(&tuple);
+
+			/* We don't need to copy the shared objects to the target. */
+			if (classForm->reltablespace == GLOBALTABLESPACE_OID)
+				continue;
+
+			/*
+			 * If the object doesn't have the storage then nothing to be
+			 * done for that object so just ignore it.
+			 */
+			if (!RELKIND_HAS_STORAGE(classForm->relkind))
+				continue;
+
+			/*
+			 * If relfilenode is valid then directly use it.  Otherwise,
+			 * consult the relmapper for the mapped relation.
+			 */
+			if (OidIsValid(classForm->relfilenode))
+				relfilenode = classForm->relfilenode;
+			else
+				relfilenode = RelationMapOidToFilenodeForDatabase(srcpath,
+												classForm->oid);
+
+			/* We must have a valid relfilenode oid. */
+			Assert(OidIsValid(relfilenode));
+
+			/* Prepare a rel info element and add it to the list. */
+			relinfo = (CreateDBRelInfo *) palloc(sizeof(CreateDBRelInfo));
+			if (OidIsValid(classForm->reltablespace))
+				relinfo->rnode.spcNode = classForm->reltablespace;
+			else
+				relinfo->rnode.spcNode = tbid;
+
+			relinfo->rnode.dbNode = dbid;
+			relinfo->rnode.relNode = relfilenode;
+			relinfo->reloid = classForm->oid;
+			relinfo->relpersistence = classForm->relpersistence;
+
+			/* Add it to the list. */
+			rnodelist = lappend(rnodelist, relinfo);
+		}
+	}
+
+	return rnodelist;
+}
+
+/*
+ * GetDatabaseRelationList - Get relfilenode list to be copied.
+ *
+ * Iterate over each block of the pg_class relation.  From there, we will check
+ * all the visible tuples in order to get a list of all the valid relfilenodes
+ * in the source database that should be copied to the target database.
+ */
+static List *
+GetDatabaseRelationList(Oid tbid, Oid dbid, char *srcpath)
+{
+	SMgrRelation	rd_smgr;
+	RelFileNode		rnode;
+	BlockNumber		nblocks;
+	BlockNumber		blkno;
+	Buffer			buf;
+	Oid				relfilenode;
+	Page			page;
+	List		   *rnodelist = NIL;
+	LockRelId		relid;
+	Snapshot		snapshot;
+	BufferAccessStrategy bstrategy;
+
+	/* Get pg_class relfilenode. */
+	relfilenode = RelationMapOidToFilenodeForDatabase(srcpath,
+													  RelationRelationId);
+	/*
+	 * We are going to read the buffers associated with the pg_class relation.
+	 * Thus, acquire the relation level lock before start scanning.  As we are
+	 * not connected to the database, we cannot use relation_open directly, so
+	 * we have to lock using relation id.
+	 */
+	relid.dbId = dbid;
+	relid.relId = RelationRelationId;
+	LockRelationId(&relid, AccessShareLock);
+
+	/* Prepare a relnode for pg_class relation. */
+	rnode.spcNode = tbid;
+	rnode.dbNode = dbid;
+	rnode.relNode = relfilenode;
+
+	/*
+	 * We are not connected to the source database so open the pg_class
+	 * relation at the smgr level and get the block count.
+	 */
+	rd_smgr = smgropen(rnode, InvalidBackendId);
+	nblocks = smgrnblocks(rd_smgr, MAIN_FORKNUM);
+
+	/*
+	 * We're going to read the whole pg_class so better to use bulk-read buffer
+	 * access strategy.
+	 */
+	bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	/* Get current active snapshot for scanning the pg_class. */
+	snapshot = GetActiveSnapshot();
+
+	/* Iterate over each block on the pg_class relation. */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		/*
+		 * We are not connected to the source database so directly use the lower
+		 * level bufmgr interface which operates on the rnode.
+		 */
+		buf = ReadBufferWithoutRelcache(rnode, MAIN_FORKNUM, blkno,
+										RBM_NORMAL, bstrategy,
+										RELPERSISTENCE_PERMANENT);
+
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buf);
+		if (PageIsNew(page) || PageIsEmpty(page))
+		{
+			UnlockReleaseBuffer(buf);
+			continue;
+		}
+
+		/*
+		 * Process pg_class tuple for the current page and add all the valid
+		 * relfilenode entries to the rnodelist.
+		 */
+		rnodelist = GetRelListFromPage(page, buf, tbid, dbid, srcpath,
+									   rnodelist, snapshot);
+
+		/* Release the buffer lock. */
+		UnlockReleaseBuffer(buf);
+	}
+
+	/* Release the lock. */
+	UnlockRelationId(&relid, AccessShareLock);
+
+	return rnodelist;
+}
+
+/*
+ * RelationCopyStorageUsingBuffer - Copy fork's data using bufmgr.
+ *
+ * Same as RelationCopyStorage but instead of using smgrread and smgrextend
+ * this will copy using bufmgr APIs.
+ */
+static void
+RelationCopyStorageUsingBuffer(SMgrRelation src, SMgrRelation dst,
+							   ForkNumber forkNum, char relpersistence)
+{
+	Buffer		srcBuf;
+	Buffer		dstBuf;
+	Page		srcPage;
+	Page		dstPage;
+	bool		use_wal;
+	bool		copying_initfork;
+	BlockNumber nblocks;
+	BlockNumber blkno;
+	BufferAccessStrategy bstrategy_src;
+	BufferAccessStrategy bstrategy_dst;
+
+	/* Refer comments in RelationCopyStorage. */
+	copying_initfork = relpersistence == RELPERSISTENCE_UNLOGGED &&
+		forkNum == INIT_FORKNUM;
+	use_wal = XLogIsNeeded() &&
+		(relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork);
+
+	/* Get number of blocks in the source relation. */
+	nblocks = smgrnblocks(src, forkNum);
+
+	/*
+	 * We are going to copy whole relation from the source to the destination
+	 * so use BAS_BULKREAD strategy for the source relation and BAS_BULKWRITE
+	 * strategy for the destination relation.
+	 */
+	bstrategy_src = GetAccessStrategy(BAS_BULKREAD);
+	bstrategy_dst = GetAccessStrategy(BAS_BULKWRITE);
+
+	/* Iterate over each block of the source relation file. */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		/* If we got a cancel signal during the copy of the data, quit */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Read block from source relation. */
+		srcBuf = ReadBufferWithoutRelcache(src->smgr_rnode.node, forkNum,
+										   blkno, RBM_NORMAL, bstrategy_src,
+										   relpersistence);
+		srcPage = BufferGetPage(srcBuf);
+		if (PageIsNew(srcPage) || PageIsEmpty(srcPage))
+		{
+			ReleaseBuffer(srcBuf);
+			continue;
+		}
+
+		/* Use P_NEW to extend the relation. */
+		dstBuf = ReadBufferWithoutRelcache(dst->smgr_rnode.node, forkNum,
+										   P_NEW, RBM_NORMAL, bstrategy_dst,
+										   relpersistence);
+		LockBuffer(dstBuf, BUFFER_LOCK_EXCLUSIVE);
+
+		START_CRIT_SECTION();
+
+		/* Initialize the page and write the data. */
+		dstPage = BufferGetPage(dstBuf);
+		PageInit(dstPage, BufferGetPageSize(dstBuf), 0);
+		memcpy(dstPage, srcPage, BLCKSZ);
+		MarkBufferDirty(dstBuf);
+
+		/* WAL-log the copied page. */
+		if (use_wal)
+			log_newpage_buffer(dstBuf, true);
+
+		END_CRIT_SECTION();
+
+		UnlockReleaseBuffer(dstBuf);
+		ReleaseBuffer(srcBuf);
+	}
+}
+
+/*
+ * CopyDatabaseWithWal - Copy source database to the target database with WAL
+ *
+ * Create target database directory and copy data files from the source
+ * database to the target database, block by block and WAL log all the
+ * operations.
+ */
+static void
+CopyDatabaseWithWal(Oid src_dboid, Oid dst_dboid, Oid src_tsid, Oid dst_tsid)
+{
+	char	   *srcpath;
+	char	   *dstpath;
+	List	   *rnodelist = NULL;
+	ListCell   *cell;
+	LockRelId	relid;
+	RelFileNode	srcrnode;
+	RelFileNode	dstrnode;
+	CreateDBRelInfo	*relinfo;
+
+	/* Get the source database path. */
+	srcpath = GetDatabasePath(src_dboid, src_tsid);
+
+	/* Get the destination database path. */
+	dstpath = GetDatabasePath(dst_dboid, dst_tsid);
+
+	/* Create database directory and write PG_VERSION file. */
+	CreateDirAndVersionFile(dstpath, dst_dboid, dst_tsid, false);
+
+	/* Copy relmap file from source database to the destination database. */
+	CopyRelationMap(dst_dboid, dst_tsid, srcpath, dstpath);
+
+	/* Get list of all valid relnode from the source database. */
+	rnodelist = GetDatabaseRelationList(src_tsid, src_dboid, srcpath);
+	Assert(rnodelist != NIL);
+
+	/*
+	 * Database id is common for all the relation so set it before entering to
+	 * the loop.
+	 */
+	relid.dbId = src_dboid;
+
+	/*
+	 * Iterate over each relfilenode and copy the relation data block by block
+	 * from source database to the destination database.
+	 */
+	foreach(cell, rnodelist)
+	{
+		SMgrRelation	src_smgr;
+		SMgrRelation	dst_smgr;
+
+		relinfo = lfirst(cell);
+		srcrnode = relinfo->rnode;
+
+		/*
+		 * If the relation is from the source db's default tablespace then we
+		 * need to create it in the destinations db's default tablespace.
+		 * Otherwise, we need to create in the same tablespace as it is in the
+		 * source database.
+		 */
+		if (srcrnode.spcNode == src_tsid)
+			dstrnode.spcNode = dst_tsid;
+		/*
+		 * In case of ALTER DATABASE SET TABLESPACE we don't need to do
+		 * anything for the object which are not in the source db's default
+		 * tablespace.  The source and destination dboid will be same in
+		 * case of ALTER DATABASE SET TABLESPACE.
+		 */
+		else if (src_dboid == dst_dboid)
+			continue;
+		else
+			dstrnode.spcNode = srcrnode.spcNode;
+
+		dstrnode.dbNode = dst_dboid;
+		dstrnode.relNode = srcrnode.relNode;
+
+		/* Acquire the lock on relation before start copying. */
+		relid.relId = relinfo->reloid;
+		LockRelationId(&relid, AccessShareLock);
 
+		/* Open the source and the destination relation at smgr level. */
+		src_smgr = smgropen(srcrnode, InvalidBackendId);
+		dst_smgr = smgropen(dstrnode, InvalidBackendId);
+
+		/* Copy relation storage from source to the destination. */
+		RelationCopyAllFork(src_smgr, dst_smgr, relinfo->relpersistence,
+							RelationCopyStorageUsingBuffer);
+
+		/* Release the lock. */
+		UnlockRelationId(&relid, AccessShareLock);
+	}
+
+	list_free_deep(rnodelist);
+}
+
+/*
+ * CopyDatabase - Copy source database to the target database
+ *
+ * Copy source database directory to the destination directory using copydir
+ * operation.
+ */
+static void
+CopyDatabase(Oid src_dboid, Oid dst_dboid, Oid src_tsid, Oid dst_tsid)
+{
+	TableScanDesc	scan;
+	Relation		rel;
+	HeapTuple		tuple;
+
+	/*
+	 * Force a checkpoint before starting the copy. This will force all
+	 * dirty buffers, including those of unlogged tables, out to disk, to
+	 * ensure source database is up-to-date on disk for the copy.
+	 * FlushDatabaseBuffers() would suffice for that, but we also want to
+	 * process any pending unlink requests. Otherwise, if a checkpoint
+	 * happened while we're copying files, a file might be deleted just
+	 * when we're about to copy it, causing the lstat() call in copydir()
+	 * to fail with ENOENT.
+	 */
+	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE |
+					  CHECKPOINT_WAIT | CHECKPOINT_FLUSH_ALL);
+
+	/*
+	 * Iterate through all tablespaces of the template database, and copy
+	 * each one to the new database.
+	 */
+	rel = table_open(TableSpaceRelationId, AccessShareLock);
+	scan = table_beginscan_catalog(rel, 0, NULL);
+	while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+	{
+		Form_pg_tablespace spaceform = (Form_pg_tablespace) GETSTRUCT(tuple);
+		Oid			srctablespace = spaceform->oid;
+		Oid			dsttablespace;
+		char	   *srcpath;
+		char	   *dstpath;
+		struct stat st;
+
+		/* No need to copy global tablespace */
+		if (srctablespace == GLOBALTABLESPACE_OID)
+			continue;
+
+		srcpath = GetDatabasePath(src_dboid, srctablespace);
+
+		if (stat(srcpath, &st) < 0 || !S_ISDIR(st.st_mode) ||
+			directory_is_empty(srcpath))
+		{
+			/* Assume we can ignore it */
+			pfree(srcpath);
+			continue;
+		}
+
+		if (srctablespace == src_tsid)
+			dsttablespace = dst_tsid;
+		else
+			dsttablespace = srctablespace;
+
+		dstpath = GetDatabasePath(dst_dboid, dsttablespace);
+
+		/*
+		 * Copy this subdirectory to the new location
+		 *
+		 * We don't need to copy subdirectories
+		 */
+		copydir(srcpath, dstpath, false);
+
+		/* Record the filesystem change in XLOG */
+		{
+			xl_dbase_create_rec xlrec;
+
+			xlrec.db_id = dst_dboid;
+			xlrec.tablespace_id = dsttablespace;
+			xlrec.src_db_id = src_dboid;
+			xlrec.src_tablespace_id = srctablespace;
+
+			XLogBeginInsert();
+			XLogRegisterData((char *) &xlrec, sizeof(xl_dbase_create_rec));
+
+			(void) XLogInsert(RM_DBASE_ID,
+							  XLOG_DBASE_CREATE | XLR_SPECIAL_REL_UPDATE);
+		}
+	}
+	table_endscan(scan);
+	table_close(rel, AccessShareLock);
+
+	/*
+	 * We force a checkpoint before committing.  This effectively means
+	 * that committed XLOG_DBASE_CREATE operations will never need to be
+	 * replayed (at least not in ordinary crash recovery; we still have to
+	 * make the XLOG entry for the benefit of PITR operations). This
+	 * avoids two nasty scenarios:
+	 *
+	 * #1: When PITR is off, we don't XLOG the contents of newly created
+	 * indexes; therefore the drop-and-recreate-whole-directory behavior
+	 * of DBASE_CREATE replay would lose such indexes.
+	 *
+	 * #2: Since we have to recopy the source database during DBASE_CREATE
+	 * replay, we run the risk of copying changes in it that were
+	 * committed after the original CREATE DATABASE command but before the
+	 * system crash that led to the replay.  This is at least unexpected
+	 * and at worst could lead to inconsistencies, eg duplicate table
+	 * names.
+	 *
+	 * (Both of these were real bugs in releases 8.0 through 8.0.3.)
+	 *
+	 * In PITR replay, the first of these isn't an issue, and the second
+	 * is only a risk if the CREATE DATABASE and subsequent template
+	 * database change both occur while a base backup is being taken.
+	 * There doesn't seem to be much we can do about that except document
+	 * it as a limitation.
+	 *
+	 * Perhaps if we ever implement CREATE DATABASE in a less cheesy way,
+	 * we can avoid this.
+	 */
+	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
+}
 
 /*
  * CREATE DATABASE
@@ -100,8 +709,6 @@ static int	errdetail_busy_db(int notherbackends, int npreparedxacts);
 Oid
 createdb(ParseState *pstate, const CreatedbStmt *stmt)
 {
-	TableScanDesc scan;
-	Relation	rel;
 	Oid			src_dboid;
 	Oid			src_owner;
 	int			src_encoding = -1;
@@ -132,6 +739,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	DefElem    *dallowconnections = NULL;
 	DefElem    *dconnlimit = NULL;
 	DefElem    *dcollversion = NULL;
+	DefElem    *dstrategy = NULL;
 	char	   *dbname = stmt->dbname;
 	char	   *dbowner = NULL;
 	const char *dbtemplate = NULL;
@@ -145,6 +753,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	char	   *dbcollversion = NULL;
 	int			notherbackends;
 	int			npreparedxacts;
+	CreateDBStrategy	dbstrategy = CREATEDB_WAL_LOG;
 	createdb_failure_params fparms;
 
 	/* Extract options from the statement node tree */
@@ -250,6 +859,12 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 						(errcode(ERRCODE_INVALID_PARAMETER_VALUE)),
 						errmsg("OIDs less than %u are reserved for system objects", FirstNormalObjectId));
 		}
+		else if (strcmp(defel->defname, "strategy") == 0)
+		{
+			if (dstrategy)
+				errorConflictingDefElem(defel, pstate);
+			dstrategy = defel;
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -374,6 +989,24 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 							dbtemplate)));
 	}
 
+	/* Validate the database creation strategy. */
+	if (dstrategy && dstrategy->arg)
+	{
+		char	*strategy;
+
+		strategy = defGetString(dstrategy);
+		if (strcmp(strategy, "wal_log") == 0)
+			dbstrategy = CREATEDB_WAL_LOG;
+		else if (strcmp(strategy, "file_copy") == 0)
+			dbstrategy = CREATEDB_FILE_COPY;
+		else
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("%s is not a valid create database strategy",
+							strategy),
+					 parser_errposition(pstate, dstrategy->location)));
+	}
+
 	/* If encoding or locales are defaulted, use source's setting */
 	if (encoding < 0)
 		encoding = src_encoding;
@@ -668,19 +1301,6 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	InvokeObjectPostCreateHook(DatabaseRelationId, dboid, 0);
 
 	/*
-	 * Force a checkpoint before starting the copy. This will force all dirty
-	 * buffers, including those of unlogged tables, out to disk, to ensure
-	 * source database is up-to-date on disk for the copy.
-	 * FlushDatabaseBuffers() would suffice for that, but we also want to
-	 * process any pending unlink requests. Otherwise, if a checkpoint
-	 * happened while we're copying files, a file might be deleted just when
-	 * we're about to copy it, causing the lstat() call in copydir() to fail
-	 * with ENOENT.
-	 */
-	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT
-					  | CHECKPOINT_FLUSH_ALL);
-
-	/*
 	 * Once we start copying subdirectories, we need to be able to clean 'em
 	 * up if we fail.  Use an ENSURE block to make sure this happens.  (This
 	 * is not a 100% solution, because of the possibility of failure during
@@ -689,114 +1309,47 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	 */
 	fparms.src_dboid = src_dboid;
 	fparms.dest_dboid = dboid;
+	fparms.strategy = dbstrategy;
+
 	PG_ENSURE_ERROR_CLEANUP(createdb_failure_callback,
 							PointerGetDatum(&fparms));
 	{
 		/*
-		 * Iterate through all tablespaces of the template database, and copy
-		 * each one to the new database.
+		 * If the user has asked to create a database with WAL_LOG strategy
+		 * then call CopyDatabaseWithWal, which will copy the database at the
+		 * block level and it will WAL log each copied block.  Otherwise,
+		 * call CopyDatabase that will copy the database file by file.
 		 */
-		rel = table_open(TableSpaceRelationId, AccessShareLock);
-		scan = table_beginscan_catalog(rel, 0, NULL);
-		while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+		if (dbstrategy == CREATEDB_WAL_LOG)
 		{
-			Form_pg_tablespace spaceform = (Form_pg_tablespace) GETSTRUCT(tuple);
-			Oid			srctablespace = spaceform->oid;
-			Oid			dsttablespace;
-			char	   *srcpath;
-			char	   *dstpath;
-			struct stat st;
-
-			/* No need to copy global tablespace */
-			if (srctablespace == GLOBALTABLESPACE_OID)
-				continue;
-
-			srcpath = GetDatabasePath(src_dboid, srctablespace);
-
-			if (stat(srcpath, &st) < 0 || !S_ISDIR(st.st_mode) ||
-				directory_is_empty(srcpath))
-			{
-				/* Assume we can ignore it */
-				pfree(srcpath);
-				continue;
-			}
-
-			if (srctablespace == src_deftablespace)
-				dsttablespace = dst_deftablespace;
-			else
-				dsttablespace = srctablespace;
-
-			dstpath = GetDatabasePath(dboid, dsttablespace);
+			CopyDatabaseWithWal(src_dboid, dboid, src_deftablespace,
+								dst_deftablespace);
 
 			/*
-			 * Copy this subdirectory to the new location
-			 *
-			 * We don't need to copy subdirectories
+			 * Close pg_database, but keep lock till commit.
 			 */
-			copydir(srcpath, dstpath, false);
-
-			/* Record the filesystem change in XLOG */
-			{
-				xl_dbase_create_rec xlrec;
-
-				xlrec.db_id = dboid;
-				xlrec.tablespace_id = dsttablespace;
-				xlrec.src_db_id = src_dboid;
-				xlrec.src_tablespace_id = srctablespace;
-
-				XLogBeginInsert();
-				XLogRegisterData((char *) &xlrec, sizeof(xl_dbase_create_rec));
-
-				(void) XLogInsert(RM_DBASE_ID,
-								  XLOG_DBASE_CREATE | XLR_SPECIAL_REL_UPDATE);
-			}
+			table_close(pg_database_rel, NoLock);
 		}
-		table_endscan(scan);
-		table_close(rel, AccessShareLock);
+		else
+		{
+			Assert(dbstrategy == CREATEDB_FILE_COPY);
 
-		/*
-		 * We force a checkpoint before committing.  This effectively means
-		 * that committed XLOG_DBASE_CREATE operations will never need to be
-		 * replayed (at least not in ordinary crash recovery; we still have to
-		 * make the XLOG entry for the benefit of PITR operations). This
-		 * avoids two nasty scenarios:
-		 *
-		 * #1: When PITR is off, we don't XLOG the contents of newly created
-		 * indexes; therefore the drop-and-recreate-whole-directory behavior
-		 * of DBASE_CREATE replay would lose such indexes.
-		 *
-		 * #2: Since we have to recopy the source database during DBASE_CREATE
-		 * replay, we run the risk of copying changes in it that were
-		 * committed after the original CREATE DATABASE command but before the
-		 * system crash that led to the replay.  This is at least unexpected
-		 * and at worst could lead to inconsistencies, eg duplicate table
-		 * names.
-		 *
-		 * (Both of these were real bugs in releases 8.0 through 8.0.3.)
-		 *
-		 * In PITR replay, the first of these isn't an issue, and the second
-		 * is only a risk if the CREATE DATABASE and subsequent template
-		 * database change both occur while a base backup is being taken.
-		 * There doesn't seem to be much we can do about that except document
-		 * it as a limitation.
-		 *
-		 * Perhaps if we ever implement CREATE DATABASE in a less cheesy way,
-		 * we can avoid this.
-		 */
-		RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
+			CopyDatabase(src_dboid, dboid, src_deftablespace,
+						 dst_deftablespace);
 
-		/*
-		 * Close pg_database, but keep lock till commit.
-		 */
-		table_close(pg_database_rel, NoLock);
+			/*
+			 * Close pg_database, but keep lock till commit.
+			 */
+			table_close(pg_database_rel, NoLock);
 
-		/*
-		 * Force synchronous commit, thus minimizing the window between
-		 * creation of the database files and committal of the transaction. If
-		 * we crash before committing, we'll have a DB that's taking up disk
-		 * space but is not in pg_database, which is not good.
-		 */
-		ForceSyncCommit();
+			/*
+			 * Force synchronous commit, thus minimizing the window between
+			 * creation of the database files and committal of the transaction.
+			 * If we crash before committing, we'll have a DB that's taking up
+			 * disk space but is not in pg_database, which is not good.
+			 */
+			ForceSyncCommit();
+		}
 	}
 	PG_END_ENSURE_ERROR_CLEANUP(createdb_failure_callback,
 								PointerGetDatum(&fparms));
@@ -870,6 +1423,21 @@ createdb_failure_callback(int code, Datum arg)
 	createdb_failure_params *fparms = (createdb_failure_params *) DatumGetPointer(arg);
 
 	/*
+	 * If we were copying database at block levels then drop pages for the
+	 * destination database that are in the shared buffer cache.  And tell
+	 * checkpointer to forget any pending fsync and unlink requests for
+	 * files in the database.  The reasoning behind doing this is same as
+	 * explained in dropdb function.  But unlike dropdb we don't need to call
+	 * pgstat_drop_database because this database is still not created so there
+	 * should not be any stat for this.
+	 */
+	if (fparms->strategy == CREATEDB_WAL_LOG)
+	{
+		DropDatabaseBuffers(fparms->dest_dboid);
+		ForgetDatabaseSyncRequests(fparms->dest_dboid);
+	}
+
+	/*
 	 * Release lock on source database before doing recursive remove. This is
 	 * not essential but it seems desirable to release the lock as soon as
 	 * possible.
@@ -2387,32 +2955,40 @@ dbase_redo(XLogReaderState *record)
 		src_path = GetDatabasePath(xlrec->src_db_id, xlrec->src_tablespace_id);
 		dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
 
-		/*
-		 * Our theory for replaying a CREATE is to forcibly drop the target
-		 * subdirectory if present, then re-copy the source data. This may be
-		 * more work than needed, but it is simple to implement.
-		 */
-		if (stat(dst_path, &st) == 0 && S_ISDIR(st.st_mode))
+		if (!OidIsValid(xlrec->src_db_id))
 		{
-			if (!rmtree(dst_path, true))
-				/* If this failed, copydir() below is going to error. */
-				ereport(WARNING,
-						(errmsg("some useless files may be left behind in old database directory \"%s\"",
-								dst_path)));
+			CreateDirAndVersionFile(dst_path, xlrec->db_id, xlrec->tablespace_id,
+									true);
 		}
+		else
+		{
+			/*
+			* Our theory for replaying a CREATE is to forcibly drop the target
+			* subdirectory if present, then re-copy the source data. This may be
+			* more work than needed, but it is simple to implement.
+			*/
+			if (stat(dst_path, &st) == 0 && S_ISDIR(st.st_mode))
+			{
+				if (!rmtree(dst_path, true))
+					/* If this failed, copydir() below is going to error. */
+					ereport(WARNING,
+							(errmsg("some useless files may be left behind in old database directory \"%s\"",
+									dst_path)));
+			}
 
-		/*
-		 * Force dirty buffers out to disk, to ensure source database is
-		 * up-to-date for the copy.
-		 */
-		FlushDatabaseBuffers(xlrec->src_db_id);
+			/*
+			* Force dirty buffers out to disk, to ensure source database is
+			* up-to-date for the copy.
+			*/
+			FlushDatabaseBuffers(xlrec->src_db_id);
 
-		/*
-		 * Copy this subdirectory to the new location
-		 *
-		 * We don't need to copy subdirectories
-		 */
-		copydir(src_path, dst_path, false);
+			/*
+			* Copy this subdirectory to the new location
+			*
+			* We don't need to copy subdirectories
+			*/
+			copydir(src_path, dst_path, false);
+		}
 	}
 	else if (info == XLOG_DBASE_DROP)
 	{
diff --git a/src/include/commands/dbcommands_xlog.h b/src/include/commands/dbcommands_xlog.h
index 593a857..8f59870 100644
--- a/src/include/commands/dbcommands_xlog.h
+++ b/src/include/commands/dbcommands_xlog.h
@@ -20,6 +20,7 @@
 /* record types */
 #define XLOG_DBASE_CREATE		0x00
 #define XLOG_DBASE_DROP			0x10
+#define XLOG_DBASE_CREATEDIR	0x20
 
 typedef struct xl_dbase_create_rec
 {
@@ -30,6 +31,13 @@ typedef struct xl_dbase_create_rec
 	Oid			src_tablespace_id;
 } xl_dbase_create_rec;
 
+typedef struct xl_dbase_createdir_rec
+{
+	/* Records creating database directory */
+	Oid			db_id;
+	Oid			tablespace_id;
+} xl_dbase_createdir_rec;
+
 typedef struct xl_dbase_drop_rec
 {
 	Oid			db_id;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index c44b5a9..5e34ea3 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -460,6 +460,7 @@ CoverPos
 CreateAmStmt
 CreateCastStmt
 CreateConversionStmt
+CreateDBRelInfo
 CreateDomainStmt
 CreateEnumStmt
 CreateEventTrigStmt
-- 
1.8.3.1

#125

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Dilip Kumar (#124)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Tue, Feb 15, 2022 at 6:49 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Here is the updated version of the patch, the changes are 1) Fixed
review comments given by Robert and one open comment from Ashutosh.
2) Preserved the old create db method. 3) As agreed upthread for now
we are using the new strategy only for createdb not for movedb so I
have removed the changes in ForgetDatabaseSyncRequests() and
DropDatabaseBuffers(). 3) Provided a database creation strategy
option as of now I have kept it as below.

CREATE DATABASE ... WITH (STRATEGY = WAL_LOG); -- default if
option is omitted
CREATE DATABASE ... WITH (STRATEGY = FILE_COPY);

All right. I think there have been two design-level objections to this
patch, and this resolves one of them. The other one is trickier,
because AFAICT it's basically an opinion question: is accessing
pg_class in the template database from some backend that is connected
to another database too ugly to be acceptable? Several people have
expressed concerns about that, but it's not clear to me whether they
are essentially saying "that is not what I would do if I were doing
this project" or more like "if you commit something that does it that
way I will be enraged and demand an immediate revert and the removal
of your commit bit." If it's the former, I think it's possible to
clean up various details of these patches to make them look nicer than
they do at present and get something committed for PostgreSQL 15. But
if it is the latter then there's really no point to that kind of
cleanup work and we should probably just give up now. So, Andres,
Heikki, and anybody else with a strong opinion, can you clarify how
vigorously you hate this design, or don't?

My own opinion is that this is actually rather elegant. It just makes
sense to me that the right way to figure out what relations are in a
database is to get that list from the database rather than from the
filesystem. Nobody would accept the idea of making \d work by listing
out the directory contents rather than by walking pg_class, and so the
only reason we ought to accept that in the case of cloning a database
is if the code is too ugly any other way. But is it really? It's true
that this patch set does some refactoring of interfaces in order to
make that work, and there's a few things about how it does that that I
think could be improved, but on the whole, it's seems like a
remarkably small amount of code to do something that we've long
considered absolutely taboo. Now, it's nowhere close to being
something that could be used to allow fully general cross-database
access, and there are severe problems with the idea of allowing any
such thing. In particular, there are various places that test for
connections to a database, and aren't going to be expected processes
not connected to the database to be touching it. My belief is that a
heavyweight lock on the database is a suitable surrogate, because we
actually take such a lock when connecting to a database, and that
forms part of the regular interlock. Taking such locks routinely for
short periods would be expensive and might create other problems, but
doing it for a maintenance operation seems OK. Also, if we wanted to
actually support full cross-database access, locking wouldn't be the
only problem by far. We'd have to deal with things like the relcache
and the catcache, which would be hard, and might increase the cost of
very common things that we need to be cheap. But none of that is
implicated in this patch, which only generalizes code paths that are
not so commonly taken as to pose a problem, and manages to reuse quite
a bit of code rather than introducing entirely new code to do the same
things.
.
It does introduce some new code here and there, though: there isn't
zero duplication. The biggest chunk of that FWICS is in 0006, in
GetDatabaseRelationList and GetRelListFromPage. I just can't get
excited about that. It's literally about two screens worth of code.
We're not talking about duplicating src/backend/access/heapam or
something like that. I do think it would be a good idea to split it up
just a bit more: I think the code inside GetRelListFromPage that is
guarded by HeapTupleSatisfiesVisibility() could be moved into a
separate subroutine, and I think that would likely look a big nicer.
But fundamentally I just don't see a huge issue here. That is not to
say there isn't a huge issue here: just that I don't see it.

Comments?

--
Robert Haas
EDB: http://www.enterprisedb.com

#126

Andres Freund

andres@anarazel.de

almost 4 years ago

In reply to: Robert Haas (#125)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

Hi,

On 2022-02-17 14:27:09 -0500, Robert Haas wrote:

The other one is trickier, because AFAICT it's basically an opinion
question: is accessing pg_class in the template database from some backend
that is connected to another database too ugly to be acceptable? Several
people have expressed concerns about that, but it's not clear to me whether
they are essentially saying "that is not what I would do if I were doing
this project" or more like "if you commit something that does it that way I
will be enraged and demand an immediate revert and the removal of your
commit bit." If it's the former, I think it's possible to clean up various
details of these patches to make them look nicer than they do at present and
get something committed for PostgreSQL 15.

Could you or Dilip outline how it now works, and what exactly makes it safe
etc (e.g. around locking, invalidation processing, snapshots, xid horizons)?

I just scrolled through the patchset without finding such an explanation, so
it's a bit hard to judge.

But if it is the latter then there's really no point to that kind of cleanup
work and we should probably just give up now.

This thread is long. Could you summarize what lead you to consider other
approaches (e.g. looking in the filesystem for relfilenodes) as not feasible /
too ugly / ...?

Greetings,

Andres Freund

#127

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Andres Freund (#126)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Thu, Feb 17, 2022 at 4:13 PM Andres Freund <andres@anarazel.de> wrote:

Could you or Dilip outline how it now works, and what exactly makes it safe
etc (e.g. around locking, invalidation processing, snapshots, xid horizons)?

I just scrolled through the patchset without finding such an explanation, so
it's a bit hard to judge.

That's a good question and it's making me think about a few things I
hadn't considered before.

Dilip can add more here, but my impression is that most problems are
prevented by CREATE DATABASE, with or without this patch, starts by
acquiring a ShareLock on the database, preventing new connections, and
then making sure there are no existing connections. That means nothing
in the target database can be changing, which I think makes a lot of
the stuff on your list a non-issue. Any problems that remain have to
be the result of something that CREATE DATABASE does having a bad
interaction either with something that is completed beforehand or
something that begins afterward. There just can't be overlap, and I
think that rules out most problems.

Now you pointed out earlier one problem that it doesn't fix: unlike
the current method, this method involves reading buffers from the
template database into shared_buffers, and those buffers, once read,
stick around even after the operation finishes. That's not an
intrinsic problem, though, because a connection to the database could
do the same thing. However, again as you pointed out, it is a problem,
though, if we do it with less locking than a real database connection
would have done. It seems to me that if there are other problems here,
they have to be basically of the same sort: they have to leave the
system in some state which is otherwise impossible. Do you see some
other kind of hazard, or more examples of how that could happen? I
think the leftover buffers in shared_buffers have to be basically the
only thing, because apart from that, how is this any different than a
file copy?

The only other kind of hazard I can think of is: could it be unsafe to
try to interpret the contents of a database to which no one else is
connected at present due to any of the issues you mention? But what's
the hazard exactly? It can't be a problem if we've failed to process
sinval messages for the target database, because we're not using any
caches anyway. We can't. It can't be unsafe to test visibility of XIDs
for that database, because in an alternate universe some backend could
have connected to that database and seen the same XIDs. One thing we
COULD be doing wrong is using the wrong snapshot to test the
visibility of XIDs. The patch uses GetActiveSnapshot(), and I'm
thinking that is probably wrong. Shouldn't it be GetLatestSnapshot()?
And do we need to worry about snapshots being database-specific? Maybe
so.

But if it is the latter then there's really no point to that kind of cleanup
work and we should probably just give up now.

This thread is long. Could you summarize what lead you to consider other
approaches (e.g. looking in the filesystem for relfilenodes) as not feasible /
too ugly / ...?

I don't think it's infeasible to look at the filesystem for files and
just copy whatever files we find. It's a plausible alternate design. I
just don't like it as well. I think that relying on the filesystem
contents to tell us what's going on is kind of hacky. The only
technical issue I see there is that the WAL logging might require more
kludgery, since that mechanism is kind of intertwined with
shared_buffers. You'd have to get the right block references into the
WAL record, and you have to make sure that checkpoints don't move the
redo pointer at an inopportune moment.

--
Robert Haas
EDB: http://www.enterprisedb.com

#128

Andres Freund

andres@anarazel.de

almost 4 years ago

In reply to: Robert Haas (#127)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

Hi,

On 2022-02-17 18:00:19 -0500, Robert Haas wrote:

Now you pointed out earlier one problem that it doesn't fix: unlike
the current method, this method involves reading buffers from the
template database into shared_buffers, and those buffers, once read,
stick around even after the operation finishes.

Yea, I don't see a problem with that. A concurrent DROP DATABASE or such would
be problematic, but the locking prevents that.

The only other kind of hazard I can think of is: could it be unsafe to
try to interpret the contents of a database to which no one else is
connected at present due to any of the issues you mention? But what's
the hazard exactly?

I don't quite know. But I don't think it's impossible to run into trouble in
this area. E.g. xid horizons are computed in a database specific way. If the
routine reading pg_class did hot pruning, you could end up removing data
that's actually needed for a logical slot in the other database because the
backend local horizon state was computed for the "local" database?

Could there be problems because other backends wouldn't see the backend
accessing the other database as being connected to that database
(PGPROC->databaseId)?

Or what if somebody optimized snapshots to disregard readonly transactions in
other databases?

It can't be a problem if we've failed to process sinval messages for the
target database, because we're not using any caches anyway.

Could you end up with an outdated relmap entry? Probably not.

We can't. It can't be unsafe to test visibility of XIDs for that database,
because in an alternate universe some backend could have connected to that
database and seen the same XIDs.

That's a weak argument, because in that alternative universe a PGPROC entry
with the PGPROC->databaseId = template_databases_oid would exist.

Greetings,

Andres Freund

#129

Dilip Kumar

dilipbalaut@gmail.com

almost 4 years ago

In reply to: Robert Haas (#127)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Fri, Feb 18, 2022 at 4:30 AM Robert Haas <robertmhaas@gmail.com> wrote:

This thread is long. Could you summarize what lead you to consider other
approaches (e.g. looking in the filesystem for relfilenodes) as not feasible /
too ugly / ...?

I don't think it's infeasible to look at the filesystem for files and
just copy whatever files we find. It's a plausible alternate design. I
just don't like it as well. I think that relying on the filesystem
contents to tell us what's going on is kind of hacky. The only
technical issue I see there is that the WAL logging might require more
kludgery, since that mechanism is kind of intertwined with
shared_buffers. You'd have to get the right block references into the
WAL record, and you have to make sure that checkpoints don't move the
redo pointer at an inopportune moment.

Actually based on the previous discussion, I also tried to write the
POC with the file system scanning approach to identify the relation to
be copied seet patch 0007 in this thread [1]/messages/by-id/CAFiTN-v1KYsVAhq_fOWFa27LZiw9uK4n4cz5XmQJxJpsVcfq1w@mail.gmail.com. And later we identified
one issue [2]/messages/by-id/CAFiTN-v=U58by_BeiZruNhykxk1q9XUxF+qLzD2LZAsEn2EBkg@mail.gmail.com, i.e. while scanning directly the disk file we will only
know the relfilenode but we can not identify the relation oid that
means we can not lock the relation. Now, I am not saying that there
is no way to work around that issue but that was also one of the
reasons for not pursuing that approach.

[1]: /messages/by-id/CAFiTN-v1KYsVAhq_fOWFa27LZiw9uK4n4cz5XmQJxJpsVcfq1w@mail.gmail.com
[2]: /messages/by-id/CAFiTN-v=U58by_BeiZruNhykxk1q9XUxF+qLzD2LZAsEn2EBkg@mail.gmail.com

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#130

Dilip Kumar

dilipbalaut@gmail.com

almost 4 years ago

In reply to: Andres Freund (#128)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Fri, Feb 18, 2022 at 5:09 AM Andres Freund <andres@anarazel.de> wrote:

Thanks a lot Andres for taking time to read the thread and patch.

The only other kind of hazard I can think of is: could it be unsafe to
try to interpret the contents of a database to which no one else is
connected at present due to any of the issues you mention? But what's
the hazard exactly?

I don't quite know. But I don't think it's impossible to run into trouble in
this area. E.g. xid horizons are computed in a database specific way. If the
routine reading pg_class did hot pruning, you could end up removing data
that's actually needed for a logical slot in the other database because the
backend local horizon state was computed for the "local" database?

I agree that while computing the xid horizon (ComputeXidHorizons()),
we only consider the backend which are connected to the same database
to which we are connected. But we don't need to worry here because we
know the fact that there could be absolutely no backend connected to
the database we are trying to copy so we don't need to worry about
pruning the tuples which are visible to other backends.

Now if we are worried about the replication slot then for that we also
consider the xmin horizon from the replication slots so I don't think
that we have any problem here as well. And we also consider the
walsender as well for computing the xid horizon.

Could there be problems because other backends wouldn't see the backend
accessing the other database as being connected to that database
(PGPROC->databaseId)?

You mean that other backend will not consider this backend (which is
copying database) as connected to source database? Yeah that is
correct but what is the problem in that, other backends can not
connect to the source database so what problem can they create to the
backend which is copying the database.

Or what if somebody optimized snapshots to disregard readonly transactions in
other databases?

Can you elaborate on this point?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#131

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Andres Freund (#128)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Thu, Feb 17, 2022 at 6:39 PM Andres Freund <andres@anarazel.de> wrote:

The only other kind of hazard I can think of is: could it be unsafe to
try to interpret the contents of a database to which no one else is
connected at present due to any of the issues you mention? But what's
the hazard exactly?

I don't quite know. But I don't think it's impossible to run into trouble in
this area. E.g. xid horizons are computed in a database specific way. If the
routine reading pg_class did hot pruning, you could end up removing data
that's actually needed for a logical slot in the other database because the
backend local horizon state was computed for the "local" database?

Yeah, but it doesn't -- and shouldn't. There's no HeapScanDesc here,
so we can't accidentally wander into heap_page_prune_opt(). We should
document the things we're thinking about here in the comments to
prevent future mistakes, but I think for the moment we are OK.

Could there be problems because other backends wouldn't see the backend
accessing the other database as being connected to that database
(PGPROC->databaseId)?

I think that if there's any hazard here, it must be related to
snapshots, which brings us to your next point:

Or what if somebody optimized snapshots to disregard readonly transactions in
other databases?

So there are two related questions here. One is whether the snapshot
that we're using to access the template database can be unsafe, and
the other is whether the read-only access that we're performing inside
the template database could mess up somebody else's snapshot. Let's
deal with the second point first: nobody else knows that we're reading
from the template database, and nobody else is reading from the
template database except possibly for someone who is doing exactly
what we're doing. Therefore, I think this hazard can be ruled out.

On the first point, a key point in my opinion is that there can be no
in-flight transactions in the template database, because nobody is
connected to it, and prepared transactions in a template database are
verboten. It therefore can't matter if we include too few XIDs in our
snapshot, or if our xmin is too new. The reverse case can matter,
though: if the xmin of our snapshot were too old, or if we had extra
XIDs in our snapshot, then we might think that a transaction is still
in progress when it isn't. Therefore, I think the patch is wrong to
use GetActiveSnapshot() and must use GetLatestSnapshot() *after* it's
finished making sure that nobody is using the template database. I
don't think there's a hazard beyond that, though. Let's consider the
two ways in which things could go wrong:

1. Extra XIDs in the snapshot. Any current or future optimization of
snapshots would presumably be trying to make them smaller by removing
XIDs from the snapshot, not making them bigger by adding XIDs to the
snapshot. I guess in theory you can imagine an optimization that tests
for the presence of XIDs by some method other than scanning through an
array, and which feels free to add XIDs to the snapshot if they "can't
matter," but I think it's up to the author of that hypothetical future
patch to make sure they don't break anything in so doing -- especially
because it's entirely possible for our session to see XIDs used by a
backend in some other database, because they could show up in shared
catalogs. I think that's why, as far as I can tell, we only use the
database ID when setting pruning thresholds, and not for snapshots.

2. xmin of snapshot too new. There are no in-progress transactions in
the template database, so how can this even happen? If we set the xmin
"in the future," then we could get confused about what's visible due
to wraparound, but that seems crazy. I don't see how there can be a
problem here.

It can't be a problem if we've failed to process sinval messages for the
target database, because we're not using any caches anyway.

Could you end up with an outdated relmap entry? Probably not.

Again, we're not relying on caching -- we read the file.

We can't. It can't be unsafe to test visibility of XIDs for that database,
because in an alternate universe some backend could have connected to that
database and seen the same XIDs.

That's a weak argument, because in that alternative universe a PGPROC entry
with the PGPROC->databaseId = template_databases_oid would exist.

So what? As I also argue above, I don't think that affects snapshot
generation, and if it did it wouldn't matter anyway, because it could
only remove in-progress transactions from the snapshot, and there
aren't any in the template database anyhow.

Another way of looking at this is: we could just as well use
SnapshotSelf or (if it still existed) SnapshotNow to test visibility.
In a world where there are no transactions in flight, it's the same
thing. In fact, maybe we should do it that way, just to make it
clearer what's happening.

I think these are really good questions you are raising, so I'm not
trying to be dismissive. But after some thought I'm not yet seeing any
problems (other than the use of GetActiveSnapshot).

--
Robert Haas
EDB: http://www.enterprisedb.com

#132

Ashutosh Sharma

ashu.coek88@gmail.com

almost 4 years ago

In reply to: Dilip Kumar (#124)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

I'm not sure about the current status, but found it while playing
around with the latest changes a bit, so thought of sharing it here.

+      <varlistentry>
+       <term><replaceable class="parameter">strategy</replaceable></term>
+       <listitem>
+        <para>
+         This is used for copying the database directory.  Currently, we have
+         two strategies the <literal>WAL_LOG_BLOCK</literal> and the

Isn't it wal_log instead of wal_log_block?

I think when users input wrong strategy with createdb command, we
should provide a hint message showing allowed values for strategy
types along with an error message. This will be helpful for the users.

Show quoted text

On Tue, Feb 15, 2022 at 5:19 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Feb 15, 2022 at 2:01 AM Maciek Sakrejda <m.sakrejda@gmail.com> wrote:

Here is the updated version of the patch, the changes are 1) Fixed
review comments given by Robert and one open comment from Ashutosh.
2) Preserved the old create db method. 3) As agreed upthread for now
we are using the new strategy only for createdb not for movedb so I
have removed the changes in ForgetDatabaseSyncRequests() and
DropDatabaseBuffers(). 3) Provided a database creation strategy
option as of now I have kept it as below.

CREATE DATABASE ... WITH (STRATEGY = WAL_LOG); -- default if
option is omitted
CREATE DATABASE ... WITH (STRATEGY = FILE_COPY);

I have updated the document but I was not sure how much internal
information to be exposed to the user so I will work on that based on
feedback from others.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#133

Dilip Kumar

dilipbalaut@gmail.com

almost 4 years ago

In reply to: Ashutosh Sharma (#132)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Tue, Feb 22, 2022 at 8:27 PM Ashutosh Sharma <ashu.coek88@gmail.com>
wrote:

I'm not sure about the current status, but found it while playing
around with the latest changes a bit, so thought of sharing it here.
+      <varlistentry>
+       <term><replaceable class="parameter">strategy</replaceable></term>
+       <listitem>
+        <para>
+         This is used for copying the database directory.  Currently, we
have
+         two strategies the <literal>WAL_LOG_BLOCK</literal> and the
Isn't it wal_log instead of wal_log_block?

I think when users input wrong strategy with createdb command, we
should provide a hint message showing allowed values for strategy
types along with an error message. This will be helpful for the users.

I will fix these two comments while posting the next version.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#134

Dilip Kumar

dilipbalaut@gmail.com

almost 4 years ago

In reply to: Dilip Kumar (#133)

6 attachment(s)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Tue, Mar 1, 2022 at 5:15 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Feb 22, 2022 at 8:27 PM Ashutosh Sharma <ashu.coek88@gmail.com>
wrote:
I'm not sure about the current status, but found it while playing
around with the latest changes a bit, so thought of sharing it here.
+      <varlistentry>
+       <term><replaceable class="parameter">strategy</replaceable></term>
+       <listitem>
+        <para>
+         This is used for copying the database directory.  Currently, we
have
+         two strategies the <literal>WAL_LOG_BLOCK</literal> and the
Isn't it wal_log instead of wal_log_block?

I think when users input wrong strategy with createdb command, we
should provide a hint message showing allowed values for strategy
types along with an error message. This will be helpful for the users.
I will fix these two comments while posting the next version.

The new version of the patch fixes these 2 comments pointed by Ashutosh and
also splits the GetRelListFromPage() function as suggested by Robert and
uses the latest snapshot for scanning the pg_class instead of active
snapshot as pointed out by Robert. I haven't yet added the test case to
create a database using this new strategy option. So if we are okay with
these two options FILE_COPY and WAL_LOG then I will add test cases for the
same.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v10-0002-Extend-relmap-interfaces.patchtext/x-patch; charset=US-ASCII; name=v10-0002-Extend-relmap-interfaces.patchDownload

From ba8e991e9b24ed8986d4504c44ce85f78cc4a598 Mon Sep 17 00:00:00 2001
From: dilipkumar <dilipbalaut@gmail.com>
Date: Mon, 4 Oct 2021 13:50:44 +0530
Subject: [PATCH v10 2/6] Extend relmap interfaces

Support new interfaces in relmapper, 1) Support copying the
relmap file from one database path to the other database path.
2) And another interface for getting filenode from oid.  We already
have RelationMapOidToFilenode for the same purpose but that assumes
we are connected to the database for which we want to get the mapping.
So this new interface will do the same but instead, it will get the
mapping for the input database.

These interfaces are required for next patch, for supporting the
wal logged created database.
---
 src/backend/utils/cache/relmapper.c | 123 +++++++++++++++++++++++++++++++-----
 src/include/utils/relmapper.h       |   6 +-
 2 files changed, 113 insertions(+), 16 deletions(-)

diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 56495f0..86a85c8 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -141,7 +141,7 @@ static void read_relmap_file(char *mapfilename, RelMapFile *map,
 static void write_relmap_file_internal(char *mapfilename, RelMapFile *newmap,
 									   bool write_wal, bool send_sinval,
 									   bool preserve_files, Oid dbid, Oid tsid,
-									   const char *dbpath);
+									   const char *dbpath, bool create);
 static void load_relmap_file(bool shared, bool lock_held);
 static void write_relmap_file(bool shared, RelMapFile *newmap,
 							  bool write_wal, bool send_sinval, bool preserve_files,
@@ -256,6 +256,37 @@ RelationMapFilenodeToOid(Oid filenode, bool shared)
 }
 
 /*
+ * RelationMapOidToFilenodeForDatabase
+ *
+ * Same as RelationMapOidToFilenode, but instead of reading the mapping from
+ * the database we are connected to it will read the mapping from the input
+ * database.
+ */
+Oid
+RelationMapOidToFilenodeForDatabase(char *dbpath, Oid relationId)
+{
+	RelMapFile	map;
+	int			i;
+	char		mapfilename[MAXPGPATH];
+
+	/* Relmap file path for the given dbpath. */
+	snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+			 dbpath, RELMAPPER_FILENAME);
+
+	/* Read the relmap file from the source database. */
+	read_relmap_file(mapfilename, &map, false);
+
+	/* Iterate over the relmap entries to find the input relation oid. */
+	for (i = 0; i < map.num_mappings; i++)
+	{
+		if (relationId == map.mappings[i].mapoid)
+			return map.mappings[i].mapfilenode;
+	}
+
+	return InvalidOid;
+}
+
+/*
  * RelationMapUpdateMap
  *
  * Install a new relfilenode mapping for the specified relation.
@@ -693,7 +724,43 @@ RestoreRelationMap(char *startAddress)
 }
 
 /*
- * read_relmap_file -- read data from given mapfilename file.
+ * CopyRelationMap
+ *
+ * Copy relmapfile from source db path to the destination db path and WAL log
+ * the operation.  This function is only called during the create database, so
+ * the destination database is not yet visible to anyone else, thus we don't
+ * need to acquire the relmap lock while updating the destination relmap.
+ */
+void
+CopyRelationMap(Oid dbid, Oid tsid, char *srcdbpath, char *dstdbpath)
+{
+	RelMapFile map;
+	char mapfilename[MAXPGPATH];
+
+	/* Relmap file path of the source database. */
+	snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+			 srcdbpath, RELMAPPER_FILENAME);
+
+	/* Read the relmap file from the source database. */
+	read_relmap_file(mapfilename, &map, false);
+
+	/* Relmap file path of the destination database. */
+	snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+			 dstdbpath, RELMAPPER_FILENAME);
+
+	/*
+	 * Write map contents into the destination database's relmap file.
+	 * write_relmap_file_internal, expects that the CRC should have been
+	 * computed and stored in the input map.  But, since we have read this map
+	 * from the source database and directly writing to the destination file
+	 * without updating it so we don't need to recompute it.
+	 */
+	write_relmap_file_internal(mapfilename, &map, true, false, true, dbid,
+							   tsid, dstdbpath, true);
+}
+
+/*
+ * read_relmap_file - read data from given mapfilename file.
  *
  * Because the map file is essential for access to core system catalogs,
  * failure to read it is a fatal error.
@@ -796,15 +863,18 @@ load_relmap_file(bool shared, bool lock_held)
 }
 
 /*
- * Helper function for write_relmap_file, Read comments atop write_relmap_file
- * for more details.  The CRC should be computed by the caller and stored in
- * the newmap.
+ * Helper function for write_relmap_file and CopyRelationMap, Read comments
+ * atop write_relmap_file for more details.  The CRC should be computed by the
+ * caller and stored in the newmap.
+ *
+ * Pass the create = true, if we are copying the relmap file during CREATE
+ * DATABASE command.
  */
 static void
 write_relmap_file_internal(char *mapfilename, RelMapFile *newmap,
 						   bool write_wal, bool send_sinval,
 						   bool preserve_files, Oid dbid, Oid tsid,
-						   const char *dbpath)
+						   const char *dbpath, bool create)
 {
 	int			fd;
 
@@ -830,6 +900,7 @@ write_relmap_file_internal(char *mapfilename, RelMapFile *newmap,
 		xlrec.dbid = dbid;
 		xlrec.tsid = tsid;
 		xlrec.nbytes = sizeof(RelMapFile);
+		xlrec.create = create;
 
 		XLogBeginInsert();
 		XLogRegisterData((char *) (&xlrec), MinSizeOfRelmapUpdate);
@@ -971,7 +1042,7 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 	/* Write the map to the relmap file. */
 	write_relmap_file_internal(mapfilename, newmap, write_wal,
 							   send_sinval, preserve_files, dbid, tsid,
-							   dbpath);
+							   dbpath, false);
 
 	/*
 	 * Success, update permanent copy.  During bootstrap, we might be working
@@ -1063,15 +1134,37 @@ relmap_redo(XLogReaderState *record)
 		 * Write out the new map and send sinval, but of course don't write a
 		 * new WAL entry.  There's no surrounding transaction to tell to
 		 * preserve files, either.
-		 *
-		 * There shouldn't be anyone else updating relmaps during WAL replay,
-		 * but grab the lock to interlock against load_relmap_file().
 		 */
-		LWLockAcquire(RelationMappingLock, LW_EXCLUSIVE);
-		write_relmap_file((xlrec->dbid == InvalidOid), &newmap,
-						  false, true, false,
-						  xlrec->dbid, xlrec->tsid, dbpath);
-		LWLockRelease(RelationMappingLock);
+		if (!xlrec->create)
+		{
+			/*
+			 * There shouldn't be anyone else updating relmaps during WAL
+			 * replay, but grab the lock to interlock against
+			 * load_relmap_file().
+			 */
+			LWLockAcquire(RelationMappingLock, LW_EXCLUSIVE);
+			write_relmap_file((xlrec->dbid == InvalidOid), &newmap,
+							false, true, false,
+							xlrec->dbid, xlrec->tsid, dbpath);
+			LWLockRelease(RelationMappingLock);
+		}
+		else
+		{
+			char		mapfilename[MAXPGPATH];
+
+			/* Construct the mapfilename. */
+			snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+					 dbpath, RELMAPPER_FILENAME);
+
+			/*
+			 * We don't need to take relmap lock because this wal is logged
+			 * while creating a new database, so there could be no one else
+			 * reading/writing the relmap file.
+			 */
+			write_relmap_file_internal(mapfilename, &newmap, false, false,
+									   false, xlrec->dbid, xlrec->tsid, dbpath,
+									   true);
+		}
 
 		pfree(dbpath);
 	}
diff --git a/src/include/utils/relmapper.h b/src/include/utils/relmapper.h
index 9fbb5a7..e5635bd 100644
--- a/src/include/utils/relmapper.h
+++ b/src/include/utils/relmapper.h
@@ -29,6 +29,7 @@ typedef struct xl_relmap_update
 	Oid			dbid;			/* database ID, or 0 for shared map */
 	Oid			tsid;			/* database's tablespace, or pg_global */
 	int32		nbytes;			/* size of relmap data */
+	bool		create;			/* true if creating new relmap */
 	char		data[FLEXIBLE_ARRAY_MEMBER];
 } xl_relmap_update;
 
@@ -39,6 +40,8 @@ extern Oid	RelationMapOidToFilenode(Oid relationId, bool shared);
 
 extern Oid	RelationMapFilenodeToOid(Oid relationId, bool shared);
 
+extern Oid RelationMapOidToFilenodeForDatabase(char *dbpath, Oid relationId);
+
 extern void RelationMapUpdateMap(Oid relationId, Oid fileNode, bool shared,
 								 bool immediate);
 
@@ -62,7 +65,8 @@ extern void RelationMapInitializePhase3(void);
 extern Size EstimateRelationMapSpace(void);
 extern void SerializeRelationMap(Size maxSize, char *startAddress);
 extern void RestoreRelationMap(char *startAddress);
-
+extern void CopyRelationMap(Oid dbid, Oid tsid, char *srcdbpath,
+							char *dstdbpath);
 extern void relmap_redo(XLogReaderState *record);
 extern void relmap_desc(StringInfo buf, XLogReaderState *record);
 extern const char *relmap_identify(uint8 info);
-- 
1.8.3.1

v10-0001-Refactor-relmap-load-and-relmap-write-functions.patchtext/x-patch; charset=US-ASCII; name=v10-0001-Refactor-relmap-load-and-relmap-write-functions.patchDownload

From 37b67898404982119116cea73eab2bf750472c69 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Wed, 1 Sep 2021 14:06:29 +0530
Subject: [PATCH v10 1/6] Refactor relmap load and relmap write functions

Currently, relmap reading and writing interfaces are tightly
coupled with shared_map and local_map of the database
it is connected to.  But as higher level patch set we need
interfaces where we can read relmap into any input memory
and while writing also we should be able to pass the map.

So as part of this patch, we are doing refactoring of the
existing code such that we can expose the read and write
interfaces that are independent of the shared_map and the
local_map, without changing any logic.

XXX For the code simplicity in write_relmap_file we are
updating the permanent memory copy outside the critical
section but we have already done the disk changes and it
is just a memory change so there is no reason for this
to be in the critical section.
---
 src/backend/utils/cache/relmapper.c | 163 ++++++++++++++++++++++--------------
 1 file changed, 99 insertions(+), 64 deletions(-)

diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 4f6811f..56495f0 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -136,6 +136,12 @@ static void apply_map_update(RelMapFile *map, Oid relationId, Oid fileNode,
 							 bool add_okay);
 static void merge_map_updates(RelMapFile *map, const RelMapFile *updates,
 							  bool add_okay);
+static void read_relmap_file(char *mapfilename, RelMapFile *map,
+							 bool lock_held);
+static void write_relmap_file_internal(char *mapfilename, RelMapFile *newmap,
+									   bool write_wal, bool send_sinval,
+									   bool preserve_files, Oid dbid, Oid tsid,
+									   const char *dbpath);
 static void load_relmap_file(bool shared, bool lock_held);
 static void write_relmap_file(bool shared, RelMapFile *newmap,
 							  bool write_wal, bool send_sinval, bool preserve_files,
@@ -687,36 +693,19 @@ RestoreRelationMap(char *startAddress)
 }
 
 /*
- * load_relmap_file -- load data from the shared or local map file
+ * read_relmap_file -- read data from given mapfilename file.
  *
  * Because the map file is essential for access to core system catalogs,
  * failure to read it is a fatal error.
- *
- * Note that the local case requires DatabasePath to be set up.
  */
 static void
-load_relmap_file(bool shared, bool lock_held)
+read_relmap_file(char *mapfilename, RelMapFile *map, bool lock_held)
 {
-	RelMapFile *map;
-	char		mapfilename[MAXPGPATH];
 	pg_crc32c	crc;
 	int			fd;
 	int			r;
 
-	if (shared)
-	{
-		snprintf(mapfilename, sizeof(mapfilename), "global/%s",
-				 RELMAPPER_FILENAME);
-		map = &shared_map;
-	}
-	else
-	{
-		snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
-				 DatabasePath, RELMAPPER_FILENAME);
-		map = &local_map;
-	}
-
-	/* Read data ... */
+	/* Open the relmap file for reading. */
 	fd = OpenTransientFile(mapfilename, O_RDONLY | PG_BINARY);
 	if (fd < 0)
 		ereport(FATAL,
@@ -779,62 +768,50 @@ load_relmap_file(bool shared, bool lock_held)
 }
 
 /*
- * Write out a new shared or local map file with the given contents.
- *
- * The magic number and CRC are automatically updated in *newmap.  On
- * success, we copy the data to the appropriate permanent static variable.
- *
- * If write_wal is true then an appropriate WAL message is emitted.
- * (It will be false for bootstrap and WAL replay cases.)
- *
- * If send_sinval is true then a SI invalidation message is sent.
- * (This should be true except in bootstrap case.)
- *
- * If preserve_files is true then the storage manager is warned not to
- * delete the files listed in the map.
+ * load_relmap_file -- load data from the shared or local map file
  *
- * Because this may be called during WAL replay when MyDatabaseId,
- * DatabasePath, etc aren't valid, we require the caller to pass in suitable
- * values.  The caller is also responsible for being sure no concurrent
- * map update could be happening.
+ * Note that the local case requires DatabasePath to be set up.
  */
 static void
-write_relmap_file(bool shared, RelMapFile *newmap,
-				  bool write_wal, bool send_sinval, bool preserve_files,
-				  Oid dbid, Oid tsid, const char *dbpath)
+load_relmap_file(bool shared, bool lock_held)
 {
-	int			fd;
-	RelMapFile *realmap;
+	RelMapFile *map;
 	char		mapfilename[MAXPGPATH];
 
-	/*
-	 * Fill in the overhead fields and update CRC.
-	 */
-	newmap->magic = RELMAPPER_FILEMAGIC;
-	if (newmap->num_mappings < 0 || newmap->num_mappings > MAX_MAPPINGS)
-		elog(ERROR, "attempt to write bogus relation mapping");
-
-	INIT_CRC32C(newmap->crc);
-	COMP_CRC32C(newmap->crc, (char *) newmap, offsetof(RelMapFile, crc));
-	FIN_CRC32C(newmap->crc);
-
-	/*
-	 * Open the target file.  We prefer to do this before entering the
-	 * critical section, so that an open() failure need not force PANIC.
-	 */
 	if (shared)
 	{
 		snprintf(mapfilename, sizeof(mapfilename), "global/%s",
 				 RELMAPPER_FILENAME);
-		realmap = &shared_map;
+		map = &shared_map;
 	}
 	else
 	{
 		snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
-				 dbpath, RELMAPPER_FILENAME);
-		realmap = &local_map;
+				 DatabasePath, RELMAPPER_FILENAME);
+		map = &local_map;
 	}
 
+	/* Read data ... */
+	read_relmap_file(mapfilename, map, lock_held);
+}
+
+/*
+ * Helper function for write_relmap_file, Read comments atop write_relmap_file
+ * for more details.  The CRC should be computed by the caller and stored in
+ * the newmap.
+ */
+static void
+write_relmap_file_internal(char *mapfilename, RelMapFile *newmap,
+						   bool write_wal, bool send_sinval,
+						   bool preserve_files, Oid dbid, Oid tsid,
+						   const char *dbpath)
+{
+	int			fd;
+
+	/*
+	 * Open the target file.  We prefer to do this before entering the
+	 * critical section, so that an open() failure need not force PANIC.
+	 */
 	fd = OpenTransientFile(mapfilename, O_WRONLY | O_CREAT | PG_BINARY);
 	if (fd < 0)
 		ereport(ERROR,
@@ -934,6 +911,68 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		}
 	}
 
+	/* Critical section done */
+	if (write_wal)
+		END_CRIT_SECTION();
+}
+
+/*
+ * Write out a new shared or local map file with the given contents.
+ *
+ * The magic number and CRC are automatically updated in *newmap.  On
+ * success, we copy the data to the appropriate permanent static variable.
+ *
+ * If write_wal is true then an appropriate WAL message is emitted.
+ * (It will be false for bootstrap and WAL replay cases.)
+ *
+ * If send_sinval is true then a SI invalidation message is sent.
+ * (This should be true except in bootstrap case.)
+ *
+ * If preserve_files is true then the storage manager is warned not to
+ * delete the files listed in the map.
+ *
+ * Because this may be called during WAL replay when MyDatabaseId,
+ * DatabasePath, etc aren't valid, we require the caller to pass in suitable
+ * values.  The caller is also responsible for being sure no concurrent
+ * map update could be happening.
+ */
+static void
+write_relmap_file(bool shared, RelMapFile *newmap,
+				  bool write_wal, bool send_sinval, bool preserve_files,
+				  Oid dbid, Oid tsid, const char *dbpath)
+{
+	RelMapFile *realmap;
+	char		mapfilename[MAXPGPATH];
+
+	/*
+	 * Fill in the overhead fields and update CRC.
+	 */
+	newmap->magic = RELMAPPER_FILEMAGIC;
+	if (newmap->num_mappings < 0 || newmap->num_mappings > MAX_MAPPINGS)
+		elog(ERROR, "attempt to write bogus relation mapping");
+
+	INIT_CRC32C(newmap->crc);
+	COMP_CRC32C(newmap->crc, (char *) newmap, offsetof(RelMapFile, crc));
+	FIN_CRC32C(newmap->crc);
+
+	if (shared)
+	{
+		snprintf(mapfilename, sizeof(mapfilename), "global/%s",
+				 RELMAPPER_FILENAME);
+		realmap = &shared_map;
+	}
+	else
+	{
+		snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+				 dbpath, RELMAPPER_FILENAME);
+		realmap = &local_map;
+	}
+
+	/* Write the map to the relmap file. */
+	write_relmap_file_internal(mapfilename, newmap, write_wal,
+							   send_sinval, preserve_files, dbid, tsid,
+							   dbpath);
+
 	/*
 	 * Success, update permanent copy.  During bootstrap, we might be working
 	 * on the permanent copy itself, in which case skip the memcpy() to avoid
@@ -943,10 +982,6 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		memcpy(realmap, newmap, sizeof(RelMapFile));
 	else
 		Assert(!send_sinval);	/* must be bootstrapping */
-
-	/* Critical section done */
-	if (write_wal)
-		END_CRIT_SECTION();
 }
 
 /*
-- 
1.8.3.1

v10-0003-Refactor-index_copy_data.patchtext/x-patch; charset=US-ASCII; name=v10-0003-Refactor-index_copy_data.patchDownload

From 44b2398e89fdf4dee1bd072d7d19299e4fac45ae Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 24 Sep 2021 18:13:25 +0530
Subject: [PATCH v10 3/6] Refactor index_copy_data

Make separate interface for copying relation storage, this will
be used by later patch for copying the database relations.
---
 src/backend/commands/tablecmds.c | 68 +++++++++++++++++++++++++---------------
 src/include/commands/tablecmds.h |  5 +++
 src/tools/pgindent/typedefs.list |  1 +
 3 files changed, 48 insertions(+), 26 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 3e83f37..a57d6b0 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -14580,54 +14580,70 @@ AlterTableMoveAll(AlterTableMoveAllStmt *stmt)
 	return new_tablespaceoid;
 }
 
-static void
-index_copy_data(Relation rel, RelFileNode newrnode)
+/*
+ * Copy source smgr relation's all fork's data to the destination.
+ *
+ * copy_storage - storage copy function, which is passed by the caller.
+ */
+void
+RelationCopyAllFork(SMgrRelation src_smgr, SMgrRelation	dst_smgr,
+					char relpersistence, copy_relation_storage copy_storage)
 {
-	SMgrRelation dstrel;
-
-	dstrel = smgropen(newrnode, rel->rd_backend);
-
 	/*
-	 * Since we copy the file directly without looking at the shared buffers,
-	 * we'd better first flush out any pages of the source relation that are
-	 * in shared buffers.  We assume no new changes will be made while we are
-	 * holding exclusive lock on the rel.
-	 */
-	FlushRelationBuffers(rel);
-
-	/*
-	 * Create and copy all forks of the relation, and schedule unlinking of
-	 * old physical files.
+	 * Create and copy all forks of the relation.
 	 *
 	 * NOTE: any conflict in relfilenode value will be caught in
 	 * RelationCreateStorage().
 	 */
-	RelationCreateStorage(newrnode, rel->rd_rel->relpersistence);
+	RelationCreateStorage(dst_smgr->smgr_rnode.node, relpersistence);
 
 	/* copy main fork */
-	RelationCopyStorage(RelationGetSmgr(rel), dstrel, MAIN_FORKNUM,
-						rel->rd_rel->relpersistence);
+	copy_storage(src_smgr, dst_smgr, MAIN_FORKNUM, relpersistence);
 
 	/* copy those extra forks that exist */
 	for (ForkNumber forkNum = MAIN_FORKNUM + 1;
 		 forkNum <= MAX_FORKNUM; forkNum++)
 	{
-		if (smgrexists(RelationGetSmgr(rel), forkNum))
+		if (smgrexists(src_smgr, forkNum))
 		{
-			smgrcreate(dstrel, forkNum, false);
+			smgrcreate(dst_smgr, forkNum, false);
 
 			/*
 			 * WAL log creation if the relation is persistent, or this is the
 			 * init fork of an unlogged relation.
 			 */
-			if (RelationIsPermanent(rel) ||
-				(rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
+			if (relpersistence == RELPERSISTENCE_PERMANENT ||
+				(relpersistence == RELPERSISTENCE_UNLOGGED &&
 				 forkNum == INIT_FORKNUM))
-				log_smgrcreate(&newrnode, forkNum);
-			RelationCopyStorage(RelationGetSmgr(rel), dstrel, forkNum,
-								rel->rd_rel->relpersistence);
+				log_smgrcreate(&dst_smgr->smgr_rnode.node, forkNum);
+
+			/* Copy a fork's data, block by block. */
+			copy_storage(src_smgr, dst_smgr, forkNum, relpersistence);
 		}
 	}
+}
+
+static void
+index_copy_data(Relation rel, RelFileNode newrnode)
+{
+	SMgrRelation dstrel;
+
+	dstrel = smgropen(newrnode, rel->rd_backend);
+
+	/*
+	 * Since we copy the file directly without looking at the shared buffers,
+	 * we'd better first flush out any pages of the source relation that are
+	 * in shared buffers.  We assume no new changes will be made while we are
+	 * holding exclusive lock on the rel.
+	 */
+	FlushRelationBuffers(rel);
+
+	/*
+	 * Create and copy all forks of the relation, and schedule unlinking of
+	 * old physical files.
+	 */
+	RelationCopyAllFork(RelationGetSmgr(rel), dstrel,
+						rel->rd_rel->relpersistence, RelationCopyStorage);
 
 	/* drop old relation, and close new one */
 	RelationDropStorage(rel);
diff --git a/src/include/commands/tablecmds.h b/src/include/commands/tablecmds.h
index 5d4037f..cd49471 100644
--- a/src/include/commands/tablecmds.h
+++ b/src/include/commands/tablecmds.h
@@ -19,10 +19,13 @@
 #include "catalog/objectaddress.h"
 #include "nodes/parsenodes.h"
 #include "storage/lock.h"
+#include "storage/smgr.h"
 #include "utils/relcache.h"
 
 struct AlterTableUtilityContext;	/* avoid including tcop/utility.h here */
 
+typedef void (*copy_relation_storage) (SMgrRelation src, SMgrRelation dst,
+									  ForkNumber forkNum, char relpersistence);
 
 extern ObjectAddress DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 									ObjectAddress *typaddress, const char *queryString);
@@ -42,6 +45,8 @@ extern void AlterTableInternal(Oid relid, List *cmds, bool recurse);
 
 extern Oid	AlterTableMoveAll(AlterTableMoveAllStmt *stmt);
 
+extern void RelationCopyAllFork(SMgrRelation src_smgr, SMgrRelation	dst_smgr,
+								char relpersistence, copy_relation_storage copy_storage);
 extern ObjectAddress AlterTableNamespace(AlterObjectSchemaStmt *stmt,
 										 Oid *oldschema);
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index d9b83f7..c1400d3 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3059,6 +3059,7 @@ config_var_value
 contain_aggs_of_level_context
 convert_testexpr_context
 copy_data_source_cb
+copy_relation_storage
 core_YYSTYPE
 core_yy_extra_type
 core_yyscan_t
-- 
1.8.3.1

v10-0004-Extend-bufmgr-interfaces.patchtext/x-patch; charset=US-ASCII; name=v10-0004-Extend-bufmgr-interfaces.patchDownload

From 824cc17de6f9ca027cc6cfae60f73086490a23db Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Thu, 10 Feb 2022 15:55:33 +0530
Subject: [PATCH v10 4/6] Extend bufmgr interfaces

Extend ReadBufferWithoutRelcache interface to take relpersistence
as input. At present, this function may only be used on permanent
relations, because we only use it during XLOG replay.  But now as
part of the bigger patch set, we will be using this for reading the
buffer from the database which we are not connected so now we might
have temporary and unlogged relations as well.
---
 src/backend/access/transam/xlogutils.c |  9 ++++++---
 src/backend/storage/buffer/bufmgr.c    | 13 +++----------
 src/include/storage/bufmgr.h           |  3 ++-
 3 files changed, 11 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 54d5f20..c292794 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -484,7 +484,8 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 	{
 		/* page exists in file */
 		buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-										   mode, NULL);
+										   mode, NULL,
+										   RELPERSISTENCE_PERMANENT);
 	}
 	else
 	{
@@ -509,7 +510,8 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 				ReleaseBuffer(buffer);
 			}
 			buffer = ReadBufferWithoutRelcache(rnode, forknum,
-											   P_NEW, mode, NULL);
+											   P_NEW, mode, NULL,
+											   RELPERSISTENCE_PERMANENT);
 		}
 		while (BufferGetBlockNumber(buffer) < blkno);
 		/* Handle the corner case that P_NEW returns non-consecutive pages */
@@ -519,7 +521,8 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 				LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 			ReleaseBuffer(buffer);
 			buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-											   mode, NULL);
+											   mode, NULL,
+											   RELPERSISTENCE_PERMANENT);
 		}
 	}
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f5459c6..d6d366a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -771,24 +771,17 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 /*
  * ReadBufferWithoutRelcache -- like ReadBufferExtended, but doesn't require
  *		a relcache entry for the relation.
- *
- * NB: At present, this function may only be used on permanent relations, which
- * is OK, because we only use it during XLOG replay.  If in the future we
- * want to use it on temporary or unlogged relations, we could pass additional
- * parameters.
  */
 Buffer
 ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
 						  BlockNumber blockNum, ReadBufferMode mode,
-						  BufferAccessStrategy strategy)
+						  BufferAccessStrategy strategy, char relpersistence)
 {
 	bool		hit;
 
 	SMgrRelation smgr = smgropen(rnode, InvalidBackendId);
 
-	Assert(InRecovery);
-
-	return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
+	return ReadBuffer_common(smgr, relpersistence, forkNum, blockNum,
 							 mode, strategy, &hit);
 }
 
@@ -798,7 +791,7 @@ ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
  *
  * *hit is set to true if the request was satisfied from shared buffer cache.
  */
-static Buffer
+Buffer
 ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 				  BlockNumber blockNum, ReadBufferMode mode,
 				  BufferAccessStrategy strategy, bool *hit)
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index dd01841..7b80f58 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -184,7 +184,8 @@ extern Buffer ReadBufferExtended(Relation reln, ForkNumber forkNum,
 								 BufferAccessStrategy strategy);
 extern Buffer ReadBufferWithoutRelcache(RelFileNode rnode,
 										ForkNumber forkNum, BlockNumber blockNum,
-										ReadBufferMode mode, BufferAccessStrategy strategy);
+										ReadBufferMode mode, BufferAccessStrategy strategy,
+										char relpersistence);
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern void MarkBufferDirty(Buffer buffer);
-- 
1.8.3.1

v10-0005-New-interface-to-lock-relation-id.patchtext/x-patch; charset=US-ASCII; name=v10-0005-New-interface-to-lock-relation-id.patchDownload

From 93e3ed5a9d0b385f2a62cba5c96023bb20c97845 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 24 Sep 2021 18:29:17 +0530
Subject: [PATCH v10 5/6] New interface to lock relation id

Currently, we have LockRelationOid which provide a mechanism to
lock the relation oid but we must be connected to the database
from which this relation belong.  As part of this patch we are
providing a new interface which can lock the relation even if we
are not connected to the containing database.
---
 src/backend/storage/lmgr/lmgr.c | 28 ++++++++++++++++++++++++++++
 src/include/storage/lmgr.h      |  1 +
 2 files changed, 29 insertions(+)

diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 5ae52dd..1543da6 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -176,6 +176,34 @@ ConditionalLockRelationOid(Oid relid, LOCKMODE lockmode)
 }
 
 /*
+ *		LockRelationId
+ *
+ * Lock, given a LockRelId.  Same as LockRelationOid but take LockRelId as an
+ * input.
+ */
+void
+LockRelationId(LockRelId *relid, LOCKMODE lockmode)
+{
+	LOCKTAG		tag;
+	LOCALLOCK  *locallock;
+	LockAcquireResult res;
+
+	SET_LOCKTAG_RELATION(tag, relid->dbId, relid->relId);
+
+	res = LockAcquireExtended(&tag, lockmode, false, false, true, &locallock);
+
+	/*
+	 * Now that we have the lock, check for invalidation messages; see notes
+	 * in LockRelationOid.
+	 */
+	if (res != LOCKACQUIRE_ALREADY_CLEAR)
+	{
+		AcceptInvalidationMessages();
+		MarkLockClear(locallock);
+	}
+}
+
+/*
  *		UnlockRelationId
  *
  * Unlock, given a LockRelId.  This is preferred over UnlockRelationOid
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index 49edbcc..be1d2c9 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -38,6 +38,7 @@ extern void RelationInitLockInfo(Relation relation);
 
 /* Lock a relation */
 extern void LockRelationOid(Oid relid, LOCKMODE lockmode);
+extern void LockRelationId(LockRelId *relid, LOCKMODE lockmode);
 extern bool ConditionalLockRelationOid(Oid relid, LOCKMODE lockmode);
 extern void UnlockRelationId(LockRelId *relid, LOCKMODE lockmode);
 extern void UnlockRelationOid(Oid relid, LOCKMODE lockmode);
-- 
1.8.3.1

v10-0006-WAL-logged-CREATE-DATABASE.patchtext/x-patch; charset=US-ASCII; name=v10-0006-WAL-logged-CREATE-DATABASE.patchDownload

From 2632da8b5770ac67633fbadbcdfaffc3ceeab277 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 14 Feb 2022 17:48:03 +0530
Subject: [PATCH v10 6/6] WAL logged CREATE DATABASE

Currently, CREATE DATABASE forces a checkpoint, then copies all the files,
then forces another checkpoint. The comments in the createdb() function
explain the reasons for this. The attached patch fixes this problem by making
create database completely WAL logged so that we can avoid the checkpoints.

We are also maintaining the old way of creating the database and for that we
are providing an option to choose the strategy for creating the database.
For the new method the user need to give STRATEGY=WAL_LOG and for the
old method they need to give STRATEGY=FILE_COPY.  The default strategy will
be WAL_LOG.
---
 doc/src/sgml/ref/create_database.sgml  |  22 +
 src/backend/commands/dbcommands.c      | 858 ++++++++++++++++++++++++++++-----
 src/include/commands/dbcommands_xlog.h |   8 +
 src/tools/pgindent/typedefs.list       |   1 +
 4 files changed, 757 insertions(+), 132 deletions(-)

diff --git a/doc/src/sgml/ref/create_database.sgml b/doc/src/sgml/ref/create_database.sgml
index f70d0c7..a906cc8 100644
--- a/doc/src/sgml/ref/create_database.sgml
+++ b/doc/src/sgml/ref/create_database.sgml
@@ -34,6 +34,7 @@ CREATE DATABASE <replaceable class="parameter">name</replaceable>
            [ CONNECTION LIMIT [=] <replaceable class="parameter">connlimit</replaceable> ]
            [ IS_TEMPLATE [=] <replaceable class="parameter">istemplate</replaceable> ]
            [ OID [=] <replaceable class="parameter">oid</replaceable> ] ]
+           [ STRATEGY [=] <replaceable class="parameter">strategy</replaceable> ] ]
 </synopsis>
  </refsynopsisdiv>
 
@@ -240,6 +241,27 @@ CREATE DATABASE <replaceable class="parameter">name</replaceable>
        </listitem>
       </varlistentry>
 
+      <varlistentry>
+       <term><replaceable class="parameter">strategy</replaceable></term>
+       <listitem>
+        <para>
+         This is used for copying the database directory.  Currently, we have
+         two strategies the <literal>WAL_LOG</literal> and the
+         <literal>FILE_COPY</literal>.  If <literal>WAL_LOG</literal> strategy
+         is used then the database will be copied block by block and it will
+         also WAL log each copied block.  Otherwise, if <literal>FILE_COPY
+         </literal> strategy is used then it will do the file system level copy
+         so individual the block is not WAL logged.  If the <literal>FILE_COPY
+         </literal> strategy is used then it has to issue a checkpoint before
+         and after performing the copy and if the shared buffers are large and
+         there are a lot of dirty buffers then issuing checkpoint would be
+         costly and it may impact the performance of the whole system.  On the
+         other hand, if we WAL log each block then if the source database is
+         large then creating the database may take more time.
+        </para>
+       </listitem>
+      </varlistentry>
+
     </variablelist>
 
   <para>
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index c37e3c9..255570f 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -46,6 +46,7 @@
 #include "commands/dbcommands_xlog.h"
 #include "commands/defrem.h"
 #include "commands/seclabel.h"
+#include "commands/tablecmds.h"
 #include "commands/tablespace.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
@@ -63,13 +64,27 @@
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
 #include "utils/pg_locale.h"
+#include "utils/relmapper.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 
+/*
+ * Create database strategy.  The CREATEDB_WAL_LOG will copy the database at
+ * the block level and WAL log each copied block.  Whereas the
+ * CREATEDB_FILE_COPY will directly copy the database at the file level and no
+ * individual operations will be WAL logged.
+ */
+typedef enum CreateDBStrategy
+{
+	CREATEDB_WAL_LOG = 0,
+	CREATEDB_FILE_COPY = 1
+} CreateDBStrategy;
+
 typedef struct
 {
 	Oid			src_dboid;		/* source (template) DB */
 	Oid			dest_dboid;		/* DB we are trying to create */
+	CreateDBStrategy	strategy;	/* create db strategy */
 } createdb_failure_params;
 
 typedef struct
@@ -78,6 +93,19 @@ typedef struct
 	Oid			dest_tsoid;		/* tablespace we are trying to move to */
 } movedb_failure_params;
 
+/*
+ * When creating a database, we scan the pg_class of the source database to
+ * identify all the relations to be copied.  The structure is used for storing
+ * information about each relation of the source database.
+ */
+typedef struct CreateDBRelInfo
+{
+	RelFileNode		rnode;				/* physical relation identifier */
+	Oid				reloid;				/* relation oid */
+	char			relpersistence;		/* relation's persistence level */
+} CreateDBRelInfo;
+
+
 /* non-export function prototypes */
 static void createdb_failure_callback(int code, Datum arg);
 static void movedb(const char *dbname, const char *tblspcname);
@@ -92,7 +120,607 @@ static bool have_createdb_privilege(void);
 static void remove_dbtablespaces(Oid db_id);
 static bool check_db_file_conflict(Oid db_id);
 static int	errdetail_busy_db(int notherbackends, int npreparedxacts);
+static void CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid,
+									bool isRedo);
+static CreateDBRelInfo *GetRelInfoFromTuple(HeapTupleData *tuple,
+											Oid tbid, Oid dbid, char *srcpath);
+static List *GetRelListFromPage(Page page, Buffer buf, Oid tbid, Oid dbid,
+								char *srcpath, List *rnodelist, Snapshot
+								snapshot);
+static List *GetDatabaseRelationList(Oid srctbid, Oid srcdbid, char *srcpath);
+static void RelationCopyStorageUsingBuffer(SMgrRelation src, SMgrRelation dst,
+									ForkNumber forkNum, char relpersistence);
+static void CopyDatabaseWithWal(Oid src_dboid, Oid dboid, Oid src_tsid,
+								Oid dst_tsid);
+static void CopyDatabase(Oid src_dboid, Oid dboid, Oid src_tsid, Oid dst_tsid);
+
+/*
+ * CreateDirAndVersionFile - Create database directory and write out the
+ *							 PG_VERSION file in the database path.
+ *
+ * If isRedo is true, it's okay for the database directory to exist already.
+ *
+ * We can directly write PG_MAJORVERSION in the version file instead of copying
+ * from the source database file because these two must be the same.
+ */
+static void
+CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid, bool isRedo)
+{
+	int		fd;
+	int		nbytes;
+	char	versionfile[MAXPGPATH];
+	char	buf[16];
+
+	/* Prepare version data before starting a critical section. */
+	sprintf(buf, "%s\n", PG_MAJORVERSION);
+	nbytes = strlen(PG_MAJORVERSION) + 1;
+
+	/* If we are not in WAL replay then write the WAL. */
+	if (!isRedo)
+	{
+		xl_dbase_create_rec xlrec;
+		XLogRecPtr	lsn;
+
+		/* Now errors are fatal ... */
+		START_CRIT_SECTION();
+
+		xlrec.db_id = dbid;
+		xlrec.tablespace_id = tsid;
+		xlrec.src_db_id = InvalidOid;
+		xlrec.src_tablespace_id = InvalidOid;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec), sizeof(xl_dbase_create_rec));
+
+		lsn = XLogInsert(RM_DBASE_ID, XLOG_DBASE_CREATE);
+
+		/* As always, WAL must hit the disk before the data update does. */
+		XLogFlush(lsn);
+	}
+
+	/* Create database directory. */
+	if (MakePGDirectory(dbpath) < 0)
+	{
+		/* Failure other than already exists or not in WAL replay? */
+		if (errno != EEXIST || !isRedo)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not create directory \"%s\": %m", dbpath)));
+	}
+
+	/*
+	 * Create PG_VERSION file in the database path.  If the file already exists
+	 * and we are in WAL replay then try again to open it in write mode.
+	 */
+	snprintf(versionfile, sizeof(versionfile), "%s/%s", dbpath, "PG_VERSION");
+
+	fd = OpenTransientFile(versionfile, O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
+	if (fd < 0 && errno == EEXIST && isRedo)
+		fd = OpenTransientFile(versionfile, O_WRONLY | O_TRUNC | PG_BINARY);
+
+	if (fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m", versionfile)));
+
+	/* Write PG_MAJORVERSION in the PG_VERSION file. */
+	pgstat_report_wait_start(WAIT_EVENT_COPY_FILE_WRITE);
+	errno = 0;
+	if ((int) write(fd, buf, nbytes) != nbytes)
+	{
+		/* If write didn't set errno, assume problem is no disk space. */
+		if (errno == 0)
+			errno = ENOSPC;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m", versionfile)));
+	}
+	pgstat_report_wait_end();
+
+	/* Close the version file. */
+	CloseTransientFile(fd);
+
+	/* Critical section done. */
+	if (!isRedo)
+		END_CRIT_SECTION();
+}
+
+/*
+ * GetRelInfoFromTuple - Prepare a CreateDBRelInfo element from the tuple
+ *
+ * Helper function for GetRelListFromPage to prepare a single element from the
+ * pg_class tuple.
+ */
+CreateDBRelInfo *
+GetRelInfoFromTuple(HeapTupleData *tuple, Oid tbid, Oid dbid, char *srcpath)
+{
+	CreateDBRelInfo	   *relinfo;
+	Form_pg_class		classForm;
+	Oid					relfilenode = InvalidOid;
+
+	classForm = (Form_pg_class) GETSTRUCT(tuple);
+
+	/* We don't need to copy the shared objects to the target. */
+	if (classForm->reltablespace == GLOBALTABLESPACE_OID)
+		return NULL;
+
+	/*
+	 * If the object doesn't have the storage then nothing to be
+	 * done for that object so just ignore it.
+	 */
+	if (!RELKIND_HAS_STORAGE(classForm->relkind))
+		return NULL;
+
+	/*
+	 * If relfilenode is valid then directly use it.  Otherwise,
+	 * consult the relmapper for the mapped relation.
+	 */
+	if (OidIsValid(classForm->relfilenode))
+		relfilenode = classForm->relfilenode;
+	else
+		relfilenode = RelationMapOidToFilenodeForDatabase(srcpath,
+										classForm->oid);
+
+	/* We must have a valid relfilenode oid. */
+	Assert(OidIsValid(relfilenode));
+
+	/* Prepare a rel info element and add it to the list. */
+	relinfo = (CreateDBRelInfo *) palloc(sizeof(CreateDBRelInfo));
+	if (OidIsValid(classForm->reltablespace))
+		relinfo->rnode.spcNode = classForm->reltablespace;
+	else
+		relinfo->rnode.spcNode = tbid;
+
+	relinfo->rnode.dbNode = dbid;
+	relinfo->rnode.relNode = relfilenode;
+	relinfo->reloid = classForm->oid;
+	relinfo->relpersistence = classForm->relpersistence;
+
+	return relinfo;
+}
+
+/*
+ * GetRelListFromPage - Helper function for GetDatabaseRelationList.
+ *
+ * Iterate over each tuple of input pg_class and get a list of all the valid
+ * relfilenodes of the given block and append them to input rnodelist.
+ */
+static List *
+GetRelListFromPage(Page page, Buffer buf, Oid tbid, Oid dbid, char *srcpath,
+				  List *rnodelist, Snapshot snapshot)
+{
+	BlockNumber		blkno = BufferGetBlockNumber(buf);
+	OffsetNumber	offnum;
+	OffsetNumber	maxoff;
+	HeapTupleData	tuple;
+
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/* Iterate over each tuple on the page. */
+	for (offnum = FirstOffsetNumber;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid;
+
+		itemid = PageGetItemId(page, offnum);
+
+		/* Nothing to do if slot is empty or already dead. */
+		if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid) ||
+			ItemIdIsRedirected(itemid))
+			continue;
+
+		Assert(ItemIdIsNormal(itemid));
+		ItemPointerSet(&(tuple.t_self), blkno, offnum);
+
+		tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
+		tuple.t_len = ItemIdGetLength(itemid);
+		tuple.t_tableOid = RelationRelationId;
+
+		/*
+		 * If the tuple is visible then add its relfilenode info to the
+		 * list.
+		 */
+		if (HeapTupleSatisfiesVisibility(&tuple, snapshot, buf))
+		{
+			CreateDBRelInfo	   *relinfo;
+
+			relinfo = GetRelInfoFromTuple(&tuple, tbid, dbid, srcpath);
 
+			/* Add it to the list. */
+			if (relinfo != NULL)
+				rnodelist = lappend(rnodelist, relinfo);
+		}
+	}
+
+	return rnodelist;
+}
+
+/*
+ * GetDatabaseRelationList - Get relfilenode list to be copied.
+ *
+ * Iterate over each block of the pg_class relation.  From there, we will check
+ * all the visible tuples in order to get a list of all the valid relfilenodes
+ * in the source database that should be copied to the target database.
+ */
+static List *
+GetDatabaseRelationList(Oid tbid, Oid dbid, char *srcpath)
+{
+	SMgrRelation	rd_smgr;
+	RelFileNode		rnode;
+	BlockNumber		nblocks;
+	BlockNumber		blkno;
+	Buffer			buf;
+	Oid				relfilenode;
+	Page			page;
+	List		   *rnodelist = NIL;
+	LockRelId		relid;
+	Snapshot		snapshot;
+	BufferAccessStrategy bstrategy;
+
+	/* Get pg_class relfilenode. */
+	relfilenode = RelationMapOidToFilenodeForDatabase(srcpath,
+													  RelationRelationId);
+	/*
+	 * We are going to read the buffers associated with the pg_class relation.
+	 * Thus, acquire the relation level lock before start scanning.  As we are
+	 * not connected to the database, we cannot use relation_open directly, so
+	 * we have to lock using relation id.
+	 */
+	relid.dbId = dbid;
+	relid.relId = RelationRelationId;
+	LockRelationId(&relid, AccessShareLock);
+
+	/* Prepare a relnode for pg_class relation. */
+	rnode.spcNode = tbid;
+	rnode.dbNode = dbid;
+	rnode.relNode = relfilenode;
+
+	/*
+	 * We are not connected to the source database so open the pg_class
+	 * relation at the smgr level and get the block count.
+	 */
+	rd_smgr = smgropen(rnode, InvalidBackendId);
+	nblocks = smgrnblocks(rd_smgr, MAIN_FORKNUM);
+
+	/*
+	 * We're going to read the whole pg_class so better to use bulk-read buffer
+	 * access strategy.
+	 */
+	bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	/* Get latest snapshot for scanning the pg_class. */
+	snapshot = GetLatestSnapshot();
+
+	/* Iterate over each block on the pg_class relation. */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		/*
+		 * We are not connected to the source database so directly use the lower
+		 * level bufmgr interface which operates on the rnode.
+		 */
+		buf = ReadBufferWithoutRelcache(rnode, MAIN_FORKNUM, blkno,
+										RBM_NORMAL, bstrategy,
+										RELPERSISTENCE_PERMANENT);
+
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buf);
+		if (PageIsNew(page) || PageIsEmpty(page))
+		{
+			UnlockReleaseBuffer(buf);
+			continue;
+		}
+
+		/*
+		 * Process pg_class tuple for the current page and add all the valid
+		 * relfilenode entries to the rnodelist.
+		 */
+		rnodelist = GetRelListFromPage(page, buf, tbid, dbid, srcpath,
+									   rnodelist, snapshot);
+
+		/* Release the buffer lock. */
+		UnlockReleaseBuffer(buf);
+	}
+
+	/* Release the lock. */
+	UnlockRelationId(&relid, AccessShareLock);
+
+	return rnodelist;
+}
+
+/*
+ * RelationCopyStorageUsingBuffer - Copy fork's data using bufmgr.
+ *
+ * Same as RelationCopyStorage but instead of using smgrread and smgrextend
+ * this will copy using bufmgr APIs.
+ */
+static void
+RelationCopyStorageUsingBuffer(SMgrRelation src, SMgrRelation dst,
+							   ForkNumber forkNum, char relpersistence)
+{
+	Buffer		srcBuf;
+	Buffer		dstBuf;
+	Page		srcPage;
+	Page		dstPage;
+	bool		use_wal;
+	bool		copying_initfork;
+	BlockNumber nblocks;
+	BlockNumber blkno;
+	BufferAccessStrategy bstrategy_src;
+	BufferAccessStrategy bstrategy_dst;
+
+	/* Refer comments in RelationCopyStorage. */
+	copying_initfork = relpersistence == RELPERSISTENCE_UNLOGGED &&
+		forkNum == INIT_FORKNUM;
+	use_wal = XLogIsNeeded() &&
+		(relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork);
+
+	/* Get number of blocks in the source relation. */
+	nblocks = smgrnblocks(src, forkNum);
+
+	/*
+	 * We are going to copy whole relation from the source to the destination
+	 * so use BAS_BULKREAD strategy for the source relation and BAS_BULKWRITE
+	 * strategy for the destination relation.
+	 */
+	bstrategy_src = GetAccessStrategy(BAS_BULKREAD);
+	bstrategy_dst = GetAccessStrategy(BAS_BULKWRITE);
+
+	/* Iterate over each block of the source relation file. */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		/* If we got a cancel signal during the copy of the data, quit */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Read block from source relation. */
+		srcBuf = ReadBufferWithoutRelcache(src->smgr_rnode.node, forkNum,
+										   blkno, RBM_NORMAL, bstrategy_src,
+										   relpersistence);
+		srcPage = BufferGetPage(srcBuf);
+		if (PageIsNew(srcPage) || PageIsEmpty(srcPage))
+		{
+			ReleaseBuffer(srcBuf);
+			continue;
+		}
+
+		/* Use P_NEW to extend the relation. */
+		dstBuf = ReadBufferWithoutRelcache(dst->smgr_rnode.node, forkNum,
+										   P_NEW, RBM_NORMAL, bstrategy_dst,
+										   relpersistence);
+		LockBuffer(dstBuf, BUFFER_LOCK_EXCLUSIVE);
+
+		START_CRIT_SECTION();
+
+		/* Initialize the page and write the data. */
+		dstPage = BufferGetPage(dstBuf);
+		PageInit(dstPage, BufferGetPageSize(dstBuf), 0);
+		memcpy(dstPage, srcPage, BLCKSZ);
+		MarkBufferDirty(dstBuf);
+
+		/* WAL-log the copied page. */
+		if (use_wal)
+			log_newpage_buffer(dstBuf, true);
+
+		END_CRIT_SECTION();
+
+		UnlockReleaseBuffer(dstBuf);
+		ReleaseBuffer(srcBuf);
+	}
+}
+
+/*
+ * CopyDatabaseWithWal - Copy source database to the target database with WAL
+ *
+ * Create target database directory and copy data files from the source
+ * database to the target database, block by block and WAL log all the
+ * operations.
+ */
+static void
+CopyDatabaseWithWal(Oid src_dboid, Oid dst_dboid, Oid src_tsid, Oid dst_tsid)
+{
+	char	   *srcpath;
+	char	   *dstpath;
+	List	   *rnodelist = NULL;
+	ListCell   *cell;
+	LockRelId	relid;
+	RelFileNode	srcrnode;
+	RelFileNode	dstrnode;
+	CreateDBRelInfo	*relinfo;
+
+	/* Get the source database path. */
+	srcpath = GetDatabasePath(src_dboid, src_tsid);
+
+	/* Get the destination database path. */
+	dstpath = GetDatabasePath(dst_dboid, dst_tsid);
+
+	/* Create database directory and write PG_VERSION file. */
+	CreateDirAndVersionFile(dstpath, dst_dboid, dst_tsid, false);
+
+	/* Copy relmap file from source database to the destination database. */
+	CopyRelationMap(dst_dboid, dst_tsid, srcpath, dstpath);
+
+	/* Get list of all valid relnode from the source database. */
+	rnodelist = GetDatabaseRelationList(src_tsid, src_dboid, srcpath);
+	Assert(rnodelist != NIL);
+
+	/*
+	 * Database id is common for all the relation so set it before entering to
+	 * the loop.
+	 */
+	relid.dbId = src_dboid;
+
+	/*
+	 * Iterate over each relfilenode and copy the relation data block by block
+	 * from source database to the destination database.
+	 */
+	foreach(cell, rnodelist)
+	{
+		SMgrRelation	src_smgr;
+		SMgrRelation	dst_smgr;
+
+		relinfo = lfirst(cell);
+		srcrnode = relinfo->rnode;
+
+		/*
+		 * If the relation is from the source db's default tablespace then we
+		 * need to create it in the destinations db's default tablespace.
+		 * Otherwise, we need to create in the same tablespace as it is in the
+		 * source database.
+		 */
+		if (srcrnode.spcNode == src_tsid)
+			dstrnode.spcNode = dst_tsid;
+		/*
+		 * In case of ALTER DATABASE SET TABLESPACE we don't need to do
+		 * anything for the object which are not in the source db's default
+		 * tablespace.  The source and destination dboid will be same in
+		 * case of ALTER DATABASE SET TABLESPACE.
+		 */
+		else if (src_dboid == dst_dboid)
+			continue;
+		else
+			dstrnode.spcNode = srcrnode.spcNode;
+
+		dstrnode.dbNode = dst_dboid;
+		dstrnode.relNode = srcrnode.relNode;
+
+		/* Acquire the lock on relation before start copying. */
+		relid.relId = relinfo->reloid;
+		LockRelationId(&relid, AccessShareLock);
+
+		/* Open the source and the destination relation at smgr level. */
+		src_smgr = smgropen(srcrnode, InvalidBackendId);
+		dst_smgr = smgropen(dstrnode, InvalidBackendId);
+
+		/* Copy relation storage from source to the destination. */
+		RelationCopyAllFork(src_smgr, dst_smgr, relinfo->relpersistence,
+							RelationCopyStorageUsingBuffer);
+
+		/* Release the lock. */
+		UnlockRelationId(&relid, AccessShareLock);
+	}
+
+	list_free_deep(rnodelist);
+}
+
+/*
+ * CopyDatabase - Copy source database to the target database
+ *
+ * Copy source database directory to the destination directory using copydir
+ * operation.
+ */
+static void
+CopyDatabase(Oid src_dboid, Oid dst_dboid, Oid src_tsid, Oid dst_tsid)
+{
+	TableScanDesc	scan;
+	Relation		rel;
+	HeapTuple		tuple;
+
+	/*
+	 * Force a checkpoint before starting the copy. This will force all
+	 * dirty buffers, including those of unlogged tables, out to disk, to
+	 * ensure source database is up-to-date on disk for the copy.
+	 * FlushDatabaseBuffers() would suffice for that, but we also want to
+	 * process any pending unlink requests. Otherwise, if a checkpoint
+	 * happened while we're copying files, a file might be deleted just
+	 * when we're about to copy it, causing the lstat() call in copydir()
+	 * to fail with ENOENT.
+	 */
+	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE |
+					  CHECKPOINT_WAIT | CHECKPOINT_FLUSH_ALL);
+
+	/*
+	 * Iterate through all tablespaces of the template database, and copy
+	 * each one to the new database.
+	 */
+	rel = table_open(TableSpaceRelationId, AccessShareLock);
+	scan = table_beginscan_catalog(rel, 0, NULL);
+	while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+	{
+		Form_pg_tablespace spaceform = (Form_pg_tablespace) GETSTRUCT(tuple);
+		Oid			srctablespace = spaceform->oid;
+		Oid			dsttablespace;
+		char	   *srcpath;
+		char	   *dstpath;
+		struct stat st;
+
+		/* No need to copy global tablespace */
+		if (srctablespace == GLOBALTABLESPACE_OID)
+			continue;
+
+		srcpath = GetDatabasePath(src_dboid, srctablespace);
+
+		if (stat(srcpath, &st) < 0 || !S_ISDIR(st.st_mode) ||
+			directory_is_empty(srcpath))
+		{
+			/* Assume we can ignore it */
+			pfree(srcpath);
+			continue;
+		}
+
+		if (srctablespace == src_tsid)
+			dsttablespace = dst_tsid;
+		else
+			dsttablespace = srctablespace;
+
+		dstpath = GetDatabasePath(dst_dboid, dsttablespace);
+
+		/*
+		 * Copy this subdirectory to the new location
+		 *
+		 * We don't need to copy subdirectories
+		 */
+		copydir(srcpath, dstpath, false);
+
+		/* Record the filesystem change in XLOG */
+		{
+			xl_dbase_create_rec xlrec;
+
+			xlrec.db_id = dst_dboid;
+			xlrec.tablespace_id = dsttablespace;
+			xlrec.src_db_id = src_dboid;
+			xlrec.src_tablespace_id = srctablespace;
+
+			XLogBeginInsert();
+			XLogRegisterData((char *) &xlrec, sizeof(xl_dbase_create_rec));
+
+			(void) XLogInsert(RM_DBASE_ID,
+							  XLOG_DBASE_CREATE | XLR_SPECIAL_REL_UPDATE);
+		}
+	}
+	table_endscan(scan);
+	table_close(rel, AccessShareLock);
+
+	/*
+	 * We force a checkpoint before committing.  This effectively means
+	 * that committed XLOG_DBASE_CREATE operations will never need to be
+	 * replayed (at least not in ordinary crash recovery; we still have to
+	 * make the XLOG entry for the benefit of PITR operations). This
+	 * avoids two nasty scenarios:
+	 *
+	 * #1: When PITR is off, we don't XLOG the contents of newly created
+	 * indexes; therefore the drop-and-recreate-whole-directory behavior
+	 * of DBASE_CREATE replay would lose such indexes.
+	 *
+	 * #2: Since we have to recopy the source database during DBASE_CREATE
+	 * replay, we run the risk of copying changes in it that were
+	 * committed after the original CREATE DATABASE command but before the
+	 * system crash that led to the replay.  This is at least unexpected
+	 * and at worst could lead to inconsistencies, eg duplicate table
+	 * names.
+	 *
+	 * (Both of these were real bugs in releases 8.0 through 8.0.3.)
+	 *
+	 * In PITR replay, the first of these isn't an issue, and the second
+	 * is only a risk if the CREATE DATABASE and subsequent template
+	 * database change both occur while a base backup is being taken.
+	 * There doesn't seem to be much we can do about that except document
+	 * it as a limitation.
+	 *
+	 * Perhaps if we ever implement CREATE DATABASE in a less cheesy way,
+	 * we can avoid this.
+	 */
+	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
+}
 
 /*
  * CREATE DATABASE
@@ -100,8 +728,6 @@ static int	errdetail_busy_db(int notherbackends, int npreparedxacts);
 Oid
 createdb(ParseState *pstate, const CreatedbStmt *stmt)
 {
-	TableScanDesc scan;
-	Relation	rel;
 	Oid			src_dboid;
 	Oid			src_owner;
 	int			src_encoding = -1;
@@ -132,6 +758,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	DefElem    *dallowconnections = NULL;
 	DefElem    *dconnlimit = NULL;
 	DefElem    *dcollversion = NULL;
+	DefElem    *dstrategy = NULL;
 	char	   *dbname = stmt->dbname;
 	char	   *dbowner = NULL;
 	const char *dbtemplate = NULL;
@@ -145,6 +772,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	char	   *dbcollversion = NULL;
 	int			notherbackends;
 	int			npreparedxacts;
+	CreateDBStrategy	dbstrategy = CREATEDB_WAL_LOG;
 	createdb_failure_params fparms;
 
 	/* Extract options from the statement node tree */
@@ -250,6 +878,12 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 						(errcode(ERRCODE_INVALID_PARAMETER_VALUE)),
 						errmsg("OIDs less than %u are reserved for system objects", FirstNormalObjectId));
 		}
+		else if (strcmp(defel->defname, "strategy") == 0)
+		{
+			if (dstrategy)
+				errorConflictingDefElem(defel, pstate);
+			dstrategy = defel;
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -374,6 +1008,23 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 							dbtemplate)));
 	}
 
+	/* Validate the database creation strategy. */
+	if (dstrategy && dstrategy->arg)
+	{
+		char	*strategy;
+
+		strategy = defGetString(dstrategy);
+		if (strcmp(strategy, "wal_log") == 0)
+			dbstrategy = CREATEDB_WAL_LOG;
+		else if (strcmp(strategy, "file_copy") == 0)
+			dbstrategy = CREATEDB_FILE_COPY;
+		else
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("invalid create strategy %s", strategy),
+					 errhint("Valid strategies are \"wal_log\", and \"file_copy\".")));
+	}
+
 	/* If encoding or locales are defaulted, use source's setting */
 	if (encoding < 0)
 		encoding = src_encoding;
@@ -668,19 +1319,6 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	InvokeObjectPostCreateHook(DatabaseRelationId, dboid, 0);
 
 	/*
-	 * Force a checkpoint before starting the copy. This will force all dirty
-	 * buffers, including those of unlogged tables, out to disk, to ensure
-	 * source database is up-to-date on disk for the copy.
-	 * FlushDatabaseBuffers() would suffice for that, but we also want to
-	 * process any pending unlink requests. Otherwise, if a checkpoint
-	 * happened while we're copying files, a file might be deleted just when
-	 * we're about to copy it, causing the lstat() call in copydir() to fail
-	 * with ENOENT.
-	 */
-	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT
-					  | CHECKPOINT_FLUSH_ALL);
-
-	/*
 	 * Once we start copying subdirectories, we need to be able to clean 'em
 	 * up if we fail.  Use an ENSURE block to make sure this happens.  (This
 	 * is not a 100% solution, because of the possibility of failure during
@@ -689,114 +1327,47 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	 */
 	fparms.src_dboid = src_dboid;
 	fparms.dest_dboid = dboid;
+	fparms.strategy = dbstrategy;
+
 	PG_ENSURE_ERROR_CLEANUP(createdb_failure_callback,
 							PointerGetDatum(&fparms));
 	{
 		/*
-		 * Iterate through all tablespaces of the template database, and copy
-		 * each one to the new database.
+		 * If the user has asked to create a database with WAL_LOG strategy
+		 * then call CopyDatabaseWithWal, which will copy the database at the
+		 * block level and it will WAL log each copied block.  Otherwise,
+		 * call CopyDatabase that will copy the database file by file.
 		 */
-		rel = table_open(TableSpaceRelationId, AccessShareLock);
-		scan = table_beginscan_catalog(rel, 0, NULL);
-		while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+		if (dbstrategy == CREATEDB_WAL_LOG)
 		{
-			Form_pg_tablespace spaceform = (Form_pg_tablespace) GETSTRUCT(tuple);
-			Oid			srctablespace = spaceform->oid;
-			Oid			dsttablespace;
-			char	   *srcpath;
-			char	   *dstpath;
-			struct stat st;
-
-			/* No need to copy global tablespace */
-			if (srctablespace == GLOBALTABLESPACE_OID)
-				continue;
-
-			srcpath = GetDatabasePath(src_dboid, srctablespace);
-
-			if (stat(srcpath, &st) < 0 || !S_ISDIR(st.st_mode) ||
-				directory_is_empty(srcpath))
-			{
-				/* Assume we can ignore it */
-				pfree(srcpath);
-				continue;
-			}
-
-			if (srctablespace == src_deftablespace)
-				dsttablespace = dst_deftablespace;
-			else
-				dsttablespace = srctablespace;
-
-			dstpath = GetDatabasePath(dboid, dsttablespace);
+			CopyDatabaseWithWal(src_dboid, dboid, src_deftablespace,
+								dst_deftablespace);
 
 			/*
-			 * Copy this subdirectory to the new location
-			 *
-			 * We don't need to copy subdirectories
+			 * Close pg_database, but keep lock till commit.
 			 */
-			copydir(srcpath, dstpath, false);
-
-			/* Record the filesystem change in XLOG */
-			{
-				xl_dbase_create_rec xlrec;
-
-				xlrec.db_id = dboid;
-				xlrec.tablespace_id = dsttablespace;
-				xlrec.src_db_id = src_dboid;
-				xlrec.src_tablespace_id = srctablespace;
-
-				XLogBeginInsert();
-				XLogRegisterData((char *) &xlrec, sizeof(xl_dbase_create_rec));
-
-				(void) XLogInsert(RM_DBASE_ID,
-								  XLOG_DBASE_CREATE | XLR_SPECIAL_REL_UPDATE);
-			}
+			table_close(pg_database_rel, NoLock);
 		}
-		table_endscan(scan);
-		table_close(rel, AccessShareLock);
+		else
+		{
+			Assert(dbstrategy == CREATEDB_FILE_COPY);
 
-		/*
-		 * We force a checkpoint before committing.  This effectively means
-		 * that committed XLOG_DBASE_CREATE operations will never need to be
-		 * replayed (at least not in ordinary crash recovery; we still have to
-		 * make the XLOG entry for the benefit of PITR operations). This
-		 * avoids two nasty scenarios:
-		 *
-		 * #1: When PITR is off, we don't XLOG the contents of newly created
-		 * indexes; therefore the drop-and-recreate-whole-directory behavior
-		 * of DBASE_CREATE replay would lose such indexes.
-		 *
-		 * #2: Since we have to recopy the source database during DBASE_CREATE
-		 * replay, we run the risk of copying changes in it that were
-		 * committed after the original CREATE DATABASE command but before the
-		 * system crash that led to the replay.  This is at least unexpected
-		 * and at worst could lead to inconsistencies, eg duplicate table
-		 * names.
-		 *
-		 * (Both of these were real bugs in releases 8.0 through 8.0.3.)
-		 *
-		 * In PITR replay, the first of these isn't an issue, and the second
-		 * is only a risk if the CREATE DATABASE and subsequent template
-		 * database change both occur while a base backup is being taken.
-		 * There doesn't seem to be much we can do about that except document
-		 * it as a limitation.
-		 *
-		 * Perhaps if we ever implement CREATE DATABASE in a less cheesy way,
-		 * we can avoid this.
-		 */
-		RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
+			CopyDatabase(src_dboid, dboid, src_deftablespace,
+						 dst_deftablespace);
 
-		/*
-		 * Close pg_database, but keep lock till commit.
-		 */
-		table_close(pg_database_rel, NoLock);
+			/*
+			 * Close pg_database, but keep lock till commit.
+			 */
+			table_close(pg_database_rel, NoLock);
 
-		/*
-		 * Force synchronous commit, thus minimizing the window between
-		 * creation of the database files and committal of the transaction. If
-		 * we crash before committing, we'll have a DB that's taking up disk
-		 * space but is not in pg_database, which is not good.
-		 */
-		ForceSyncCommit();
+			/*
+			 * Force synchronous commit, thus minimizing the window between
+			 * creation of the database files and committal of the transaction.
+			 * If we crash before committing, we'll have a DB that's taking up
+			 * disk space but is not in pg_database, which is not good.
+			 */
+			ForceSyncCommit();
+		}
 	}
 	PG_END_ENSURE_ERROR_CLEANUP(createdb_failure_callback,
 								PointerGetDatum(&fparms));
@@ -870,6 +1441,21 @@ createdb_failure_callback(int code, Datum arg)
 	createdb_failure_params *fparms = (createdb_failure_params *) DatumGetPointer(arg);
 
 	/*
+	 * If we were copying database at block levels then drop pages for the
+	 * destination database that are in the shared buffer cache.  And tell
+	 * checkpointer to forget any pending fsync and unlink requests for
+	 * files in the database.  The reasoning behind doing this is same as
+	 * explained in dropdb function.  But unlike dropdb we don't need to call
+	 * pgstat_drop_database because this database is still not created so there
+	 * should not be any stat for this.
+	 */
+	if (fparms->strategy == CREATEDB_WAL_LOG)
+	{
+		DropDatabaseBuffers(fparms->dest_dboid);
+		ForgetDatabaseSyncRequests(fparms->dest_dboid);
+	}
+
+	/*
 	 * Release lock on source database before doing recursive remove. This is
 	 * not essential but it seems desirable to release the lock as soon as
 	 * possible.
@@ -2387,32 +2973,40 @@ dbase_redo(XLogReaderState *record)
 		src_path = GetDatabasePath(xlrec->src_db_id, xlrec->src_tablespace_id);
 		dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
 
-		/*
-		 * Our theory for replaying a CREATE is to forcibly drop the target
-		 * subdirectory if present, then re-copy the source data. This may be
-		 * more work than needed, but it is simple to implement.
-		 */
-		if (stat(dst_path, &st) == 0 && S_ISDIR(st.st_mode))
+		if (!OidIsValid(xlrec->src_db_id))
 		{
-			if (!rmtree(dst_path, true))
-				/* If this failed, copydir() below is going to error. */
-				ereport(WARNING,
-						(errmsg("some useless files may be left behind in old database directory \"%s\"",
-								dst_path)));
+			CreateDirAndVersionFile(dst_path, xlrec->db_id, xlrec->tablespace_id,
+									true);
 		}
+		else
+		{
+			/*
+			* Our theory for replaying a CREATE is to forcibly drop the target
+			* subdirectory if present, then re-copy the source data. This may be
+			* more work than needed, but it is simple to implement.
+			*/
+			if (stat(dst_path, &st) == 0 && S_ISDIR(st.st_mode))
+			{
+				if (!rmtree(dst_path, true))
+					/* If this failed, copydir() below is going to error. */
+					ereport(WARNING,
+							(errmsg("some useless files may be left behind in old database directory \"%s\"",
+									dst_path)));
+			}
 
-		/*
-		 * Force dirty buffers out to disk, to ensure source database is
-		 * up-to-date for the copy.
-		 */
-		FlushDatabaseBuffers(xlrec->src_db_id);
+			/*
+			* Force dirty buffers out to disk, to ensure source database is
+			* up-to-date for the copy.
+			*/
+			FlushDatabaseBuffers(xlrec->src_db_id);
 
-		/*
-		 * Copy this subdirectory to the new location
-		 *
-		 * We don't need to copy subdirectories
-		 */
-		copydir(src_path, dst_path, false);
+			/*
+			* Copy this subdirectory to the new location
+			*
+			* We don't need to copy subdirectories
+			*/
+			copydir(src_path, dst_path, false);
+		}
 	}
 	else if (info == XLOG_DBASE_DROP)
 	{
diff --git a/src/include/commands/dbcommands_xlog.h b/src/include/commands/dbcommands_xlog.h
index 593a857..8f59870 100644
--- a/src/include/commands/dbcommands_xlog.h
+++ b/src/include/commands/dbcommands_xlog.h
@@ -20,6 +20,7 @@
 /* record types */
 #define XLOG_DBASE_CREATE		0x00
 #define XLOG_DBASE_DROP			0x10
+#define XLOG_DBASE_CREATEDIR	0x20
 
 typedef struct xl_dbase_create_rec
 {
@@ -30,6 +31,13 @@ typedef struct xl_dbase_create_rec
 	Oid			src_tablespace_id;
 } xl_dbase_create_rec;
 
+typedef struct xl_dbase_createdir_rec
+{
+	/* Records creating database directory */
+	Oid			db_id;
+	Oid			tablespace_id;
+} xl_dbase_createdir_rec;
+
 typedef struct xl_dbase_drop_rec
 {
 	Oid			db_id;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index c1400d3..317c2f2 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -460,6 +460,7 @@ CoverPos
 CreateAmStmt
 CreateCastStmt
 CreateConversionStmt
+CreateDBRelInfo
 CreateDomainStmt
 CreateEnumStmt
 CreateEventTrigStmt
-- 
1.8.3.1

#135

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Dilip Kumar (#134)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Thu, Mar 3, 2022 at 11:22 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

The new version of the patch fixes these 2 comments pointed by Ashutosh and also splits the GetRelListFromPage() function as suggested by Robert and uses the latest snapshot for scanning the pg_class instead of active snapshot as pointed out by Robert. I haven't yet added the test case to create a database using this new strategy option. So if we are okay with these two options FILE_COPY and WAL_LOG then I will add test cases for the same.

Reviewing 0001, the boundaries of the critical section move slightly,
but only over a memcpy, which can't fail, so that seems fine. But this
comment looks ominous:

* Note: we're cheating a little bit here by assuming that mapped files
* are either in pg_global or the database's default tablespace.

It's not clear to me how the code that follows relies on this
assumption, but the overall patch set would make that not true any
more, so there's some kind of an issue to think about there.

It's a little asymmetric that load_relmap_file() gets a subroutine
read_relmap_file() while write_relmap_file() gets a subroutine
write_relmap_file_internal(). Perhaps we could call the functions
{load,write}_named_relmap_file() or something of that sort.

Reviewing 0002, your comment updates in relmap_redo() are not
complete. Note that there's an unmodified comment that says "Write out
the new map and send sinval" just above where you modify the code to
only conditionally send sinval. I'm somewhat uncomfortable with the
shape of this logic, too. It looks weird to be sometimes calling
write_relmap_file and sometimes write_relmap_file_internal. You'd
expect functions with those names to be called at different
abstraction levels, rather than at parallel call sites. The renaming I
proposed would help with this but it's not just a cosmetic issue: the
logic to construct mapfilename is in this function in one case, but in
the called function in the other case. I can't help but think that the
write_relmap_file()/write_relmap_file_internal() split isn't entirely
the right thing.

I think part of the confusion here is that, pre-patch,
write_relmap_file() gets called during both recovery and normal
running, and the updates to shared_map or local_map are actually
nonsense during recovery, because the local map at least is local to
whatever our database is, and we don't have a database connection if
we're the startup process. After your patch, we're still going through
write_relmap_file in recovery in some cases, but really those map
updates don't seem like things that should be happening at all. And on
the other hand it's not clear to me why the CRC stuff isn't needed in
all cases, but that's only going to happen when we go through the
non-internal version of the function. You've probably spent more time
looking at this code than I have, but I'm wondering if the division
should be like this: we have one function that does the actual update,
and another function that does the update plus sets global variables.
Recovery always uses the first one, and outside of recovery we use the
first one for the create-database case and the second one otherwise.
Thoughts?

Regarding 0003, my initial thought was to like the fact that you'd
avoided duplicating code by using a function parameter, but as I look
at it a bit more, it's not clear to me that it's enough code that we
really care about not duplicating it. I would not expect to find a
function called RelationCopyAllFork() in tablecmds.c. I'd expect to
find it in storage.c, I think. And I think I'd be surprised to find
out that it doesn't actually know anything about copying; it's
basically just a loop over the forks to which you can supply your own
copy-function. And the fact that it's got an argument of type
copy_relation_storage and the argument name is copy_storage and the
value is sometimes RelationCopyStorageis a terminological muddle, too.
So I feel like perhaps this needs more thought.

--
Robert Haas
EDB: http://www.enterprisedb.com

#136

Dilip Kumar

dilipbalaut@gmail.com

almost 4 years ago

In reply to: Robert Haas (#135)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Wed, Mar 9, 2022 at 3:12 AM Robert Haas <robertmhaas@gmail.com> wrote:

Thanks for reviewing and valuable feedback.

Reviewing 0001, the boundaries of the critical section move slightly,
but only over a memcpy, which can't fail, so that seems fine. But this
comment looks ominous:

* Note: we're cheating a little bit here by assuming that mapped files
* are either in pg_global or the database's default tablespace.

It's not clear to me how the code that follows relies on this
assumption, but the overall patch set would make that not true any
more, so there's some kind of an issue to think about there.

I think the comments are w.r.t choosing the file path, because here we
assume either it is in the global tablespace or default tablespace of
the database. Here also the comment is partially true because we also
assume that it will be in the default tablespace of the database
(because we do not need to worry about the shared relations). But I
think this comments can move to the caller function where we are
creating the file path.

if (shared)
{
snprintf(mapfilename, sizeof(mapfilename), "global/%s",
RELMAPPER_FILENAME);
}
else
{
snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
dbpath, RELMAPPER_FILENAME);
}

It's a little asymmetric that load_relmap_file() gets a subroutine
read_relmap_file() while write_relmap_file() gets a subroutine
write_relmap_file_internal(). Perhaps we could call the functions
{load,write}_named_relmap_file() or something of that sort.

Yeah this should be changed.

Reviewing 0002, your comment updates in relmap_redo() are not
complete. Note that there's an unmodified comment that says "Write out
the new map and send sinval" just above where you modify the code to
only conditionally send sinval. I'm somewhat uncomfortable with the
shape of this logic, too. It looks weird to be sometimes calling
write_relmap_file and sometimes write_relmap_file_internal. You'd
expect functions with those names to be called at different
abstraction levels, rather than at parallel call sites. The renaming I
proposed would help with this but it's not just a cosmetic issue: the
logic to construct mapfilename is in this function in one case, but in
the called function in the other case. I can't help but think that the
write_relmap_file()/write_relmap_file_internal() split isn't entirely
the right thing.

I think part of the confusion here is that, pre-patch,
write_relmap_file() gets called during both recovery and normal
running, and the updates to shared_map or local_map are actually
nonsense during recovery, because the local map at least is local to
whatever our database is, and we don't have a database connection if
we're the startup process.

Yeah you are correct about the local map, but I am not sure whether we
can rely on not updating the shared map in the startup process.
Because how can we guarantee that now or in future the startup process
can never look into the map? I agree that it is not connected to the
database so it doesn't make sense to look into the local map but how
we are going to ensure the shared map. Said that I think there are
only 3 function which must be looking at these maps directly
RelationMapOidToFilenode(), RelationMapFilenodeToOid() and
RelationMapUpdateMap() and these functions are called from a very few
places and I don't think these should be called during recovery. So
probably we can put a elog saying they should never be called during
recovery?

After your patch, we're still going through

write_relmap_file in recovery in some cases, but really those map
updates don't seem like things that should be happening at all. And on
the other hand it's not clear to me why the CRC stuff isn't needed in
all cases, but that's only going to happen when we go through the
non-internal version of the function. You've probably spent more time
looking at this code than I have, but I'm wondering if the division
should be like this: we have one function that does the actual update,
and another function that does the update plus sets global variables.
Recovery always uses the first one, and outside of recovery we use the
first one for the create-database case and the second one otherwise.
Thoughts?

Right, infact now also if you see the logic, the
write_relmap_file_internal() is taking care of the actual update and
the write_relmap_file() is doing update + setting the global
variables. So yeah we can rename as you suggested in 0001 and here
also we can change the logic as you suggested that the recovery and
createdb will only call the first function which is just doing the
update.

Regarding 0003, my initial thought was to like the fact that you'd
avoided duplicating code by using a function parameter, but as I look
at it a bit more, it's not clear to me that it's enough code that we
really care about not duplicating it. I would not expect to find a
function called RelationCopyAllFork() in tablecmds.c.

Okay, actually I see this logic of copying the fork at a few different
places like
index_copy_data() in tablecmds.c. and then in
heapam_relation_copy_data() in heapam_handler.c. So I was not sure
what could be right place for this function so I kept it in the same
file (tablecmds.c) because I splitted it from the function in this
file.

I'd expect to

find it in storage.c, I think. And I think I'd be surprised to find
out that it doesn't actually know anything about copying; it's
basically just a loop over the forks to which you can supply your own
copy-function.

Yeah but it eventually expects a function pointer to copy storage so
we can not completely deny that it knows nothing about the copy?

And the fact that it's got an argument of type

copy_relation_storage and the argument name is copy_storage and the
value is sometimes RelationCopyStorageis a terminological muddle, too.
So I feel like perhaps this needs more thought.

One option is that we can duplicate this loop in dbcommand.c as well
like we are having it already in tablecmds.c and heapam_handler.c?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#137

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Dilip Kumar (#136)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Wed, Mar 9, 2022 at 6:07 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Yeah you are correct about the local map, but I am not sure whether we
can rely on not updating the shared map in the startup process.
Because how can we guarantee that now or in future the startup process
can never look into the map? I agree that it is not connected to the
database so it doesn't make sense to look into the local map but how
we are going to ensure the shared map. Said that I think there are
only 3 function which must be looking at these maps directly
RelationMapOidToFilenode(), RelationMapFilenodeToOid() and
RelationMapUpdateMap() and these functions are called from a very few
places and I don't think these should be called during recovery. So
probably we can put a elog saying they should never be called during
recovery?

Yeah, that seems reasonable.

Right, infact now also if you see the logic, the
write_relmap_file_internal() is taking care of the actual update and
the write_relmap_file() is doing update + setting the global
variables. So yeah we can rename as you suggested in 0001 and here
also we can change the logic as you suggested that the recovery and
createdb will only call the first function which is just doing the
update.

But I think we want the path construction to be managed by the
function rather than the caller, too.

I'd expect to

find it in storage.c, I think. And I think I'd be surprised to find
out that it doesn't actually know anything about copying; it's
basically just a loop over the forks to which you can supply your own
copy-function.

Yeah but it eventually expects a function pointer to copy storage so
we can not completely deny that it knows nothing about the copy?

Sure, I guess. It's just not obvious why the argument would actually
need to be a function that copies storage, or why there's more than
one way to copy storage. I'd rather keep all the code paths unified,
if we can, and set behavior via flags or something, maybe. I'm not
sure whether that's realistic, though.

And the fact that it's got an argument of type

copy_relation_storage and the argument name is copy_storage and the
value is sometimes RelationCopyStorageis a terminological muddle, too.
So I feel like perhaps this needs more thought.

One option is that we can duplicate this loop in dbcommand.c as well
like we are having it already in tablecmds.c and heapam_handler.c?

Yeah, I think this is also worth considering.

--
Robert Haas
EDB: http://www.enterprisedb.com

#138

Dilip Kumar

dilipbalaut@gmail.com

almost 4 years ago

In reply to: Robert Haas (#137)

4 attachment(s)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Wed, Mar 9, 2022 at 6:44 PM Robert Haas <robertmhaas@gmail.com> wrote:

Right, infact now also if you see the logic, the
write_relmap_file_internal() is taking care of the actual update and
the write_relmap_file() is doing update + setting the global
variables. So yeah we can rename as you suggested in 0001 and here
also we can change the logic as you suggested that the recovery and
createdb will only call the first function which is just doing the
update.

But I think we want the path construction to be managed by the
function rather than the caller, too.

I have completely changed the logic for this refactoring. Basically,
write_relmap_file(), is already having parameters to control whether
to write wal, send inval and we are already passing the dbpath.
Instead of making a new function I just pass one additional parameter
to this function itself about whether we are creating a new map or not
and I think with that changes are very less and this looks cleaner to
me. Similarly for load_relmap_file() also I just had to pass the
dbpath and memory for destination map. Please have a look and let me
know your thoughts.

Sure, I guess. It's just not obvious why the argument would actually
need to be a function that copies storage, or why there's more than
one way to copy storage. I'd rather keep all the code paths unified,
if we can, and set behavior via flags or something, maybe. I'm not
sure whether that's realistic, though.

I try considering that, I think it doesn't look good to make it flag
based, One of the main problem I noticed is that now for copying
either we need to call RelationCopyStorageis() or
RelationCopyStorageUsingBuffer() based on the input flag. But if we
move the main copy function to the storage.c then the storage.c will
have depedency on bufmgr functions because I don't think we can keep
RelationCopyStorageUsingBuffer() inside storage.c. So for now, I have
duplicated the loop which is already there in index_copy_data() and
heapam_relation_copy_data() and kept that in bufmgr.c and also moved
RelationCopyStorageUsingBuffer() into the bufmgr.c. I think bufmgr.c
is already having function which are dealing with smgr things so I
feel this is the right place for the function.

Other changes:
1. 0001 and 0002 are merged because now we are not really refactoring
these function and just passing the additioanl arguments to it make
sense to combine the changes.
2. Same with 0003, that now we are not refactoring existing functions
but providing new interfaces so merged it with the 0004 (which was
0006 previously)

I think we should also write the test cases for create database
strategy. But I do not see any test case for create database for
testing the existing options. So I am wondering whether we should add
the test case only for the new option we are providing or we should
create a separate path which tests the new option as well as the
existing options.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v11-0001-Extend-relmap-interfaces.patchtext/x-patch; charset=US-ASCII; name=v11-0001-Extend-relmap-interfaces.patchDownload

From a479f7057649e2c6ef332ff313e9291089e193e0 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Thu, 10 Mar 2022 10:18:18 +0530
Subject: [PATCH v11 1/4] Extend relmap interfaces

Support new interfaces in relmapper, 1) Support copying the
relmap file from one database path to the other database path.
2) And another interface for getting filenode from oid.  We already
have RelationMapOidToFilenode for the same purpose but that assumes
we are connected to the database for which we want to get the mapping.
So this new interface will do the same but instead, it will get the
mapping for the input database.

These interfaces are required for next patch, for supporting the
wal logged created database.
---
 src/backend/utils/cache/relmapper.c | 159 ++++++++++++++++++++++++++++--------
 src/include/utils/relmapper.h       |   7 +-
 2 files changed, 132 insertions(+), 34 deletions(-)

diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 4f6811f..6501110 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -136,10 +136,12 @@ static void apply_map_update(RelMapFile *map, Oid relationId, Oid fileNode,
 							 bool add_okay);
 static void merge_map_updates(RelMapFile *map, const RelMapFile *updates,
 							  bool add_okay);
-static void load_relmap_file(bool shared, bool lock_held);
+static void load_relmap_file(bool shared, bool lock_held, RelMapFile *dstmap,
+							 const char *dbpath);
 static void write_relmap_file(bool shared, RelMapFile *newmap,
 							  bool write_wal, bool send_sinval, bool preserve_files,
-							  Oid dbid, Oid tsid, const char *dbpath);
+							  Oid dbid, Oid tsid, const char *dbpath,
+							  bool update_relmap);
 static void perform_relmap_update(bool shared, const RelMapFile *updates);
 
 
@@ -250,6 +252,32 @@ RelationMapFilenodeToOid(Oid filenode, bool shared)
 }
 
 /*
+ * RelationMapOidToFilenodeForDatabase
+ *
+ * Same as RelationMapOidToFilenode, but instead of reading the mapping from
+ * the database we are connected to it will read the mapping from the input
+ * database.
+ */
+Oid
+RelationMapOidToFilenodeForDatabase(const char *dbpath, Oid relationId)
+{
+	RelMapFile	map;
+	int			i;
+
+	/* Read the relmap file from the source database. */
+	load_relmap_file(false, false, &map, dbpath);
+
+	/* Iterate over the relmap entries to find the input relation oid. */
+	for (i = 0; i < map.num_mappings; i++)
+	{
+		if (relationId == map.mappings[i].mapoid)
+			return map.mappings[i].mapfilenode;
+	}
+
+	return InvalidOid;
+}
+
+/*
  * RelationMapUpdateMap
  *
  * Install a new relfilenode mapping for the specified relation.
@@ -405,12 +433,12 @@ RelationMapInvalidate(bool shared)
 	if (shared)
 	{
 		if (shared_map.magic == RELMAPPER_FILEMAGIC)
-			load_relmap_file(true, false);
+			load_relmap_file(true, false, NULL, NULL);
 	}
 	else
 	{
 		if (local_map.magic == RELMAPPER_FILEMAGIC)
-			load_relmap_file(false, false);
+			load_relmap_file(false, false, NULL, NULL);
 	}
 }
 
@@ -425,9 +453,9 @@ void
 RelationMapInvalidateAll(void)
 {
 	if (shared_map.magic == RELMAPPER_FILEMAGIC)
-		load_relmap_file(true, false);
+		load_relmap_file(true, false, NULL, NULL);
 	if (local_map.magic == RELMAPPER_FILEMAGIC)
-		load_relmap_file(false, false);
+		load_relmap_file(false, false, NULL, NULL);
 }
 
 /*
@@ -569,9 +597,9 @@ RelationMapFinishBootstrap(void)
 
 	/* Write the files; no WAL or sinval needed */
 	write_relmap_file(true, &shared_map, false, false, false,
-					  InvalidOid, GLOBALTABLESPACE_OID, NULL);
+					  InvalidOid, GLOBALTABLESPACE_OID, NULL, false);
 	write_relmap_file(false, &local_map, false, false, false,
-					  MyDatabaseId, MyDatabaseTableSpace, DatabasePath);
+					  MyDatabaseId, MyDatabaseTableSpace, DatabasePath, false);
 }
 
 /*
@@ -612,7 +640,7 @@ RelationMapInitializePhase2(void)
 	/*
 	 * Load the shared map file, die on error.
 	 */
-	load_relmap_file(true, false);
+	load_relmap_file(true, false, NULL, NULL);
 }
 
 /*
@@ -633,7 +661,7 @@ RelationMapInitializePhase3(void)
 	/*
 	 * Load the local map file, die on error.
 	 */
-	load_relmap_file(false, false);
+	load_relmap_file(false, false, NULL, NULL);
 }
 
 /*
@@ -687,15 +715,46 @@ RestoreRelationMap(char *startAddress)
 }
 
 /*
+ * CopyRelationMap
+ *
+ * Copy relmapfile from source db path to the destination db path and WAL log
+ * the operation.  This function is only called during the create database, so
+ * the destination database is not yet visible to anyone else, thus we don't
+ * need to acquire the relmap lock while updating the destination relmap.
+ */
+void
+CopyRelationMap(Oid dbid, Oid tsid, const char *srcdbpath,
+				const char *dstdbpath)
+{
+	RelMapFile map;
+
+	/* Read the relmap file from the source database. */
+	load_relmap_file(false, false, &map, srcdbpath);
+
+	/*
+	 * Write map contents into the destination database's relmap file; no
+	 * sinval needed because there could be no one else connected to the
+	 * database we are creating now.
+	 */
+	write_relmap_file(false, &map, true, false, true, dbid, tsid, dstdbpath,
+					  true);
+}
+
+/*
  * load_relmap_file -- load data from the shared or local map file
  *
  * Because the map file is essential for access to core system catalogs,
  * failure to read it is a fatal error.
  *
- * Note that the local case requires DatabasePath to be set up.
+ * Note that the local case requires DatabasePath to be set up.  But during
+ * createdb we are not connected to the source database so we will have to pass
+ * the dbpath of the source database from which we want to read the relmap
+ * file.  And, we will have to pass a valid memory for the 'dstmap' into which
+ * we want to read the relmap.
  */
 static void
-load_relmap_file(bool shared, bool lock_held)
+load_relmap_file(bool shared, bool lock_held, RelMapFile *dstmap,
+				 const char *dbpath)
 {
 	RelMapFile *map;
 	char		mapfilename[MAXPGPATH];
@@ -703,7 +762,20 @@ load_relmap_file(bool shared, bool lock_held)
 	int			fd;
 	int			r;
 
-	if (shared)
+	/*
+	 * Prepare relmap file path.  If a valid dbpath is given then read the file
+	 * from that path.
+	 */
+	if (dbpath != NULL)
+	{
+		/* We must pass a valid dstmap for reading the mapfile contents. */
+		Assert(dstmap != NULL);
+
+		snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+				 dbpath, RELMAPPER_FILENAME);
+		map = dstmap;
+	}
+	else if (shared)
 	{
 		snprintf(mapfilename, sizeof(mapfilename), "global/%s",
 				 RELMAPPER_FILENAME);
@@ -796,12 +868,15 @@ load_relmap_file(bool shared, bool lock_held)
  * Because this may be called during WAL replay when MyDatabaseId,
  * DatabasePath, etc aren't valid, we require the caller to pass in suitable
  * values.  The caller is also responsible for being sure no concurrent
- * map update could be happening.
+ * map update could be happening.  This will also be called during create
+ * database and that time we are not connected to the database for which we
+ * have to write the relmap.  So we have to pass the valid dbpath for which we
+ * want to write the relmap file and also pass create as true.
  */
 static void
 write_relmap_file(bool shared, RelMapFile *newmap,
 				  bool write_wal, bool send_sinval, bool preserve_files,
-				  Oid dbid, Oid tsid, const char *dbpath)
+				  Oid dbid, Oid tsid, const char *dbpath, bool create)
 {
 	int			fd;
 	RelMapFile *realmap;
@@ -819,10 +894,18 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 	FIN_CRC32C(newmap->crc);
 
 	/*
-	 * Open the target file.  We prefer to do this before entering the
-	 * critical section, so that an open() failure need not force PANIC.
+	 * Prepare the target mapfilename, and also set which relmap we want to
+	 * update.  But if the create is passed true then we don't need to update
+	 * the memory relmap because we are not connected to database for which
+	 * we are writing the relmap file.
 	 */
-	if (shared)
+	if (create)
+	{
+		snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+				 dbpath, RELMAPPER_FILENAME);
+		realmap = NULL;
+	}
+	else if (shared)
 	{
 		snprintf(mapfilename, sizeof(mapfilename), "global/%s",
 				 RELMAPPER_FILENAME);
@@ -853,6 +936,7 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		xlrec.dbid = dbid;
 		xlrec.tsid = tsid;
 		xlrec.nbytes = sizeof(RelMapFile);
+		xlrec.create = create;
 
 		XLogBeginInsert();
 		XLogRegisterData((char *) (&xlrec), MinSizeOfRelmapUpdate);
@@ -935,14 +1019,17 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 	}
 
 	/*
-	 * Success, update permanent copy.  During bootstrap, we might be working
-	 * on the permanent copy itself, in which case skip the memcpy() to avoid
-	 * invoking nominally-undefined behavior.
+	 * Success, update permanent copy.  During bootstrap and the create
+	 * database, skip the memcpy().  Because during bootstrap, we might be
+	 * working on the permanent copy itself, whereas during create database
+	 * we are not connected to the database for which we are creating the
+	 * relmap file so it will be wrong to update the shared map of the current
+	 * database to which we are connected.
 	 */
-	if (realmap != newmap)
+	if (realmap != NULL && realmap != newmap)
 		memcpy(realmap, newmap, sizeof(RelMapFile));
 	else
-		Assert(!send_sinval);	/* must be bootstrapping */
+		Assert(!send_sinval);	/* must be bootstrapping or createdb */
 
 	/* Critical section done */
 	if (write_wal)
@@ -975,7 +1062,7 @@ perform_relmap_update(bool shared, const RelMapFile *updates)
 	LWLockAcquire(RelationMappingLock, LW_EXCLUSIVE);
 
 	/* Be certain we see any other updates just made */
-	load_relmap_file(shared, true);
+	load_relmap_file(shared, true, NULL, NULL);
 
 	/* Prepare updated data in a local variable */
 	if (shared)
@@ -993,7 +1080,7 @@ perform_relmap_update(bool shared, const RelMapFile *updates)
 	write_relmap_file(shared, &newmap, true, true, true,
 					  (shared ? InvalidOid : MyDatabaseId),
 					  (shared ? GLOBALTABLESPACE_OID : MyDatabaseTableSpace),
-					  DatabasePath);
+					  DatabasePath, false);
 
 	/* Now we can release the lock */
 	LWLockRelease(RelationMappingLock);
@@ -1025,18 +1112,24 @@ relmap_redo(XLogReaderState *record)
 		dbpath = GetDatabasePath(xlrec->dbid, xlrec->tsid);
 
 		/*
-		 * Write out the new map and send sinval, but of course don't write a
-		 * new WAL entry.  There's no surrounding transaction to tell to
-		 * preserve files, either.
+		 * Write out the new map and send sinval if create is not set because
+		 * in case of create there should be no one else accessing the relmap.
+		 * But of course don't write a new WAL entry.  There's no surrounding
+		 * transaction to tell to preserve files, either.
 		 *
 		 * There shouldn't be anyone else updating relmaps during WAL replay,
-		 * but grab the lock to interlock against load_relmap_file().
+		 * but grab the lock to interlock against load_relmap_file().  But if
+		 * create is set then we don't need to lock because we are creating a
+		 * new database so there can be absolutely no one else looking at its
+		 * relmap file.
 		 */
-		LWLockAcquire(RelationMappingLock, LW_EXCLUSIVE);
+		if (!xlrec->create)
+			LWLockAcquire(RelationMappingLock, LW_EXCLUSIVE);
 		write_relmap_file((xlrec->dbid == InvalidOid), &newmap,
-						  false, true, false,
-						  xlrec->dbid, xlrec->tsid, dbpath);
-		LWLockRelease(RelationMappingLock);
+						  false, !xlrec->create, false,
+						  xlrec->dbid, xlrec->tsid, dbpath, xlrec->create);
+		if (!xlrec->create)
+			LWLockRelease(RelationMappingLock);
 
 		pfree(dbpath);
 	}
diff --git a/src/include/utils/relmapper.h b/src/include/utils/relmapper.h
index 9fbb5a7..b8c7ef0 100644
--- a/src/include/utils/relmapper.h
+++ b/src/include/utils/relmapper.h
@@ -29,6 +29,7 @@ typedef struct xl_relmap_update
 	Oid			dbid;			/* database ID, or 0 for shared map */
 	Oid			tsid;			/* database's tablespace, or pg_global */
 	int32		nbytes;			/* size of relmap data */
+	bool		create;			/* true if creating new relmap */
 	char		data[FLEXIBLE_ARRAY_MEMBER];
 } xl_relmap_update;
 
@@ -39,6 +40,9 @@ extern Oid	RelationMapOidToFilenode(Oid relationId, bool shared);
 
 extern Oid	RelationMapFilenodeToOid(Oid relationId, bool shared);
 
+extern Oid RelationMapOidToFilenodeForDatabase(const char *dbpath,
+											   Oid relationId);
+
 extern void RelationMapUpdateMap(Oid relationId, Oid fileNode, bool shared,
 								 bool immediate);
 
@@ -62,7 +66,8 @@ extern void RelationMapInitializePhase3(void);
 extern Size EstimateRelationMapSpace(void);
 extern void SerializeRelationMap(Size maxSize, char *startAddress);
 extern void RestoreRelationMap(char *startAddress);
-
+extern void CopyRelationMap(Oid dbid, Oid tsid, const char *srcdbpath,
+							const char *dstdbpath);
 extern void relmap_redo(XLogReaderState *record);
 extern void relmap_desc(StringInfo buf, XLogReaderState *record);
 extern const char *relmap_identify(uint8 info);
-- 
1.8.3.1

v11-0002-Extend-bufmgr-interfaces.patchtext/x-patch; charset=US-ASCII; name=v11-0002-Extend-bufmgr-interfaces.patchDownload

From f8dda8ea34673ab12872c7d3bcc7e95610d86f1a Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Thu, 10 Feb 2022 15:55:33 +0530
Subject: [PATCH v11 2/4] Extend bufmgr interfaces

Extend ReadBufferWithoutRelcache interface to take relpersistence
as input. At present, this function may only be used on permanent
relations, because we only use it during XLOG replay.  But now as
part of the bigger patch set, we will be using this for reading the
buffer from the database which we are not connected so now we might
have temporary and unlogged relations as well.
---
 src/backend/access/transam/xlogutils.c |  9 ++++++---
 src/backend/storage/buffer/bufmgr.c    | 11 ++---------
 src/include/storage/bufmgr.h           |  3 ++-
 3 files changed, 10 insertions(+), 13 deletions(-)

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 54d5f20..c292794 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -484,7 +484,8 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 	{
 		/* page exists in file */
 		buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-										   mode, NULL);
+										   mode, NULL,
+										   RELPERSISTENCE_PERMANENT);
 	}
 	else
 	{
@@ -509,7 +510,8 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 				ReleaseBuffer(buffer);
 			}
 			buffer = ReadBufferWithoutRelcache(rnode, forknum,
-											   P_NEW, mode, NULL);
+											   P_NEW, mode, NULL,
+											   RELPERSISTENCE_PERMANENT);
 		}
 		while (BufferGetBlockNumber(buffer) < blkno);
 		/* Handle the corner case that P_NEW returns non-consecutive pages */
@@ -519,7 +521,8 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 				LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 			ReleaseBuffer(buffer);
 			buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-											   mode, NULL);
+											   mode, NULL,
+											   RELPERSISTENCE_PERMANENT);
 		}
 	}
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f5459c6..0ed2d31 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -771,24 +771,17 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 /*
  * ReadBufferWithoutRelcache -- like ReadBufferExtended, but doesn't require
  *		a relcache entry for the relation.
- *
- * NB: At present, this function may only be used on permanent relations, which
- * is OK, because we only use it during XLOG replay.  If in the future we
- * want to use it on temporary or unlogged relations, we could pass additional
- * parameters.
  */
 Buffer
 ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
 						  BlockNumber blockNum, ReadBufferMode mode,
-						  BufferAccessStrategy strategy)
+						  BufferAccessStrategy strategy, char relpersistence)
 {
 	bool		hit;
 
 	SMgrRelation smgr = smgropen(rnode, InvalidBackendId);
 
-	Assert(InRecovery);
-
-	return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
+	return ReadBuffer_common(smgr, relpersistence, forkNum, blockNum,
 							 mode, strategy, &hit);
 }
 
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index dd01841..7b80f58 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -184,7 +184,8 @@ extern Buffer ReadBufferExtended(Relation reln, ForkNumber forkNum,
 								 BufferAccessStrategy strategy);
 extern Buffer ReadBufferWithoutRelcache(RelFileNode rnode,
 										ForkNumber forkNum, BlockNumber blockNum,
-										ReadBufferMode mode, BufferAccessStrategy strategy);
+										ReadBufferMode mode, BufferAccessStrategy strategy,
+										char relpersistence);
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern void MarkBufferDirty(Buffer buffer);
-- 
1.8.3.1

v11-0004-WAL-logged-CREATE-DATABASE.patchtext/x-patch; charset=US-ASCII; name=v11-0004-WAL-logged-CREATE-DATABASE.patchDownload

From ceb78697560710327da2c65988e6cde3850235c3 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 14 Feb 2022 17:48:03 +0530
Subject: [PATCH v11 4/4] WAL logged CREATE DATABASE

Currently, CREATE DATABASE forces a checkpoint, then copies all the files,
then forces another checkpoint. The comments in the createdb() function
explain the reasons for this. The attached patch fixes this problem by making
create database completely WAL logged so that we can avoid the checkpoints.

We are also maintaining the old way of creating the database and for that we
are providing an option to choose the strategy for creating the database.
For the new method the user need to give STRATEGY=WAL_LOG and for the
old method they need to give STRATEGY=FILE_COPY.  The default strategy will
be WAL_LOG.
---
 doc/src/sgml/ref/create_database.sgml  |  22 +
 src/backend/commands/dbcommands.c      | 775 +++++++++++++++++++++++++++------
 src/backend/storage/buffer/bufmgr.c    | 132 ++++++
 src/include/commands/dbcommands_xlog.h |   8 +
 src/include/storage/bufmgr.h           |   3 +
 src/tools/pgindent/typedefs.list       |   1 +
 6 files changed, 809 insertions(+), 132 deletions(-)

diff --git a/doc/src/sgml/ref/create_database.sgml b/doc/src/sgml/ref/create_database.sgml
index f70d0c7..a906cc8 100644
--- a/doc/src/sgml/ref/create_database.sgml
+++ b/doc/src/sgml/ref/create_database.sgml
@@ -34,6 +34,7 @@ CREATE DATABASE <replaceable class="parameter">name</replaceable>
            [ CONNECTION LIMIT [=] <replaceable class="parameter">connlimit</replaceable> ]
            [ IS_TEMPLATE [=] <replaceable class="parameter">istemplate</replaceable> ]
            [ OID [=] <replaceable class="parameter">oid</replaceable> ] ]
+           [ STRATEGY [=] <replaceable class="parameter">strategy</replaceable> ] ]
 </synopsis>
  </refsynopsisdiv>
 
@@ -240,6 +241,27 @@ CREATE DATABASE <replaceable class="parameter">name</replaceable>
        </listitem>
       </varlistentry>
 
+      <varlistentry>
+       <term><replaceable class="parameter">strategy</replaceable></term>
+       <listitem>
+        <para>
+         This is used for copying the database directory.  Currently, we have
+         two strategies the <literal>WAL_LOG</literal> and the
+         <literal>FILE_COPY</literal>.  If <literal>WAL_LOG</literal> strategy
+         is used then the database will be copied block by block and it will
+         also WAL log each copied block.  Otherwise, if <literal>FILE_COPY
+         </literal> strategy is used then it will do the file system level copy
+         so individual the block is not WAL logged.  If the <literal>FILE_COPY
+         </literal> strategy is used then it has to issue a checkpoint before
+         and after performing the copy and if the shared buffers are large and
+         there are a lot of dirty buffers then issuing checkpoint would be
+         costly and it may impact the performance of the whole system.  On the
+         other hand, if we WAL log each block then if the source database is
+         large then creating the database may take more time.
+        </para>
+       </listitem>
+      </varlistentry>
+
     </variablelist>
 
   <para>
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index c37e3c9..4d34e72 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -46,6 +46,7 @@
 #include "commands/dbcommands_xlog.h"
 #include "commands/defrem.h"
 #include "commands/seclabel.h"
+#include "commands/tablecmds.h"
 #include "commands/tablespace.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
@@ -63,13 +64,27 @@
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
 #include "utils/pg_locale.h"
+#include "utils/relmapper.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 
+/*
+ * Create database strategy.  The CREATEDB_WAL_LOG will copy the database at
+ * the block level and WAL log each copied block.  Whereas the
+ * CREATEDB_FILE_COPY will directly copy the database at the file level and no
+ * individual operations will be WAL logged.
+ */
+typedef enum CreateDBStrategy
+{
+	CREATEDB_WAL_LOG = 0,
+	CREATEDB_FILE_COPY = 1
+} CreateDBStrategy;
+
 typedef struct
 {
 	Oid			src_dboid;		/* source (template) DB */
 	Oid			dest_dboid;		/* DB we are trying to create */
+	CreateDBStrategy	strategy;	/* create db strategy */
 } createdb_failure_params;
 
 typedef struct
@@ -78,6 +93,19 @@ typedef struct
 	Oid			dest_tsoid;		/* tablespace we are trying to move to */
 } movedb_failure_params;
 
+/*
+ * When creating a database, we scan the pg_class of the source database to
+ * identify all the relations to be copied.  The structure is used for storing
+ * information about each relation of the source database.
+ */
+typedef struct CreateDBRelInfo
+{
+	RelFileNode		rnode;				/* physical relation identifier */
+	Oid				reloid;				/* relation oid */
+	char			relpersistence;		/* relation's persistence level */
+} CreateDBRelInfo;
+
+
 /* non-export function prototypes */
 static void createdb_failure_callback(int code, Datum arg);
 static void movedb(const char *dbname, const char *tblspcname);
@@ -92,7 +120,524 @@ static bool have_createdb_privilege(void);
 static void remove_dbtablespaces(Oid db_id);
 static bool check_db_file_conflict(Oid db_id);
 static int	errdetail_busy_db(int notherbackends, int npreparedxacts);
+static void CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid,
+									bool isRedo);
+static CreateDBRelInfo *GetRelInfoFromTuple(HeapTupleData *tuple,
+											Oid tbid, Oid dbid, char *srcpath);
+static List *GetRelListFromPage(Page page, Buffer buf, Oid tbid, Oid dbid,
+								char *srcpath, List *rnodelist, Snapshot
+								snapshot);
+static List *GetDatabaseRelationList(Oid srctbid, Oid srcdbid, char *srcpath);
+static void CopyDatabaseWithWal(Oid src_dboid, Oid dboid, Oid src_tsid,
+								Oid dst_tsid);
+static void CopyDatabase(Oid src_dboid, Oid dboid, Oid src_tsid, Oid dst_tsid);
+
+/*
+ * CreateDirAndVersionFile - Create database directory and write out the
+ *							 PG_VERSION file in the database path.
+ *
+ * If isRedo is true, it's okay for the database directory to exist already.
+ *
+ * We can directly write PG_MAJORVERSION in the version file instead of copying
+ * from the source database file because these two must be the same.
+ */
+static void
+CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid, bool isRedo)
+{
+	int		fd;
+	int		nbytes;
+	char	versionfile[MAXPGPATH];
+	char	buf[16];
+
+	/* Prepare version data before starting a critical section. */
+	sprintf(buf, "%s\n", PG_MAJORVERSION);
+	nbytes = strlen(PG_MAJORVERSION) + 1;
+
+	/* If we are not in WAL replay then write the WAL. */
+	if (!isRedo)
+	{
+		xl_dbase_create_rec xlrec;
+		XLogRecPtr	lsn;
+
+		/* Now errors are fatal ... */
+		START_CRIT_SECTION();
 
+		xlrec.db_id = dbid;
+		xlrec.tablespace_id = tsid;
+		xlrec.src_db_id = InvalidOid;
+		xlrec.src_tablespace_id = InvalidOid;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec), sizeof(xl_dbase_create_rec));
+
+		lsn = XLogInsert(RM_DBASE_ID, XLOG_DBASE_CREATE);
+
+		/* As always, WAL must hit the disk before the data update does. */
+		XLogFlush(lsn);
+	}
+
+	/* Create database directory. */
+	if (MakePGDirectory(dbpath) < 0)
+	{
+		/* Failure other than already exists or not in WAL replay? */
+		if (errno != EEXIST || !isRedo)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not create directory \"%s\": %m", dbpath)));
+	}
+
+	/*
+	 * Create PG_VERSION file in the database path.  If the file already exists
+	 * and we are in WAL replay then try again to open it in write mode.
+	 */
+	snprintf(versionfile, sizeof(versionfile), "%s/%s", dbpath, "PG_VERSION");
+
+	fd = OpenTransientFile(versionfile, O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
+	if (fd < 0 && errno == EEXIST && isRedo)
+		fd = OpenTransientFile(versionfile, O_WRONLY | O_TRUNC | PG_BINARY);
+
+	if (fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m", versionfile)));
+
+	/* Write PG_MAJORVERSION in the PG_VERSION file. */
+	pgstat_report_wait_start(WAIT_EVENT_COPY_FILE_WRITE);
+	errno = 0;
+	if ((int) write(fd, buf, nbytes) != nbytes)
+	{
+		/* If write didn't set errno, assume problem is no disk space. */
+		if (errno == 0)
+			errno = ENOSPC;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m", versionfile)));
+	}
+	pgstat_report_wait_end();
+
+	/* Close the version file. */
+	CloseTransientFile(fd);
+
+	/* Critical section done. */
+	if (!isRedo)
+		END_CRIT_SECTION();
+}
+
+/*
+ * GetRelInfoFromTuple - Prepare a CreateDBRelInfo element from the tuple
+ *
+ * Helper function for GetRelListFromPage to prepare a single element from the
+ * pg_class tuple.
+ */
+CreateDBRelInfo *
+GetRelInfoFromTuple(HeapTupleData *tuple, Oid tbid, Oid dbid, char *srcpath)
+{
+	CreateDBRelInfo	   *relinfo;
+	Form_pg_class		classForm;
+	Oid					relfilenode = InvalidOid;
+
+	classForm = (Form_pg_class) GETSTRUCT(tuple);
+
+	/* We don't need to copy the shared objects to the target. */
+	if (classForm->reltablespace == GLOBALTABLESPACE_OID)
+		return NULL;
+
+	/*
+	 * If the object doesn't have the storage then nothing to be
+	 * done for that object so just ignore it.
+	 */
+	if (!RELKIND_HAS_STORAGE(classForm->relkind))
+		return NULL;
+
+	/*
+	 * If relfilenode is valid then directly use it.  Otherwise,
+	 * consult the relmapper for the mapped relation.
+	 */
+	if (OidIsValid(classForm->relfilenode))
+		relfilenode = classForm->relfilenode;
+	else
+		relfilenode = RelationMapOidToFilenodeForDatabase(srcpath,
+										classForm->oid);
+
+	/* We must have a valid relfilenode oid. */
+	Assert(OidIsValid(relfilenode));
+
+	/* Prepare a rel info element and add it to the list. */
+	relinfo = (CreateDBRelInfo *) palloc(sizeof(CreateDBRelInfo));
+	if (OidIsValid(classForm->reltablespace))
+		relinfo->rnode.spcNode = classForm->reltablespace;
+	else
+		relinfo->rnode.spcNode = tbid;
+
+	relinfo->rnode.dbNode = dbid;
+	relinfo->rnode.relNode = relfilenode;
+	relinfo->reloid = classForm->oid;
+	relinfo->relpersistence = classForm->relpersistence;
+
+	return relinfo;
+}
+
+/*
+ * GetRelListFromPage - Helper function for GetDatabaseRelationList.
+ *
+ * Iterate over each tuple of input pg_class and get a list of all the valid
+ * relfilenodes of the given block and append them to input rnodelist.
+ */
+static List *
+GetRelListFromPage(Page page, Buffer buf, Oid tbid, Oid dbid, char *srcpath,
+				  List *rnodelist, Snapshot snapshot)
+{
+	BlockNumber		blkno = BufferGetBlockNumber(buf);
+	OffsetNumber	offnum;
+	OffsetNumber	maxoff;
+	HeapTupleData	tuple;
+
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/* Iterate over each tuple on the page. */
+	for (offnum = FirstOffsetNumber;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid;
+
+		itemid = PageGetItemId(page, offnum);
+
+		/* Nothing to do if slot is empty or already dead. */
+		if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid) ||
+			ItemIdIsRedirected(itemid))
+			continue;
+
+		Assert(ItemIdIsNormal(itemid));
+		ItemPointerSet(&(tuple.t_self), blkno, offnum);
+
+		tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
+		tuple.t_len = ItemIdGetLength(itemid);
+		tuple.t_tableOid = RelationRelationId;
+
+		/*
+		 * If the tuple is visible then add its relfilenode info to the
+		 * list.
+		 */
+		if (HeapTupleSatisfiesVisibility(&tuple, snapshot, buf))
+		{
+			CreateDBRelInfo	   *relinfo;
+
+			relinfo = GetRelInfoFromTuple(&tuple, tbid, dbid, srcpath);
+
+			/* Add it to the list. */
+			if (relinfo != NULL)
+				rnodelist = lappend(rnodelist, relinfo);
+		}
+	}
+
+	return rnodelist;
+}
+
+/*
+ * GetDatabaseRelationList - Get relfilenode list to be copied.
+ *
+ * Iterate over each block of the pg_class relation.  From there, we will check
+ * all the visible tuples in order to get a list of all the valid relfilenodes
+ * in the source database that should be copied to the target database.
+ */
+static List *
+GetDatabaseRelationList(Oid tbid, Oid dbid, char *srcpath)
+{
+	SMgrRelation	rd_smgr;
+	RelFileNode		rnode;
+	BlockNumber		nblocks;
+	BlockNumber		blkno;
+	Buffer			buf;
+	Oid				relfilenode;
+	Page			page;
+	List		   *rnodelist = NIL;
+	LockRelId		relid;
+	Snapshot		snapshot;
+	BufferAccessStrategy bstrategy;
+
+	/* Get pg_class relfilenode. */
+	relfilenode = RelationMapOidToFilenodeForDatabase(srcpath,
+													  RelationRelationId);
+	/*
+	 * We are going to read the buffers associated with the pg_class relation.
+	 * Thus, acquire the relation level lock before start scanning.  As we are
+	 * not connected to the database, we cannot use relation_open directly, so
+	 * we have to lock using relation id.
+	 */
+	relid.dbId = dbid;
+	relid.relId = RelationRelationId;
+	LockRelationId(&relid, AccessShareLock);
+
+	/* Prepare a relnode for pg_class relation. */
+	rnode.spcNode = tbid;
+	rnode.dbNode = dbid;
+	rnode.relNode = relfilenode;
+
+	/*
+	 * We are not connected to the source database so open the pg_class
+	 * relation at the smgr level and get the block count.
+	 */
+	rd_smgr = smgropen(rnode, InvalidBackendId);
+	nblocks = smgrnblocks(rd_smgr, MAIN_FORKNUM);
+
+	/*
+	 * We're going to read the whole pg_class so better to use bulk-read buffer
+	 * access strategy.
+	 */
+	bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	/* Get latest snapshot for scanning the pg_class. */
+	snapshot = GetLatestSnapshot();
+
+	/* Iterate over each block on the pg_class relation. */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		/*
+		 * We are not connected to the source database so directly use the lower
+		 * level bufmgr interface which operates on the rnode.
+		 */
+		buf = ReadBufferWithoutRelcache(rnode, MAIN_FORKNUM, blkno,
+										RBM_NORMAL, bstrategy,
+										RELPERSISTENCE_PERMANENT);
+
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buf);
+		if (PageIsNew(page) || PageIsEmpty(page))
+		{
+			UnlockReleaseBuffer(buf);
+			continue;
+		}
+
+		/*
+		 * Process pg_class tuple for the current page and add all the valid
+		 * relfilenode entries to the rnodelist.
+		 */
+		rnodelist = GetRelListFromPage(page, buf, tbid, dbid, srcpath,
+									   rnodelist, snapshot);
+
+		/* Release the buffer lock. */
+		UnlockReleaseBuffer(buf);
+	}
+
+	/* Release the lock. */
+	UnlockRelationId(&relid, AccessShareLock);
+
+	return rnodelist;
+}
+
+/*
+ * CopyDatabaseWithWal - Copy source database to the target database with WAL
+ *
+ * Create target database directory and copy data files from the source
+ * database to the target database, block by block and WAL log all the
+ * operations.
+ */
+static void
+CopyDatabaseWithWal(Oid src_dboid, Oid dst_dboid, Oid src_tsid, Oid dst_tsid)
+{
+	char	   *srcpath;
+	char	   *dstpath;
+	List	   *rnodelist = NULL;
+	ListCell   *cell;
+	LockRelId	relid;
+	RelFileNode	srcrnode;
+	RelFileNode	dstrnode;
+	CreateDBRelInfo	*relinfo;
+
+	/* Get the source database path. */
+	srcpath = GetDatabasePath(src_dboid, src_tsid);
+
+	/* Get the destination database path. */
+	dstpath = GetDatabasePath(dst_dboid, dst_tsid);
+
+	/* Create database directory and write PG_VERSION file. */
+	CreateDirAndVersionFile(dstpath, dst_dboid, dst_tsid, false);
+
+	/* Copy relmap file from source database to the destination database. */
+	CopyRelationMap(dst_dboid, dst_tsid, srcpath, dstpath);
+
+	/* Get list of all valid relnode from the source database. */
+	rnodelist = GetDatabaseRelationList(src_tsid, src_dboid, srcpath);
+	Assert(rnodelist != NIL);
+
+	/*
+	 * Database id is common for all the relation so set it before entering to
+	 * the loop.
+	 */
+	relid.dbId = src_dboid;
+
+	/*
+	 * Iterate over each relfilenode and copy the relation data block by block
+	 * from source database to the destination database.
+	 */
+	foreach(cell, rnodelist)
+	{
+		SMgrRelation	src_smgr;
+		SMgrRelation	dst_smgr;
+
+		relinfo = lfirst(cell);
+		srcrnode = relinfo->rnode;
+
+		/*
+		 * If the relation is from the source db's default tablespace then we
+		 * need to create it in the destinations db's default tablespace.
+		 * Otherwise, we need to create in the same tablespace as it is in the
+		 * source database.
+		 */
+		if (srcrnode.spcNode == src_tsid)
+			dstrnode.spcNode = dst_tsid;
+		/*
+		 * In case of ALTER DATABASE SET TABLESPACE we don't need to do
+		 * anything for the object which are not in the source db's default
+		 * tablespace.  The source and destination dboid will be same in
+		 * case of ALTER DATABASE SET TABLESPACE.
+		 */
+		else if (src_dboid == dst_dboid)
+			continue;
+		else
+			dstrnode.spcNode = srcrnode.spcNode;
+
+		dstrnode.dbNode = dst_dboid;
+		dstrnode.relNode = srcrnode.relNode;
+
+		/* Acquire the lock on relation before start copying. */
+		relid.relId = relinfo->reloid;
+		LockRelationId(&relid, AccessShareLock);
+
+		/* Open the source and the destination relation at smgr level. */
+		src_smgr = smgropen(srcrnode, InvalidBackendId);
+		dst_smgr = smgropen(dstrnode, InvalidBackendId);
+
+		/* Copy relation storage from source to the destination. */
+		CreateAndCopyRelationData(src_smgr, dst_smgr, relinfo->relpersistence);
+
+		/* Release the lock. */
+		UnlockRelationId(&relid, AccessShareLock);
+	}
+
+	list_free_deep(rnodelist);
+}
+
+/*
+ * CopyDatabase - Copy source database to the target database
+ *
+ * Copy source database directory to the destination directory using copydir
+ * operation.
+ */
+static void
+CopyDatabase(Oid src_dboid, Oid dst_dboid, Oid src_tsid, Oid dst_tsid)
+{
+	TableScanDesc	scan;
+	Relation		rel;
+	HeapTuple		tuple;
+
+	/*
+	 * Force a checkpoint before starting the copy. This will force all
+	 * dirty buffers, including those of unlogged tables, out to disk, to
+	 * ensure source database is up-to-date on disk for the copy.
+	 * FlushDatabaseBuffers() would suffice for that, but we also want to
+	 * process any pending unlink requests. Otherwise, if a checkpoint
+	 * happened while we're copying files, a file might be deleted just
+	 * when we're about to copy it, causing the lstat() call in copydir()
+	 * to fail with ENOENT.
+	 */
+	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE |
+					  CHECKPOINT_WAIT | CHECKPOINT_FLUSH_ALL);
+
+	/*
+	 * Iterate through all tablespaces of the template database, and copy
+	 * each one to the new database.
+	 */
+	rel = table_open(TableSpaceRelationId, AccessShareLock);
+	scan = table_beginscan_catalog(rel, 0, NULL);
+	while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+	{
+		Form_pg_tablespace spaceform = (Form_pg_tablespace) GETSTRUCT(tuple);
+		Oid			srctablespace = spaceform->oid;
+		Oid			dsttablespace;
+		char	   *srcpath;
+		char	   *dstpath;
+		struct stat st;
+
+		/* No need to copy global tablespace */
+		if (srctablespace == GLOBALTABLESPACE_OID)
+			continue;
+
+		srcpath = GetDatabasePath(src_dboid, srctablespace);
+
+		if (stat(srcpath, &st) < 0 || !S_ISDIR(st.st_mode) ||
+			directory_is_empty(srcpath))
+		{
+			/* Assume we can ignore it */
+			pfree(srcpath);
+			continue;
+		}
+
+		if (srctablespace == src_tsid)
+			dsttablespace = dst_tsid;
+		else
+			dsttablespace = srctablespace;
+
+		dstpath = GetDatabasePath(dst_dboid, dsttablespace);
+
+		/*
+		 * Copy this subdirectory to the new location
+		 *
+		 * We don't need to copy subdirectories
+		 */
+		copydir(srcpath, dstpath, false);
+
+		/* Record the filesystem change in XLOG */
+		{
+			xl_dbase_create_rec xlrec;
+
+			xlrec.db_id = dst_dboid;
+			xlrec.tablespace_id = dsttablespace;
+			xlrec.src_db_id = src_dboid;
+			xlrec.src_tablespace_id = srctablespace;
+
+			XLogBeginInsert();
+			XLogRegisterData((char *) &xlrec, sizeof(xl_dbase_create_rec));
+
+			(void) XLogInsert(RM_DBASE_ID,
+							  XLOG_DBASE_CREATE | XLR_SPECIAL_REL_UPDATE);
+		}
+	}
+	table_endscan(scan);
+	table_close(rel, AccessShareLock);
+
+	/*
+	 * We force a checkpoint before committing.  This effectively means
+	 * that committed XLOG_DBASE_CREATE operations will never need to be
+	 * replayed (at least not in ordinary crash recovery; we still have to
+	 * make the XLOG entry for the benefit of PITR operations). This
+	 * avoids two nasty scenarios:
+	 *
+	 * #1: When PITR is off, we don't XLOG the contents of newly created
+	 * indexes; therefore the drop-and-recreate-whole-directory behavior
+	 * of DBASE_CREATE replay would lose such indexes.
+	 *
+	 * #2: Since we have to recopy the source database during DBASE_CREATE
+	 * replay, we run the risk of copying changes in it that were
+	 * committed after the original CREATE DATABASE command but before the
+	 * system crash that led to the replay.  This is at least unexpected
+	 * and at worst could lead to inconsistencies, eg duplicate table
+	 * names.
+	 *
+	 * (Both of these were real bugs in releases 8.0 through 8.0.3.)
+	 *
+	 * In PITR replay, the first of these isn't an issue, and the second
+	 * is only a risk if the CREATE DATABASE and subsequent template
+	 * database change both occur while a base backup is being taken.
+	 * There doesn't seem to be much we can do about that except document
+	 * it as a limitation.
+	 *
+	 * Perhaps if we ever implement CREATE DATABASE in a less cheesy way,
+	 * we can avoid this.
+	 */
+	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
+}
 
 /*
  * CREATE DATABASE
@@ -100,8 +645,6 @@ static int	errdetail_busy_db(int notherbackends, int npreparedxacts);
 Oid
 createdb(ParseState *pstate, const CreatedbStmt *stmt)
 {
-	TableScanDesc scan;
-	Relation	rel;
 	Oid			src_dboid;
 	Oid			src_owner;
 	int			src_encoding = -1;
@@ -132,6 +675,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	DefElem    *dallowconnections = NULL;
 	DefElem    *dconnlimit = NULL;
 	DefElem    *dcollversion = NULL;
+	DefElem    *dstrategy = NULL;
 	char	   *dbname = stmt->dbname;
 	char	   *dbowner = NULL;
 	const char *dbtemplate = NULL;
@@ -145,6 +689,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	char	   *dbcollversion = NULL;
 	int			notherbackends;
 	int			npreparedxacts;
+	CreateDBStrategy	dbstrategy = CREATEDB_WAL_LOG;
 	createdb_failure_params fparms;
 
 	/* Extract options from the statement node tree */
@@ -250,6 +795,12 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 						(errcode(ERRCODE_INVALID_PARAMETER_VALUE)),
 						errmsg("OIDs less than %u are reserved for system objects", FirstNormalObjectId));
 		}
+		else if (strcmp(defel->defname, "strategy") == 0)
+		{
+			if (dstrategy)
+				errorConflictingDefElem(defel, pstate);
+			dstrategy = defel;
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -374,6 +925,23 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 							dbtemplate)));
 	}
 
+	/* Validate the database creation strategy. */
+	if (dstrategy && dstrategy->arg)
+	{
+		char	*strategy;
+
+		strategy = defGetString(dstrategy);
+		if (strcmp(strategy, "wal_log") == 0)
+			dbstrategy = CREATEDB_WAL_LOG;
+		else if (strcmp(strategy, "file_copy") == 0)
+			dbstrategy = CREATEDB_FILE_COPY;
+		else
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("invalid create strategy %s", strategy),
+					 errhint("Valid strategies are \"wal_log\", and \"file_copy\".")));
+	}
+
 	/* If encoding or locales are defaulted, use source's setting */
 	if (encoding < 0)
 		encoding = src_encoding;
@@ -668,19 +1236,6 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	InvokeObjectPostCreateHook(DatabaseRelationId, dboid, 0);
 
 	/*
-	 * Force a checkpoint before starting the copy. This will force all dirty
-	 * buffers, including those of unlogged tables, out to disk, to ensure
-	 * source database is up-to-date on disk for the copy.
-	 * FlushDatabaseBuffers() would suffice for that, but we also want to
-	 * process any pending unlink requests. Otherwise, if a checkpoint
-	 * happened while we're copying files, a file might be deleted just when
-	 * we're about to copy it, causing the lstat() call in copydir() to fail
-	 * with ENOENT.
-	 */
-	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT
-					  | CHECKPOINT_FLUSH_ALL);
-
-	/*
 	 * Once we start copying subdirectories, we need to be able to clean 'em
 	 * up if we fail.  Use an ENSURE block to make sure this happens.  (This
 	 * is not a 100% solution, because of the possibility of failure during
@@ -689,114 +1244,47 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	 */
 	fparms.src_dboid = src_dboid;
 	fparms.dest_dboid = dboid;
+	fparms.strategy = dbstrategy;
+
 	PG_ENSURE_ERROR_CLEANUP(createdb_failure_callback,
 							PointerGetDatum(&fparms));
 	{
 		/*
-		 * Iterate through all tablespaces of the template database, and copy
-		 * each one to the new database.
+		 * If the user has asked to create a database with WAL_LOG strategy
+		 * then call CopyDatabaseWithWal, which will copy the database at the
+		 * block level and it will WAL log each copied block.  Otherwise,
+		 * call CopyDatabase that will copy the database file by file.
 		 */
-		rel = table_open(TableSpaceRelationId, AccessShareLock);
-		scan = table_beginscan_catalog(rel, 0, NULL);
-		while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+		if (dbstrategy == CREATEDB_WAL_LOG)
 		{
-			Form_pg_tablespace spaceform = (Form_pg_tablespace) GETSTRUCT(tuple);
-			Oid			srctablespace = spaceform->oid;
-			Oid			dsttablespace;
-			char	   *srcpath;
-			char	   *dstpath;
-			struct stat st;
-
-			/* No need to copy global tablespace */
-			if (srctablespace == GLOBALTABLESPACE_OID)
-				continue;
-
-			srcpath = GetDatabasePath(src_dboid, srctablespace);
-
-			if (stat(srcpath, &st) < 0 || !S_ISDIR(st.st_mode) ||
-				directory_is_empty(srcpath))
-			{
-				/* Assume we can ignore it */
-				pfree(srcpath);
-				continue;
-			}
-
-			if (srctablespace == src_deftablespace)
-				dsttablespace = dst_deftablespace;
-			else
-				dsttablespace = srctablespace;
-
-			dstpath = GetDatabasePath(dboid, dsttablespace);
+			CopyDatabaseWithWal(src_dboid, dboid, src_deftablespace,
+								dst_deftablespace);
 
 			/*
-			 * Copy this subdirectory to the new location
-			 *
-			 * We don't need to copy subdirectories
+			 * Close pg_database, but keep lock till commit.
 			 */
-			copydir(srcpath, dstpath, false);
-
-			/* Record the filesystem change in XLOG */
-			{
-				xl_dbase_create_rec xlrec;
-
-				xlrec.db_id = dboid;
-				xlrec.tablespace_id = dsttablespace;
-				xlrec.src_db_id = src_dboid;
-				xlrec.src_tablespace_id = srctablespace;
-
-				XLogBeginInsert();
-				XLogRegisterData((char *) &xlrec, sizeof(xl_dbase_create_rec));
-
-				(void) XLogInsert(RM_DBASE_ID,
-								  XLOG_DBASE_CREATE | XLR_SPECIAL_REL_UPDATE);
-			}
+			table_close(pg_database_rel, NoLock);
 		}
-		table_endscan(scan);
-		table_close(rel, AccessShareLock);
+		else
+		{
+			Assert(dbstrategy == CREATEDB_FILE_COPY);
 
-		/*
-		 * We force a checkpoint before committing.  This effectively means
-		 * that committed XLOG_DBASE_CREATE operations will never need to be
-		 * replayed (at least not in ordinary crash recovery; we still have to
-		 * make the XLOG entry for the benefit of PITR operations). This
-		 * avoids two nasty scenarios:
-		 *
-		 * #1: When PITR is off, we don't XLOG the contents of newly created
-		 * indexes; therefore the drop-and-recreate-whole-directory behavior
-		 * of DBASE_CREATE replay would lose such indexes.
-		 *
-		 * #2: Since we have to recopy the source database during DBASE_CREATE
-		 * replay, we run the risk of copying changes in it that were
-		 * committed after the original CREATE DATABASE command but before the
-		 * system crash that led to the replay.  This is at least unexpected
-		 * and at worst could lead to inconsistencies, eg duplicate table
-		 * names.
-		 *
-		 * (Both of these were real bugs in releases 8.0 through 8.0.3.)
-		 *
-		 * In PITR replay, the first of these isn't an issue, and the second
-		 * is only a risk if the CREATE DATABASE and subsequent template
-		 * database change both occur while a base backup is being taken.
-		 * There doesn't seem to be much we can do about that except document
-		 * it as a limitation.
-		 *
-		 * Perhaps if we ever implement CREATE DATABASE in a less cheesy way,
-		 * we can avoid this.
-		 */
-		RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
+			CopyDatabase(src_dboid, dboid, src_deftablespace,
+						 dst_deftablespace);
 
-		/*
-		 * Close pg_database, but keep lock till commit.
-		 */
-		table_close(pg_database_rel, NoLock);
+			/*
+			 * Close pg_database, but keep lock till commit.
+			 */
+			table_close(pg_database_rel, NoLock);
 
-		/*
-		 * Force synchronous commit, thus minimizing the window between
-		 * creation of the database files and committal of the transaction. If
-		 * we crash before committing, we'll have a DB that's taking up disk
-		 * space but is not in pg_database, which is not good.
-		 */
-		ForceSyncCommit();
+			/*
+			 * Force synchronous commit, thus minimizing the window between
+			 * creation of the database files and committal of the transaction.
+			 * If we crash before committing, we'll have a DB that's taking up
+			 * disk space but is not in pg_database, which is not good.
+			 */
+			ForceSyncCommit();
+		}
 	}
 	PG_END_ENSURE_ERROR_CLEANUP(createdb_failure_callback,
 								PointerGetDatum(&fparms));
@@ -870,6 +1358,21 @@ createdb_failure_callback(int code, Datum arg)
 	createdb_failure_params *fparms = (createdb_failure_params *) DatumGetPointer(arg);
 
 	/*
+	 * If we were copying database at block levels then drop pages for the
+	 * destination database that are in the shared buffer cache.  And tell
+	 * checkpointer to forget any pending fsync and unlink requests for
+	 * files in the database.  The reasoning behind doing this is same as
+	 * explained in dropdb function.  But unlike dropdb we don't need to call
+	 * pgstat_drop_database because this database is still not created so there
+	 * should not be any stat for this.
+	 */
+	if (fparms->strategy == CREATEDB_WAL_LOG)
+	{
+		DropDatabaseBuffers(fparms->dest_dboid);
+		ForgetDatabaseSyncRequests(fparms->dest_dboid);
+	}
+
+	/*
 	 * Release lock on source database before doing recursive remove. This is
 	 * not essential but it seems desirable to release the lock as soon as
 	 * possible.
@@ -2387,32 +2890,40 @@ dbase_redo(XLogReaderState *record)
 		src_path = GetDatabasePath(xlrec->src_db_id, xlrec->src_tablespace_id);
 		dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
 
-		/*
-		 * Our theory for replaying a CREATE is to forcibly drop the target
-		 * subdirectory if present, then re-copy the source data. This may be
-		 * more work than needed, but it is simple to implement.
-		 */
-		if (stat(dst_path, &st) == 0 && S_ISDIR(st.st_mode))
+		if (!OidIsValid(xlrec->src_db_id))
 		{
-			if (!rmtree(dst_path, true))
-				/* If this failed, copydir() below is going to error. */
-				ereport(WARNING,
-						(errmsg("some useless files may be left behind in old database directory \"%s\"",
-								dst_path)));
+			CreateDirAndVersionFile(dst_path, xlrec->db_id, xlrec->tablespace_id,
+									true);
 		}
+		else
+		{
+			/*
+			* Our theory for replaying a CREATE is to forcibly drop the target
+			* subdirectory if present, then re-copy the source data. This may be
+			* more work than needed, but it is simple to implement.
+			*/
+			if (stat(dst_path, &st) == 0 && S_ISDIR(st.st_mode))
+			{
+				if (!rmtree(dst_path, true))
+					/* If this failed, copydir() below is going to error. */
+					ereport(WARNING,
+							(errmsg("some useless files may be left behind in old database directory \"%s\"",
+									dst_path)));
+			}
 
-		/*
-		 * Force dirty buffers out to disk, to ensure source database is
-		 * up-to-date for the copy.
-		 */
-		FlushDatabaseBuffers(xlrec->src_db_id);
+			/*
+			* Force dirty buffers out to disk, to ensure source database is
+			* up-to-date for the copy.
+			*/
+			FlushDatabaseBuffers(xlrec->src_db_id);
 
-		/*
-		 * Copy this subdirectory to the new location
-		 *
-		 * We don't need to copy subdirectories
-		 */
-		copydir(src_path, dst_path, false);
+			/*
+			* Copy this subdirectory to the new location
+			*
+			* We don't need to copy subdirectories
+			*/
+			copydir(src_path, dst_path, false);
+		}
 	}
 	else if (info == XLOG_DBASE_DROP)
 	{
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 0ed2d31..be2167f 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -38,6 +38,7 @@
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/storage.h"
+#include "catalog/storage_xlog.h"
 #include "executor/instrument.h"
 #include "lib/binaryheap.h"
 #include "miscadmin.h"
@@ -486,6 +487,9 @@ static void FindAndDropRelFileNodeBuffers(RelFileNode rnode,
 										  ForkNumber forkNum,
 										  BlockNumber nForkBlock,
 										  BlockNumber firstDelBlock);
+static void RelationCopyStorageUsingBuffer(SMgrRelation src, SMgrRelation dst,
+										   ForkNumber forkNum,
+										   char relpersistence);
 static void AtProcExit_Buffers(int code, Datum arg);
 static void CheckForBufferLeaks(void);
 static int	rnode_comparator(const void *p1, const void *p2);
@@ -3670,6 +3674,134 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 }
 
 /* ---------------------------------------------------------------------
+ *		RelationCopyStorageUsingBuffer
+ *
+ *		Copy fork's data using bufmgr.  Same as RelationCopyStorage but instead
+ *		of using smgrread and smgrextend this will copy using bufmgr APIs.
+ * --------------------------------------------------------------------
+ */
+static void
+RelationCopyStorageUsingBuffer(SMgrRelation src, SMgrRelation dst,
+							   ForkNumber forkNum, char relpersistence)
+{
+	Buffer		srcBuf;
+	Buffer		dstBuf;
+	Page		srcPage;
+	Page		dstPage;
+	bool		use_wal;
+	bool		copying_initfork;
+	BlockNumber nblocks;
+	BlockNumber blkno;
+	BufferAccessStrategy bstrategy_src;
+	BufferAccessStrategy bstrategy_dst;
+
+	/* Refer comments in RelationCopyStorage. */
+	copying_initfork = relpersistence == RELPERSISTENCE_UNLOGGED &&
+		forkNum == INIT_FORKNUM;
+	use_wal = XLogIsNeeded() &&
+		(relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork);
+
+	/* Get number of blocks in the source relation. */
+	nblocks = smgrnblocks(src, forkNum);
+
+	/*
+	 * We are going to copy whole relation from the source to the destination
+	 * so use BAS_BULKREAD strategy for the source relation and BAS_BULKWRITE
+	 * strategy for the destination relation.
+	 */
+	bstrategy_src = GetAccessStrategy(BAS_BULKREAD);
+	bstrategy_dst = GetAccessStrategy(BAS_BULKWRITE);
+
+	/* Iterate over each block of the source relation file. */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		/* If we got a cancel signal during the copy of the data, quit */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Read block from source relation. */
+		srcBuf = ReadBufferWithoutRelcache(src->smgr_rnode.node, forkNum,
+										   blkno, RBM_NORMAL, bstrategy_src,
+										   relpersistence);
+		srcPage = BufferGetPage(srcBuf);
+		if (PageIsNew(srcPage) || PageIsEmpty(srcPage))
+		{
+			ReleaseBuffer(srcBuf);
+			continue;
+		}
+
+		/* Use P_NEW to extend the relation. */
+		dstBuf = ReadBufferWithoutRelcache(dst->smgr_rnode.node, forkNum,
+										   P_NEW, RBM_NORMAL, bstrategy_dst,
+										   relpersistence);
+		LockBuffer(dstBuf, BUFFER_LOCK_EXCLUSIVE);
+
+		START_CRIT_SECTION();
+
+		/* Initialize the page and write the data. */
+		dstPage = BufferGetPage(dstBuf);
+		PageInit(dstPage, BufferGetPageSize(dstBuf), 0);
+		memcpy(dstPage, srcPage, BLCKSZ);
+		MarkBufferDirty(dstBuf);
+
+		/* WAL-log the copied page. */
+		if (use_wal)
+			log_newpage_buffer(dstBuf, true);
+
+		END_CRIT_SECTION();
+
+		UnlockReleaseBuffer(dstBuf);
+		ReleaseBuffer(srcBuf);
+	}
+}
+
+/* ---------------------------------------------------------------------
+ *		CreateAndCopyRelationData
+ *
+ *		Create and copy source smgr relation's all fork's data to the
+ *		destination.
+ * --------------------------------------------------------------------
+ */
+void
+CreateAndCopyRelationData(SMgrRelation src_smgr, SMgrRelation dst_smgr,
+						  char relpersistence)
+{
+	/*
+	 * Create and copy all forks of the relation.
+	 *
+	 * NOTE: any conflict in relfilenode value will be caught in
+	 * RelationCreateStorage().
+	 */
+	RelationCreateStorage(dst_smgr->smgr_rnode.node, relpersistence);
+
+	/* copy main fork */
+	RelationCopyStorageUsingBuffer(src_smgr, dst_smgr, MAIN_FORKNUM,
+								   relpersistence);
+
+	/* copy those extra forks that exist */
+	for (ForkNumber forkNum = MAIN_FORKNUM + 1;
+		 forkNum <= MAX_FORKNUM; forkNum++)
+	{
+		if (smgrexists(src_smgr, forkNum))
+		{
+			smgrcreate(dst_smgr, forkNum, false);
+
+			/*
+			 * WAL log creation if the relation is persistent, or this is the
+			 * init fork of an unlogged relation.
+			 */
+			if (relpersistence == RELPERSISTENCE_PERMANENT ||
+				(relpersistence == RELPERSISTENCE_UNLOGGED &&
+				 forkNum == INIT_FORKNUM))
+				log_smgrcreate(&dst_smgr->smgr_rnode.node, forkNum);
+
+			/* Copy a fork's data, block by block. */
+			RelationCopyStorageUsingBuffer(src_smgr, dst_smgr, forkNum,
+										   relpersistence);
+		}
+	}
+}
+
+/* ---------------------------------------------------------------------
  *		FlushDatabaseBuffers
  *
  *		This function writes all dirty pages of a database out to disk
diff --git a/src/include/commands/dbcommands_xlog.h b/src/include/commands/dbcommands_xlog.h
index 593a857..8f59870 100644
--- a/src/include/commands/dbcommands_xlog.h
+++ b/src/include/commands/dbcommands_xlog.h
@@ -20,6 +20,7 @@
 /* record types */
 #define XLOG_DBASE_CREATE		0x00
 #define XLOG_DBASE_DROP			0x10
+#define XLOG_DBASE_CREATEDIR	0x20
 
 typedef struct xl_dbase_create_rec
 {
@@ -30,6 +31,13 @@ typedef struct xl_dbase_create_rec
 	Oid			src_tablespace_id;
 } xl_dbase_create_rec;
 
+typedef struct xl_dbase_createdir_rec
+{
+	/* Records creating database directory */
+	Oid			db_id;
+	Oid			tablespace_id;
+} xl_dbase_createdir_rec;
+
 typedef struct xl_dbase_drop_rec
 {
 	Oid			db_id;
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 7b80f58..6d54812 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -204,6 +204,9 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
 extern void FlushOneBuffer(Buffer buffer);
 extern void FlushRelationBuffers(Relation rel);
 extern void FlushRelationsAllBuffers(struct SMgrRelationData **smgrs, int nrels);
+extern void CreateAndCopyRelationData(struct SMgrRelationData *src_smgr,
+									  struct SMgrRelationData *dst_smgr,
+									  char relpersistence);
 extern void FlushDatabaseBuffers(Oid dbid);
 extern void DropRelFileNodeBuffers(struct SMgrRelationData *smgr_reln, ForkNumber *forkNum,
 								   int nforks, BlockNumber *firstDelBlock);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index d9b83f7..dcda8ca 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -460,6 +460,7 @@ CoverPos
 CreateAmStmt
 CreateCastStmt
 CreateConversionStmt
+CreateDBRelInfo
 CreateDomainStmt
 CreateEnumStmt
 CreateEventTrigStmt
-- 
1.8.3.1

v11-0003-New-interface-to-lock-relation-id.patchtext/x-patch; charset=US-ASCII; name=v11-0003-New-interface-to-lock-relation-id.patchDownload

From 2b54a88af311c67f7c36418f4e530c81d36e5a78 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 24 Sep 2021 18:29:17 +0530
Subject: [PATCH v11 3/4] New interface to lock relation id

Currently, we have LockRelationOid which provide a mechanism to
lock the relation oid but we must be connected to the database
from which this relation belong.  As part of this patch we are
providing a new interface which can lock the relation even if we
are not connected to the containing database.
---
 src/backend/storage/lmgr/lmgr.c | 28 ++++++++++++++++++++++++++++
 src/include/storage/lmgr.h      |  1 +
 2 files changed, 29 insertions(+)

diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 5ae52dd..1543da6 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -176,6 +176,34 @@ ConditionalLockRelationOid(Oid relid, LOCKMODE lockmode)
 }
 
 /*
+ *		LockRelationId
+ *
+ * Lock, given a LockRelId.  Same as LockRelationOid but take LockRelId as an
+ * input.
+ */
+void
+LockRelationId(LockRelId *relid, LOCKMODE lockmode)
+{
+	LOCKTAG		tag;
+	LOCALLOCK  *locallock;
+	LockAcquireResult res;
+
+	SET_LOCKTAG_RELATION(tag, relid->dbId, relid->relId);
+
+	res = LockAcquireExtended(&tag, lockmode, false, false, true, &locallock);
+
+	/*
+	 * Now that we have the lock, check for invalidation messages; see notes
+	 * in LockRelationOid.
+	 */
+	if (res != LOCKACQUIRE_ALREADY_CLEAR)
+	{
+		AcceptInvalidationMessages();
+		MarkLockClear(locallock);
+	}
+}
+
+/*
  *		UnlockRelationId
  *
  * Unlock, given a LockRelId.  This is preferred over UnlockRelationOid
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index 49edbcc..be1d2c9 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -38,6 +38,7 @@ extern void RelationInitLockInfo(Relation relation);
 
 /* Lock a relation */
 extern void LockRelationOid(Oid relid, LOCKMODE lockmode);
+extern void LockRelationId(LockRelId *relid, LOCKMODE lockmode);
 extern bool ConditionalLockRelationOid(Oid relid, LOCKMODE lockmode);
 extern void UnlockRelationId(LockRelId *relid, LOCKMODE lockmode);
 extern void UnlockRelationOid(Oid relid, LOCKMODE lockmode);
-- 
1.8.3.1

#139

Ashutosh Sharma

ashu.coek88@gmail.com

almost 4 years ago

In reply to: Dilip Kumar (#138)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

Here are some review comments on the latest patch
(v11-0004-WAL-logged-CREATE-DATABASE.patch). I actually did the review
of the v10 patch but that applies for this latest version as well.

+               /* Now errors are fatal ... */
+               START_CRIT_SECTION();

Did you mean PANIC instead of FATAL?

+
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                                        errmsg("invalid create
strategy %s", strategy),
+                                        errhint("Valid strategies are
\"wal_log\", and \"file_copy\".")));
+       }

Should this be - "invalid createdb strategy" instead of "invalid
create strategy"?

+               /*
+                * In case of ALTER DATABASE SET TABLESPACE we don't need to do
+                * anything for the object which are not in the source
db's default
+                * tablespace.  The source and destination dboid will be same in
+                * case of ALTER DATABASE SET TABLESPACE.
+                */
+               else if (src_dboid == dst_dboid)
+                       continue;
+               else
+                       dstrnode.spcNode = srcrnode.spcNode;

Is this change still required? Do we support the WAL_COPY strategy for
ALTER DATABASE?

+               /* Open the source and the destination relation at
smgr level. */
+               src_smgr = smgropen(srcrnode, InvalidBackendId);
+               dst_smgr = smgropen(dstrnode, InvalidBackendId);
+
+               /* Copy relation storage from source to the destination. */
+               CreateAndCopyRelationData(src_smgr, dst_smgr,
relinfo->relpersistence);

Do we need to do smgropen for destination relfilenode here? Aren't we
already doing that inside RelationCreateStorage?

+       use_wal = XLogIsNeeded() &&
+               (relpersistence == RELPERSISTENCE_PERMANENT ||
copying_initfork);
+
+       /* Get number of blocks in the source relation. */
+       nblocks = smgrnblocks(src, forkNum);

What if number of blocks in a source relation is ZERO? Should we check
for that and return immediately. We have already done smgrcreate.

+       /* We don't need to copy the shared objects to the target. */
+       if (classForm->reltablespace == GLOBALTABLESPACE_OID)
+               return NULL;
+
+       /*
+        * If the object doesn't have the storage then nothing to be
+        * done for that object so just ignore it.
+        */
+       if (!RELKIND_HAS_STORAGE(classForm->relkind))
+               return NULL;

We can probably club together above two if-checks.

+      <varlistentry>
+       <term><replaceable class="parameter">strategy</replaceable></term>
+       <listitem>
+        <para>
+         This is used for copying the database directory.  Currently, we have
+         two strategies the <literal>WAL_LOG</literal> and the
+         <literal>FILE_COPY</literal>.  If <literal>WAL_LOG</literal> strategy
+         is used then the database will be copied block by block and it will
+         also WAL log each copied block.  Otherwise, if <literal>FILE_COPY

I think we need to mention the default strategy in the documentation page.

--
With Regards,
Ashutosh Sharma.

Show quoted text

On Thu, Mar 10, 2022 at 4:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Mar 9, 2022 at 6:44 PM Robert Haas <robertmhaas@gmail.com> wrote:

Right, infact now also if you see the logic, the
write_relmap_file_internal() is taking care of the actual update and
the write_relmap_file() is doing update + setting the global
variables. So yeah we can rename as you suggested in 0001 and here
also we can change the logic as you suggested that the recovery and
createdb will only call the first function which is just doing the
update.

But I think we want the path construction to be managed by the
function rather than the caller, too.

I have completely changed the logic for this refactoring. Basically,
write_relmap_file(), is already having parameters to control whether
to write wal, send inval and we are already passing the dbpath.
Instead of making a new function I just pass one additional parameter
to this function itself about whether we are creating a new map or not
and I think with that changes are very less and this looks cleaner to
me. Similarly for load_relmap_file() also I just had to pass the
dbpath and memory for destination map. Please have a look and let me
know your thoughts.

Sure, I guess. It's just not obvious why the argument would actually
need to be a function that copies storage, or why there's more than
one way to copy storage. I'd rather keep all the code paths unified,
if we can, and set behavior via flags or something, maybe. I'm not
sure whether that's realistic, though.

I try considering that, I think it doesn't look good to make it flag
based, One of the main problem I noticed is that now for copying
either we need to call RelationCopyStorageis() or
RelationCopyStorageUsingBuffer() based on the input flag. But if we
move the main copy function to the storage.c then the storage.c will
have depedency on bufmgr functions because I don't think we can keep
RelationCopyStorageUsingBuffer() inside storage.c. So for now, I have
duplicated the loop which is already there in index_copy_data() and
heapam_relation_copy_data() and kept that in bufmgr.c and also moved
RelationCopyStorageUsingBuffer() into the bufmgr.c. I think bufmgr.c
is already having function which are dealing with smgr things so I
feel this is the right place for the function.

Other changes:
1. 0001 and 0002 are merged because now we are not really refactoring
these function and just passing the additioanl arguments to it make
sense to combine the changes.
2. Same with 0003, that now we are not refactoring existing functions
but providing new interfaces so merged it with the 0004 (which was
0006 previously)

I think we should also write the test cases for create database
strategy. But I do not see any test case for create database for
testing the existing options. So I am wondering whether we should add
the test case only for the new option we are providing or we should
create a separate path which tests the new option as well as the
existing options.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#140

Dilip Kumar

dilipbalaut@gmail.com

almost 4 years ago

In reply to: Ashutosh Sharma (#139)

4 attachment(s)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Thu, Mar 10, 2022 at 7:22 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Here are some review comments on the latest patch
(v11-0004-WAL-logged-CREATE-DATABASE.patch). I actually did the review
of the v10 patch but that applies for this latest version as well.
+               /* Now errors are fatal ... */
+               START_CRIT_SECTION();
Did you mean PANIC instead of FATAL?

I think here fatal didn't really mean the error level FATAL, that
means critical and I have seen it is used in other places also. But I
really don't think we need this comments to removed to avoid any
confusion.

+
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                                        errmsg("invalid create
strategy %s", strategy),
+                                        errhint("Valid strategies are
\"wal_log\", and \"file_copy\".")));
+       }

Should this be - "invalid createdb strategy" instead of "invalid
create strategy"?

Changed

+               /*
+                * In case of ALTER DATABASE SET TABLESPACE we don't need to do
+                * anything for the object which are not in the source
db's default
+                * tablespace.  The source and destination dboid will be same in
+                * case of ALTER DATABASE SET TABLESPACE.
+                */
+               else if (src_dboid == dst_dboid)
+                       continue;
+               else
+                       dstrnode.spcNode = srcrnode.spcNode;

Is this change still required? Do we support the WAL_COPY strategy for
ALTER DATABASE?

Yeah now it is unreachable code so removed.

+               /* Open the source and the destination relation at
smgr level. */
+               src_smgr = smgropen(srcrnode, InvalidBackendId);
+               dst_smgr = smgropen(dstrnode, InvalidBackendId);
+
+               /* Copy relation storage from source to the destination. */
+               CreateAndCopyRelationData(src_smgr, dst_smgr,
relinfo->relpersistence);

Do we need to do smgropen for destination relfilenode here? Aren't we
already doing that inside RelationCreateStorage?

Yeah I have changed the complete logic and removed the smgr_open for
both source and destination and moved inside
CreateAndCopyRelationData, please check the updated code.

==
+       use_wal = XLogIsNeeded() &&
+               (relpersistence == RELPERSISTENCE_PERMANENT ||
copying_initfork);
+
+       /* Get number of blocks in the source relation. */
+       nblocks = smgrnblocks(src, forkNum);
What if number of blocks in a source relation is ZERO? Should we check
for that and return immediately. We have already done smgrcreate.

Yeah make sense to optimize, with that we will not have to get the
buffer strategy so done.

+       /* We don't need to copy the shared objects to the target. */
+       if (classForm->reltablespace == GLOBALTABLESPACE_OID)
+               return NULL;
+
+       /*
+        * If the object doesn't have the storage then nothing to be
+        * done for that object so just ignore it.
+        */
+       if (!RELKIND_HAS_STORAGE(classForm->relkind))
+               return NULL;

We can probably club together above two if-checks.

Done

+      <varlistentry>
+       <term><replaceable class="parameter">strategy</replaceable></term>
+       <listitem>
+        <para>
+         This is used for copying the database directory.  Currently, we have
+         two strategies the <literal>WAL_LOG</literal> and the
+         <literal>FILE_COPY</literal>.  If <literal>WAL_LOG</literal> strategy
+         is used then the database will be copied block by block and it will
+         also WAL log each copied block.  Otherwise, if <literal>FILE_COPY

I think we need to mention the default strategy in the documentation page.

Done

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v12-0003-New-interface-to-lock-relation-id.patchtext/x-patch; charset=US-ASCII; name=v12-0003-New-interface-to-lock-relation-id.patchDownload

From 2b54a88af311c67f7c36418f4e530c81d36e5a78 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 24 Sep 2021 18:29:17 +0530
Subject: [PATCH v12 3/4] New interface to lock relation id

Currently, we have LockRelationOid which provide a mechanism to
lock the relation oid but we must be connected to the database
from which this relation belong.  As part of this patch we are
providing a new interface which can lock the relation even if we
are not connected to the containing database.
---
 src/backend/storage/lmgr/lmgr.c | 28 ++++++++++++++++++++++++++++
 src/include/storage/lmgr.h      |  1 +
 2 files changed, 29 insertions(+)

diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 5ae52dd..1543da6 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -176,6 +176,34 @@ ConditionalLockRelationOid(Oid relid, LOCKMODE lockmode)
 }
 
 /*
+ *		LockRelationId
+ *
+ * Lock, given a LockRelId.  Same as LockRelationOid but take LockRelId as an
+ * input.
+ */
+void
+LockRelationId(LockRelId *relid, LOCKMODE lockmode)
+{
+	LOCKTAG		tag;
+	LOCALLOCK  *locallock;
+	LockAcquireResult res;
+
+	SET_LOCKTAG_RELATION(tag, relid->dbId, relid->relId);
+
+	res = LockAcquireExtended(&tag, lockmode, false, false, true, &locallock);
+
+	/*
+	 * Now that we have the lock, check for invalidation messages; see notes
+	 * in LockRelationOid.
+	 */
+	if (res != LOCKACQUIRE_ALREADY_CLEAR)
+	{
+		AcceptInvalidationMessages();
+		MarkLockClear(locallock);
+	}
+}
+
+/*
  *		UnlockRelationId
  *
  * Unlock, given a LockRelId.  This is preferred over UnlockRelationOid
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index 49edbcc..be1d2c9 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -38,6 +38,7 @@ extern void RelationInitLockInfo(Relation relation);
 
 /* Lock a relation */
 extern void LockRelationOid(Oid relid, LOCKMODE lockmode);
+extern void LockRelationId(LockRelId *relid, LOCKMODE lockmode);
 extern bool ConditionalLockRelationOid(Oid relid, LOCKMODE lockmode);
 extern void UnlockRelationId(LockRelId *relid, LOCKMODE lockmode);
 extern void UnlockRelationOid(Oid relid, LOCKMODE lockmode);
-- 
1.8.3.1

v12-0002-Extend-bufmgr-interfaces.patchtext/x-patch; charset=US-ASCII; name=v12-0002-Extend-bufmgr-interfaces.patchDownload

From f8dda8ea34673ab12872c7d3bcc7e95610d86f1a Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Thu, 10 Feb 2022 15:55:33 +0530
Subject: [PATCH v12 2/4] Extend bufmgr interfaces

Extend ReadBufferWithoutRelcache interface to take relpersistence
as input. At present, this function may only be used on permanent
relations, because we only use it during XLOG replay.  But now as
part of the bigger patch set, we will be using this for reading the
buffer from the database which we are not connected so now we might
have temporary and unlogged relations as well.
---
 src/backend/access/transam/xlogutils.c |  9 ++++++---
 src/backend/storage/buffer/bufmgr.c    | 11 ++---------
 src/include/storage/bufmgr.h           |  3 ++-
 3 files changed, 10 insertions(+), 13 deletions(-)

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 54d5f20..c292794 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -484,7 +484,8 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 	{
 		/* page exists in file */
 		buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-										   mode, NULL);
+										   mode, NULL,
+										   RELPERSISTENCE_PERMANENT);
 	}
 	else
 	{
@@ -509,7 +510,8 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 				ReleaseBuffer(buffer);
 			}
 			buffer = ReadBufferWithoutRelcache(rnode, forknum,
-											   P_NEW, mode, NULL);
+											   P_NEW, mode, NULL,
+											   RELPERSISTENCE_PERMANENT);
 		}
 		while (BufferGetBlockNumber(buffer) < blkno);
 		/* Handle the corner case that P_NEW returns non-consecutive pages */
@@ -519,7 +521,8 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 				LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 			ReleaseBuffer(buffer);
 			buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-											   mode, NULL);
+											   mode, NULL,
+											   RELPERSISTENCE_PERMANENT);
 		}
 	}
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f5459c6..0ed2d31 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -771,24 +771,17 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 /*
  * ReadBufferWithoutRelcache -- like ReadBufferExtended, but doesn't require
  *		a relcache entry for the relation.
- *
- * NB: At present, this function may only be used on permanent relations, which
- * is OK, because we only use it during XLOG replay.  If in the future we
- * want to use it on temporary or unlogged relations, we could pass additional
- * parameters.
  */
 Buffer
 ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
 						  BlockNumber blockNum, ReadBufferMode mode,
-						  BufferAccessStrategy strategy)
+						  BufferAccessStrategy strategy, char relpersistence)
 {
 	bool		hit;
 
 	SMgrRelation smgr = smgropen(rnode, InvalidBackendId);
 
-	Assert(InRecovery);
-
-	return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
+	return ReadBuffer_common(smgr, relpersistence, forkNum, blockNum,
 							 mode, strategy, &hit);
 }
 
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index dd01841..7b80f58 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -184,7 +184,8 @@ extern Buffer ReadBufferExtended(Relation reln, ForkNumber forkNum,
 								 BufferAccessStrategy strategy);
 extern Buffer ReadBufferWithoutRelcache(RelFileNode rnode,
 										ForkNumber forkNum, BlockNumber blockNum,
-										ReadBufferMode mode, BufferAccessStrategy strategy);
+										ReadBufferMode mode, BufferAccessStrategy strategy,
+										char relpersistence);
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern void MarkBufferDirty(Buffer buffer);
-- 
1.8.3.1

v12-0004-WAL-logged-CREATE-DATABASE.patchtext/x-patch; charset=US-ASCII; name=v12-0004-WAL-logged-CREATE-DATABASE.patchDownload

From 60bcdf3dc8458481b17fd4f4053b48e69ac9f050 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 14 Feb 2022 17:48:03 +0530
Subject: [PATCH v12 4/4] WAL logged CREATE DATABASE

Currently, CREATE DATABASE forces a checkpoint, then copies all the files,
then forces another checkpoint. The comments in the createdb() function
explain the reasons for this. The attached patch fixes this problem by making
create database completely WAL logged so that we can avoid the checkpoints.

We are also maintaining the old way of creating the database and for that we
are providing an option to choose the strategy for creating the database.
For the new method the user need to give STRATEGY=WAL_LOG and for the
old method they need to give STRATEGY=FILE_COPY.  The default strategy will
be WAL_LOG.
---
 doc/src/sgml/ref/create_database.sgml  |  23 +
 src/backend/commands/dbcommands.c      | 756 +++++++++++++++++++++++++++------
 src/backend/storage/buffer/bufmgr.c    | 146 +++++++
 src/include/commands/dbcommands_xlog.h |   8 +
 src/include/storage/bufmgr.h           |   3 +
 src/tools/pgindent/typedefs.list       |   1 +
 6 files changed, 805 insertions(+), 132 deletions(-)

diff --git a/doc/src/sgml/ref/create_database.sgml b/doc/src/sgml/ref/create_database.sgml
index f70d0c7..2f6b069 100644
--- a/doc/src/sgml/ref/create_database.sgml
+++ b/doc/src/sgml/ref/create_database.sgml
@@ -34,6 +34,7 @@ CREATE DATABASE <replaceable class="parameter">name</replaceable>
            [ CONNECTION LIMIT [=] <replaceable class="parameter">connlimit</replaceable> ]
            [ IS_TEMPLATE [=] <replaceable class="parameter">istemplate</replaceable> ]
            [ OID [=] <replaceable class="parameter">oid</replaceable> ] ]
+           [ STRATEGY [=] <replaceable class="parameter">strategy</replaceable> ] ]
 </synopsis>
  </refsynopsisdiv>
 
@@ -240,6 +241,28 @@ CREATE DATABASE <replaceable class="parameter">name</replaceable>
        </listitem>
       </varlistentry>
 
+      <varlistentry>
+       <term><replaceable class="parameter">strategy</replaceable></term>
+       <listitem>
+        <para>
+         This is used for copying the database directory.  Currently, we have
+         two strategies the <literal>WAL_LOG</literal> and the
+         <literal>FILE_COPY</literal>.  If <literal>WAL_LOG</literal> strategy
+         is used then the database will be copied block by block and it will
+         also WAL log each copied block.  Otherwise, if <literal>FILE_COPY
+         </literal> strategy is used then it will do the file system level copy
+         so individual the block is not WAL logged.  The default strategy is
+         <literal>WAL_LOG</literal>.  If the <literal>FILE_COPY</literal>
+         strategy is used then it has to issue a checkpoint before and after
+         performing the copy and if the shared buffers are large and there are
+         a lot of dirty buffers then issuing checkpoint would be costly and it
+         may impact the performance of the whole system.  On the other hand, if
+         we WAL log each block then if the source database is large then
+         creating the database may take more time.
+        </para>
+       </listitem>
+      </varlistentry>
+
     </variablelist>
 
   <para>
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index c37e3c9..8dd19f0 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -46,6 +46,7 @@
 #include "commands/dbcommands_xlog.h"
 #include "commands/defrem.h"
 #include "commands/seclabel.h"
+#include "commands/tablecmds.h"
 #include "commands/tablespace.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
@@ -63,13 +64,27 @@
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
 #include "utils/pg_locale.h"
+#include "utils/relmapper.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 
+/*
+ * Create database strategy.  The CREATEDB_WAL_LOG will copy the database at
+ * the block level and WAL log each copied block.  Whereas the
+ * CREATEDB_FILE_COPY will directly copy the database at the file level and no
+ * individual operations will be WAL logged.
+ */
+typedef enum CreateDBStrategy
+{
+	CREATEDB_WAL_LOG = 0,
+	CREATEDB_FILE_COPY = 1
+} CreateDBStrategy;
+
 typedef struct
 {
 	Oid			src_dboid;		/* source (template) DB */
 	Oid			dest_dboid;		/* DB we are trying to create */
+	CreateDBStrategy	strategy;	/* create db strategy */
 } createdb_failure_params;
 
 typedef struct
@@ -78,6 +93,19 @@ typedef struct
 	Oid			dest_tsoid;		/* tablespace we are trying to move to */
 } movedb_failure_params;
 
+/*
+ * When creating a database, we scan the pg_class of the source database to
+ * identify all the relations to be copied.  The structure is used for storing
+ * information about each relation of the source database.
+ */
+typedef struct CreateDBRelInfo
+{
+	RelFileNode		rnode;				/* physical relation identifier */
+	Oid				reloid;				/* relation oid */
+	char			relpersistence;		/* relation's persistence level */
+} CreateDBRelInfo;
+
+
 /* non-export function prototypes */
 static void createdb_failure_callback(int code, Datum arg);
 static void movedb(const char *dbname, const char *tblspcname);
@@ -92,7 +120,505 @@ static bool have_createdb_privilege(void);
 static void remove_dbtablespaces(Oid db_id);
 static bool check_db_file_conflict(Oid db_id);
 static int	errdetail_busy_db(int notherbackends, int npreparedxacts);
+static void CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid,
+									bool isRedo);
+static CreateDBRelInfo *GetRelInfoFromTuple(HeapTupleData *tuple,
+											Oid tbid, Oid dbid, char *srcpath);
+static List *GetRelListFromPage(Page page, Buffer buf, Oid tbid, Oid dbid,
+								char *srcpath, List *rnodelist, Snapshot
+								snapshot);
+static List *GetDatabaseRelationList(Oid srctbid, Oid srcdbid, char *srcpath);
+static void CopyDatabaseWithWal(Oid src_dboid, Oid dboid, Oid src_tsid,
+								Oid dst_tsid);
+static void CopyDatabase(Oid src_dboid, Oid dboid, Oid src_tsid, Oid dst_tsid);
+
+/*
+ * CreateDirAndVersionFile - Create database directory and write out the
+ *							 PG_VERSION file in the database path.
+ *
+ * If isRedo is true, it's okay for the database directory to exist already.
+ *
+ * We can directly write PG_MAJORVERSION in the version file instead of copying
+ * from the source database file because these two must be the same.
+ */
+static void
+CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid, bool isRedo)
+{
+	int		fd;
+	int		nbytes;
+	char	versionfile[MAXPGPATH];
+	char	buf[16];
+
+	/* Prepare version data before starting a critical section. */
+	sprintf(buf, "%s\n", PG_MAJORVERSION);
+	nbytes = strlen(PG_MAJORVERSION) + 1;
+
+	/* If we are not in WAL replay then write the WAL. */
+	if (!isRedo)
+	{
+		xl_dbase_create_rec xlrec;
+		XLogRecPtr	lsn;
+
+		START_CRIT_SECTION();
+
+		xlrec.db_id = dbid;
+		xlrec.tablespace_id = tsid;
+		xlrec.src_db_id = InvalidOid;
+		xlrec.src_tablespace_id = InvalidOid;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec), sizeof(xl_dbase_create_rec));
+
+		lsn = XLogInsert(RM_DBASE_ID, XLOG_DBASE_CREATE);
+
+		/* As always, WAL must hit the disk before the data update does. */
+		XLogFlush(lsn);
+	}
+
+	/* Create database directory. */
+	if (MakePGDirectory(dbpath) < 0)
+	{
+		/* Failure other than already exists or not in WAL replay? */
+		if (errno != EEXIST || !isRedo)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not create directory \"%s\": %m", dbpath)));
+	}
+
+	/*
+	 * Create PG_VERSION file in the database path.  If the file already exists
+	 * and we are in WAL replay then try again to open it in write mode.
+	 */
+	snprintf(versionfile, sizeof(versionfile), "%s/%s", dbpath, "PG_VERSION");
+
+	fd = OpenTransientFile(versionfile, O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
+	if (fd < 0 && errno == EEXIST && isRedo)
+		fd = OpenTransientFile(versionfile, O_WRONLY | O_TRUNC | PG_BINARY);
+
+	if (fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m", versionfile)));
+
+	/* Write PG_MAJORVERSION in the PG_VERSION file. */
+	pgstat_report_wait_start(WAIT_EVENT_COPY_FILE_WRITE);
+	errno = 0;
+	if ((int) write(fd, buf, nbytes) != nbytes)
+	{
+		/* If write didn't set errno, assume problem is no disk space. */
+		if (errno == 0)
+			errno = ENOSPC;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m", versionfile)));
+	}
+	pgstat_report_wait_end();
+
+	/* Close the version file. */
+	CloseTransientFile(fd);
+
+	/* Critical section done. */
+	if (!isRedo)
+		END_CRIT_SECTION();
+}
+
+/*
+ * GetRelInfoFromTuple - Prepare a CreateDBRelInfo element from the tuple
+ *
+ * Helper function for GetRelListFromPage to prepare a single element from the
+ * pg_class tuple.
+ */
+CreateDBRelInfo *
+GetRelInfoFromTuple(HeapTupleData *tuple, Oid tbid, Oid dbid, char *srcpath)
+{
+	CreateDBRelInfo	   *relinfo;
+	Form_pg_class		classForm;
+	Oid					relfilenode = InvalidOid;
+
+	classForm = (Form_pg_class) GETSTRUCT(tuple);
+
+	/*
+	 * If this is a shared object or the object doesn't have the storage then
+	 * nothing to be done, so just return.
+	 */
+	if (classForm->reltablespace == GLOBALTABLESPACE_OID ||
+		!RELKIND_HAS_STORAGE(classForm->relkind))
+		return NULL;
+
+	/*
+	 * If relfilenode is valid then directly use it.  Otherwise,
+	 * consult the relmapper for the mapped relation.
+	 */
+	if (OidIsValid(classForm->relfilenode))
+		relfilenode = classForm->relfilenode;
+	else
+		relfilenode = RelationMapOidToFilenodeForDatabase(srcpath,
+										classForm->oid);
+
+	/* We must have a valid relfilenode oid. */
+	Assert(OidIsValid(relfilenode));
+
+	/* Prepare a rel info element and add it to the list. */
+	relinfo = (CreateDBRelInfo *) palloc(sizeof(CreateDBRelInfo));
+	if (OidIsValid(classForm->reltablespace))
+		relinfo->rnode.spcNode = classForm->reltablespace;
+	else
+		relinfo->rnode.spcNode = tbid;
+
+	relinfo->rnode.dbNode = dbid;
+	relinfo->rnode.relNode = relfilenode;
+	relinfo->reloid = classForm->oid;
+	relinfo->relpersistence = classForm->relpersistence;
+
+	return relinfo;
+}
+
+/*
+ * GetRelListFromPage - Helper function for GetDatabaseRelationList.
+ *
+ * Iterate over each tuple of input pg_class and get a list of all the valid
+ * relfilenodes of the given block and append them to input rnodelist.
+ */
+static List *
+GetRelListFromPage(Page page, Buffer buf, Oid tbid, Oid dbid, char *srcpath,
+				  List *rnodelist, Snapshot snapshot)
+{
+	BlockNumber		blkno = BufferGetBlockNumber(buf);
+	OffsetNumber	offnum;
+	OffsetNumber	maxoff;
+	HeapTupleData	tuple;
+
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/* Iterate over each tuple on the page. */
+	for (offnum = FirstOffsetNumber;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid;
+
+		itemid = PageGetItemId(page, offnum);
+
+		/* Nothing to do if slot is empty or already dead. */
+		if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid) ||
+			ItemIdIsRedirected(itemid))
+			continue;
+
+		Assert(ItemIdIsNormal(itemid));
+		ItemPointerSet(&(tuple.t_self), blkno, offnum);
+
+		tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
+		tuple.t_len = ItemIdGetLength(itemid);
+		tuple.t_tableOid = RelationRelationId;
 
+		/*
+		 * If the tuple is visible then add its relfilenode info to the
+		 * list.
+		 */
+		if (HeapTupleSatisfiesVisibility(&tuple, snapshot, buf))
+		{
+			CreateDBRelInfo	   *relinfo;
+
+			relinfo = GetRelInfoFromTuple(&tuple, tbid, dbid, srcpath);
+
+			/* Add it to the list. */
+			if (relinfo != NULL)
+				rnodelist = lappend(rnodelist, relinfo);
+		}
+	}
+
+	return rnodelist;
+}
+
+/*
+ * GetDatabaseRelationList - Get relfilenode list to be copied.
+ *
+ * Iterate over each block of the pg_class relation.  From there, we will check
+ * all the visible tuples in order to get a list of all the valid relfilenodes
+ * in the source database that should be copied to the target database.
+ */
+static List *
+GetDatabaseRelationList(Oid tbid, Oid dbid, char *srcpath)
+{
+	SMgrRelation	rd_smgr;
+	RelFileNode		rnode;
+	BlockNumber		nblocks;
+	BlockNumber		blkno;
+	Buffer			buf;
+	Oid				relfilenode;
+	Page			page;
+	List		   *rnodelist = NIL;
+	LockRelId		relid;
+	Snapshot		snapshot;
+	BufferAccessStrategy bstrategy;
+
+	/* Get pg_class relfilenode. */
+	relfilenode = RelationMapOidToFilenodeForDatabase(srcpath,
+													  RelationRelationId);
+	/*
+	 * We are going to read the buffers associated with the pg_class relation.
+	 * Thus, acquire the relation level lock before start scanning.  As we are
+	 * not connected to the database, we cannot use relation_open directly, so
+	 * we have to lock using relation id.
+	 */
+	relid.dbId = dbid;
+	relid.relId = RelationRelationId;
+	LockRelationId(&relid, AccessShareLock);
+
+	/* Prepare a relnode for pg_class relation. */
+	rnode.spcNode = tbid;
+	rnode.dbNode = dbid;
+	rnode.relNode = relfilenode;
+
+	/*
+	 * We are not connected to the source database so open the pg_class
+	 * relation at the smgr level and get the block count.
+	 */
+	rd_smgr = smgropen(rnode, InvalidBackendId);
+	nblocks = smgrnblocks(rd_smgr, MAIN_FORKNUM);
+
+	/*
+	 * We're going to read the whole pg_class so better to use bulk-read buffer
+	 * access strategy.
+	 */
+	bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	/* Get latest snapshot for scanning the pg_class. */
+	snapshot = GetLatestSnapshot();
+
+	/* Iterate over each block on the pg_class relation. */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		/*
+		 * We are not connected to the source database so directly use the lower
+		 * level bufmgr interface which operates on the rnode.
+		 */
+		buf = ReadBufferWithoutRelcache(rnode, MAIN_FORKNUM, blkno,
+										RBM_NORMAL, bstrategy,
+										RELPERSISTENCE_PERMANENT);
+
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buf);
+		if (PageIsNew(page) || PageIsEmpty(page))
+		{
+			UnlockReleaseBuffer(buf);
+			continue;
+		}
+
+		/*
+		 * Process pg_class tuple for the current page and add all the valid
+		 * relfilenode entries to the rnodelist.
+		 */
+		rnodelist = GetRelListFromPage(page, buf, tbid, dbid, srcpath,
+									   rnodelist, snapshot);
+
+		/* Release the buffer lock. */
+		UnlockReleaseBuffer(buf);
+	}
+
+	/* Release the lock. */
+	UnlockRelationId(&relid, AccessShareLock);
+
+	return rnodelist;
+}
+
+/*
+ * CopyDatabaseWithWal - Copy source database to the target database with WAL
+ *
+ * Create target database directory and copy data files from the source
+ * database to the target database, block by block and WAL log all the
+ * operations.
+ */
+static void
+CopyDatabaseWithWal(Oid src_dboid, Oid dst_dboid, Oid src_tsid, Oid dst_tsid)
+{
+	char	   *srcpath;
+	char	   *dstpath;
+	List	   *rnodelist = NULL;
+	ListCell   *cell;
+	LockRelId	relid;
+	RelFileNode	srcrnode;
+	RelFileNode	dstrnode;
+	CreateDBRelInfo	*relinfo;
+
+	/* Get the source database path. */
+	srcpath = GetDatabasePath(src_dboid, src_tsid);
+
+	/* Get the destination database path. */
+	dstpath = GetDatabasePath(dst_dboid, dst_tsid);
+
+	/* Create database directory and write PG_VERSION file. */
+	CreateDirAndVersionFile(dstpath, dst_dboid, dst_tsid, false);
+
+	/* Copy relmap file from source database to the destination database. */
+	CopyRelationMap(dst_dboid, dst_tsid, srcpath, dstpath);
+
+	/* Get list of all valid relnode from the source database. */
+	rnodelist = GetDatabaseRelationList(src_tsid, src_dboid, srcpath);
+	Assert(rnodelist != NIL);
+
+	/*
+	 * Database id is common for all the relation so set it before entering to
+	 * the loop.
+	 */
+	relid.dbId = src_dboid;
+
+	/*
+	 * Iterate over each relfilenode and copy the relation data block by block
+	 * from source database to the destination database.
+	 */
+	foreach(cell, rnodelist)
+	{
+		relinfo = lfirst(cell);
+		srcrnode = relinfo->rnode;
+
+		/*
+		 * If the relation is from the source db's default tablespace then we
+		 * need to create it in the destinations db's default tablespace.
+		 * Otherwise, we need to create in the same tablespace as it is in the
+		 * source database.
+		 */
+		if (srcrnode.spcNode == src_tsid)
+			dstrnode.spcNode = dst_tsid;
+		else
+			dstrnode.spcNode = srcrnode.spcNode;
+
+		dstrnode.dbNode = dst_dboid;
+		dstrnode.relNode = srcrnode.relNode;
+
+		/* Acquire the lock on relation before start copying. */
+		relid.relId = relinfo->reloid;
+		LockRelationId(&relid, AccessShareLock);
+
+		/* Copy relation storage from source to the destination. */
+		CreateAndCopyRelationData(srcrnode, dstrnode, relinfo->relpersistence);
+
+		/* Release the lock. */
+		UnlockRelationId(&relid, AccessShareLock);
+	}
+
+	list_free_deep(rnodelist);
+}
+
+/*
+ * CopyDatabase - Copy source database to the target database
+ *
+ * Copy source database directory to the destination directory using copydir
+ * operation.
+ */
+static void
+CopyDatabase(Oid src_dboid, Oid dst_dboid, Oid src_tsid, Oid dst_tsid)
+{
+	TableScanDesc	scan;
+	Relation		rel;
+	HeapTuple		tuple;
+
+	/*
+	 * Force a checkpoint before starting the copy. This will force all
+	 * dirty buffers, including those of unlogged tables, out to disk, to
+	 * ensure source database is up-to-date on disk for the copy.
+	 * FlushDatabaseBuffers() would suffice for that, but we also want to
+	 * process any pending unlink requests. Otherwise, if a checkpoint
+	 * happened while we're copying files, a file might be deleted just
+	 * when we're about to copy it, causing the lstat() call in copydir()
+	 * to fail with ENOENT.
+	 */
+	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE |
+					  CHECKPOINT_WAIT | CHECKPOINT_FLUSH_ALL);
+
+	/*
+	 * Iterate through all tablespaces of the template database, and copy
+	 * each one to the new database.
+	 */
+	rel = table_open(TableSpaceRelationId, AccessShareLock);
+	scan = table_beginscan_catalog(rel, 0, NULL);
+	while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+	{
+		Form_pg_tablespace spaceform = (Form_pg_tablespace) GETSTRUCT(tuple);
+		Oid			srctablespace = spaceform->oid;
+		Oid			dsttablespace;
+		char	   *srcpath;
+		char	   *dstpath;
+		struct stat st;
+
+		/* No need to copy global tablespace */
+		if (srctablespace == GLOBALTABLESPACE_OID)
+			continue;
+
+		srcpath = GetDatabasePath(src_dboid, srctablespace);
+
+		if (stat(srcpath, &st) < 0 || !S_ISDIR(st.st_mode) ||
+			directory_is_empty(srcpath))
+		{
+			/* Assume we can ignore it */
+			pfree(srcpath);
+			continue;
+		}
+
+		if (srctablespace == src_tsid)
+			dsttablespace = dst_tsid;
+		else
+			dsttablespace = srctablespace;
+
+		dstpath = GetDatabasePath(dst_dboid, dsttablespace);
+
+		/*
+		 * Copy this subdirectory to the new location
+		 *
+		 * We don't need to copy subdirectories
+		 */
+		copydir(srcpath, dstpath, false);
+
+		/* Record the filesystem change in XLOG */
+		{
+			xl_dbase_create_rec xlrec;
+
+			xlrec.db_id = dst_dboid;
+			xlrec.tablespace_id = dsttablespace;
+			xlrec.src_db_id = src_dboid;
+			xlrec.src_tablespace_id = srctablespace;
+
+			XLogBeginInsert();
+			XLogRegisterData((char *) &xlrec, sizeof(xl_dbase_create_rec));
+
+			(void) XLogInsert(RM_DBASE_ID,
+							  XLOG_DBASE_CREATE | XLR_SPECIAL_REL_UPDATE);
+		}
+	}
+	table_endscan(scan);
+	table_close(rel, AccessShareLock);
+
+	/*
+	 * We force a checkpoint before committing.  This effectively means
+	 * that committed XLOG_DBASE_CREATE operations will never need to be
+	 * replayed (at least not in ordinary crash recovery; we still have to
+	 * make the XLOG entry for the benefit of PITR operations). This
+	 * avoids two nasty scenarios:
+	 *
+	 * #1: When PITR is off, we don't XLOG the contents of newly created
+	 * indexes; therefore the drop-and-recreate-whole-directory behavior
+	 * of DBASE_CREATE replay would lose such indexes.
+	 *
+	 * #2: Since we have to recopy the source database during DBASE_CREATE
+	 * replay, we run the risk of copying changes in it that were
+	 * committed after the original CREATE DATABASE command but before the
+	 * system crash that led to the replay.  This is at least unexpected
+	 * and at worst could lead to inconsistencies, eg duplicate table
+	 * names.
+	 *
+	 * (Both of these were real bugs in releases 8.0 through 8.0.3.)
+	 *
+	 * In PITR replay, the first of these isn't an issue, and the second
+	 * is only a risk if the CREATE DATABASE and subsequent template
+	 * database change both occur while a base backup is being taken.
+	 * There doesn't seem to be much we can do about that except document
+	 * it as a limitation.
+	 *
+	 * Perhaps if we ever implement CREATE DATABASE in a less cheesy way,
+	 * we can avoid this.
+	 */
+	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
+}
 
 /*
  * CREATE DATABASE
@@ -100,8 +626,6 @@ static int	errdetail_busy_db(int notherbackends, int npreparedxacts);
 Oid
 createdb(ParseState *pstate, const CreatedbStmt *stmt)
 {
-	TableScanDesc scan;
-	Relation	rel;
 	Oid			src_dboid;
 	Oid			src_owner;
 	int			src_encoding = -1;
@@ -132,6 +656,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	DefElem    *dallowconnections = NULL;
 	DefElem    *dconnlimit = NULL;
 	DefElem    *dcollversion = NULL;
+	DefElem    *dstrategy = NULL;
 	char	   *dbname = stmt->dbname;
 	char	   *dbowner = NULL;
 	const char *dbtemplate = NULL;
@@ -145,6 +670,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	char	   *dbcollversion = NULL;
 	int			notherbackends;
 	int			npreparedxacts;
+	CreateDBStrategy	dbstrategy = CREATEDB_WAL_LOG;
 	createdb_failure_params fparms;
 
 	/* Extract options from the statement node tree */
@@ -250,6 +776,12 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 						(errcode(ERRCODE_INVALID_PARAMETER_VALUE)),
 						errmsg("OIDs less than %u are reserved for system objects", FirstNormalObjectId));
 		}
+		else if (strcmp(defel->defname, "strategy") == 0)
+		{
+			if (dstrategy)
+				errorConflictingDefElem(defel, pstate);
+			dstrategy = defel;
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -374,6 +906,23 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 							dbtemplate)));
 	}
 
+	/* Validate the database creation strategy. */
+	if (dstrategy && dstrategy->arg)
+	{
+		char	*strategy;
+
+		strategy = defGetString(dstrategy);
+		if (strcmp(strategy, "wal_log") == 0)
+			dbstrategy = CREATEDB_WAL_LOG;
+		else if (strcmp(strategy, "file_copy") == 0)
+			dbstrategy = CREATEDB_FILE_COPY;
+		else
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("invalid create database strategy %s", strategy),
+					 errhint("Valid strategies are \"wal_log\", and \"file_copy\".")));
+	}
+
 	/* If encoding or locales are defaulted, use source's setting */
 	if (encoding < 0)
 		encoding = src_encoding;
@@ -668,19 +1217,6 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	InvokeObjectPostCreateHook(DatabaseRelationId, dboid, 0);
 
 	/*
-	 * Force a checkpoint before starting the copy. This will force all dirty
-	 * buffers, including those of unlogged tables, out to disk, to ensure
-	 * source database is up-to-date on disk for the copy.
-	 * FlushDatabaseBuffers() would suffice for that, but we also want to
-	 * process any pending unlink requests. Otherwise, if a checkpoint
-	 * happened while we're copying files, a file might be deleted just when
-	 * we're about to copy it, causing the lstat() call in copydir() to fail
-	 * with ENOENT.
-	 */
-	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT
-					  | CHECKPOINT_FLUSH_ALL);
-
-	/*
 	 * Once we start copying subdirectories, we need to be able to clean 'em
 	 * up if we fail.  Use an ENSURE block to make sure this happens.  (This
 	 * is not a 100% solution, because of the possibility of failure during
@@ -689,114 +1225,47 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	 */
 	fparms.src_dboid = src_dboid;
 	fparms.dest_dboid = dboid;
+	fparms.strategy = dbstrategy;
+
 	PG_ENSURE_ERROR_CLEANUP(createdb_failure_callback,
 							PointerGetDatum(&fparms));
 	{
 		/*
-		 * Iterate through all tablespaces of the template database, and copy
-		 * each one to the new database.
+		 * If the user has asked to create a database with WAL_LOG strategy
+		 * then call CopyDatabaseWithWal, which will copy the database at the
+		 * block level and it will WAL log each copied block.  Otherwise,
+		 * call CopyDatabase that will copy the database file by file.
 		 */
-		rel = table_open(TableSpaceRelationId, AccessShareLock);
-		scan = table_beginscan_catalog(rel, 0, NULL);
-		while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+		if (dbstrategy == CREATEDB_WAL_LOG)
 		{
-			Form_pg_tablespace spaceform = (Form_pg_tablespace) GETSTRUCT(tuple);
-			Oid			srctablespace = spaceform->oid;
-			Oid			dsttablespace;
-			char	   *srcpath;
-			char	   *dstpath;
-			struct stat st;
-
-			/* No need to copy global tablespace */
-			if (srctablespace == GLOBALTABLESPACE_OID)
-				continue;
-
-			srcpath = GetDatabasePath(src_dboid, srctablespace);
-
-			if (stat(srcpath, &st) < 0 || !S_ISDIR(st.st_mode) ||
-				directory_is_empty(srcpath))
-			{
-				/* Assume we can ignore it */
-				pfree(srcpath);
-				continue;
-			}
-
-			if (srctablespace == src_deftablespace)
-				dsttablespace = dst_deftablespace;
-			else
-				dsttablespace = srctablespace;
-
-			dstpath = GetDatabasePath(dboid, dsttablespace);
+			CopyDatabaseWithWal(src_dboid, dboid, src_deftablespace,
+								dst_deftablespace);
 
 			/*
-			 * Copy this subdirectory to the new location
-			 *
-			 * We don't need to copy subdirectories
+			 * Close pg_database, but keep lock till commit.
 			 */
-			copydir(srcpath, dstpath, false);
-
-			/* Record the filesystem change in XLOG */
-			{
-				xl_dbase_create_rec xlrec;
-
-				xlrec.db_id = dboid;
-				xlrec.tablespace_id = dsttablespace;
-				xlrec.src_db_id = src_dboid;
-				xlrec.src_tablespace_id = srctablespace;
-
-				XLogBeginInsert();
-				XLogRegisterData((char *) &xlrec, sizeof(xl_dbase_create_rec));
-
-				(void) XLogInsert(RM_DBASE_ID,
-								  XLOG_DBASE_CREATE | XLR_SPECIAL_REL_UPDATE);
-			}
+			table_close(pg_database_rel, NoLock);
 		}
-		table_endscan(scan);
-		table_close(rel, AccessShareLock);
+		else
+		{
+			Assert(dbstrategy == CREATEDB_FILE_COPY);
 
-		/*
-		 * We force a checkpoint before committing.  This effectively means
-		 * that committed XLOG_DBASE_CREATE operations will never need to be
-		 * replayed (at least not in ordinary crash recovery; we still have to
-		 * make the XLOG entry for the benefit of PITR operations). This
-		 * avoids two nasty scenarios:
-		 *
-		 * #1: When PITR is off, we don't XLOG the contents of newly created
-		 * indexes; therefore the drop-and-recreate-whole-directory behavior
-		 * of DBASE_CREATE replay would lose such indexes.
-		 *
-		 * #2: Since we have to recopy the source database during DBASE_CREATE
-		 * replay, we run the risk of copying changes in it that were
-		 * committed after the original CREATE DATABASE command but before the
-		 * system crash that led to the replay.  This is at least unexpected
-		 * and at worst could lead to inconsistencies, eg duplicate table
-		 * names.
-		 *
-		 * (Both of these were real bugs in releases 8.0 through 8.0.3.)
-		 *
-		 * In PITR replay, the first of these isn't an issue, and the second
-		 * is only a risk if the CREATE DATABASE and subsequent template
-		 * database change both occur while a base backup is being taken.
-		 * There doesn't seem to be much we can do about that except document
-		 * it as a limitation.
-		 *
-		 * Perhaps if we ever implement CREATE DATABASE in a less cheesy way,
-		 * we can avoid this.
-		 */
-		RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
+			CopyDatabase(src_dboid, dboid, src_deftablespace,
+						 dst_deftablespace);
 
-		/*
-		 * Close pg_database, but keep lock till commit.
-		 */
-		table_close(pg_database_rel, NoLock);
+			/*
+			 * Close pg_database, but keep lock till commit.
+			 */
+			table_close(pg_database_rel, NoLock);
 
-		/*
-		 * Force synchronous commit, thus minimizing the window between
-		 * creation of the database files and committal of the transaction. If
-		 * we crash before committing, we'll have a DB that's taking up disk
-		 * space but is not in pg_database, which is not good.
-		 */
-		ForceSyncCommit();
+			/*
+			 * Force synchronous commit, thus minimizing the window between
+			 * creation of the database files and committal of the transaction.
+			 * If we crash before committing, we'll have a DB that's taking up
+			 * disk space but is not in pg_database, which is not good.
+			 */
+			ForceSyncCommit();
+		}
 	}
 	PG_END_ENSURE_ERROR_CLEANUP(createdb_failure_callback,
 								PointerGetDatum(&fparms));
@@ -870,6 +1339,21 @@ createdb_failure_callback(int code, Datum arg)
 	createdb_failure_params *fparms = (createdb_failure_params *) DatumGetPointer(arg);
 
 	/*
+	 * If we were copying database at block levels then drop pages for the
+	 * destination database that are in the shared buffer cache.  And tell
+	 * checkpointer to forget any pending fsync and unlink requests for
+	 * files in the database.  The reasoning behind doing this is same as
+	 * explained in dropdb function.  But unlike dropdb we don't need to call
+	 * pgstat_drop_database because this database is still not created so there
+	 * should not be any stat for this.
+	 */
+	if (fparms->strategy == CREATEDB_WAL_LOG)
+	{
+		DropDatabaseBuffers(fparms->dest_dboid);
+		ForgetDatabaseSyncRequests(fparms->dest_dboid);
+	}
+
+	/*
 	 * Release lock on source database before doing recursive remove. This is
 	 * not essential but it seems desirable to release the lock as soon as
 	 * possible.
@@ -2387,32 +2871,40 @@ dbase_redo(XLogReaderState *record)
 		src_path = GetDatabasePath(xlrec->src_db_id, xlrec->src_tablespace_id);
 		dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
 
-		/*
-		 * Our theory for replaying a CREATE is to forcibly drop the target
-		 * subdirectory if present, then re-copy the source data. This may be
-		 * more work than needed, but it is simple to implement.
-		 */
-		if (stat(dst_path, &st) == 0 && S_ISDIR(st.st_mode))
+		if (!OidIsValid(xlrec->src_db_id))
 		{
-			if (!rmtree(dst_path, true))
-				/* If this failed, copydir() below is going to error. */
-				ereport(WARNING,
-						(errmsg("some useless files may be left behind in old database directory \"%s\"",
-								dst_path)));
+			CreateDirAndVersionFile(dst_path, xlrec->db_id, xlrec->tablespace_id,
+									true);
 		}
+		else
+		{
+			/*
+			* Our theory for replaying a CREATE is to forcibly drop the target
+			* subdirectory if present, then re-copy the source data. This may be
+			* more work than needed, but it is simple to implement.
+			*/
+			if (stat(dst_path, &st) == 0 && S_ISDIR(st.st_mode))
+			{
+				if (!rmtree(dst_path, true))
+					/* If this failed, copydir() below is going to error. */
+					ereport(WARNING,
+							(errmsg("some useless files may be left behind in old database directory \"%s\"",
+									dst_path)));
+			}
 
-		/*
-		 * Force dirty buffers out to disk, to ensure source database is
-		 * up-to-date for the copy.
-		 */
-		FlushDatabaseBuffers(xlrec->src_db_id);
+			/*
+			* Force dirty buffers out to disk, to ensure source database is
+			* up-to-date for the copy.
+			*/
+			FlushDatabaseBuffers(xlrec->src_db_id);
 
-		/*
-		 * Copy this subdirectory to the new location
-		 *
-		 * We don't need to copy subdirectories
-		 */
-		copydir(src_path, dst_path, false);
+			/*
+			* Copy this subdirectory to the new location
+			*
+			* We don't need to copy subdirectories
+			*/
+			copydir(src_path, dst_path, false);
+		}
 	}
 	else if (info == XLOG_DBASE_DROP)
 	{
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 0ed2d31..156806d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -38,6 +38,7 @@
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/storage.h"
+#include "catalog/storage_xlog.h"
 #include "executor/instrument.h"
 #include "lib/binaryheap.h"
 #include "miscadmin.h"
@@ -486,6 +487,9 @@ static void FindAndDropRelFileNodeBuffers(RelFileNode rnode,
 										  ForkNumber forkNum,
 										  BlockNumber nForkBlock,
 										  BlockNumber firstDelBlock);
+static void RelationCopyStorageUsingBuffer(SMgrRelation src, SMgrRelation dst,
+										   ForkNumber forkNum,
+										   char relpersistence);
 static void AtProcExit_Buffers(int code, Datum arg);
 static void CheckForBufferLeaks(void);
 static int	rnode_comparator(const void *p1, const void *p2);
@@ -3670,6 +3674,148 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 }
 
 /* ---------------------------------------------------------------------
+ *		RelationCopyStorageUsingBuffer
+ *
+ *		Copy fork's data using bufmgr.  Same as RelationCopyStorage but instead
+ *		of using smgrread and smgrextend this will copy using bufmgr APIs.
+ * --------------------------------------------------------------------
+ */
+static void
+RelationCopyStorageUsingBuffer(SMgrRelation src, SMgrRelation dst,
+							   ForkNumber forkNum, char relpersistence)
+{
+	Buffer		srcBuf;
+	Buffer		dstBuf;
+	Page		srcPage;
+	Page		dstPage;
+	bool		use_wal;
+	bool		copying_initfork;
+	BlockNumber nblocks;
+	BlockNumber blkno;
+	BufferAccessStrategy bstrategy_src;
+	BufferAccessStrategy bstrategy_dst;
+
+	/* Refer comments in RelationCopyStorage. */
+	copying_initfork = relpersistence == RELPERSISTENCE_UNLOGGED &&
+		forkNum == INIT_FORKNUM;
+	use_wal = XLogIsNeeded() &&
+		(relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork);
+
+	/* Get number of blocks in the source relation. */
+	nblocks = smgrnblocks(src, forkNum);
+
+	/* Nothing to copy so directly exit. */
+	if (nblocks == 0)
+		return;
+
+	/*
+	 * We are going to copy whole relation from the source to the destination
+	 * so use BAS_BULKREAD strategy for the source relation and BAS_BULKWRITE
+	 * strategy for the destination relation.
+	 */
+	bstrategy_src = GetAccessStrategy(BAS_BULKREAD);
+	bstrategy_dst = GetAccessStrategy(BAS_BULKWRITE);
+
+	/* Iterate over each block of the source relation file. */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		/* If we got a cancel signal during the copy of the data, quit */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Read block from source relation. */
+		srcBuf = ReadBufferWithoutRelcache(src->smgr_rnode.node, forkNum,
+										   blkno, RBM_NORMAL, bstrategy_src,
+										   relpersistence);
+		srcPage = BufferGetPage(srcBuf);
+		if (PageIsNew(srcPage) || PageIsEmpty(srcPage))
+		{
+			ReleaseBuffer(srcBuf);
+			continue;
+		}
+
+		/* Use P_NEW to extend the relation. */
+		dstBuf = ReadBufferWithoutRelcache(dst->smgr_rnode.node, forkNum,
+										   P_NEW, RBM_NORMAL, bstrategy_dst,
+										   relpersistence);
+		LockBuffer(dstBuf, BUFFER_LOCK_EXCLUSIVE);
+
+		START_CRIT_SECTION();
+
+		/* Initialize the page and write the data. */
+		dstPage = BufferGetPage(dstBuf);
+		PageInit(dstPage, BufferGetPageSize(dstBuf), 0);
+		memcpy(dstPage, srcPage, BLCKSZ);
+		MarkBufferDirty(dstBuf);
+
+		/* WAL-log the copied page. */
+		if (use_wal)
+			log_newpage_buffer(dstBuf, true);
+
+		END_CRIT_SECTION();
+
+		UnlockReleaseBuffer(dstBuf);
+		ReleaseBuffer(srcBuf);
+	}
+}
+
+/* ---------------------------------------------------------------------
+ *		CreateAndCopyRelationData
+ *
+ *		Create destination relation storage and copy source relation's all
+ *		fork's data to the destination.
+ * --------------------------------------------------------------------
+ */
+void
+CreateAndCopyRelationData(RelFileNode src_rnode, RelFileNode dst_rnode,
+						  char relpersistence)
+{
+	SMgrRelation	src_smgr;
+	SMgrRelation	dst_smgr;
+
+	/* Open the source relation at smgr level. */
+	src_smgr = smgropen(src_rnode, InvalidBackendId);
+
+	/*
+	 * Create and copy all forks of the relation.
+	 *
+	 * NOTE: any conflict in relfilenode value will be caught in
+	 * RelationCreateStorage().
+	 */
+	dst_smgr = RelationCreateStorage(dst_rnode, relpersistence);
+
+	/* copy main fork */
+	RelationCopyStorageUsingBuffer(src_smgr, dst_smgr, MAIN_FORKNUM,
+								   relpersistence);
+
+	/* copy those extra forks that exist */
+	for (ForkNumber forkNum = MAIN_FORKNUM + 1;
+		 forkNum <= MAX_FORKNUM; forkNum++)
+	{
+		if (smgrexists(src_smgr, forkNum))
+		{
+			smgrcreate(dst_smgr, forkNum, false);
+
+			/*
+			 * WAL log creation if the relation is persistent, or this is the
+			 * init fork of an unlogged relation.
+			 */
+			if (relpersistence == RELPERSISTENCE_PERMANENT ||
+				(relpersistence == RELPERSISTENCE_UNLOGGED &&
+				 forkNum == INIT_FORKNUM))
+				log_smgrcreate(&dst_rnode, forkNum);
+
+			/* Copy a fork's data, block by block. */
+			RelationCopyStorageUsingBuffer(src_smgr, dst_smgr, forkNum,
+										   relpersistence);
+		}
+	}
+
+	/* Close the smgr rel */
+	smgrclose(src_smgr);
+	smgrclose(dst_smgr);
+}
+
+/* ---------------------------------------------------------------------
  *		FlushDatabaseBuffers
  *
  *		This function writes all dirty pages of a database out to disk
diff --git a/src/include/commands/dbcommands_xlog.h b/src/include/commands/dbcommands_xlog.h
index 593a857..8f59870 100644
--- a/src/include/commands/dbcommands_xlog.h
+++ b/src/include/commands/dbcommands_xlog.h
@@ -20,6 +20,7 @@
 /* record types */
 #define XLOG_DBASE_CREATE		0x00
 #define XLOG_DBASE_DROP			0x10
+#define XLOG_DBASE_CREATEDIR	0x20
 
 typedef struct xl_dbase_create_rec
 {
@@ -30,6 +31,13 @@ typedef struct xl_dbase_create_rec
 	Oid			src_tablespace_id;
 } xl_dbase_create_rec;
 
+typedef struct xl_dbase_createdir_rec
+{
+	/* Records creating database directory */
+	Oid			db_id;
+	Oid			tablespace_id;
+} xl_dbase_createdir_rec;
+
 typedef struct xl_dbase_drop_rec
 {
 	Oid			db_id;
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 7b80f58..a5659c0 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -204,6 +204,9 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
 extern void FlushOneBuffer(Buffer buffer);
 extern void FlushRelationBuffers(Relation rel);
 extern void FlushRelationsAllBuffers(struct SMgrRelationData **smgrs, int nrels);
+extern void CreateAndCopyRelationData(RelFileNode src_rnode,
+									  RelFileNode dst_rnode,
+									  char relpersistence);
 extern void FlushDatabaseBuffers(Oid dbid);
 extern void DropRelFileNodeBuffers(struct SMgrRelationData *smgr_reln, ForkNumber *forkNum,
 								   int nforks, BlockNumber *firstDelBlock);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index d9b83f7..dcda8ca 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -460,6 +460,7 @@ CoverPos
 CreateAmStmt
 CreateCastStmt
 CreateConversionStmt
+CreateDBRelInfo
 CreateDomainStmt
 CreateEnumStmt
 CreateEventTrigStmt
-- 
1.8.3.1

v12-0001-Extend-relmap-interfaces.patchtext/x-patch; charset=US-ASCII; name=v12-0001-Extend-relmap-interfaces.patchDownload

From a479f7057649e2c6ef332ff313e9291089e193e0 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Thu, 10 Mar 2022 10:18:18 +0530
Subject: [PATCH v12 1/4] Extend relmap interfaces

Support new interfaces in relmapper, 1) Support copying the
relmap file from one database path to the other database path.
2) And another interface for getting filenode from oid.  We already
have RelationMapOidToFilenode for the same purpose but that assumes
we are connected to the database for which we want to get the mapping.
So this new interface will do the same but instead, it will get the
mapping for the input database.

These interfaces are required for next patch, for supporting the
wal logged created database.
---
 src/backend/utils/cache/relmapper.c | 159 ++++++++++++++++++++++++++++--------
 src/include/utils/relmapper.h       |   7 +-
 2 files changed, 132 insertions(+), 34 deletions(-)

diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 4f6811f..6501110 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -136,10 +136,12 @@ static void apply_map_update(RelMapFile *map, Oid relationId, Oid fileNode,
 							 bool add_okay);
 static void merge_map_updates(RelMapFile *map, const RelMapFile *updates,
 							  bool add_okay);
-static void load_relmap_file(bool shared, bool lock_held);
+static void load_relmap_file(bool shared, bool lock_held, RelMapFile *dstmap,
+							 const char *dbpath);
 static void write_relmap_file(bool shared, RelMapFile *newmap,
 							  bool write_wal, bool send_sinval, bool preserve_files,
-							  Oid dbid, Oid tsid, const char *dbpath);
+							  Oid dbid, Oid tsid, const char *dbpath,
+							  bool update_relmap);
 static void perform_relmap_update(bool shared, const RelMapFile *updates);
 
 
@@ -250,6 +252,32 @@ RelationMapFilenodeToOid(Oid filenode, bool shared)
 }
 
 /*
+ * RelationMapOidToFilenodeForDatabase
+ *
+ * Same as RelationMapOidToFilenode, but instead of reading the mapping from
+ * the database we are connected to it will read the mapping from the input
+ * database.
+ */
+Oid
+RelationMapOidToFilenodeForDatabase(const char *dbpath, Oid relationId)
+{
+	RelMapFile	map;
+	int			i;
+
+	/* Read the relmap file from the source database. */
+	load_relmap_file(false, false, &map, dbpath);
+
+	/* Iterate over the relmap entries to find the input relation oid. */
+	for (i = 0; i < map.num_mappings; i++)
+	{
+		if (relationId == map.mappings[i].mapoid)
+			return map.mappings[i].mapfilenode;
+	}
+
+	return InvalidOid;
+}
+
+/*
  * RelationMapUpdateMap
  *
  * Install a new relfilenode mapping for the specified relation.
@@ -405,12 +433,12 @@ RelationMapInvalidate(bool shared)
 	if (shared)
 	{
 		if (shared_map.magic == RELMAPPER_FILEMAGIC)
-			load_relmap_file(true, false);
+			load_relmap_file(true, false, NULL, NULL);
 	}
 	else
 	{
 		if (local_map.magic == RELMAPPER_FILEMAGIC)
-			load_relmap_file(false, false);
+			load_relmap_file(false, false, NULL, NULL);
 	}
 }
 
@@ -425,9 +453,9 @@ void
 RelationMapInvalidateAll(void)
 {
 	if (shared_map.magic == RELMAPPER_FILEMAGIC)
-		load_relmap_file(true, false);
+		load_relmap_file(true, false, NULL, NULL);
 	if (local_map.magic == RELMAPPER_FILEMAGIC)
-		load_relmap_file(false, false);
+		load_relmap_file(false, false, NULL, NULL);
 }
 
 /*
@@ -569,9 +597,9 @@ RelationMapFinishBootstrap(void)
 
 	/* Write the files; no WAL or sinval needed */
 	write_relmap_file(true, &shared_map, false, false, false,
-					  InvalidOid, GLOBALTABLESPACE_OID, NULL);
+					  InvalidOid, GLOBALTABLESPACE_OID, NULL, false);
 	write_relmap_file(false, &local_map, false, false, false,
-					  MyDatabaseId, MyDatabaseTableSpace, DatabasePath);
+					  MyDatabaseId, MyDatabaseTableSpace, DatabasePath, false);
 }
 
 /*
@@ -612,7 +640,7 @@ RelationMapInitializePhase2(void)
 	/*
 	 * Load the shared map file, die on error.
 	 */
-	load_relmap_file(true, false);
+	load_relmap_file(true, false, NULL, NULL);
 }
 
 /*
@@ -633,7 +661,7 @@ RelationMapInitializePhase3(void)
 	/*
 	 * Load the local map file, die on error.
 	 */
-	load_relmap_file(false, false);
+	load_relmap_file(false, false, NULL, NULL);
 }
 
 /*
@@ -687,15 +715,46 @@ RestoreRelationMap(char *startAddress)
 }
 
 /*
+ * CopyRelationMap
+ *
+ * Copy relmapfile from source db path to the destination db path and WAL log
+ * the operation.  This function is only called during the create database, so
+ * the destination database is not yet visible to anyone else, thus we don't
+ * need to acquire the relmap lock while updating the destination relmap.
+ */
+void
+CopyRelationMap(Oid dbid, Oid tsid, const char *srcdbpath,
+				const char *dstdbpath)
+{
+	RelMapFile map;
+
+	/* Read the relmap file from the source database. */
+	load_relmap_file(false, false, &map, srcdbpath);
+
+	/*
+	 * Write map contents into the destination database's relmap file; no
+	 * sinval needed because there could be no one else connected to the
+	 * database we are creating now.
+	 */
+	write_relmap_file(false, &map, true, false, true, dbid, tsid, dstdbpath,
+					  true);
+}
+
+/*
  * load_relmap_file -- load data from the shared or local map file
  *
  * Because the map file is essential for access to core system catalogs,
  * failure to read it is a fatal error.
  *
- * Note that the local case requires DatabasePath to be set up.
+ * Note that the local case requires DatabasePath to be set up.  But during
+ * createdb we are not connected to the source database so we will have to pass
+ * the dbpath of the source database from which we want to read the relmap
+ * file.  And, we will have to pass a valid memory for the 'dstmap' into which
+ * we want to read the relmap.
  */
 static void
-load_relmap_file(bool shared, bool lock_held)
+load_relmap_file(bool shared, bool lock_held, RelMapFile *dstmap,
+				 const char *dbpath)
 {
 	RelMapFile *map;
 	char		mapfilename[MAXPGPATH];
@@ -703,7 +762,20 @@ load_relmap_file(bool shared, bool lock_held)
 	int			fd;
 	int			r;
 
-	if (shared)
+	/*
+	 * Prepare relmap file path.  If a valid dbpath is given then read the file
+	 * from that path.
+	 */
+	if (dbpath != NULL)
+	{
+		/* We must pass a valid dstmap for reading the mapfile contents. */
+		Assert(dstmap != NULL);
+
+		snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+				 dbpath, RELMAPPER_FILENAME);
+		map = dstmap;
+	}
+	else if (shared)
 	{
 		snprintf(mapfilename, sizeof(mapfilename), "global/%s",
 				 RELMAPPER_FILENAME);
@@ -796,12 +868,15 @@ load_relmap_file(bool shared, bool lock_held)
  * Because this may be called during WAL replay when MyDatabaseId,
  * DatabasePath, etc aren't valid, we require the caller to pass in suitable
  * values.  The caller is also responsible for being sure no concurrent
- * map update could be happening.
+ * map update could be happening.  This will also be called during create
+ * database and that time we are not connected to the database for which we
+ * have to write the relmap.  So we have to pass the valid dbpath for which we
+ * want to write the relmap file and also pass create as true.
  */
 static void
 write_relmap_file(bool shared, RelMapFile *newmap,
 				  bool write_wal, bool send_sinval, bool preserve_files,
-				  Oid dbid, Oid tsid, const char *dbpath)
+				  Oid dbid, Oid tsid, const char *dbpath, bool create)
 {
 	int			fd;
 	RelMapFile *realmap;
@@ -819,10 +894,18 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 	FIN_CRC32C(newmap->crc);
 
 	/*
-	 * Open the target file.  We prefer to do this before entering the
-	 * critical section, so that an open() failure need not force PANIC.
+	 * Prepare the target mapfilename, and also set which relmap we want to
+	 * update.  But if the create is passed true then we don't need to update
+	 * the memory relmap because we are not connected to database for which
+	 * we are writing the relmap file.
 	 */
-	if (shared)
+	if (create)
+	{
+		snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+				 dbpath, RELMAPPER_FILENAME);
+		realmap = NULL;
+	}
+	else if (shared)
 	{
 		snprintf(mapfilename, sizeof(mapfilename), "global/%s",
 				 RELMAPPER_FILENAME);
@@ -853,6 +936,7 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		xlrec.dbid = dbid;
 		xlrec.tsid = tsid;
 		xlrec.nbytes = sizeof(RelMapFile);
+		xlrec.create = create;
 
 		XLogBeginInsert();
 		XLogRegisterData((char *) (&xlrec), MinSizeOfRelmapUpdate);
@@ -935,14 +1019,17 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 	}
 
 	/*
-	 * Success, update permanent copy.  During bootstrap, we might be working
-	 * on the permanent copy itself, in which case skip the memcpy() to avoid
-	 * invoking nominally-undefined behavior.
+	 * Success, update permanent copy.  During bootstrap and the create
+	 * database, skip the memcpy().  Because during bootstrap, we might be
+	 * working on the permanent copy itself, whereas during create database
+	 * we are not connected to the database for which we are creating the
+	 * relmap file so it will be wrong to update the shared map of the current
+	 * database to which we are connected.
 	 */
-	if (realmap != newmap)
+	if (realmap != NULL && realmap != newmap)
 		memcpy(realmap, newmap, sizeof(RelMapFile));
 	else
-		Assert(!send_sinval);	/* must be bootstrapping */
+		Assert(!send_sinval);	/* must be bootstrapping or createdb */
 
 	/* Critical section done */
 	if (write_wal)
@@ -975,7 +1062,7 @@ perform_relmap_update(bool shared, const RelMapFile *updates)
 	LWLockAcquire(RelationMappingLock, LW_EXCLUSIVE);
 
 	/* Be certain we see any other updates just made */
-	load_relmap_file(shared, true);
+	load_relmap_file(shared, true, NULL, NULL);
 
 	/* Prepare updated data in a local variable */
 	if (shared)
@@ -993,7 +1080,7 @@ perform_relmap_update(bool shared, const RelMapFile *updates)
 	write_relmap_file(shared, &newmap, true, true, true,
 					  (shared ? InvalidOid : MyDatabaseId),
 					  (shared ? GLOBALTABLESPACE_OID : MyDatabaseTableSpace),
-					  DatabasePath);
+					  DatabasePath, false);
 
 	/* Now we can release the lock */
 	LWLockRelease(RelationMappingLock);
@@ -1025,18 +1112,24 @@ relmap_redo(XLogReaderState *record)
 		dbpath = GetDatabasePath(xlrec->dbid, xlrec->tsid);
 
 		/*
-		 * Write out the new map and send sinval, but of course don't write a
-		 * new WAL entry.  There's no surrounding transaction to tell to
-		 * preserve files, either.
+		 * Write out the new map and send sinval if create is not set because
+		 * in case of create there should be no one else accessing the relmap.
+		 * But of course don't write a new WAL entry.  There's no surrounding
+		 * transaction to tell to preserve files, either.
 		 *
 		 * There shouldn't be anyone else updating relmaps during WAL replay,
-		 * but grab the lock to interlock against load_relmap_file().
+		 * but grab the lock to interlock against load_relmap_file().  But if
+		 * create is set then we don't need to lock because we are creating a
+		 * new database so there can be absolutely no one else looking at its
+		 * relmap file.
 		 */
-		LWLockAcquire(RelationMappingLock, LW_EXCLUSIVE);
+		if (!xlrec->create)
+			LWLockAcquire(RelationMappingLock, LW_EXCLUSIVE);
 		write_relmap_file((xlrec->dbid == InvalidOid), &newmap,
-						  false, true, false,
-						  xlrec->dbid, xlrec->tsid, dbpath);
-		LWLockRelease(RelationMappingLock);
+						  false, !xlrec->create, false,
+						  xlrec->dbid, xlrec->tsid, dbpath, xlrec->create);
+		if (!xlrec->create)
+			LWLockRelease(RelationMappingLock);
 
 		pfree(dbpath);
 	}
diff --git a/src/include/utils/relmapper.h b/src/include/utils/relmapper.h
index 9fbb5a7..b8c7ef0 100644
--- a/src/include/utils/relmapper.h
+++ b/src/include/utils/relmapper.h
@@ -29,6 +29,7 @@ typedef struct xl_relmap_update
 	Oid			dbid;			/* database ID, or 0 for shared map */
 	Oid			tsid;			/* database's tablespace, or pg_global */
 	int32		nbytes;			/* size of relmap data */
+	bool		create;			/* true if creating new relmap */
 	char		data[FLEXIBLE_ARRAY_MEMBER];
 } xl_relmap_update;
 
@@ -39,6 +40,9 @@ extern Oid	RelationMapOidToFilenode(Oid relationId, bool shared);
 
 extern Oid	RelationMapFilenodeToOid(Oid relationId, bool shared);
 
+extern Oid RelationMapOidToFilenodeForDatabase(const char *dbpath,
+											   Oid relationId);
+
 extern void RelationMapUpdateMap(Oid relationId, Oid fileNode, bool shared,
 								 bool immediate);
 
@@ -62,7 +66,8 @@ extern void RelationMapInitializePhase3(void);
 extern Size EstimateRelationMapSpace(void);
 extern void SerializeRelationMap(Size maxSize, char *startAddress);
 extern void RestoreRelationMap(char *startAddress);
-
+extern void CopyRelationMap(Oid dbid, Oid tsid, const char *srcdbpath,
+							const char *dstdbpath);
 extern void relmap_redo(XLogReaderState *record);
 extern void relmap_desc(StringInfo buf, XLogReaderState *record);
 extern const char *relmap_identify(uint8 info);
-- 
1.8.3.1

#141

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Dilip Kumar (#138)

1 attachment(s)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Thu, Mar 10, 2022 at 6:02 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have completely changed the logic for this refactoring. Basically,
write_relmap_file(), is already having parameters to control whether
to write wal, send inval and we are already passing the dbpath.
Instead of making a new function I just pass one additional parameter
to this function itself about whether we are creating a new map or not
and I think with that changes are very less and this looks cleaner to
me. Similarly for load_relmap_file() also I just had to pass the
dbpath and memory for destination map. Please have a look and let me
know your thoughts.

It's not terrible, but how about something like the attached instead?
I think this has the effect of reducing the number of cases that the
low-level code needs to know about from 2 to 1, instead of making it
go up from 2 to 3.

I think we should also write the test cases for create database
strategy. But I do not see any test case for create database for
testing the existing options. So I am wondering whether we should add
the test case only for the new option we are providing or we should
create a separate path which tests the new option as well as the
existing options.

FWIW, src/bin/scripts/t/020_createdb.pl does a little bit of testing
of this kind.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachments:

relmap-refactor-rmh.patchapplication/octet-stream; name=relmap-refactor-rmh.patchDownload

diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 4f6811f571..f172f61b58 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -136,9 +136,11 @@ static void apply_map_update(RelMapFile *map, Oid relationId, Oid fileNode,
 							 bool add_okay);
 static void merge_map_updates(RelMapFile *map, const RelMapFile *updates,
 							  bool add_okay);
-static void load_relmap_file(bool shared, bool lock_held);
-static void write_relmap_file(bool shared, RelMapFile *newmap,
-							  bool write_wal, bool send_sinval, bool preserve_files,
+static void read_relmap_file(bool shared, bool lock_held);
+static void load_relmap_file(RelMapFile *map, char *dbpath, bool lock_held,
+							 int elevel);
+static void write_relmap_file(RelMapFile *newmap, bool write_wal,
+							  bool send_sinval, bool preserve_files,
 							  Oid dbid, Oid tsid, const char *dbpath);
 static void perform_relmap_update(bool shared, const RelMapFile *updates);
 
@@ -405,12 +407,12 @@ RelationMapInvalidate(bool shared)
 	if (shared)
 	{
 		if (shared_map.magic == RELMAPPER_FILEMAGIC)
-			load_relmap_file(true, false);
+			read_relmap_file(true, false);
 	}
 	else
 	{
 		if (local_map.magic == RELMAPPER_FILEMAGIC)
-			load_relmap_file(false, false);
+			read_relmap_file(false, false);
 	}
 }
 
@@ -425,9 +427,9 @@ void
 RelationMapInvalidateAll(void)
 {
 	if (shared_map.magic == RELMAPPER_FILEMAGIC)
-		load_relmap_file(true, false);
+		read_relmap_file(true, false);
 	if (local_map.magic == RELMAPPER_FILEMAGIC)
-		load_relmap_file(false, false);
+		read_relmap_file(false, false);
 }
 
 /*
@@ -568,9 +570,9 @@ RelationMapFinishBootstrap(void)
 	Assert(pending_local_updates.num_mappings == 0);
 
 	/* Write the files; no WAL or sinval needed */
-	write_relmap_file(true, &shared_map, false, false, false,
-					  InvalidOid, GLOBALTABLESPACE_OID, NULL);
-	write_relmap_file(false, &local_map, false, false, false,
+	write_relmap_file(&shared_map, false, false, false,
+					  InvalidOid, GLOBALTABLESPACE_OID, "global");
+	write_relmap_file(&local_map, false, false, false,
 					  MyDatabaseId, MyDatabaseTableSpace, DatabasePath);
 }
 
@@ -612,7 +614,7 @@ RelationMapInitializePhase2(void)
 	/*
 	 * Load the shared map file, die on error.
 	 */
-	load_relmap_file(true, false);
+	read_relmap_file(true, false);
 }
 
 /*
@@ -633,7 +635,7 @@ RelationMapInitializePhase3(void)
 	/*
 	 * Load the local map file, die on error.
 	 */
-	load_relmap_file(false, false);
+	read_relmap_file(false, false);
 }
 
 /*
@@ -687,39 +689,48 @@ RestoreRelationMap(char *startAddress)
 }
 
 /*
- * load_relmap_file -- load data from the shared or local map file
+ * read_relmap_file -- load the shared or local map file
  *
- * Because the map file is essential for access to core system catalogs,
- * failure to read it is a fatal error.
+ * Because these files are essential for access to core system catalogs,
+ * failure to load either of them is a fatal error.
  *
  * Note that the local case requires DatabasePath to be set up.
  */
 static void
-load_relmap_file(bool shared, bool lock_held)
+read_relmap_file(bool shared, bool lock_held)
+{
+	if (shared)
+		load_relmap_file(&shared_map, "global", lock_held, FATAL);
+	else
+		load_relmap_file(&local_map, DatabasePath, lock_held, FATAL);
+}
+
+/*
+ * load_relmap_file -- load data from any relation mapper file
+ *
+ * dbpath must be the relevant database path, or "global" for shared relations.
+ *
+ * RelationMappingLock will be acquired released unless lock_held = true.
+ *
+ * Errors will be reported at the indicated elevel, which should be at least
+ * ERROR.
+ */
+static void
+load_relmap_file(RelMapFile *map, char *dbpath, bool lock_held, int elevel)
 {
-	RelMapFile *map;
 	char		mapfilename[MAXPGPATH];
 	pg_crc32c	crc;
 	int			fd;
 	int			r;
 
-	if (shared)
-	{
-		snprintf(mapfilename, sizeof(mapfilename), "global/%s",
-				 RELMAPPER_FILENAME);
-		map = &shared_map;
-	}
-	else
-	{
-		snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
-				 DatabasePath, RELMAPPER_FILENAME);
-		map = &local_map;
-	}
+	Assert(elevel >= ERROR);
 
-	/* Read data ... */
+	/* Open the target file. */
+	snprintf(mapfilename, sizeof(mapfilename), "%s/%s", dbpath,
+			 RELMAPPER_FILENAME);
 	fd = OpenTransientFile(mapfilename, O_RDONLY | PG_BINARY);
 	if (fd < 0)
-		ereport(FATAL,
+		ereport(elevel,
 				(errcode_for_file_access(),
 				 errmsg("could not open file \"%s\": %m",
 						mapfilename)));
@@ -734,16 +745,17 @@ load_relmap_file(bool shared, bool lock_held)
 	if (!lock_held)
 		LWLockAcquire(RelationMappingLock, LW_SHARED);
 
+	/* Now read the data. */
 	pgstat_report_wait_start(WAIT_EVENT_RELATION_MAP_READ);
 	r = read(fd, map, sizeof(RelMapFile));
 	if (r != sizeof(RelMapFile))
 	{
 		if (r < 0)
-			ereport(FATAL,
+			ereport(elevel,
 					(errcode_for_file_access(),
 					 errmsg("could not read file \"%s\": %m", mapfilename)));
 		else
-			ereport(FATAL,
+			ereport(elevel,
 					(errcode(ERRCODE_DATA_CORRUPTED),
 					 errmsg("could not read file \"%s\": read %d of %zu",
 							mapfilename, r, sizeof(RelMapFile))));
@@ -754,7 +766,7 @@ load_relmap_file(bool shared, bool lock_held)
 		LWLockRelease(RelationMappingLock);
 
 	if (CloseTransientFile(fd) != 0)
-		ereport(FATAL,
+		ereport(elevel,
 				(errcode_for_file_access(),
 				 errmsg("could not close file \"%s\": %m",
 						mapfilename)));
@@ -763,7 +775,7 @@ load_relmap_file(bool shared, bool lock_held)
 	if (map->magic != RELMAPPER_FILEMAGIC ||
 		map->num_mappings < 0 ||
 		map->num_mappings > MAX_MAPPINGS)
-		ereport(FATAL,
+		ereport(elevel,
 				(errmsg("relation mapping file \"%s\" contains invalid data",
 						mapfilename)));
 
@@ -773,7 +785,7 @@ load_relmap_file(bool shared, bool lock_held)
 	FIN_CRC32C(crc);
 
 	if (!EQ_CRC32C(crc, map->crc))
-		ereport(FATAL,
+		ereport(elevel,
 				(errmsg("relation mapping file \"%s\" contains incorrect checksum",
 						mapfilename)));
 }
@@ -795,16 +807,16 @@ load_relmap_file(bool shared, bool lock_held)
  *
  * Because this may be called during WAL replay when MyDatabaseId,
  * DatabasePath, etc aren't valid, we require the caller to pass in suitable
- * values.  The caller is also responsible for being sure no concurrent
- * map update could be happening.
+ * values. Pass dbpath as "global" for the shared map.
+ *
+ * The caller is also responsible for being sure no concurrent map update
+ * could be happening.
  */
 static void
-write_relmap_file(bool shared, RelMapFile *newmap,
-				  bool write_wal, bool send_sinval, bool preserve_files,
-				  Oid dbid, Oid tsid, const char *dbpath)
+write_relmap_file(RelMapFile *newmap, bool write_wal, bool send_sinval,
+				  bool preserve_files, Oid dbid, Oid tsid, const char *dbpath)
 {
 	int			fd;
-	RelMapFile *realmap;
 	char		mapfilename[MAXPGPATH];
 
 	/*
@@ -822,19 +834,8 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 	 * Open the target file.  We prefer to do this before entering the
 	 * critical section, so that an open() failure need not force PANIC.
 	 */
-	if (shared)
-	{
-		snprintf(mapfilename, sizeof(mapfilename), "global/%s",
-				 RELMAPPER_FILENAME);
-		realmap = &shared_map;
-	}
-	else
-	{
-		snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
-				 dbpath, RELMAPPER_FILENAME);
-		realmap = &local_map;
-	}
-
+	snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+			 dbpath, RELMAPPER_FILENAME);
 	fd = OpenTransientFile(mapfilename, O_WRONLY | O_CREAT | PG_BINARY);
 	if (fd < 0)
 		ereport(ERROR,
@@ -934,16 +935,6 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		}
 	}
 
-	/*
-	 * Success, update permanent copy.  During bootstrap, we might be working
-	 * on the permanent copy itself, in which case skip the memcpy() to avoid
-	 * invoking nominally-undefined behavior.
-	 */
-	if (realmap != newmap)
-		memcpy(realmap, newmap, sizeof(RelMapFile));
-	else
-		Assert(!send_sinval);	/* must be bootstrapping */
-
 	/* Critical section done */
 	if (write_wal)
 		END_CRIT_SECTION();
@@ -975,7 +966,7 @@ perform_relmap_update(bool shared, const RelMapFile *updates)
 	LWLockAcquire(RelationMappingLock, LW_EXCLUSIVE);
 
 	/* Be certain we see any other updates just made */
-	load_relmap_file(shared, true);
+	read_relmap_file(shared, true);
 
 	/* Prepare updated data in a local variable */
 	if (shared)
@@ -990,10 +981,19 @@ perform_relmap_update(bool shared, const RelMapFile *updates)
 	merge_map_updates(&newmap, updates, allowSystemTableMods);
 
 	/* Write out the updated map and do other necessary tasks */
-	write_relmap_file(shared, &newmap, true, true, true,
+	write_relmap_file(&newmap, true, true, true,
 					  (shared ? InvalidOid : MyDatabaseId),
 					  (shared ? GLOBALTABLESPACE_OID : MyDatabaseTableSpace),
-					  DatabasePath);
+					  (shared ? "global" : DatabasePath));
+
+	/*
+	 * We succesfully wrote the updated file, so it's now safe to rely on the
+	 * new values in this process, too.
+	 */
+	if (shared)
+		memcpy(&shared_map, &newmap, sizeof(RelMapFile));
+	else
+		memcpy(&local_map, &newmap, sizeof(RelMapFile));
 
 	/* Now we can release the lock */
 	LWLockRelease(RelationMappingLock);
@@ -1021,8 +1021,10 @@ relmap_redo(XLogReaderState *record)
 				 xlrec->nbytes);
 		memcpy(&newmap, xlrec->data, sizeof(newmap));
 
-		/* We need to construct the pathname for this database */
-		dbpath = GetDatabasePath(xlrec->dbid, xlrec->tsid);
+		if (xlrec->dbid != InvalidOid)
+			dbpath = GetDatabasePath(xlrec->dbid, xlrec->tsid);
+		else
+			dbpath = pstrdup("global");
 
 		/*
 		 * Write out the new map and send sinval, but of course don't write a
@@ -1030,11 +1032,10 @@ relmap_redo(XLogReaderState *record)
 		 * preserve files, either.
 		 *
 		 * There shouldn't be anyone else updating relmaps during WAL replay,
-		 * but grab the lock to interlock against load_relmap_file().
+		 * but grab the lock to interlock against read_relmap_file().
 		 */
 		LWLockAcquire(RelationMappingLock, LW_EXCLUSIVE);
-		write_relmap_file((xlrec->dbid == InvalidOid), &newmap,
-						  false, true, false,
+		write_relmap_file(&newmap, false, true, false,
 						  xlrec->dbid, xlrec->tsid, dbpath);
 		LWLockRelease(RelationMappingLock);

#142

Ashutosh Sharma

ashu.coek88@gmail.com

almost 4 years ago

In reply to: Dilip Kumar (#140)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

Thanks Dilip for working on the review comments. I'll take a look at
the new version of patch and let you know my comments, if any.

--
With Regards,
Ashutosh Sharma.

Show quoted text

On Thu, Mar 10, 2022 at 8:38 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Mar 10, 2022 at 7:22 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
Here are some review comments on the latest patch
(v11-0004-WAL-logged-CREATE-DATABASE.patch). I actually did the review
of the v10 patch but that applies for this latest version as well.
+               /* Now errors are fatal ... */
+               START_CRIT_SECTION();
Did you mean PANIC instead of FATAL?
I think here fatal didn't really mean the error level FATAL, that
means critical and I have seen it is used in other places also. But I
really don't think we need this comments to removed to avoid any
confusion.
==
+
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                                        errmsg("invalid create
strategy %s", strategy),
+                                        errhint("Valid strategies are
\"wal_log\", and \"file_copy\".")));
+       }
Should this be - "invalid createdb strategy" instead of "invalid
create strategy"?
Changed
==
+               /*
+                * In case of ALTER DATABASE SET TABLESPACE we don't need to do
+                * anything for the object which are not in the source
db's default
+                * tablespace.  The source and destination dboid will be same in
+                * case of ALTER DATABASE SET TABLESPACE.
+                */
+               else if (src_dboid == dst_dboid)
+                       continue;
+               else
+                       dstrnode.spcNode = srcrnode.spcNode;
Is this change still required? Do we support the WAL_COPY strategy for
ALTER DATABASE?
Yeah now it is unreachable code so removed.
==
+               /* Open the source and the destination relation at
smgr level. */
+               src_smgr = smgropen(srcrnode, InvalidBackendId);
+               dst_smgr = smgropen(dstrnode, InvalidBackendId);
+
+               /* Copy relation storage from source to the destination. */
+               CreateAndCopyRelationData(src_smgr, dst_smgr,
relinfo->relpersistence);
Do we need to do smgropen for destination relfilenode here? Aren't we
already doing that inside RelationCreateStorage?
Yeah I have changed the complete logic and removed the smgr_open for
both source and destination and moved inside
CreateAndCopyRelationData, please check the updated code.
==
+       use_wal = XLogIsNeeded() &&
+               (relpersistence == RELPERSISTENCE_PERMANENT ||
copying_initfork);
+
+       /* Get number of blocks in the source relation. */
+       nblocks = smgrnblocks(src, forkNum);
What if number of blocks in a source relation is ZERO? Should we check
for that and return immediately. We have already done smgrcreate.
Yeah make sense to optimize, with that we will not have to get the
buffer strategy so done.
==
+       /* We don't need to copy the shared objects to the target. */
+       if (classForm->reltablespace == GLOBALTABLESPACE_OID)
+               return NULL;
+
+       /*
+        * If the object doesn't have the storage then nothing to be
+        * done for that object so just ignore it.
+        */
+       if (!RELKIND_HAS_STORAGE(classForm->relkind))
+               return NULL;
We can probably club together above two if-checks.
Done
==
+      <varlistentry>
+       <term><replaceable class="parameter">strategy</replaceable></term>
+       <listitem>
+        <para>
+         This is used for copying the database directory.  Currently, we have
+         two strategies the <literal>WAL_LOG</literal> and the
+         <literal>FILE_COPY</literal>.  If <literal>WAL_LOG</literal> strategy
+         is used then the database will be copied block by block and it will
+         also WAL log each copied block.  Otherwise, if <literal>FILE_COPY
I think we need to mention the default strategy in the documentation page.
Done

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#143

Ashutosh Sharma

ashu.coek88@gmail.com

almost 4 years ago

In reply to: Robert Haas (#141)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Thu, Mar 10, 2022 at 10:18 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Mar 10, 2022 at 6:02 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have completely changed the logic for this refactoring. Basically,
write_relmap_file(), is already having parameters to control whether
to write wal, send inval and we are already passing the dbpath.
Instead of making a new function I just pass one additional parameter
to this function itself about whether we are creating a new map or not
and I think with that changes are very less and this looks cleaner to
me. Similarly for load_relmap_file() also I just had to pass the
dbpath and memory for destination map. Please have a look and let me
know your thoughts.

It's not terrible, but how about something like the attached instead?
I think this has the effect of reducing the number of cases that the
low-level code needs to know about from 2 to 1, instead of making it
go up from 2 to 3.

Looks better, but why do you want to pass elevel to the
load_relmap_file()? Are we calling this function from somewhere other
than read_relmap_file()? If not, do we have any plans to call this
function directly bypassing read_relmap_file for any upcoming patch?

--
With Regards,
Ashutosh Sharma.

#144

Dilip Kumar

dilipbalaut@gmail.com

almost 4 years ago

In reply to: Robert Haas (#141)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Thu, Mar 10, 2022 at 10:18 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Mar 10, 2022 at 6:02 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have completely changed the logic for this refactoring. Basically,
write_relmap_file(), is already having parameters to control whether
to write wal, send inval and we are already passing the dbpath.
Instead of making a new function I just pass one additional parameter
to this function itself about whether we are creating a new map or not
and I think with that changes are very less and this looks cleaner to
me. Similarly for load_relmap_file() also I just had to pass the
dbpath and memory for destination map. Please have a look and let me
know your thoughts.

It's not terrible, but how about something like the attached instead?
I think this has the effect of reducing the number of cases that the
low-level code needs to know about from 2 to 1, instead of making it
go up from 2 to 3.

Yeah this looks cleaner, I will rebase the remaining patch.

I think we should also write the test cases for create database
strategy. But I do not see any test case for create database for
testing the existing options. So I am wondering whether we should add
the test case only for the new option we are providing or we should
create a separate path which tests the new option as well as the
existing options.

FWIW, src/bin/scripts/t/020_createdb.pl does a little bit of testing
of this kind.

Okay, I think we need to support the strategy in createdb bin as well.
I will do that.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#145

Dilip Kumar

dilipbalaut@gmail.com

almost 4 years ago

In reply to: Dilip Kumar (#144)

6 attachment(s)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Fri, Mar 11, 2022 at 11:52 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Mar 10, 2022 at 10:18 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Mar 10, 2022 at 6:02 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have completely changed the logic for this refactoring. Basically,
write_relmap_file(), is already having parameters to control whether
to write wal, send inval and we are already passing the dbpath.
Instead of making a new function I just pass one additional parameter
to this function itself about whether we are creating a new map or not
and I think with that changes are very less and this looks cleaner to
me. Similarly for load_relmap_file() also I just had to pass the
dbpath and memory for destination map. Please have a look and let me
know your thoughts.

It's not terrible, but how about something like the attached instead?
I think this has the effect of reducing the number of cases that the
low-level code needs to know about from 2 to 1, instead of making it
go up from 2 to 3.

Yeah this looks cleaner, I will rebase the remaining patch.

Here is the updated version of the patch set.

Changes, 1) it take Robert's patch as first refactoring patch 2)
Rebase other new relmapper apis on top of that in 0002 3) Some code
refactoring in main patch 0005 and also one problem fix, earlier in
wal log method I have removed ForceSyncCommit(), but IMHO that is
equally valid whether we use file_copy or wal_log because in both
cases we are creating the disk files. 4) Support strategy in createdb
tool and add test case as part of 0006.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v13-0001-Refactor-relmap-load-and-relmap-write-functions.patchtext/x-patch; charset=US-ASCII; name=v13-0001-Refactor-relmap-load-and-relmap-write-functions.patchDownload

From 7bcc39c740a2a737f1ebd6c7b0441c9df4fab6d3 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 11 Mar 2022 09:03:09 +0530
Subject: [PATCH v13 1/6] Refactor relmap load and relmap write functions

Currently, relmap reading and writing interfaces are tightly
coupled with shared_map and local_map of the database
it is connected to.  But as higher level patch set we need
interfaces where we can read relmap into any input memory
and while writing also we should be able to pass the map.

So as part of this patch, we are doing refactoring of the
existing code such that we can expose the read and write
interfaces that are independent of the shared_map and the
local_map, without changing any logic.

Author: Robert Haas
---
 src/backend/utils/cache/relmapper.c | 147 ++++++++++++++++++------------------
 1 file changed, 74 insertions(+), 73 deletions(-)

diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 4f6811f..f172f61 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -136,9 +136,11 @@ static void apply_map_update(RelMapFile *map, Oid relationId, Oid fileNode,
 							 bool add_okay);
 static void merge_map_updates(RelMapFile *map, const RelMapFile *updates,
 							  bool add_okay);
-static void load_relmap_file(bool shared, bool lock_held);
-static void write_relmap_file(bool shared, RelMapFile *newmap,
-							  bool write_wal, bool send_sinval, bool preserve_files,
+static void read_relmap_file(bool shared, bool lock_held);
+static void load_relmap_file(RelMapFile *map, char *dbpath, bool lock_held,
+							 int elevel);
+static void write_relmap_file(RelMapFile *newmap, bool write_wal,
+							  bool send_sinval, bool preserve_files,
 							  Oid dbid, Oid tsid, const char *dbpath);
 static void perform_relmap_update(bool shared, const RelMapFile *updates);
 
@@ -405,12 +407,12 @@ RelationMapInvalidate(bool shared)
 	if (shared)
 	{
 		if (shared_map.magic == RELMAPPER_FILEMAGIC)
-			load_relmap_file(true, false);
+			read_relmap_file(true, false);
 	}
 	else
 	{
 		if (local_map.magic == RELMAPPER_FILEMAGIC)
-			load_relmap_file(false, false);
+			read_relmap_file(false, false);
 	}
 }
 
@@ -425,9 +427,9 @@ void
 RelationMapInvalidateAll(void)
 {
 	if (shared_map.magic == RELMAPPER_FILEMAGIC)
-		load_relmap_file(true, false);
+		read_relmap_file(true, false);
 	if (local_map.magic == RELMAPPER_FILEMAGIC)
-		load_relmap_file(false, false);
+		read_relmap_file(false, false);
 }
 
 /*
@@ -568,9 +570,9 @@ RelationMapFinishBootstrap(void)
 	Assert(pending_local_updates.num_mappings == 0);
 
 	/* Write the files; no WAL or sinval needed */
-	write_relmap_file(true, &shared_map, false, false, false,
-					  InvalidOid, GLOBALTABLESPACE_OID, NULL);
-	write_relmap_file(false, &local_map, false, false, false,
+	write_relmap_file(&shared_map, false, false, false,
+					  InvalidOid, GLOBALTABLESPACE_OID, "global");
+	write_relmap_file(&local_map, false, false, false,
 					  MyDatabaseId, MyDatabaseTableSpace, DatabasePath);
 }
 
@@ -612,7 +614,7 @@ RelationMapInitializePhase2(void)
 	/*
 	 * Load the shared map file, die on error.
 	 */
-	load_relmap_file(true, false);
+	read_relmap_file(true, false);
 }
 
 /*
@@ -633,7 +635,7 @@ RelationMapInitializePhase3(void)
 	/*
 	 * Load the local map file, die on error.
 	 */
-	load_relmap_file(false, false);
+	read_relmap_file(false, false);
 }
 
 /*
@@ -687,39 +689,48 @@ RestoreRelationMap(char *startAddress)
 }
 
 /*
- * load_relmap_file -- load data from the shared or local map file
+ * read_relmap_file -- load the shared or local map file
  *
- * Because the map file is essential for access to core system catalogs,
- * failure to read it is a fatal error.
+ * Because these files are essential for access to core system catalogs,
+ * failure to load either of them is a fatal error.
  *
  * Note that the local case requires DatabasePath to be set up.
  */
 static void
-load_relmap_file(bool shared, bool lock_held)
+read_relmap_file(bool shared, bool lock_held)
+{
+	if (shared)
+		load_relmap_file(&shared_map, "global", lock_held, FATAL);
+	else
+		load_relmap_file(&local_map, DatabasePath, lock_held, FATAL);
+}
+
+/*
+ * load_relmap_file -- load data from any relation mapper file
+ *
+ * dbpath must be the relevant database path, or "global" for shared relations.
+ *
+ * RelationMappingLock will be acquired released unless lock_held = true.
+ *
+ * Errors will be reported at the indicated elevel, which should be at least
+ * ERROR.
+ */
+static void
+load_relmap_file(RelMapFile *map, char *dbpath, bool lock_held, int elevel)
 {
-	RelMapFile *map;
 	char		mapfilename[MAXPGPATH];
 	pg_crc32c	crc;
 	int			fd;
 	int			r;
 
-	if (shared)
-	{
-		snprintf(mapfilename, sizeof(mapfilename), "global/%s",
-				 RELMAPPER_FILENAME);
-		map = &shared_map;
-	}
-	else
-	{
-		snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
-				 DatabasePath, RELMAPPER_FILENAME);
-		map = &local_map;
-	}
+	Assert(elevel >= ERROR);
 
-	/* Read data ... */
+	/* Open the target file. */
+	snprintf(mapfilename, sizeof(mapfilename), "%s/%s", dbpath,
+			 RELMAPPER_FILENAME);
 	fd = OpenTransientFile(mapfilename, O_RDONLY | PG_BINARY);
 	if (fd < 0)
-		ereport(FATAL,
+		ereport(elevel,
 				(errcode_for_file_access(),
 				 errmsg("could not open file \"%s\": %m",
 						mapfilename)));
@@ -734,16 +745,17 @@ load_relmap_file(bool shared, bool lock_held)
 	if (!lock_held)
 		LWLockAcquire(RelationMappingLock, LW_SHARED);
 
+	/* Now read the data. */
 	pgstat_report_wait_start(WAIT_EVENT_RELATION_MAP_READ);
 	r = read(fd, map, sizeof(RelMapFile));
 	if (r != sizeof(RelMapFile))
 	{
 		if (r < 0)
-			ereport(FATAL,
+			ereport(elevel,
 					(errcode_for_file_access(),
 					 errmsg("could not read file \"%s\": %m", mapfilename)));
 		else
-			ereport(FATAL,
+			ereport(elevel,
 					(errcode(ERRCODE_DATA_CORRUPTED),
 					 errmsg("could not read file \"%s\": read %d of %zu",
 							mapfilename, r, sizeof(RelMapFile))));
@@ -754,7 +766,7 @@ load_relmap_file(bool shared, bool lock_held)
 		LWLockRelease(RelationMappingLock);
 
 	if (CloseTransientFile(fd) != 0)
-		ereport(FATAL,
+		ereport(elevel,
 				(errcode_for_file_access(),
 				 errmsg("could not close file \"%s\": %m",
 						mapfilename)));
@@ -763,7 +775,7 @@ load_relmap_file(bool shared, bool lock_held)
 	if (map->magic != RELMAPPER_FILEMAGIC ||
 		map->num_mappings < 0 ||
 		map->num_mappings > MAX_MAPPINGS)
-		ereport(FATAL,
+		ereport(elevel,
 				(errmsg("relation mapping file \"%s\" contains invalid data",
 						mapfilename)));
 
@@ -773,7 +785,7 @@ load_relmap_file(bool shared, bool lock_held)
 	FIN_CRC32C(crc);
 
 	if (!EQ_CRC32C(crc, map->crc))
-		ereport(FATAL,
+		ereport(elevel,
 				(errmsg("relation mapping file \"%s\" contains incorrect checksum",
 						mapfilename)));
 }
@@ -795,16 +807,16 @@ load_relmap_file(bool shared, bool lock_held)
  *
  * Because this may be called during WAL replay when MyDatabaseId,
  * DatabasePath, etc aren't valid, we require the caller to pass in suitable
- * values.  The caller is also responsible for being sure no concurrent
- * map update could be happening.
+ * values. Pass dbpath as "global" for the shared map.
+ *
+ * The caller is also responsible for being sure no concurrent map update
+ * could be happening.
  */
 static void
-write_relmap_file(bool shared, RelMapFile *newmap,
-				  bool write_wal, bool send_sinval, bool preserve_files,
-				  Oid dbid, Oid tsid, const char *dbpath)
+write_relmap_file(RelMapFile *newmap, bool write_wal, bool send_sinval,
+				  bool preserve_files, Oid dbid, Oid tsid, const char *dbpath)
 {
 	int			fd;
-	RelMapFile *realmap;
 	char		mapfilename[MAXPGPATH];
 
 	/*
@@ -822,19 +834,8 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 	 * Open the target file.  We prefer to do this before entering the
 	 * critical section, so that an open() failure need not force PANIC.
 	 */
-	if (shared)
-	{
-		snprintf(mapfilename, sizeof(mapfilename), "global/%s",
-				 RELMAPPER_FILENAME);
-		realmap = &shared_map;
-	}
-	else
-	{
-		snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
-				 dbpath, RELMAPPER_FILENAME);
-		realmap = &local_map;
-	}
-
+	snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+			 dbpath, RELMAPPER_FILENAME);
 	fd = OpenTransientFile(mapfilename, O_WRONLY | O_CREAT | PG_BINARY);
 	if (fd < 0)
 		ereport(ERROR,
@@ -934,16 +935,6 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		}
 	}
 
-	/*
-	 * Success, update permanent copy.  During bootstrap, we might be working
-	 * on the permanent copy itself, in which case skip the memcpy() to avoid
-	 * invoking nominally-undefined behavior.
-	 */
-	if (realmap != newmap)
-		memcpy(realmap, newmap, sizeof(RelMapFile));
-	else
-		Assert(!send_sinval);	/* must be bootstrapping */
-
 	/* Critical section done */
 	if (write_wal)
 		END_CRIT_SECTION();
@@ -975,7 +966,7 @@ perform_relmap_update(bool shared, const RelMapFile *updates)
 	LWLockAcquire(RelationMappingLock, LW_EXCLUSIVE);
 
 	/* Be certain we see any other updates just made */
-	load_relmap_file(shared, true);
+	read_relmap_file(shared, true);
 
 	/* Prepare updated data in a local variable */
 	if (shared)
@@ -990,10 +981,19 @@ perform_relmap_update(bool shared, const RelMapFile *updates)
 	merge_map_updates(&newmap, updates, allowSystemTableMods);
 
 	/* Write out the updated map and do other necessary tasks */
-	write_relmap_file(shared, &newmap, true, true, true,
+	write_relmap_file(&newmap, true, true, true,
 					  (shared ? InvalidOid : MyDatabaseId),
 					  (shared ? GLOBALTABLESPACE_OID : MyDatabaseTableSpace),
-					  DatabasePath);
+					  (shared ? "global" : DatabasePath));
+
+	/*
+	 * We succesfully wrote the updated file, so it's now safe to rely on the
+	 * new values in this process, too.
+	 */
+	if (shared)
+		memcpy(&shared_map, &newmap, sizeof(RelMapFile));
+	else
+		memcpy(&local_map, &newmap, sizeof(RelMapFile));
 
 	/* Now we can release the lock */
 	LWLockRelease(RelationMappingLock);
@@ -1021,8 +1021,10 @@ relmap_redo(XLogReaderState *record)
 				 xlrec->nbytes);
 		memcpy(&newmap, xlrec->data, sizeof(newmap));
 
-		/* We need to construct the pathname for this database */
-		dbpath = GetDatabasePath(xlrec->dbid, xlrec->tsid);
+		if (xlrec->dbid != InvalidOid)
+			dbpath = GetDatabasePath(xlrec->dbid, xlrec->tsid);
+		else
+			dbpath = pstrdup("global");
 
 		/*
 		 * Write out the new map and send sinval, but of course don't write a
@@ -1030,11 +1032,10 @@ relmap_redo(XLogReaderState *record)
 		 * preserve files, either.
 		 *
 		 * There shouldn't be anyone else updating relmaps during WAL replay,
-		 * but grab the lock to interlock against load_relmap_file().
+		 * but grab the lock to interlock against read_relmap_file().
 		 */
 		LWLockAcquire(RelationMappingLock, LW_EXCLUSIVE);
-		write_relmap_file((xlrec->dbid == InvalidOid), &newmap,
-						  false, true, false,
+		write_relmap_file(&newmap, false, true, false,
 						  xlrec->dbid, xlrec->tsid, dbpath);
 		LWLockRelease(RelationMappingLock);
 
-- 
1.8.3.1

v13-0003-Extend-bufmgr-interfaces.patchtext/x-patch; charset=US-ASCII; name=v13-0003-Extend-bufmgr-interfaces.patchDownload

From b34b556c885eebaa4f37bdf293b334af4978d255 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Thu, 10 Feb 2022 15:55:33 +0530
Subject: [PATCH v13 3/6] Extend bufmgr interfaces

Extend ReadBufferWithoutRelcache interface to take relpersistence
as input. At present, this function may only be used on permanent
relations, because we only use it during XLOG replay.  But now as
part of the bigger patch set, we will be using this for reading the
buffer from the database which we are not connected so now we might
have temporary and unlogged relations as well.
---
 src/backend/access/transam/xlogutils.c |  9 ++++++---
 src/backend/storage/buffer/bufmgr.c    | 11 ++---------
 src/include/storage/bufmgr.h           |  3 ++-
 3 files changed, 10 insertions(+), 13 deletions(-)

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 54d5f20..c292794 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -484,7 +484,8 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 	{
 		/* page exists in file */
 		buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-										   mode, NULL);
+										   mode, NULL,
+										   RELPERSISTENCE_PERMANENT);
 	}
 	else
 	{
@@ -509,7 +510,8 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 				ReleaseBuffer(buffer);
 			}
 			buffer = ReadBufferWithoutRelcache(rnode, forknum,
-											   P_NEW, mode, NULL);
+											   P_NEW, mode, NULL,
+											   RELPERSISTENCE_PERMANENT);
 		}
 		while (BufferGetBlockNumber(buffer) < blkno);
 		/* Handle the corner case that P_NEW returns non-consecutive pages */
@@ -519,7 +521,8 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 				LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 			ReleaseBuffer(buffer);
 			buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-											   mode, NULL);
+											   mode, NULL,
+											   RELPERSISTENCE_PERMANENT);
 		}
 	}
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f5459c6..0ed2d31 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -771,24 +771,17 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 /*
  * ReadBufferWithoutRelcache -- like ReadBufferExtended, but doesn't require
  *		a relcache entry for the relation.
- *
- * NB: At present, this function may only be used on permanent relations, which
- * is OK, because we only use it during XLOG replay.  If in the future we
- * want to use it on temporary or unlogged relations, we could pass additional
- * parameters.
  */
 Buffer
 ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
 						  BlockNumber blockNum, ReadBufferMode mode,
-						  BufferAccessStrategy strategy)
+						  BufferAccessStrategy strategy, char relpersistence)
 {
 	bool		hit;
 
 	SMgrRelation smgr = smgropen(rnode, InvalidBackendId);
 
-	Assert(InRecovery);
-
-	return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
+	return ReadBuffer_common(smgr, relpersistence, forkNum, blockNum,
 							 mode, strategy, &hit);
 }
 
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index dd01841..7b80f58 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -184,7 +184,8 @@ extern Buffer ReadBufferExtended(Relation reln, ForkNumber forkNum,
 								 BufferAccessStrategy strategy);
 extern Buffer ReadBufferWithoutRelcache(RelFileNode rnode,
 										ForkNumber forkNum, BlockNumber blockNum,
-										ReadBufferMode mode, BufferAccessStrategy strategy);
+										ReadBufferMode mode, BufferAccessStrategy strategy,
+										char relpersistence);
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern void MarkBufferDirty(Buffer buffer);
-- 
1.8.3.1

v13-0005-WAL-logged-CREATE-DATABASE.patchtext/x-patch; charset=US-ASCII; name=v13-0005-WAL-logged-CREATE-DATABASE.patchDownload

From 5d1bbe9d82f80577a29498cc4cad9f2a1b23e49b Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 14 Feb 2022 17:48:03 +0530
Subject: [PATCH v13 5/6] WAL logged CREATE DATABASE

Currently, CREATE DATABASE forces a checkpoint, then copies all the files,
then forces another checkpoint. The comments in the createdb() function
explain the reasons for this. The attached patch fixes this problem by making
create database completely WAL logged so that we can avoid the checkpoints.

We are also maintaining the old way of creating the database and for that we
are providing an option to choose the strategy for creating the database.
For the new method the user need to give STRATEGY=WAL_LOG and for the
old method they need to give STRATEGY=FILE_COPY.  The default strategy will
be WAL_LOG.
---
 doc/src/sgml/ref/create_database.sgml  |  23 ++
 src/backend/commands/dbcommands.c      | 735 +++++++++++++++++++++++++++------
 src/backend/storage/buffer/bufmgr.c    | 146 +++++++
 src/include/commands/dbcommands_xlog.h |   8 +
 src/include/storage/bufmgr.h           |   3 +
 src/tools/pgindent/typedefs.list       |   1 +
 6 files changed, 789 insertions(+), 127 deletions(-)

diff --git a/doc/src/sgml/ref/create_database.sgml b/doc/src/sgml/ref/create_database.sgml
index f70d0c7..2f6b069 100644
--- a/doc/src/sgml/ref/create_database.sgml
+++ b/doc/src/sgml/ref/create_database.sgml
@@ -34,6 +34,7 @@ CREATE DATABASE <replaceable class="parameter">name</replaceable>
            [ CONNECTION LIMIT [=] <replaceable class="parameter">connlimit</replaceable> ]
            [ IS_TEMPLATE [=] <replaceable class="parameter">istemplate</replaceable> ]
            [ OID [=] <replaceable class="parameter">oid</replaceable> ] ]
+           [ STRATEGY [=] <replaceable class="parameter">strategy</replaceable> ] ]
 </synopsis>
  </refsynopsisdiv>
 
@@ -240,6 +241,28 @@ CREATE DATABASE <replaceable class="parameter">name</replaceable>
        </listitem>
       </varlistentry>
 
+      <varlistentry>
+       <term><replaceable class="parameter">strategy</replaceable></term>
+       <listitem>
+        <para>
+         This is used for copying the database directory.  Currently, we have
+         two strategies the <literal>WAL_LOG</literal> and the
+         <literal>FILE_COPY</literal>.  If <literal>WAL_LOG</literal> strategy
+         is used then the database will be copied block by block and it will
+         also WAL log each copied block.  Otherwise, if <literal>FILE_COPY
+         </literal> strategy is used then it will do the file system level copy
+         so individual the block is not WAL logged.  The default strategy is
+         <literal>WAL_LOG</literal>.  If the <literal>FILE_COPY</literal>
+         strategy is used then it has to issue a checkpoint before and after
+         performing the copy and if the shared buffers are large and there are
+         a lot of dirty buffers then issuing checkpoint would be costly and it
+         may impact the performance of the whole system.  On the other hand, if
+         we WAL log each block then if the source database is large then
+         creating the database may take more time.
+        </para>
+       </listitem>
+      </varlistentry>
+
     </variablelist>
 
   <para>
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index c37e3c9..929908f 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -46,6 +46,7 @@
 #include "commands/dbcommands_xlog.h"
 #include "commands/defrem.h"
 #include "commands/seclabel.h"
+#include "commands/tablecmds.h"
 #include "commands/tablespace.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
@@ -63,13 +64,27 @@
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
 #include "utils/pg_locale.h"
+#include "utils/relmapper.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 
+/*
+ * Create database strategy.  The CREATEDB_WAL_LOG will copy the database at
+ * the block level and WAL log each copied block.  Whereas the
+ * CREATEDB_FILE_COPY will directly copy the database at the file level and no
+ * individual operations will be WAL logged.
+ */
+typedef enum CreateDBStrategy
+{
+	CREATEDB_WAL_LOG = 0,
+	CREATEDB_FILE_COPY = 1
+} CreateDBStrategy;
+
 typedef struct
 {
 	Oid			src_dboid;		/* source (template) DB */
 	Oid			dest_dboid;		/* DB we are trying to create */
+	CreateDBStrategy	strategy;	/* create db strategy */
 } createdb_failure_params;
 
 typedef struct
@@ -78,6 +93,19 @@ typedef struct
 	Oid			dest_tsoid;		/* tablespace we are trying to move to */
 } movedb_failure_params;
 
+/*
+ * When creating a database, we scan the pg_class of the source database to
+ * identify all the relations to be copied.  The structure is used for storing
+ * information about each relation of the source database.
+ */
+typedef struct CreateDBRelInfo
+{
+	RelFileNode		rnode;				/* physical relation identifier */
+	Oid				reloid;				/* relation oid */
+	char			relpersistence;		/* relation's persistence level */
+} CreateDBRelInfo;
+
+
 /* non-export function prototypes */
 static void createdb_failure_callback(int code, Datum arg);
 static void movedb(const char *dbname, const char *tblspcname);
@@ -92,7 +120,505 @@ static bool have_createdb_privilege(void);
 static void remove_dbtablespaces(Oid db_id);
 static bool check_db_file_conflict(Oid db_id);
 static int	errdetail_busy_db(int notherbackends, int npreparedxacts);
+static void CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid,
+									bool isRedo);
+static CreateDBRelInfo *GetRelInfoFromTuple(HeapTupleData *tuple,
+											Oid tbid, Oid dbid, char *srcpath);
+static List *GetRelListFromPage(Page page, Buffer buf, Oid tbid, Oid dbid,
+								char *srcpath, List *rnodelist, Snapshot
+								snapshot);
+static List *GetDatabaseRelationList(Oid srctbid, Oid srcdbid, char *srcpath);
+static void CopyDatabaseWithWal(Oid src_dboid, Oid dboid, Oid src_tsid,
+								Oid dst_tsid);
+static void CopyDatabase(Oid src_dboid, Oid dboid, Oid src_tsid, Oid dst_tsid);
+
+/*
+ * CreateDirAndVersionFile - Create database directory and write out the
+ *							 PG_VERSION file in the database path.
+ *
+ * If isRedo is true, it's okay for the database directory to exist already.
+ *
+ * We can directly write PG_MAJORVERSION in the version file instead of copying
+ * from the source database file because these two must be the same.
+ */
+static void
+CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid, bool isRedo)
+{
+	int		fd;
+	int		nbytes;
+	char	versionfile[MAXPGPATH];
+	char	buf[16];
+
+	/* Prepare version data before starting a critical section. */
+	sprintf(buf, "%s\n", PG_MAJORVERSION);
+	nbytes = strlen(PG_MAJORVERSION) + 1;
+
+	/* If we are not in WAL replay then write the WAL. */
+	if (!isRedo)
+	{
+		xl_dbase_create_rec xlrec;
+		XLogRecPtr	lsn;
+
+		START_CRIT_SECTION();
+
+		xlrec.db_id = dbid;
+		xlrec.tablespace_id = tsid;
+		xlrec.src_db_id = InvalidOid;
+		xlrec.src_tablespace_id = InvalidOid;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec), sizeof(xl_dbase_create_rec));
+
+		lsn = XLogInsert(RM_DBASE_ID, XLOG_DBASE_CREATE);
+
+		/* As always, WAL must hit the disk before the data update does. */
+		XLogFlush(lsn);
+	}
+
+	/* Create database directory. */
+	if (MakePGDirectory(dbpath) < 0)
+	{
+		/* Failure other than already exists or not in WAL replay? */
+		if (errno != EEXIST || !isRedo)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not create directory \"%s\": %m", dbpath)));
+	}
+
+	/*
+	 * Create PG_VERSION file in the database path.  If the file already exists
+	 * and we are in WAL replay then try again to open it in write mode.
+	 */
+	snprintf(versionfile, sizeof(versionfile), "%s/%s", dbpath, "PG_VERSION");
+
+	fd = OpenTransientFile(versionfile, O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
+	if (fd < 0 && errno == EEXIST && isRedo)
+		fd = OpenTransientFile(versionfile, O_WRONLY | O_TRUNC | PG_BINARY);
+
+	if (fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m", versionfile)));
+
+	/* Write PG_MAJORVERSION in the PG_VERSION file. */
+	pgstat_report_wait_start(WAIT_EVENT_COPY_FILE_WRITE);
+	errno = 0;
+	if ((int) write(fd, buf, nbytes) != nbytes)
+	{
+		/* If write didn't set errno, assume problem is no disk space. */
+		if (errno == 0)
+			errno = ENOSPC;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m", versionfile)));
+	}
+	pgstat_report_wait_end();
+
+	/* Close the version file. */
+	CloseTransientFile(fd);
+
+	/* Critical section done. */
+	if (!isRedo)
+		END_CRIT_SECTION();
+}
+
+/*
+ * GetRelInfoFromTuple - Prepare a CreateDBRelInfo element from the tuple
+ *
+ * Helper function for GetRelListFromPage to prepare a single element from the
+ * pg_class tuple.
+ */
+CreateDBRelInfo *
+GetRelInfoFromTuple(HeapTupleData *tuple, Oid tbid, Oid dbid, char *srcpath)
+{
+	CreateDBRelInfo	   *relinfo;
+	Form_pg_class		classForm;
+	Oid					relfilenode = InvalidOid;
+
+	classForm = (Form_pg_class) GETSTRUCT(tuple);
+
+	/*
+	 * If this is a shared object or the object doesn't have the storage then
+	 * nothing to be done, so just return.
+	 */
+	if (classForm->reltablespace == GLOBALTABLESPACE_OID ||
+		!RELKIND_HAS_STORAGE(classForm->relkind))
+		return NULL;
+
+	/*
+	 * If relfilenode is valid then directly use it.  Otherwise,
+	 * consult the relmapper for the mapped relation.
+	 */
+	if (OidIsValid(classForm->relfilenode))
+		relfilenode = classForm->relfilenode;
+	else
+		relfilenode = RelationMapOidToFilenodeForDatabase(srcpath,
+										classForm->oid);
+
+	/* We must have a valid relfilenode oid. */
+	Assert(OidIsValid(relfilenode));
+
+	/* Prepare a rel info element and add it to the list. */
+	relinfo = (CreateDBRelInfo *) palloc(sizeof(CreateDBRelInfo));
+	if (OidIsValid(classForm->reltablespace))
+		relinfo->rnode.spcNode = classForm->reltablespace;
+	else
+		relinfo->rnode.spcNode = tbid;
+
+	relinfo->rnode.dbNode = dbid;
+	relinfo->rnode.relNode = relfilenode;
+	relinfo->reloid = classForm->oid;
+	relinfo->relpersistence = classForm->relpersistence;
+
+	return relinfo;
+}
+
+/*
+ * GetRelListFromPage - Helper function for GetDatabaseRelationList.
+ *
+ * Iterate over each tuple of input pg_class and get a list of all the valid
+ * relfilenodes of the given block and append them to input rnodelist.
+ */
+static List *
+GetRelListFromPage(Page page, Buffer buf, Oid tbid, Oid dbid, char *srcpath,
+				  List *rnodelist, Snapshot snapshot)
+{
+	BlockNumber		blkno = BufferGetBlockNumber(buf);
+	OffsetNumber	offnum;
+	OffsetNumber	maxoff;
+	HeapTupleData	tuple;
+
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/* Iterate over each tuple on the page. */
+	for (offnum = FirstOffsetNumber;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid;
+
+		itemid = PageGetItemId(page, offnum);
+
+		/* Nothing to do if slot is empty or already dead. */
+		if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid) ||
+			ItemIdIsRedirected(itemid))
+			continue;
+
+		Assert(ItemIdIsNormal(itemid));
+		ItemPointerSet(&(tuple.t_self), blkno, offnum);
+
+		tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
+		tuple.t_len = ItemIdGetLength(itemid);
+		tuple.t_tableOid = RelationRelationId;
+
+		/*
+		 * If the tuple is visible then add its relfilenode info to the
+		 * list.
+		 */
+		if (HeapTupleSatisfiesVisibility(&tuple, snapshot, buf))
+		{
+			CreateDBRelInfo	   *relinfo;
+
+			relinfo = GetRelInfoFromTuple(&tuple, tbid, dbid, srcpath);
+
+			/* Add it to the list. */
+			if (relinfo != NULL)
+				rnodelist = lappend(rnodelist, relinfo);
+		}
+	}
+
+	return rnodelist;
+}
+
+/*
+ * GetDatabaseRelationList - Get relfilenode list to be copied.
+ *
+ * Iterate over each block of the pg_class relation.  From there, we will check
+ * all the visible tuples in order to get a list of all the valid relfilenodes
+ * in the source database that should be copied to the target database.
+ */
+static List *
+GetDatabaseRelationList(Oid tbid, Oid dbid, char *srcpath)
+{
+	SMgrRelation	rd_smgr;
+	RelFileNode		rnode;
+	BlockNumber		nblocks;
+	BlockNumber		blkno;
+	Buffer			buf;
+	Oid				relfilenode;
+	Page			page;
+	List		   *rnodelist = NIL;
+	LockRelId		relid;
+	Snapshot		snapshot;
+	BufferAccessStrategy bstrategy;
+
+	/* Get pg_class relfilenode. */
+	relfilenode = RelationMapOidToFilenodeForDatabase(srcpath,
+													  RelationRelationId);
+	/*
+	 * We are going to read the buffers associated with the pg_class relation.
+	 * Thus, acquire the relation level lock before start scanning.  As we are
+	 * not connected to the database, we cannot use relation_open directly, so
+	 * we have to lock using relation id.
+	 */
+	relid.dbId = dbid;
+	relid.relId = RelationRelationId;
+	LockRelationId(&relid, AccessShareLock);
+
+	/* Prepare a relnode for pg_class relation. */
+	rnode.spcNode = tbid;
+	rnode.dbNode = dbid;
+	rnode.relNode = relfilenode;
+
+	/*
+	 * We are not connected to the source database so open the pg_class
+	 * relation at the smgr level and get the block count.
+	 */
+	rd_smgr = smgropen(rnode, InvalidBackendId);
+	nblocks = smgrnblocks(rd_smgr, MAIN_FORKNUM);
+
+	/*
+	 * We're going to read the whole pg_class so better to use bulk-read buffer
+	 * access strategy.
+	 */
+	bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	/* Get latest snapshot for scanning the pg_class. */
+	snapshot = GetLatestSnapshot();
+
+	/* Iterate over each block on the pg_class relation. */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		/*
+		 * We are not connected to the source database so directly use the lower
+		 * level bufmgr interface which operates on the rnode.
+		 */
+		buf = ReadBufferWithoutRelcache(rnode, MAIN_FORKNUM, blkno,
+										RBM_NORMAL, bstrategy,
+										RELPERSISTENCE_PERMANENT);
+
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buf);
+		if (PageIsNew(page) || PageIsEmpty(page))
+		{
+			UnlockReleaseBuffer(buf);
+			continue;
+		}
+
+		/*
+		 * Process pg_class tuple for the current page and add all the valid
+		 * relfilenode entries to the rnodelist.
+		 */
+		rnodelist = GetRelListFromPage(page, buf, tbid, dbid, srcpath,
+									   rnodelist, snapshot);
+
+		/* Release the buffer lock. */
+		UnlockReleaseBuffer(buf);
+	}
+
+	/* Release the lock. */
+	UnlockRelationId(&relid, AccessShareLock);
+
+	return rnodelist;
+}
+
+/*
+ * CopyDatabaseWithWal - Copy source database to the target database with WAL
+ *
+ * Create target database directory and copy data files from the source
+ * database to the target database, block by block and WAL log all the
+ * operations.
+ */
+static void
+CopyDatabaseWithWal(Oid src_dboid, Oid dst_dboid, Oid src_tsid, Oid dst_tsid)
+{
+	char	   *srcpath;
+	char	   *dstpath;
+	List	   *rnodelist = NULL;
+	ListCell   *cell;
+	LockRelId	relid;
+	RelFileNode	srcrnode;
+	RelFileNode	dstrnode;
+	CreateDBRelInfo	*relinfo;
+
+	/* Get the source database path. */
+	srcpath = GetDatabasePath(src_dboid, src_tsid);
+
+	/* Get the destination database path. */
+	dstpath = GetDatabasePath(dst_dboid, dst_tsid);
+
+	/* Create database directory and write PG_VERSION file. */
+	CreateDirAndVersionFile(dstpath, dst_dboid, dst_tsid, false);
+
+	/* Copy relmap file from source database to the destination database. */
+	RelationMapCopy(dst_dboid, dst_tsid, srcpath, dstpath);
+
+	/* Get list of all valid relnode from the source database. */
+	rnodelist = GetDatabaseRelationList(src_tsid, src_dboid, srcpath);
+	Assert(rnodelist != NIL);
+
+	/*
+	 * Database id is common for all the relation so set it before entering to
+	 * the loop.
+	 */
+	relid.dbId = src_dboid;
+
+	/*
+	 * Iterate over each relfilenode and copy the relation data block by block
+	 * from source database to the destination database.
+	 */
+	foreach(cell, rnodelist)
+	{
+		relinfo = lfirst(cell);
+		srcrnode = relinfo->rnode;
+
+		/*
+		 * If the relation is from the source db's default tablespace then we
+		 * need to create it in the destinations db's default tablespace.
+		 * Otherwise, we need to create in the same tablespace as it is in the
+		 * source database.
+		 */
+		if (srcrnode.spcNode == src_tsid)
+			dstrnode.spcNode = dst_tsid;
+		else
+			dstrnode.spcNode = srcrnode.spcNode;
+
+		dstrnode.dbNode = dst_dboid;
+		dstrnode.relNode = srcrnode.relNode;
+
+		/* Acquire the lock on relation before start copying. */
+		relid.relId = relinfo->reloid;
+		LockRelationId(&relid, AccessShareLock);
+
+		/* Copy relation storage from source to the destination. */
+		CreateAndCopyRelationData(srcrnode, dstrnode, relinfo->relpersistence);
+
+		/* Release the lock. */
+		UnlockRelationId(&relid, AccessShareLock);
+	}
 
+	list_free_deep(rnodelist);
+}
+
+/*
+ * CopyDatabase - Copy source database to the target database
+ *
+ * Copy source database directory to the destination directory using copydir
+ * operation.
+ */
+static void
+CopyDatabase(Oid src_dboid, Oid dst_dboid, Oid src_tsid, Oid dst_tsid)
+{
+	TableScanDesc	scan;
+	Relation		rel;
+	HeapTuple		tuple;
+
+	/*
+	 * Force a checkpoint before starting the copy. This will force all
+	 * dirty buffers, including those of unlogged tables, out to disk, to
+	 * ensure source database is up-to-date on disk for the copy.
+	 * FlushDatabaseBuffers() would suffice for that, but we also want to
+	 * process any pending unlink requests. Otherwise, if a checkpoint
+	 * happened while we're copying files, a file might be deleted just
+	 * when we're about to copy it, causing the lstat() call in copydir()
+	 * to fail with ENOENT.
+	 */
+	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE |
+					  CHECKPOINT_WAIT | CHECKPOINT_FLUSH_ALL);
+
+	/*
+	 * Iterate through all tablespaces of the template database, and copy
+	 * each one to the new database.
+	 */
+	rel = table_open(TableSpaceRelationId, AccessShareLock);
+	scan = table_beginscan_catalog(rel, 0, NULL);
+	while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+	{
+		Form_pg_tablespace spaceform = (Form_pg_tablespace) GETSTRUCT(tuple);
+		Oid			srctablespace = spaceform->oid;
+		Oid			dsttablespace;
+		char	   *srcpath;
+		char	   *dstpath;
+		struct stat st;
+
+		/* No need to copy global tablespace */
+		if (srctablespace == GLOBALTABLESPACE_OID)
+			continue;
+
+		srcpath = GetDatabasePath(src_dboid, srctablespace);
+
+		if (stat(srcpath, &st) < 0 || !S_ISDIR(st.st_mode) ||
+			directory_is_empty(srcpath))
+		{
+			/* Assume we can ignore it */
+			pfree(srcpath);
+			continue;
+		}
+
+		if (srctablespace == src_tsid)
+			dsttablespace = dst_tsid;
+		else
+			dsttablespace = srctablespace;
+
+		dstpath = GetDatabasePath(dst_dboid, dsttablespace);
+
+		/*
+		 * Copy this subdirectory to the new location
+		 *
+		 * We don't need to copy subdirectories
+		 */
+		copydir(srcpath, dstpath, false);
+
+		/* Record the filesystem change in XLOG */
+		{
+			xl_dbase_create_rec xlrec;
+
+			xlrec.db_id = dst_dboid;
+			xlrec.tablespace_id = dsttablespace;
+			xlrec.src_db_id = src_dboid;
+			xlrec.src_tablespace_id = srctablespace;
+
+			XLogBeginInsert();
+			XLogRegisterData((char *) &xlrec, sizeof(xl_dbase_create_rec));
+
+			(void) XLogInsert(RM_DBASE_ID,
+							  XLOG_DBASE_CREATE | XLR_SPECIAL_REL_UPDATE);
+		}
+	}
+	table_endscan(scan);
+	table_close(rel, AccessShareLock);
+
+	/*
+	 * We force a checkpoint before committing.  This effectively means
+	 * that committed XLOG_DBASE_CREATE operations will never need to be
+	 * replayed (at least not in ordinary crash recovery; we still have to
+	 * make the XLOG entry for the benefit of PITR operations). This
+	 * avoids two nasty scenarios:
+	 *
+	 * #1: When PITR is off, we don't XLOG the contents of newly created
+	 * indexes; therefore the drop-and-recreate-whole-directory behavior
+	 * of DBASE_CREATE replay would lose such indexes.
+	 *
+	 * #2: Since we have to recopy the source database during DBASE_CREATE
+	 * replay, we run the risk of copying changes in it that were
+	 * committed after the original CREATE DATABASE command but before the
+	 * system crash that led to the replay.  This is at least unexpected
+	 * and at worst could lead to inconsistencies, eg duplicate table
+	 * names.
+	 *
+	 * (Both of these were real bugs in releases 8.0 through 8.0.3.)
+	 *
+	 * In PITR replay, the first of these isn't an issue, and the second
+	 * is only a risk if the CREATE DATABASE and subsequent template
+	 * database change both occur while a base backup is being taken.
+	 * There doesn't seem to be much we can do about that except document
+	 * it as a limitation.
+	 *
+	 * Perhaps if we ever implement CREATE DATABASE in a less cheesy way,
+	 * we can avoid this.
+	 */
+	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
+}
 
 /*
  * CREATE DATABASE
@@ -100,8 +626,6 @@ static int	errdetail_busy_db(int notherbackends, int npreparedxacts);
 Oid
 createdb(ParseState *pstate, const CreatedbStmt *stmt)
 {
-	TableScanDesc scan;
-	Relation	rel;
 	Oid			src_dboid;
 	Oid			src_owner;
 	int			src_encoding = -1;
@@ -132,6 +656,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	DefElem    *dallowconnections = NULL;
 	DefElem    *dconnlimit = NULL;
 	DefElem    *dcollversion = NULL;
+	DefElem    *dstrategy = NULL;
 	char	   *dbname = stmt->dbname;
 	char	   *dbowner = NULL;
 	const char *dbtemplate = NULL;
@@ -145,6 +670,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	char	   *dbcollversion = NULL;
 	int			notherbackends;
 	int			npreparedxacts;
+	CreateDBStrategy	dbstrategy = CREATEDB_WAL_LOG;
 	createdb_failure_params fparms;
 
 	/* Extract options from the statement node tree */
@@ -250,6 +776,12 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 						(errcode(ERRCODE_INVALID_PARAMETER_VALUE)),
 						errmsg("OIDs less than %u are reserved for system objects", FirstNormalObjectId));
 		}
+		else if (strcmp(defel->defname, "strategy") == 0)
+		{
+			if (dstrategy)
+				errorConflictingDefElem(defel, pstate);
+			dstrategy = defel;
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -374,6 +906,23 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 							dbtemplate)));
 	}
 
+	/* Validate the database creation strategy. */
+	if (dstrategy && dstrategy->arg)
+	{
+		char	*strategy;
+
+		strategy = defGetString(dstrategy);
+		if (strcmp(strategy, "wal_log") == 0)
+			dbstrategy = CREATEDB_WAL_LOG;
+		else if (strcmp(strategy, "file_copy") == 0)
+			dbstrategy = CREATEDB_FILE_COPY;
+		else
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("invalid create database strategy %s", strategy),
+					 errhint("Valid strategies are \"wal_log\", and \"file_copy\".")));
+	}
+
 	/* If encoding or locales are defaulted, use source's setting */
 	if (encoding < 0)
 		encoding = src_encoding;
@@ -668,19 +1217,6 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	InvokeObjectPostCreateHook(DatabaseRelationId, dboid, 0);
 
 	/*
-	 * Force a checkpoint before starting the copy. This will force all dirty
-	 * buffers, including those of unlogged tables, out to disk, to ensure
-	 * source database is up-to-date on disk for the copy.
-	 * FlushDatabaseBuffers() would suffice for that, but we also want to
-	 * process any pending unlink requests. Otherwise, if a checkpoint
-	 * happened while we're copying files, a file might be deleted just when
-	 * we're about to copy it, causing the lstat() call in copydir() to fail
-	 * with ENOENT.
-	 */
-	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT
-					  | CHECKPOINT_FLUSH_ALL);
-
-	/*
 	 * Once we start copying subdirectories, we need to be able to clean 'em
 	 * up if we fail.  Use an ENSURE block to make sure this happens.  (This
 	 * is not a 100% solution, because of the possibility of failure during
@@ -689,101 +1225,23 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	 */
 	fparms.src_dboid = src_dboid;
 	fparms.dest_dboid = dboid;
+	fparms.strategy = dbstrategy;
+
 	PG_ENSURE_ERROR_CLEANUP(createdb_failure_callback,
 							PointerGetDatum(&fparms));
 	{
 		/*
-		 * Iterate through all tablespaces of the template database, and copy
-		 * each one to the new database.
+		 * If the user has asked to create a database with WAL_LOG strategy
+		 * then call CopyDatabaseWithWal, which will copy the database at the
+		 * block level and it will WAL log each copied block.  Otherwise,
+		 * call CopyDatabase that will copy the database file by file.
 		 */
-		rel = table_open(TableSpaceRelationId, AccessShareLock);
-		scan = table_beginscan_catalog(rel, 0, NULL);
-		while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
-		{
-			Form_pg_tablespace spaceform = (Form_pg_tablespace) GETSTRUCT(tuple);
-			Oid			srctablespace = spaceform->oid;
-			Oid			dsttablespace;
-			char	   *srcpath;
-			char	   *dstpath;
-			struct stat st;
-
-			/* No need to copy global tablespace */
-			if (srctablespace == GLOBALTABLESPACE_OID)
-				continue;
-
-			srcpath = GetDatabasePath(src_dboid, srctablespace);
-
-			if (stat(srcpath, &st) < 0 || !S_ISDIR(st.st_mode) ||
-				directory_is_empty(srcpath))
-			{
-				/* Assume we can ignore it */
-				pfree(srcpath);
-				continue;
-			}
-
-			if (srctablespace == src_deftablespace)
-				dsttablespace = dst_deftablespace;
-			else
-				dsttablespace = srctablespace;
-
-			dstpath = GetDatabasePath(dboid, dsttablespace);
-
-			/*
-			 * Copy this subdirectory to the new location
-			 *
-			 * We don't need to copy subdirectories
-			 */
-			copydir(srcpath, dstpath, false);
-
-			/* Record the filesystem change in XLOG */
-			{
-				xl_dbase_create_rec xlrec;
-
-				xlrec.db_id = dboid;
-				xlrec.tablespace_id = dsttablespace;
-				xlrec.src_db_id = src_dboid;
-				xlrec.src_tablespace_id = srctablespace;
-
-				XLogBeginInsert();
-				XLogRegisterData((char *) &xlrec, sizeof(xl_dbase_create_rec));
-
-				(void) XLogInsert(RM_DBASE_ID,
-								  XLOG_DBASE_CREATE | XLR_SPECIAL_REL_UPDATE);
-			}
-		}
-		table_endscan(scan);
-		table_close(rel, AccessShareLock);
-
-		/*
-		 * We force a checkpoint before committing.  This effectively means
-		 * that committed XLOG_DBASE_CREATE operations will never need to be
-		 * replayed (at least not in ordinary crash recovery; we still have to
-		 * make the XLOG entry for the benefit of PITR operations). This
-		 * avoids two nasty scenarios:
-		 *
-		 * #1: When PITR is off, we don't XLOG the contents of newly created
-		 * indexes; therefore the drop-and-recreate-whole-directory behavior
-		 * of DBASE_CREATE replay would lose such indexes.
-		 *
-		 * #2: Since we have to recopy the source database during DBASE_CREATE
-		 * replay, we run the risk of copying changes in it that were
-		 * committed after the original CREATE DATABASE command but before the
-		 * system crash that led to the replay.  This is at least unexpected
-		 * and at worst could lead to inconsistencies, eg duplicate table
-		 * names.
-		 *
-		 * (Both of these were real bugs in releases 8.0 through 8.0.3.)
-		 *
-		 * In PITR replay, the first of these isn't an issue, and the second
-		 * is only a risk if the CREATE DATABASE and subsequent template
-		 * database change both occur while a base backup is being taken.
-		 * There doesn't seem to be much we can do about that except document
-		 * it as a limitation.
-		 *
-		 * Perhaps if we ever implement CREATE DATABASE in a less cheesy way,
-		 * we can avoid this.
-		 */
-		RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
+		if (dbstrategy == CREATEDB_WAL_LOG)
+			CopyDatabaseWithWal(src_dboid, dboid, src_deftablespace,
+								dst_deftablespace);
+		else
+			CopyDatabase(src_dboid, dboid, src_deftablespace,
+						 dst_deftablespace);
 
 		/*
 		 * Close pg_database, but keep lock till commit.
@@ -870,6 +1328,21 @@ createdb_failure_callback(int code, Datum arg)
 	createdb_failure_params *fparms = (createdb_failure_params *) DatumGetPointer(arg);
 
 	/*
+	 * If we were copying database at block levels then drop pages for the
+	 * destination database that are in the shared buffer cache.  And tell
+	 * checkpointer to forget any pending fsync and unlink requests for
+	 * files in the database.  The reasoning behind doing this is same as
+	 * explained in dropdb function.  But unlike dropdb we don't need to call
+	 * pgstat_drop_database because this database is still not created so there
+	 * should not be any stat for this.
+	 */
+	if (fparms->strategy == CREATEDB_WAL_LOG)
+	{
+		DropDatabaseBuffers(fparms->dest_dboid);
+		ForgetDatabaseSyncRequests(fparms->dest_dboid);
+	}
+
+	/*
 	 * Release lock on source database before doing recursive remove. This is
 	 * not essential but it seems desirable to release the lock as soon as
 	 * possible.
@@ -2387,32 +2860,40 @@ dbase_redo(XLogReaderState *record)
 		src_path = GetDatabasePath(xlrec->src_db_id, xlrec->src_tablespace_id);
 		dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
 
-		/*
-		 * Our theory for replaying a CREATE is to forcibly drop the target
-		 * subdirectory if present, then re-copy the source data. This may be
-		 * more work than needed, but it is simple to implement.
-		 */
-		if (stat(dst_path, &st) == 0 && S_ISDIR(st.st_mode))
+		if (!OidIsValid(xlrec->src_db_id))
 		{
-			if (!rmtree(dst_path, true))
-				/* If this failed, copydir() below is going to error. */
-				ereport(WARNING,
-						(errmsg("some useless files may be left behind in old database directory \"%s\"",
-								dst_path)));
+			CreateDirAndVersionFile(dst_path, xlrec->db_id, xlrec->tablespace_id,
+									true);
 		}
+		else
+		{
+			/*
+			* Our theory for replaying a CREATE is to forcibly drop the target
+			* subdirectory if present, then re-copy the source data. This may be
+			* more work than needed, but it is simple to implement.
+			*/
+			if (stat(dst_path, &st) == 0 && S_ISDIR(st.st_mode))
+			{
+				if (!rmtree(dst_path, true))
+					/* If this failed, copydir() below is going to error. */
+					ereport(WARNING,
+							(errmsg("some useless files may be left behind in old database directory \"%s\"",
+									dst_path)));
+			}
 
-		/*
-		 * Force dirty buffers out to disk, to ensure source database is
-		 * up-to-date for the copy.
-		 */
-		FlushDatabaseBuffers(xlrec->src_db_id);
+			/*
+			* Force dirty buffers out to disk, to ensure source database is
+			* up-to-date for the copy.
+			*/
+			FlushDatabaseBuffers(xlrec->src_db_id);
 
-		/*
-		 * Copy this subdirectory to the new location
-		 *
-		 * We don't need to copy subdirectories
-		 */
-		copydir(src_path, dst_path, false);
+			/*
+			* Copy this subdirectory to the new location
+			*
+			* We don't need to copy subdirectories
+			*/
+			copydir(src_path, dst_path, false);
+		}
 	}
 	else if (info == XLOG_DBASE_DROP)
 	{
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 0ed2d31..156806d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -38,6 +38,7 @@
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/storage.h"
+#include "catalog/storage_xlog.h"
 #include "executor/instrument.h"
 #include "lib/binaryheap.h"
 #include "miscadmin.h"
@@ -486,6 +487,9 @@ static void FindAndDropRelFileNodeBuffers(RelFileNode rnode,
 										  ForkNumber forkNum,
 										  BlockNumber nForkBlock,
 										  BlockNumber firstDelBlock);
+static void RelationCopyStorageUsingBuffer(SMgrRelation src, SMgrRelation dst,
+										   ForkNumber forkNum,
+										   char relpersistence);
 static void AtProcExit_Buffers(int code, Datum arg);
 static void CheckForBufferLeaks(void);
 static int	rnode_comparator(const void *p1, const void *p2);
@@ -3670,6 +3674,148 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 }
 
 /* ---------------------------------------------------------------------
+ *		RelationCopyStorageUsingBuffer
+ *
+ *		Copy fork's data using bufmgr.  Same as RelationCopyStorage but instead
+ *		of using smgrread and smgrextend this will copy using bufmgr APIs.
+ * --------------------------------------------------------------------
+ */
+static void
+RelationCopyStorageUsingBuffer(SMgrRelation src, SMgrRelation dst,
+							   ForkNumber forkNum, char relpersistence)
+{
+	Buffer		srcBuf;
+	Buffer		dstBuf;
+	Page		srcPage;
+	Page		dstPage;
+	bool		use_wal;
+	bool		copying_initfork;
+	BlockNumber nblocks;
+	BlockNumber blkno;
+	BufferAccessStrategy bstrategy_src;
+	BufferAccessStrategy bstrategy_dst;
+
+	/* Refer comments in RelationCopyStorage. */
+	copying_initfork = relpersistence == RELPERSISTENCE_UNLOGGED &&
+		forkNum == INIT_FORKNUM;
+	use_wal = XLogIsNeeded() &&
+		(relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork);
+
+	/* Get number of blocks in the source relation. */
+	nblocks = smgrnblocks(src, forkNum);
+
+	/* Nothing to copy so directly exit. */
+	if (nblocks == 0)
+		return;
+
+	/*
+	 * We are going to copy whole relation from the source to the destination
+	 * so use BAS_BULKREAD strategy for the source relation and BAS_BULKWRITE
+	 * strategy for the destination relation.
+	 */
+	bstrategy_src = GetAccessStrategy(BAS_BULKREAD);
+	bstrategy_dst = GetAccessStrategy(BAS_BULKWRITE);
+
+	/* Iterate over each block of the source relation file. */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		/* If we got a cancel signal during the copy of the data, quit */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Read block from source relation. */
+		srcBuf = ReadBufferWithoutRelcache(src->smgr_rnode.node, forkNum,
+										   blkno, RBM_NORMAL, bstrategy_src,
+										   relpersistence);
+		srcPage = BufferGetPage(srcBuf);
+		if (PageIsNew(srcPage) || PageIsEmpty(srcPage))
+		{
+			ReleaseBuffer(srcBuf);
+			continue;
+		}
+
+		/* Use P_NEW to extend the relation. */
+		dstBuf = ReadBufferWithoutRelcache(dst->smgr_rnode.node, forkNum,
+										   P_NEW, RBM_NORMAL, bstrategy_dst,
+										   relpersistence);
+		LockBuffer(dstBuf, BUFFER_LOCK_EXCLUSIVE);
+
+		START_CRIT_SECTION();
+
+		/* Initialize the page and write the data. */
+		dstPage = BufferGetPage(dstBuf);
+		PageInit(dstPage, BufferGetPageSize(dstBuf), 0);
+		memcpy(dstPage, srcPage, BLCKSZ);
+		MarkBufferDirty(dstBuf);
+
+		/* WAL-log the copied page. */
+		if (use_wal)
+			log_newpage_buffer(dstBuf, true);
+
+		END_CRIT_SECTION();
+
+		UnlockReleaseBuffer(dstBuf);
+		ReleaseBuffer(srcBuf);
+	}
+}
+
+/* ---------------------------------------------------------------------
+ *		CreateAndCopyRelationData
+ *
+ *		Create destination relation storage and copy source relation's all
+ *		fork's data to the destination.
+ * --------------------------------------------------------------------
+ */
+void
+CreateAndCopyRelationData(RelFileNode src_rnode, RelFileNode dst_rnode,
+						  char relpersistence)
+{
+	SMgrRelation	src_smgr;
+	SMgrRelation	dst_smgr;
+
+	/* Open the source relation at smgr level. */
+	src_smgr = smgropen(src_rnode, InvalidBackendId);
+
+	/*
+	 * Create and copy all forks of the relation.
+	 *
+	 * NOTE: any conflict in relfilenode value will be caught in
+	 * RelationCreateStorage().
+	 */
+	dst_smgr = RelationCreateStorage(dst_rnode, relpersistence);
+
+	/* copy main fork */
+	RelationCopyStorageUsingBuffer(src_smgr, dst_smgr, MAIN_FORKNUM,
+								   relpersistence);
+
+	/* copy those extra forks that exist */
+	for (ForkNumber forkNum = MAIN_FORKNUM + 1;
+		 forkNum <= MAX_FORKNUM; forkNum++)
+	{
+		if (smgrexists(src_smgr, forkNum))
+		{
+			smgrcreate(dst_smgr, forkNum, false);
+
+			/*
+			 * WAL log creation if the relation is persistent, or this is the
+			 * init fork of an unlogged relation.
+			 */
+			if (relpersistence == RELPERSISTENCE_PERMANENT ||
+				(relpersistence == RELPERSISTENCE_UNLOGGED &&
+				 forkNum == INIT_FORKNUM))
+				log_smgrcreate(&dst_rnode, forkNum);
+
+			/* Copy a fork's data, block by block. */
+			RelationCopyStorageUsingBuffer(src_smgr, dst_smgr, forkNum,
+										   relpersistence);
+		}
+	}
+
+	/* Close the smgr rel */
+	smgrclose(src_smgr);
+	smgrclose(dst_smgr);
+}
+
+/* ---------------------------------------------------------------------
  *		FlushDatabaseBuffers
  *
  *		This function writes all dirty pages of a database out to disk
diff --git a/src/include/commands/dbcommands_xlog.h b/src/include/commands/dbcommands_xlog.h
index 593a857..8f59870 100644
--- a/src/include/commands/dbcommands_xlog.h
+++ b/src/include/commands/dbcommands_xlog.h
@@ -20,6 +20,7 @@
 /* record types */
 #define XLOG_DBASE_CREATE		0x00
 #define XLOG_DBASE_DROP			0x10
+#define XLOG_DBASE_CREATEDIR	0x20
 
 typedef struct xl_dbase_create_rec
 {
@@ -30,6 +31,13 @@ typedef struct xl_dbase_create_rec
 	Oid			src_tablespace_id;
 } xl_dbase_create_rec;
 
+typedef struct xl_dbase_createdir_rec
+{
+	/* Records creating database directory */
+	Oid			db_id;
+	Oid			tablespace_id;
+} xl_dbase_createdir_rec;
+
 typedef struct xl_dbase_drop_rec
 {
 	Oid			db_id;
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 7b80f58..a5659c0 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -204,6 +204,9 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
 extern void FlushOneBuffer(Buffer buffer);
 extern void FlushRelationBuffers(Relation rel);
 extern void FlushRelationsAllBuffers(struct SMgrRelationData **smgrs, int nrels);
+extern void CreateAndCopyRelationData(RelFileNode src_rnode,
+									  RelFileNode dst_rnode,
+									  char relpersistence);
 extern void FlushDatabaseBuffers(Oid dbid);
 extern void DropRelFileNodeBuffers(struct SMgrRelationData *smgr_reln, ForkNumber *forkNum,
 								   int nforks, BlockNumber *firstDelBlock);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index eaf3e7a..8d92c37 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -460,6 +460,7 @@ CoverPos
 CreateAmStmt
 CreateCastStmt
 CreateConversionStmt
+CreateDBRelInfo
 CreateDomainStmt
 CreateEnumStmt
 CreateEventTrigStmt
-- 
1.8.3.1

v13-0004-New-interface-to-lock-relation-id.patchtext/x-patch; charset=US-ASCII; name=v13-0004-New-interface-to-lock-relation-id.patchDownload

From b6ecfdcae21145a2068435dff199a81957e9f502 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 24 Sep 2021 18:29:17 +0530
Subject: [PATCH v13 4/6] New interface to lock relation id

Currently, we have LockRelationOid which provide a mechanism to
lock the relation oid but we must be connected to the database
from which this relation belong.  As part of this patch we are
providing a new interface which can lock the relation even if we
are not connected to the containing database.
---
 src/backend/storage/lmgr/lmgr.c | 28 ++++++++++++++++++++++++++++
 src/include/storage/lmgr.h      |  1 +
 2 files changed, 29 insertions(+)

diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 5ae52dd..1543da6 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -176,6 +176,34 @@ ConditionalLockRelationOid(Oid relid, LOCKMODE lockmode)
 }
 
 /*
+ *		LockRelationId
+ *
+ * Lock, given a LockRelId.  Same as LockRelationOid but take LockRelId as an
+ * input.
+ */
+void
+LockRelationId(LockRelId *relid, LOCKMODE lockmode)
+{
+	LOCKTAG		tag;
+	LOCALLOCK  *locallock;
+	LockAcquireResult res;
+
+	SET_LOCKTAG_RELATION(tag, relid->dbId, relid->relId);
+
+	res = LockAcquireExtended(&tag, lockmode, false, false, true, &locallock);
+
+	/*
+	 * Now that we have the lock, check for invalidation messages; see notes
+	 * in LockRelationOid.
+	 */
+	if (res != LOCKACQUIRE_ALREADY_CLEAR)
+	{
+		AcceptInvalidationMessages();
+		MarkLockClear(locallock);
+	}
+}
+
+/*
  *		UnlockRelationId
  *
  * Unlock, given a LockRelId.  This is preferred over UnlockRelationOid
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index 49edbcc..be1d2c9 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -38,6 +38,7 @@ extern void RelationInitLockInfo(Relation relation);
 
 /* Lock a relation */
 extern void LockRelationOid(Oid relid, LOCKMODE lockmode);
+extern void LockRelationId(LockRelId *relid, LOCKMODE lockmode);
 extern bool ConditionalLockRelationOid(Oid relid, LOCKMODE lockmode);
 extern void UnlockRelationId(LockRelId *relid, LOCKMODE lockmode);
 extern void UnlockRelationOid(Oid relid, LOCKMODE lockmode);
-- 
1.8.3.1

v13-0002-Extend-relmap-interfaces.patchtext/x-patch; charset=US-ASCII; name=v13-0002-Extend-relmap-interfaces.patchDownload

From e31ff652e30665bc12ddc7f40309e758768c3c05 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 11 Mar 2022 10:09:42 +0530
Subject: [PATCH v13 2/6] Extend relmap interfaces

Support new interfaces in relmapper, 1) Support copying the
relmap file from one database path to the other database path.
2) And another interface for getting filenode from oid.  We already
have RelationMapOidToFilenode for the same purpose but that assumes
we are connected to the database for which we want to get the mapping.
So this new interface will do the same but instead, it will get the
mapping for the input database.

These interfaces are required for next patch, for supporting the
wal logged created database.
---
 src/backend/utils/cache/relmapper.c | 60 +++++++++++++++++++++++++++++++++++++
 src/include/utils/relmapper.h       |  4 ++-
 2 files changed, 63 insertions(+), 1 deletion(-)

diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index f172f61..f5a1964 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -252,6 +252,60 @@ RelationMapFilenodeToOid(Oid filenode, bool shared)
 }
 
 /*
+ * RelationMapOidToFilenodeForDatabase
+ *
+ * Same as RelationMapOidToFilenode, but instead of reading the mapping from
+ * the database we are connected to it will read the mapping from the input
+ * database.
+ */
+Oid
+RelationMapOidToFilenodeForDatabase(char *dbpath, Oid relationId)
+{
+	RelMapFile	map;
+	int			i;
+
+	/* Read the relmap file from the source database. */
+	load_relmap_file(&map, dbpath, false, ERROR);
+
+	/* Iterate over the relmap entries to find the input relation oid. */
+	for (i = 0; i < map.num_mappings; i++)
+	{
+		if (relationId == map.mappings[i].mapoid)
+			return map.mappings[i].mapfilenode;
+	}
+
+	return InvalidOid;
+}
+
+/*
+ * RelationMapCopy
+ *
+ * Copy relmapfile from source db path to the destination db path and WAL log
+ * the operation.
+ */
+void
+RelationMapCopy(Oid dbid, Oid tsid, char *srcdbpath, char *dstdbpath)
+{
+	RelMapFile map;
+
+	/*
+	 * Read the relmap file from the source database.  This function is only
+	 * called during the create database, so elevel can be ERROR.
+	 */
+	load_relmap_file(&map, srcdbpath, false, ERROR);
+
+	/*
+	 * Write map contents into the destination database's relmap file. No
+	 * sinval needed because we are creating new file while creating a new
+	 * database so no one else must be accessing this file and for the same
+	 * reason we don't need to acquire the RelationMappingLock as well.  And,
+	 * we also don't need to preserve files because we are creating a new
+	 * database so in case of anerror relation files will be deleted anyway.
+	 */
+	write_relmap_file(&map, true, false, false, dbid, tsid, dstdbpath);
+}
+
+/*
  * RelationMapUpdateMap
  *
  * Install a new relfilenode mapping for the specified relation.
@@ -1033,6 +1087,12 @@ relmap_redo(XLogReaderState *record)
 		 *
 		 * There shouldn't be anyone else updating relmaps during WAL replay,
 		 * but grab the lock to interlock against read_relmap_file().
+		 *
+		 * Note - this WAL is also written for copying the relmap file while
+		 * creating a database.  Therefore, it makes no sense to acquire a
+		 * relmap lock or send sinval.  But if we want to avoid that, then we
+		 * must set an extra flag in WAL.  So let it grab the lock and send
+		 * sinval because there is no harm in that.
 		 */
 		LWLockAcquire(RelationMappingLock, LW_EXCLUSIVE);
 		write_relmap_file(&newmap, false, true, false,
diff --git a/src/include/utils/relmapper.h b/src/include/utils/relmapper.h
index 9fbb5a7..f10353e 100644
--- a/src/include/utils/relmapper.h
+++ b/src/include/utils/relmapper.h
@@ -38,7 +38,9 @@ typedef struct xl_relmap_update
 extern Oid	RelationMapOidToFilenode(Oid relationId, bool shared);
 
 extern Oid	RelationMapFilenodeToOid(Oid relationId, bool shared);
-
+extern Oid RelationMapOidToFilenodeForDatabase(char *dbpath, Oid relationId);
+extern void RelationMapCopy(Oid dbid, Oid tsid, char *srcdbpath,
+							char *dstdbpath);
 extern void RelationMapUpdateMap(Oid relationId, Oid fileNode, bool shared,
 								 bool immediate);
 
-- 
1.8.3.1

v13-0006-Support-create-database-strategy-in-createdb-too.patchtext/x-patch; charset=US-ASCII; name=v13-0006-Support-create-database-strategy-in-createdb-too.patchDownload

From cda2992685caa6112e9cc5d2f1fd831f0fbb1321 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 11 Mar 2022 11:48:55 +0530
Subject: [PATCH v13 6/6] Support create database strategy in createdb tool

Support create database strategy in createdb tool and add test case
---
 src/bin/scripts/createdb.c        | 10 +++++++++-
 src/bin/scripts/t/020_createdb.pl | 20 ++++++++++++++++++++
 2 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/src/bin/scripts/createdb.c b/src/bin/scripts/createdb.c
index b0c6805..479b295 100644
--- a/src/bin/scripts/createdb.c
+++ b/src/bin/scripts/createdb.c
@@ -37,6 +37,7 @@ main(int argc, char *argv[])
 		{"lc-collate", required_argument, NULL, 1},
 		{"lc-ctype", required_argument, NULL, 2},
 		{"locale", required_argument, NULL, 'l'},
+		{"strategy", required_argument, NULL, 'S'},
 		{"maintenance-db", required_argument, NULL, 3},
 		{NULL, 0, NULL, 0}
 	};
@@ -61,6 +62,7 @@ main(int argc, char *argv[])
 	char	   *lc_collate = NULL;
 	char	   *lc_ctype = NULL;
 	char	   *locale = NULL;
+	char	   *strategy = NULL;
 
 	PQExpBufferData sql;
 
@@ -73,7 +75,7 @@ main(int argc, char *argv[])
 
 	handle_help_version_opts(argc, argv, "createdb", help);
 
-	while ((c = getopt_long(argc, argv, "h:p:U:wWeO:D:T:E:l:", long_options, &optindex)) != -1)
+	while ((c = getopt_long(argc, argv, "h:p:U:wWeO:D:T:E:l:S:", long_options, &optindex)) != -1)
 	{
 		switch (c)
 		{
@@ -119,6 +121,9 @@ main(int argc, char *argv[])
 			case 3:
 				maintenance_db = pg_strdup(optarg);
 				break;
+			case 'S':
+				strategy = pg_strdup(optarg);
+				break;
 			default:
 				fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
 				exit(1);
@@ -217,6 +222,8 @@ main(int argc, char *argv[])
 		appendPQExpBufferStr(&sql, " LC_CTYPE ");
 		appendStringLiteralConn(&sql, lc_ctype, conn);
 	}
+	if (strategy)
+		appendPQExpBuffer(&sql, " STRATEGY=%s ", fmtId(strategy));
 
 	appendPQExpBufferChar(&sql, ';');
 
@@ -273,6 +280,7 @@ help(const char *progname)
 	printf(_("  -l, --locale=LOCALE          locale settings for the database\n"));
 	printf(_("      --lc-collate=LOCALE      LC_COLLATE setting for the database\n"));
 	printf(_("      --lc-ctype=LOCALE        LC_CTYPE setting for the database\n"));
+	printf(_("  -S, --strategy=STRATEGY      database creation strategy wal_log or file_copy\n"));
 	printf(_("  -O, --owner=OWNER            database user to own the new database\n"));
 	printf(_("  -T, --template=TEMPLATE      template database to copy\n"));
 	printf(_("  -V, --version                output version information, then exit\n"));
diff --git a/src/bin/scripts/t/020_createdb.pl b/src/bin/scripts/t/020_createdb.pl
index 6392454..2f9f3be 100644
--- a/src/bin/scripts/t/020_createdb.pl
+++ b/src/bin/scripts/t/020_createdb.pl
@@ -76,4 +76,24 @@ $node->command_checks_all(
 	],
 	'createdb with incorrect --lc-ctype');
 
+$node->command_checks_all(
+	[ 'createdb', '--strategy', "foo", 'foobar2' ],
+	1,
+	[qr/^$/],
+	[
+		qr/^createdb: error: database creation failed: ERROR:  invalid create database strategy|^createdb: error: database creation failed: ERROR:  invalid create database strategy foo/s
+	],
+	'createdb with incorrect --strategy');
+
+# Check database creation strategy
+$node->issues_sql_like(
+	[ 'createdb', '-T', 'foobar2', 'foobar4', '-S', 'wal_log'],
+	qr/statement: CREATE DATABASE foobar4 TEMPLATE foobar2 STRATEGY=wal_log/,
+	'create database with WAL_LOG strategy');
+
+$node->issues_sql_like(
+	[ 'createdb', '-T', 'foobar2', 'foobar5', '-S', 'file_copy'],
+	qr/statement: CREATE DATABASE foobar5 TEMPLATE foobar2 STRATEGY=file_copy/,
+	'create database with FILE_COPY strategy');
+
 done_testing();
-- 
1.8.3.1

#146

Ashutosh Sharma

ashu.coek88@gmail.com

almost 4 years ago

In reply to: Dilip Kumar (#145)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

You may also need to add documentation to app-createdb.sgml. Currently
you have just added to create_database.sgml. Also, I had a quick look
at the new changes done in v13-0005-WAL-logged-CREATE-DATABASE.patch
and they seemed fine to me although I haven't put much emphasis on the
comments and other cosmetic stuff.

--
With Regards,
Ashutosh Sharma.

Show quoted text

On Fri, Mar 11, 2022 at 3:51 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Mar 11, 2022 at 11:52 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Mar 10, 2022 at 10:18 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Mar 10, 2022 at 6:02 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have completely changed the logic for this refactoring. Basically,
write_relmap_file(), is already having parameters to control whether
to write wal, send inval and we are already passing the dbpath.
Instead of making a new function I just pass one additional parameter
to this function itself about whether we are creating a new map or not
and I think with that changes are very less and this looks cleaner to
me. Similarly for load_relmap_file() also I just had to pass the
dbpath and memory for destination map. Please have a look and let me
know your thoughts.

It's not terrible, but how about something like the attached instead?
I think this has the effect of reducing the number of cases that the
low-level code needs to know about from 2 to 1, instead of making it
go up from 2 to 3.

Yeah this looks cleaner, I will rebase the remaining patch.

Here is the updated version of the patch set.

Changes, 1) it take Robert's patch as first refactoring patch 2)
Rebase other new relmapper apis on top of that in 0002 3) Some code
refactoring in main patch 0005 and also one problem fix, earlier in
wal log method I have removed ForceSyncCommit(), but IMHO that is
equally valid whether we use file_copy or wal_log because in both
cases we are creating the disk files. 4) Support strategy in createdb
tool and add test case as part of 0006.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#147

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Ashutosh Sharma (#143)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Fri, Mar 11, 2022 at 12:15 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Looks better, but why do you want to pass elevel to the
load_relmap_file()? Are we calling this function from somewhere other
than read_relmap_file()? If not, do we have any plans to call this
function directly bypassing read_relmap_file for any upcoming patch?

If it fails during CREATE DATABASE, it should be ERROR, not FATAL. In
that case, we only need to stop trying to create a database; we don't
need to terminate the session. On the other hand if we can't read our
own database's relmap files, that's an unrecoverable error, because we
will not be able to run any queries at all, so FATAL is appropriate.

--
Robert Haas
EDB: http://www.enterprisedb.com

#148

Ashutosh Sharma

ashu.coek88@gmail.com

almost 4 years ago

In reply to: Robert Haas (#147)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Fri, Mar 11, 2022 at 8:21 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Mar 11, 2022 at 12:15 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Looks better, but why do you want to pass elevel to the
load_relmap_file()? Are we calling this function from somewhere other
than read_relmap_file()? If not, do we have any plans to call this
function directly bypassing read_relmap_file for any upcoming patch?

If it fails during CREATE DATABASE, it should be ERROR, not FATAL. In
that case, we only need to stop trying to create a database; we don't
need to terminate the session. On the other hand if we can't read our
own database's relmap files, that's an unrecoverable error, because we
will not be able to run any queries at all, so FATAL is appropriate.

OK. I can see it being used in the v13 patch. In the previous patches
it was hard-coded with FATAL. Also, we simply error out when doing
file copy as I can see in the copy_file function. So yes FATAL is not
the right option to use when creating a database. Thanks.

--
With Regards,
Ashutosh Sharma.

#149

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Dilip Kumar (#145)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Fri, Mar 11, 2022 at 5:21 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Changes, 1) it take Robert's patch as first refactoring patch 2)
Rebase other new relmapper apis on top of that in 0002 3) Some code
refactoring in main patch 0005 and also one problem fix, earlier in
wal log method I have removed ForceSyncCommit(), but IMHO that is
equally valid whether we use file_copy or wal_log because in both
cases we are creating the disk files. 4) Support strategy in createdb
tool and add test case as part of 0006.

I don't think you've adequately considered temporary relations here.
It seems to be that ReadBufferWithoutRelcache() could not be safe on a
temprel, because we'd need a BackendId to access the underlying
storage. So I think that ReadBufferWithoutRelcache can only accept
unlogged or permanent, and maybe the argument ought to be a Boolean
instead of a relpersistence value. I thought that this problem might
be only cosmetic, but I checked the code that actually does the copy,
and there's no filter there on relpersistence either. And I think
there should be.

--
Robert Haas
EDB: http://www.enterprisedb.com

#150

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Robert Haas (#149)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Fri, Mar 11, 2022 at 1:10 PM Robert Haas <robertmhaas@gmail.com> wrote:

I don't think you've adequately considered temporary relations here.
It seems to be that ReadBufferWithoutRelcache() could not be safe on a
temprel, because we'd need a BackendId to access the underlying
storage. So I think that ReadBufferWithoutRelcache can only accept
unlogged or permanent, and maybe the argument ought to be a Boolean
instead of a relpersistence value. I thought that this problem might
be only cosmetic, but I checked the code that actually does the copy,
and there's no filter there on relpersistence either. And I think
there should be.

I hit "send" too quickly there:

rhaas=# create database fudge;
CREATE DATABASE
rhaas=# \c fudge
You are now connected to database "fudge" as user "rhaas".
fudge=# create temp table q ();
CREATE TABLE
fudge=# ^Z
[2]: + Stopped psql [rhaas Downloads]$ pg_ctl stop -mi waiting for server to shut down.... done server stopped [rhaas Downloads]$ %% psql \c You are now connected to database "fudge" as user "rhaas". fudge=# select * from pg_class where relpersistence='t'; oid | relname | relnamespace | reltype | reloftype | relowner | relam | relfilenode | reltablespace | relpages | reltuples | relallvisible | reltoastrelid | relhasindex | relisshared | relpersistence | relkind | relnatts | relchecks | relhasrules | relhastriggers | relhassubclass | relrowsecurity | relforcerowsecurity | relispopulated | relreplident | relispartition | relrewrite | relfrozenxid | relminmxid | relacl | reloptions | relpartbound -------+---------+--------------+---------+-----------+----------+-------+-------------+---------------+----------+-----------+---------------+---------------+-------------+-------------+----------------+---------+----------+-----------+-------------+----------------+----------------+----------------+---------------------+----------------+--------------+----------------+------------+--------------+------------+--------+------------+-------------- 16388 | q | 16386 | 16390 | 0 | 10 | 2 | 16388 | 0 | 0 | -1 | 0 | 0 | f | f | t | r | 0 | 0 | f | f | f | f | f | t | d | f | 0 | 721 | 1 | | | (1 row)
[rhaas Downloads]$ pg_ctl stop -mi
waiting for server to shut down.... done
server stopped
[rhaas Downloads]$ %%
psql
\c
You are now connected to database "fudge" as user "rhaas".
fudge=# select * from pg_class where relpersistence='t';
oid | relname | relnamespace | reltype | reloftype | relowner |
relam | relfilenode | reltablespace | relpages | reltuples |
relallvisible | reltoastrelid | relhasindex | relisshared |
relpersistence | relkind | relnatts | relchecks | relhasrules |
relhastriggers | relhassubclass | relrowsecurity | relforcerowsecurity
| relispopulated | relreplident | relispartition | relrewrite |
relfrozenxid | relminmxid | relacl | reloptions | relpartbound
-------+---------+--------------+---------+-----------+----------+-------+-------------+---------------+----------+-----------+---------------+---------------+-------------+-------------+----------------+---------+----------+-----------+-------------+----------------+----------------+----------------+---------------------+----------------+--------------+----------------+------------+--------------+------------+--------+------------+--------------
16388 | q | 16386 | 16390 | 0 | 10 |
2 | 16388 | 0 | 0 | -1 | 0
| 0 | f | f | t | r
| 0 | 0 | f | f | f
| f | f | t | d
| f | 0 | 721 | 1 | |
|
(1 row)

fudge=# \c rhaas
You are now connected to database "rhaas" as user "rhaas".
rhaas=# alter database fudge is_template true;
ALTER DATABASE
rhaas=# create database cookies template fudge;
CREATE DATABASE
rhaas=# \c cookies
You are now connected to database "cookies" as user "rhaas".
cookies=# select count(*) from pg_class where relpersistence='t';
count
-------
1
(1 row)

You have to be quick, because autovacuum will drop the orphaned temp
table when it notices it, but it is possible.

--
Robert Haas
EDB: http://www.enterprisedb.com

#151

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Dilip Kumar (#145)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Fri, Mar 11, 2022 at 5:21 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Changes, 1) it take Robert's patch as first refactoring patch 2)
Rebase other new relmapper apis on top of that in 0002 3) Some code
refactoring in main patch 0005 and also one problem fix, earlier in
wal log method I have removed ForceSyncCommit(), but IMHO that is
equally valid whether we use file_copy or wal_log because in both
cases we are creating the disk files. 4) Support strategy in createdb
tool and add test case as part of 0006.

I think there's something wrong with what this patch is doing with the
XLOG records. It adds XLOG_DBASE_CREATEDIR, but the only new
XLogInsert() calls in the patch are passing XLOG_DBASE_CREATE, and no
existing references are adjusted. Similarly with xl_dbase_create_rec
and xl_dbase_createdir_rec. Why would we introduce a new record type
and not use it?

Let's not call the functions for the different strategies
CopyDatabase() and CopyDatabaseWithWal() but rather something that
matches up to the strategy names e.g. create_database_using_wal_log()
and create_database_using_file_copy(). There's something a little
funny about the names wal_log and file_copy ... they're not quite
parallel gramatically. But it's probably OK.

The changes to createdb_failure_params make me a little nervous. I
think we'd be in real trouble if we failed before completing both
DropDatabaseBuffers() and ForgetDatabaseSyncRequests(). However, it
looks to me like those are both intended to be no-fail operations, so
I don't see an actual hazard here. But, hmm, what about on the
recovery side? Suppose that we start copying the database block by
block and then either (a) the standby is promoted before the copy is
finished or (b) the copy fails. Now the standby has data in
shared_buffers for a database that does not exist. If that's not bad,
then why does createdb_failure_params need to DropDatabaseBuffers()?
But I bet it is bad. I wonder if we should be using
RelationCopyStorage() rather than this new function
RelationCopyStorageUsingBuffer(). That would avoid having the buffers
in shared_buffers, dodging the problem. But then we have a problem
with checkpoint interlocking: we could begin replay from a checkpoint
and find that the pages that were supposed to get copied prior to the
checkpoint were actually not copied, because the checkpoint record
could be written after we've logged a page being copied and before we
actually write the page. Or, we could crash after writing a whole lot
of pages and a checkpoint record, but before RelationCopyStorage()
fsyncs the destination fork. It doesn't seem advisable to hold off
checkpoints for the time it takes to copy an entire relation fork, so
the solution is apparently to keep the data in shared buffers after
all. But that brings us right back to square one. Have you thought
through this whole problem carefully? It seems like a total mess to me
at the moment, but maybe I'm missing something.

There seems to be no reason to specify specific values for the members
of enum CreateDBStrategy.

I think the naming of some of the new functions might need work, in
particular GetRelInfoFromTuple, GetRelListFromPage, and
GetDatabaseRelationList. The names seem kind of generic for what
they're doing. Maybe ScanSourceDatabasePgClass,
ScanSourceDatabasePgClassPage, ScanSourceDatabasePgClassTuple?

--
Robert Haas
EDB: http://www.enterprisedb.com

#152

Dilip Kumar

dilipbalaut@gmail.com

almost 4 years ago

In reply to: Robert Haas (#151)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Sat, Mar 12, 2022 at 1:55 AM Robert Haas <robertmhaas@gmail.com> wrote:

Responding to this specific issue..

The changes to createdb_failure_params make me a little nervous. I
think we'd be in real trouble if we failed before completing both
DropDatabaseBuffers() and ForgetDatabaseSyncRequests(). However, it
looks to me like those are both intended to be no-fail operations, so
I don't see an actual hazard here.

I might be missing something but even without that I do not see a real
problem here. The reason we are dropping the database buffers and
pending sync request is because right after this we are removing the
underlying files and if we just remove the files without dropping the
buffer from the buffer cache then the checkpointer will fail while
trying to flush the buffer.

But, hmm, what about on the

recovery side? Suppose that we start copying the database block by
block and then either (a) the standby is promoted before the copy is
finished or (b) the copy fails.

I think the above logic will be valid in case of standby as well
because we are not really deleting the underlying files.

Now the standby has data in

shared_buffers for a database that does not exist. If that's not bad,
then why does createdb_failure_params need to DropDatabaseBuffers()?
But I bet it is bad. I wonder if we should be using
RelationCopyStorage() rather than this new function
RelationCopyStorageUsingBuffer().

I am not sure how RelationCopyStorage() will help in the standby side,
because then also we will log the same WAL (XLOG_FPI) for each page
and standby side we will use buffer to apply this FPI so if you think
that there is a problem then it will be same with
RelationCopyStorage() at least on the standby side.

In fact while we are rewriting the relation during vacuum full that
time also we are calling log_newpage() under RelationCopyStorage() and
during standby if it gets promoted we will be having some buffers in
the buffer pool with the new relfilenode. So I think our case is also
the same.

So here my stand is that we need to drop database buffers and remove
pending sync requests because we are deleting underlying files and if
we do not do that in some extreme cases then there is no need to drop
the buffers or remove the pending sync request and the worst
consequences would be waste of disk space.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#153

Dilip Kumar

dilipbalaut@gmail.com

almost 4 years ago

In reply to: Robert Haas (#150)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Fri, Mar 11, 2022 at 11:51 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Mar 11, 2022 at 1:10 PM Robert Haas <robertmhaas@gmail.com> wrote:

I don't think you've adequately considered temporary relations here.
It seems to be that ReadBufferWithoutRelcache() could not be safe on a
temprel, because we'd need a BackendId to access the underlying
storage. So I think that ReadBufferWithoutRelcache can only accept
unlogged or permanent, and maybe the argument ought to be a Boolean
instead of a relpersistence value. I thought that this problem might
be only cosmetic, but I checked the code that actually does the copy,
and there's no filter there on relpersistence either. And I think
there should be.

Yeah right for now, this api can only support unlogged or permanent.
I will fix this

I hit "send" too quickly there:

rhaas=# create database fudge;
CREATE DATABASE
rhaas=# \c fudge
You are now connected to database "fudge" as user "rhaas".
fudge=# create temp table q ();
CREATE TABLE
fudge=# ^Z
[2]+ Stopped psql
[rhaas Downloads]$ pg_ctl stop -mi
waiting for server to shut down.... done
server stopped
[rhaas Downloads]$ %%
psql
\c
You are now connected to database "fudge" as user "rhaas".
fudge=# select * from pg_class where relpersistence='t';
oid | relname | relnamespace | reltype | reloftype | relowner |
relam | relfilenode | reltablespace | relpages | reltuples |
relallvisible | reltoastrelid | relhasindex | relisshared |
relpersistence | relkind | relnatts | relchecks | relhasrules |
relhastriggers | relhassubclass | relrowsecurity | relforcerowsecurity
| relispopulated | relreplident | relispartition | relrewrite |
relfrozenxid | relminmxid | relacl | reloptions | relpartbound
-------+---------+--------------+---------+-----------+----------+-------+-------------+---------------+----------+-----------+---------------+---------------+-------------+-------------+----------------+---------+----------+-----------+-------------+----------------+----------------+----------------+---------------------+----------------+--------------+----------------+------------+--------------+------------+--------+------------+--------------
16388 | q | 16386 | 16390 | 0 | 10 |
2 | 16388 | 0 | 0 | -1 | 0
| 0 | f | f | t | r
| 0 | 0 | f | f | f
| f | f | t | d
| f | 0 | 721 | 1 | |
|
(1 row)

fudge=# \c rhaas
You are now connected to database "rhaas" as user "rhaas".
rhaas=# alter database fudge is_template true;
ALTER DATABASE
rhaas=# create database cookies template fudge;
CREATE DATABASE
rhaas=# \c cookies
You are now connected to database "cookies" as user "rhaas".
cookies=# select count(*) from pg_class where relpersistence='t';
count
-------
1
(1 row)

I think this is not a right example to show the problem, no? Because
you are showing the pg_class entry and the pg_class is not a temp
relation so even if we avoid copying the temp relation pg_class will
be copied right? so you will still see this uncleaned temp relation
entry. I could reproduce exactly the same issue without my patch as
well.

So I agree we need to avoid copying temp relations but with that the
above behavior will not change. Am I missing something?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#154

Dilip Kumar

dilipbalaut@gmail.com

almost 4 years ago

In reply to: Dilip Kumar (#152)

6 attachment(s)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Sat, Mar 12, 2022 at 11:06 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

In fact while we are rewriting the relation during vacuum full that
time also we are calling log_newpage() under RelationCopyStorage() and
during standby if it gets promoted we will be having some buffers in
the buffer pool with the new relfilenode. So I think our case is also
the same.

So here my stand is that we need to drop database buffers and remove
pending sync requests because we are deleting underlying files and if
we do not do that in some extreme cases then there is no need to drop
the buffers or remove the pending sync request and the worst
consequences would be waste of disk space.

So other than this open point I have fixed other comments given by you
which includes,

- Avoid copying temp relfilenode
- Rename of functions CopyDatabase* -> CreateDatabaseUsing*
- GetDatabaseRelationList and friends to ScanSourceDatabasePgClass*
- Removed unused structure and macro because we are using the same WAL
for copying the database using the old method or creating the
directory and version files for the new method. Do you think we
should introduce a new WAL for that instead of using the same?

Other than that, I have fixed some mistakes in comments and supported
tab completion for the new options.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v14-0002-Extend-relmap-interfaces.patchtext/x-patch; charset=US-ASCII; name=v14-0002-Extend-relmap-interfaces.patchDownload

From c64ee04a08ee1f9b06878d88b84db2a184a45f69 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 11 Mar 2022 10:09:42 +0530
Subject: [PATCH v14 2/6] Extend relmap interfaces

Support new interfaces in relmapper, 1) Support copying the
relmap file from one database path to the other database path.
2) And another interface for getting filenode from oid.  We already
have RelationMapOidToFilenode for the same purpose but that assumes
we are connected to the database for which we want to get the mapping.
So this new interface will do the same but instead, it will get the
mapping for the input database.

These interfaces are required for next patch, for supporting the
wal logged created database.
---
 src/backend/utils/cache/relmapper.c | 60 +++++++++++++++++++++++++++++++++++++
 src/include/utils/relmapper.h       |  4 ++-
 2 files changed, 63 insertions(+), 1 deletion(-)

diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index f172f61..f5a1964 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -252,6 +252,60 @@ RelationMapFilenodeToOid(Oid filenode, bool shared)
 }
 
 /*
+ * RelationMapOidToFilenodeForDatabase
+ *
+ * Same as RelationMapOidToFilenode, but instead of reading the mapping from
+ * the database we are connected to it will read the mapping from the input
+ * database.
+ */
+Oid
+RelationMapOidToFilenodeForDatabase(char *dbpath, Oid relationId)
+{
+	RelMapFile	map;
+	int			i;
+
+	/* Read the relmap file from the source database. */
+	load_relmap_file(&map, dbpath, false, ERROR);
+
+	/* Iterate over the relmap entries to find the input relation oid. */
+	for (i = 0; i < map.num_mappings; i++)
+	{
+		if (relationId == map.mappings[i].mapoid)
+			return map.mappings[i].mapfilenode;
+	}
+
+	return InvalidOid;
+}
+
+/*
+ * RelationMapCopy
+ *
+ * Copy relmapfile from source db path to the destination db path and WAL log
+ * the operation.
+ */
+void
+RelationMapCopy(Oid dbid, Oid tsid, char *srcdbpath, char *dstdbpath)
+{
+	RelMapFile map;
+
+	/*
+	 * Read the relmap file from the source database.  This function is only
+	 * called during the create database, so elevel can be ERROR.
+	 */
+	load_relmap_file(&map, srcdbpath, false, ERROR);
+
+	/*
+	 * Write map contents into the destination database's relmap file. No
+	 * sinval needed because we are creating new file while creating a new
+	 * database so no one else must be accessing this file and for the same
+	 * reason we don't need to acquire the RelationMappingLock as well.  And,
+	 * we also don't need to preserve files because we are creating a new
+	 * database so in case of anerror relation files will be deleted anyway.
+	 */
+	write_relmap_file(&map, true, false, false, dbid, tsid, dstdbpath);
+}
+
+/*
  * RelationMapUpdateMap
  *
  * Install a new relfilenode mapping for the specified relation.
@@ -1033,6 +1087,12 @@ relmap_redo(XLogReaderState *record)
 		 *
 		 * There shouldn't be anyone else updating relmaps during WAL replay,
 		 * but grab the lock to interlock against read_relmap_file().
+		 *
+		 * Note - this WAL is also written for copying the relmap file while
+		 * creating a database.  Therefore, it makes no sense to acquire a
+		 * relmap lock or send sinval.  But if we want to avoid that, then we
+		 * must set an extra flag in WAL.  So let it grab the lock and send
+		 * sinval because there is no harm in that.
 		 */
 		LWLockAcquire(RelationMappingLock, LW_EXCLUSIVE);
 		write_relmap_file(&newmap, false, true, false,
diff --git a/src/include/utils/relmapper.h b/src/include/utils/relmapper.h
index 9fbb5a7..f10353e 100644
--- a/src/include/utils/relmapper.h
+++ b/src/include/utils/relmapper.h
@@ -38,7 +38,9 @@ typedef struct xl_relmap_update
 extern Oid	RelationMapOidToFilenode(Oid relationId, bool shared);
 
 extern Oid	RelationMapFilenodeToOid(Oid relationId, bool shared);
-
+extern Oid RelationMapOidToFilenodeForDatabase(char *dbpath, Oid relationId);
+extern void RelationMapCopy(Oid dbid, Oid tsid, char *srcdbpath,
+							char *dstdbpath);
 extern void RelationMapUpdateMap(Oid relationId, Oid fileNode, bool shared,
 								 bool immediate);
 
-- 
1.8.3.1

v14-0001-Refactor-relmap-load-and-relmap-write-functions.patchtext/x-patch; charset=US-ASCII; name=v14-0001-Refactor-relmap-load-and-relmap-write-functions.patchDownload

From 90dbeafda334018677c4bea2831e4311950442e6 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 11 Mar 2022 09:03:09 +0530
Subject: [PATCH v14 1/6] Refactor relmap load and relmap write functions

Currently, relmap reading and writing interfaces are tightly
coupled with shared_map and local_map of the database
it is connected to.  But as higher level patch set we need
interfaces where we can read relmap into any input memory
and while writing also we should be able to pass the map.

So as part of this patch, we are doing refactoring of the
existing code such that we can expose the read and write
interfaces that are independent of the shared_map and the
local_map, without changing any logic.

Author: Robert Haas
---
 src/backend/utils/cache/relmapper.c | 147 ++++++++++++++++++------------------
 1 file changed, 74 insertions(+), 73 deletions(-)

diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 4f6811f..f172f61 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -136,9 +136,11 @@ static void apply_map_update(RelMapFile *map, Oid relationId, Oid fileNode,
 							 bool add_okay);
 static void merge_map_updates(RelMapFile *map, const RelMapFile *updates,
 							  bool add_okay);
-static void load_relmap_file(bool shared, bool lock_held);
-static void write_relmap_file(bool shared, RelMapFile *newmap,
-							  bool write_wal, bool send_sinval, bool preserve_files,
+static void read_relmap_file(bool shared, bool lock_held);
+static void load_relmap_file(RelMapFile *map, char *dbpath, bool lock_held,
+							 int elevel);
+static void write_relmap_file(RelMapFile *newmap, bool write_wal,
+							  bool send_sinval, bool preserve_files,
 							  Oid dbid, Oid tsid, const char *dbpath);
 static void perform_relmap_update(bool shared, const RelMapFile *updates);
 
@@ -405,12 +407,12 @@ RelationMapInvalidate(bool shared)
 	if (shared)
 	{
 		if (shared_map.magic == RELMAPPER_FILEMAGIC)
-			load_relmap_file(true, false);
+			read_relmap_file(true, false);
 	}
 	else
 	{
 		if (local_map.magic == RELMAPPER_FILEMAGIC)
-			load_relmap_file(false, false);
+			read_relmap_file(false, false);
 	}
 }
 
@@ -425,9 +427,9 @@ void
 RelationMapInvalidateAll(void)
 {
 	if (shared_map.magic == RELMAPPER_FILEMAGIC)
-		load_relmap_file(true, false);
+		read_relmap_file(true, false);
 	if (local_map.magic == RELMAPPER_FILEMAGIC)
-		load_relmap_file(false, false);
+		read_relmap_file(false, false);
 }
 
 /*
@@ -568,9 +570,9 @@ RelationMapFinishBootstrap(void)
 	Assert(pending_local_updates.num_mappings == 0);
 
 	/* Write the files; no WAL or sinval needed */
-	write_relmap_file(true, &shared_map, false, false, false,
-					  InvalidOid, GLOBALTABLESPACE_OID, NULL);
-	write_relmap_file(false, &local_map, false, false, false,
+	write_relmap_file(&shared_map, false, false, false,
+					  InvalidOid, GLOBALTABLESPACE_OID, "global");
+	write_relmap_file(&local_map, false, false, false,
 					  MyDatabaseId, MyDatabaseTableSpace, DatabasePath);
 }
 
@@ -612,7 +614,7 @@ RelationMapInitializePhase2(void)
 	/*
 	 * Load the shared map file, die on error.
 	 */
-	load_relmap_file(true, false);
+	read_relmap_file(true, false);
 }
 
 /*
@@ -633,7 +635,7 @@ RelationMapInitializePhase3(void)
 	/*
 	 * Load the local map file, die on error.
 	 */
-	load_relmap_file(false, false);
+	read_relmap_file(false, false);
 }
 
 /*
@@ -687,39 +689,48 @@ RestoreRelationMap(char *startAddress)
 }
 
 /*
- * load_relmap_file -- load data from the shared or local map file
+ * read_relmap_file -- load the shared or local map file
  *
- * Because the map file is essential for access to core system catalogs,
- * failure to read it is a fatal error.
+ * Because these files are essential for access to core system catalogs,
+ * failure to load either of them is a fatal error.
  *
  * Note that the local case requires DatabasePath to be set up.
  */
 static void
-load_relmap_file(bool shared, bool lock_held)
+read_relmap_file(bool shared, bool lock_held)
+{
+	if (shared)
+		load_relmap_file(&shared_map, "global", lock_held, FATAL);
+	else
+		load_relmap_file(&local_map, DatabasePath, lock_held, FATAL);
+}
+
+/*
+ * load_relmap_file -- load data from any relation mapper file
+ *
+ * dbpath must be the relevant database path, or "global" for shared relations.
+ *
+ * RelationMappingLock will be acquired released unless lock_held = true.
+ *
+ * Errors will be reported at the indicated elevel, which should be at least
+ * ERROR.
+ */
+static void
+load_relmap_file(RelMapFile *map, char *dbpath, bool lock_held, int elevel)
 {
-	RelMapFile *map;
 	char		mapfilename[MAXPGPATH];
 	pg_crc32c	crc;
 	int			fd;
 	int			r;
 
-	if (shared)
-	{
-		snprintf(mapfilename, sizeof(mapfilename), "global/%s",
-				 RELMAPPER_FILENAME);
-		map = &shared_map;
-	}
-	else
-	{
-		snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
-				 DatabasePath, RELMAPPER_FILENAME);
-		map = &local_map;
-	}
+	Assert(elevel >= ERROR);
 
-	/* Read data ... */
+	/* Open the target file. */
+	snprintf(mapfilename, sizeof(mapfilename), "%s/%s", dbpath,
+			 RELMAPPER_FILENAME);
 	fd = OpenTransientFile(mapfilename, O_RDONLY | PG_BINARY);
 	if (fd < 0)
-		ereport(FATAL,
+		ereport(elevel,
 				(errcode_for_file_access(),
 				 errmsg("could not open file \"%s\": %m",
 						mapfilename)));
@@ -734,16 +745,17 @@ load_relmap_file(bool shared, bool lock_held)
 	if (!lock_held)
 		LWLockAcquire(RelationMappingLock, LW_SHARED);
 
+	/* Now read the data. */
 	pgstat_report_wait_start(WAIT_EVENT_RELATION_MAP_READ);
 	r = read(fd, map, sizeof(RelMapFile));
 	if (r != sizeof(RelMapFile))
 	{
 		if (r < 0)
-			ereport(FATAL,
+			ereport(elevel,
 					(errcode_for_file_access(),
 					 errmsg("could not read file \"%s\": %m", mapfilename)));
 		else
-			ereport(FATAL,
+			ereport(elevel,
 					(errcode(ERRCODE_DATA_CORRUPTED),
 					 errmsg("could not read file \"%s\": read %d of %zu",
 							mapfilename, r, sizeof(RelMapFile))));
@@ -754,7 +766,7 @@ load_relmap_file(bool shared, bool lock_held)
 		LWLockRelease(RelationMappingLock);
 
 	if (CloseTransientFile(fd) != 0)
-		ereport(FATAL,
+		ereport(elevel,
 				(errcode_for_file_access(),
 				 errmsg("could not close file \"%s\": %m",
 						mapfilename)));
@@ -763,7 +775,7 @@ load_relmap_file(bool shared, bool lock_held)
 	if (map->magic != RELMAPPER_FILEMAGIC ||
 		map->num_mappings < 0 ||
 		map->num_mappings > MAX_MAPPINGS)
-		ereport(FATAL,
+		ereport(elevel,
 				(errmsg("relation mapping file \"%s\" contains invalid data",
 						mapfilename)));
 
@@ -773,7 +785,7 @@ load_relmap_file(bool shared, bool lock_held)
 	FIN_CRC32C(crc);
 
 	if (!EQ_CRC32C(crc, map->crc))
-		ereport(FATAL,
+		ereport(elevel,
 				(errmsg("relation mapping file \"%s\" contains incorrect checksum",
 						mapfilename)));
 }
@@ -795,16 +807,16 @@ load_relmap_file(bool shared, bool lock_held)
  *
  * Because this may be called during WAL replay when MyDatabaseId,
  * DatabasePath, etc aren't valid, we require the caller to pass in suitable
- * values.  The caller is also responsible for being sure no concurrent
- * map update could be happening.
+ * values. Pass dbpath as "global" for the shared map.
+ *
+ * The caller is also responsible for being sure no concurrent map update
+ * could be happening.
  */
 static void
-write_relmap_file(bool shared, RelMapFile *newmap,
-				  bool write_wal, bool send_sinval, bool preserve_files,
-				  Oid dbid, Oid tsid, const char *dbpath)
+write_relmap_file(RelMapFile *newmap, bool write_wal, bool send_sinval,
+				  bool preserve_files, Oid dbid, Oid tsid, const char *dbpath)
 {
 	int			fd;
-	RelMapFile *realmap;
 	char		mapfilename[MAXPGPATH];
 
 	/*
@@ -822,19 +834,8 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 	 * Open the target file.  We prefer to do this before entering the
 	 * critical section, so that an open() failure need not force PANIC.
 	 */
-	if (shared)
-	{
-		snprintf(mapfilename, sizeof(mapfilename), "global/%s",
-				 RELMAPPER_FILENAME);
-		realmap = &shared_map;
-	}
-	else
-	{
-		snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
-				 dbpath, RELMAPPER_FILENAME);
-		realmap = &local_map;
-	}
-
+	snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+			 dbpath, RELMAPPER_FILENAME);
 	fd = OpenTransientFile(mapfilename, O_WRONLY | O_CREAT | PG_BINARY);
 	if (fd < 0)
 		ereport(ERROR,
@@ -934,16 +935,6 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		}
 	}
 
-	/*
-	 * Success, update permanent copy.  During bootstrap, we might be working
-	 * on the permanent copy itself, in which case skip the memcpy() to avoid
-	 * invoking nominally-undefined behavior.
-	 */
-	if (realmap != newmap)
-		memcpy(realmap, newmap, sizeof(RelMapFile));
-	else
-		Assert(!send_sinval);	/* must be bootstrapping */
-
 	/* Critical section done */
 	if (write_wal)
 		END_CRIT_SECTION();
@@ -975,7 +966,7 @@ perform_relmap_update(bool shared, const RelMapFile *updates)
 	LWLockAcquire(RelationMappingLock, LW_EXCLUSIVE);
 
 	/* Be certain we see any other updates just made */
-	load_relmap_file(shared, true);
+	read_relmap_file(shared, true);
 
 	/* Prepare updated data in a local variable */
 	if (shared)
@@ -990,10 +981,19 @@ perform_relmap_update(bool shared, const RelMapFile *updates)
 	merge_map_updates(&newmap, updates, allowSystemTableMods);
 
 	/* Write out the updated map and do other necessary tasks */
-	write_relmap_file(shared, &newmap, true, true, true,
+	write_relmap_file(&newmap, true, true, true,
 					  (shared ? InvalidOid : MyDatabaseId),
 					  (shared ? GLOBALTABLESPACE_OID : MyDatabaseTableSpace),
-					  DatabasePath);
+					  (shared ? "global" : DatabasePath));
+
+	/*
+	 * We succesfully wrote the updated file, so it's now safe to rely on the
+	 * new values in this process, too.
+	 */
+	if (shared)
+		memcpy(&shared_map, &newmap, sizeof(RelMapFile));
+	else
+		memcpy(&local_map, &newmap, sizeof(RelMapFile));
 
 	/* Now we can release the lock */
 	LWLockRelease(RelationMappingLock);
@@ -1021,8 +1021,10 @@ relmap_redo(XLogReaderState *record)
 				 xlrec->nbytes);
 		memcpy(&newmap, xlrec->data, sizeof(newmap));
 
-		/* We need to construct the pathname for this database */
-		dbpath = GetDatabasePath(xlrec->dbid, xlrec->tsid);
+		if (xlrec->dbid != InvalidOid)
+			dbpath = GetDatabasePath(xlrec->dbid, xlrec->tsid);
+		else
+			dbpath = pstrdup("global");
 
 		/*
 		 * Write out the new map and send sinval, but of course don't write a
@@ -1030,11 +1032,10 @@ relmap_redo(XLogReaderState *record)
 		 * preserve files, either.
 		 *
 		 * There shouldn't be anyone else updating relmaps during WAL replay,
-		 * but grab the lock to interlock against load_relmap_file().
+		 * but grab the lock to interlock against read_relmap_file().
 		 */
 		LWLockAcquire(RelationMappingLock, LW_EXCLUSIVE);
-		write_relmap_file((xlrec->dbid == InvalidOid), &newmap,
-						  false, true, false,
+		write_relmap_file(&newmap, false, true, false,
 						  xlrec->dbid, xlrec->tsid, dbpath);
 		LWLockRelease(RelationMappingLock);
 
-- 
1.8.3.1

v14-0004-New-interface-to-lock-relation-id.patchtext/x-patch; charset=US-ASCII; name=v14-0004-New-interface-to-lock-relation-id.patchDownload

From 873185632684178bda8917080ffb6cb3f2f2ccb5 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 24 Sep 2021 18:29:17 +0530
Subject: [PATCH v14 4/6] New interface to lock relation id

Currently, we have LockRelationOid which provide a mechanism to
lock the relation oid but we must be connected to the database
from which this relation belong.  As part of this patch we are
providing a new interface which can lock the relation even if we
are not connected to the containing database.
---
 src/backend/storage/lmgr/lmgr.c | 28 ++++++++++++++++++++++++++++
 src/include/storage/lmgr.h      |  1 +
 2 files changed, 29 insertions(+)

diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 5ae52dd..1543da6 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -176,6 +176,34 @@ ConditionalLockRelationOid(Oid relid, LOCKMODE lockmode)
 }
 
 /*
+ *		LockRelationId
+ *
+ * Lock, given a LockRelId.  Same as LockRelationOid but take LockRelId as an
+ * input.
+ */
+void
+LockRelationId(LockRelId *relid, LOCKMODE lockmode)
+{
+	LOCKTAG		tag;
+	LOCALLOCK  *locallock;
+	LockAcquireResult res;
+
+	SET_LOCKTAG_RELATION(tag, relid->dbId, relid->relId);
+
+	res = LockAcquireExtended(&tag, lockmode, false, false, true, &locallock);
+
+	/*
+	 * Now that we have the lock, check for invalidation messages; see notes
+	 * in LockRelationOid.
+	 */
+	if (res != LOCKACQUIRE_ALREADY_CLEAR)
+	{
+		AcceptInvalidationMessages();
+		MarkLockClear(locallock);
+	}
+}
+
+/*
  *		UnlockRelationId
  *
  * Unlock, given a LockRelId.  This is preferred over UnlockRelationOid
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index 49edbcc..be1d2c9 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -38,6 +38,7 @@ extern void RelationInitLockInfo(Relation relation);
 
 /* Lock a relation */
 extern void LockRelationOid(Oid relid, LOCKMODE lockmode);
+extern void LockRelationId(LockRelId *relid, LOCKMODE lockmode);
 extern bool ConditionalLockRelationOid(Oid relid, LOCKMODE lockmode);
 extern void UnlockRelationId(LockRelId *relid, LOCKMODE lockmode);
 extern void UnlockRelationOid(Oid relid, LOCKMODE lockmode);
-- 
1.8.3.1

v14-0003-Allow-ReadBufferWithoutRelcache-to-support-unlog.patchtext/x-patch; charset=US-ASCII; name=v14-0003-Allow-ReadBufferWithoutRelcache-to-support-unlog.patchDownload

From 59cf52ec065445c0151f086b5627e48a08d97166 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Thu, 10 Feb 2022 15:55:33 +0530
Subject: [PATCH v14 3/6] Allow ReadBufferWithoutRelcache to support unlogged
 relpersistence

At present, this function may only be used on permanent relations,
because we only use it during XLOG replay.  But now as part of the
bigger patch set, we will be using this for reading the buffer from
the database to which we are not connected.  So now we need this
for the unlogged relations as well.
---
 src/backend/access/transam/xlogutils.c |  6 +++---
 src/backend/storage/buffer/bufmgr.c    | 18 ++++++++++--------
 src/include/storage/bufmgr.h           |  3 ++-
 3 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 54d5f20..a05bdd0 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -484,7 +484,7 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 	{
 		/* page exists in file */
 		buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-										   mode, NULL);
+										   mode, NULL, false);
 	}
 	else
 	{
@@ -509,7 +509,7 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 				ReleaseBuffer(buffer);
 			}
 			buffer = ReadBufferWithoutRelcache(rnode, forknum,
-											   P_NEW, mode, NULL);
+											   P_NEW, mode, NULL, false);
 		}
 		while (BufferGetBlockNumber(buffer) < blkno);
 		/* Handle the corner case that P_NEW returns non-consecutive pages */
@@ -519,7 +519,7 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 				LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 			ReleaseBuffer(buffer);
 			buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-											   mode, NULL);
+											   mode, NULL, false);
 		}
 	}
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f5459c6..3e4926f 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -772,23 +772,25 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
  * ReadBufferWithoutRelcache -- like ReadBufferExtended, but doesn't require
  *		a relcache entry for the relation.
  *
- * NB: At present, this function may only be used on permanent relations, which
- * is OK, because we only use it during XLOG replay.  If in the future we
- * want to use it on temporary or unlogged relations, we could pass additional
- * parameters.
+ * The caller should pass 'isunlogged' as true for the unlogged relation and
+ * false for the regular relation.
+ *
+ * NB: At present, this function may only be used on unlogged and regular
+ * relations, which is OK, because we only use it during XLOG replay and while
+ * copying the database.  If in the future we want to use it on temporary
+ * relations, we could pass additional parameters.
  */
 Buffer
 ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
 						  BlockNumber blockNum, ReadBufferMode mode,
-						  BufferAccessStrategy strategy)
+						  BufferAccessStrategy strategy, bool isunlogged)
 {
 	bool		hit;
 
 	SMgrRelation smgr = smgropen(rnode, InvalidBackendId);
 
-	Assert(InRecovery);
-
-	return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
+	return ReadBuffer_common(smgr, isunlogged ? RELPERSISTENCE_UNLOGGED :
+							 RELPERSISTENCE_PERMANENT, forkNum, blockNum,
 							 mode, strategy, &hit);
 }
 
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index dd01841..699d06b 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -184,7 +184,8 @@ extern Buffer ReadBufferExtended(Relation reln, ForkNumber forkNum,
 								 BufferAccessStrategy strategy);
 extern Buffer ReadBufferWithoutRelcache(RelFileNode rnode,
 										ForkNumber forkNum, BlockNumber blockNum,
-										ReadBufferMode mode, BufferAccessStrategy strategy);
+										ReadBufferMode mode, BufferAccessStrategy strategy,
+										bool isunlogged);
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern void MarkBufferDirty(Buffer buffer);
-- 
1.8.3.1

v14-0005-WAL-logged-CREATE-DATABASE.patchtext/x-patch; charset=US-ASCII; name=v14-0005-WAL-logged-CREATE-DATABASE.patchDownload

From 59cf667fbd889691ad2347616bd5d2b4352bb6d7 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 14 Feb 2022 17:48:03 +0530
Subject: [PATCH v14 5/6] WAL logged CREATE DATABASE

Currently, CREATE DATABASE forces a checkpoint, then copies all the files,
then forces another checkpoint. The comments in the createdb() function
explain the reasons for this. The attached patch fixes this problem by making
create database completely WAL logged so that we can avoid the checkpoints.

We are also maintaining the old way of creating the database and for that we
are providing an option to choose the strategy for creating the database.
For the new method the user need to give STRATEGY=WAL_LOG and for the
old method they need to give STRATEGY=FILE_COPY.  The default strategy will
be WAL_LOG.
---
 doc/src/sgml/ref/create_database.sgml  |  23 +
 src/backend/commands/dbcommands.c      | 737 +++++++++++++++++++++++++++------
 src/backend/storage/buffer/bufmgr.c    | 153 +++++++
 src/bin/psql/tab-complete.c            |   4 +-
 src/include/commands/dbcommands_xlog.h |   7 +
 src/include/storage/bufmgr.h           |   3 +
 src/tools/pgindent/typedefs.list       |   1 +
 7 files changed, 800 insertions(+), 128 deletions(-)

diff --git a/doc/src/sgml/ref/create_database.sgml b/doc/src/sgml/ref/create_database.sgml
index f70d0c7..b0c94e40 100644
--- a/doc/src/sgml/ref/create_database.sgml
+++ b/doc/src/sgml/ref/create_database.sgml
@@ -34,6 +34,7 @@ CREATE DATABASE <replaceable class="parameter">name</replaceable>
            [ CONNECTION LIMIT [=] <replaceable class="parameter">connlimit</replaceable> ]
            [ IS_TEMPLATE [=] <replaceable class="parameter">istemplate</replaceable> ]
            [ OID [=] <replaceable class="parameter">oid</replaceable> ] ]
+           [ STRATEGY [=] <replaceable class="parameter">strategy</replaceable> ] ]
 </synopsis>
  </refsynopsisdiv>
 
@@ -240,6 +241,28 @@ CREATE DATABASE <replaceable class="parameter">name</replaceable>
        </listitem>
       </varlistentry>
 
+      <varlistentry>
+       <term><replaceable class="parameter">strategy</replaceable></term>
+       <listitem>
+        <para>
+         This is used for copying the database directory.  Currently, we have
+         two strategies the <literal>WAL_LOG</literal> and the
+         <literal>FILE_COPY</literal>.  If <literal>WAL_LOG</literal> strategy
+         is used then the database will be copied block by block and it will
+         also WAL log each copied block.  Otherwise, if <literal>FILE_COPY
+         </literal> strategy is used then it will do the file system level copy
+         but the individual operations will not be WAL logged.  The default
+         strategy is <literal>WAL_LOG</literal>.  If we choose the file system
+         level copy then it has to issue a checkpoint before and after
+         performing the copy and if there are a lot of dirty buffers then
+         performing the checkpoint could be costly and it may impact the
+         performance of the whole system.  On the other hand, if we wal log
+         each block then it may take more time in database creation if the
+         source database is large.
+        </para>
+       </listitem>
+      </varlistentry>
+
     </variablelist>
 
   <para>
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index c37e3c9..c97fd14 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -63,13 +63,27 @@
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
 #include "utils/pg_locale.h"
+#include "utils/relmapper.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 
+/*
+ * Create database strategy.  The CREATEDB_WAL_LOG will copy the database at
+ * the block level and WAL log each copied block.  Whereas the
+ * CREATEDB_FILE_COPY will directly do the file system level copy of the
+ * database so the individual operations will not be WAL logged.
+ */
+typedef enum CreateDBStrategy
+{
+	CREATEDB_WAL_LOG = 0,
+	CREATEDB_FILE_COPY = 1
+} CreateDBStrategy;
+
 typedef struct
 {
 	Oid			src_dboid;		/* source (template) DB */
 	Oid			dest_dboid;		/* DB we are trying to create */
+	CreateDBStrategy	strategy;	/* create db strategy */
 } createdb_failure_params;
 
 typedef struct
@@ -78,6 +92,20 @@ typedef struct
 	Oid			dest_tsoid;		/* tablespace we are trying to move to */
 } movedb_failure_params;
 
+/*
+ * When creating a database, we scan the pg_class of the source database to
+ * identify all the relations to be copied.  The structure is used for storing
+ * information about each relation of the source database.
+ */
+typedef struct CreateDBRelInfo
+{
+	RelFileNode		rnode;				/* physical relation identifier */
+	Oid				reloid;				/* relation oid */
+	bool			isunlogged;			/* is persistence level unlogged ?
+										   otherwise, permanent. */
+} CreateDBRelInfo;
+
+
 /* non-export function prototypes */
 static void createdb_failure_callback(int code, Datum arg);
 static void movedb(const char *dbname, const char *tblspcname);
@@ -92,7 +120,506 @@ static bool have_createdb_privilege(void);
 static void remove_dbtablespaces(Oid db_id);
 static bool check_db_file_conflict(Oid db_id);
 static int	errdetail_busy_db(int notherbackends, int npreparedxacts);
+static void CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid,
+									bool isRedo);
+static CreateDBRelInfo *ScanSourceDatabasePgClassTuple(HeapTupleData *tuple,
+													   Oid tbid, Oid dbid,
+													   char *srcpath);
+static List *ScanSourceDatabasePgClassPage(Page page, Buffer buf, Oid tbid,
+										   Oid dbid, char *srcpath,
+										   List *rnodelist, Snapshot snapshot);
+static List *ScanSourceDatabasePgClass(Oid srctbid, Oid srcdbid, char *srcpath);
+static void CreateDatabaseUsingWalLog(Oid src_dboid, Oid dboid, Oid src_tsid,
+									  Oid dst_tsid);
+static void CreateDatabaseUsingFileCopy(Oid src_dboid, Oid dboid, Oid src_tsid,
+										Oid dst_tsid);
+
+/*
+ * Create database directory and write out the PG_VERSION file in the database
+ * path.  If isRedo is true, it's okay for the database directory to exist
+ * already.  We can directly write PG_MAJORVERSION in the version file instead
+ * of copying from the source database file because these two must be the same.
+ */
+static void
+CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid, bool isRedo)
+{
+	int		fd;
+	int		nbytes;
+	char	versionfile[MAXPGPATH];
+	char	buf[16];
+
+	/* Prepare version data before starting a critical section. */
+	sprintf(buf, "%s\n", PG_MAJORVERSION);
+	nbytes = strlen(PG_MAJORVERSION) + 1;
+
+	/* If we are not in WAL replay then write the WAL. */
+	if (!isRedo)
+	{
+		xl_dbase_create_rec xlrec;
+		XLogRecPtr	lsn;
+
+		START_CRIT_SECTION();
+
+		xlrec.db_id = dbid;
+		xlrec.tablespace_id = tsid;
+		xlrec.src_db_id = InvalidOid;
+		xlrec.src_tablespace_id = InvalidOid;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec), sizeof(xl_dbase_create_rec));
+
+		lsn = XLogInsert(RM_DBASE_ID, XLOG_DBASE_CREATE);
+
+		/* As always, WAL must hit the disk before the data update does. */
+		XLogFlush(lsn);
+	}
+
+	/* Create database directory. */
+	if (MakePGDirectory(dbpath) < 0)
+	{
+		/* Failure other than already exists or not in WAL replay? */
+		if (errno != EEXIST || !isRedo)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not create directory \"%s\": %m", dbpath)));
+	}
+
+	/*
+	 * Create PG_VERSION file in the database path.  If the file already exists
+	 * and we are in WAL replay then try again to open it in write mode.
+	 */
+	snprintf(versionfile, sizeof(versionfile), "%s/%s", dbpath, "PG_VERSION");
+
+	fd = OpenTransientFile(versionfile, O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
+	if (fd < 0 && errno == EEXIST && isRedo)
+		fd = OpenTransientFile(versionfile, O_WRONLY | O_TRUNC | PG_BINARY);
+
+	if (fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m", versionfile)));
+
+	/* Write PG_MAJORVERSION in the PG_VERSION file. */
+	pgstat_report_wait_start(WAIT_EVENT_COPY_FILE_WRITE);
+	errno = 0;
+	if ((int) write(fd, buf, nbytes) != nbytes)
+	{
+		/* If write didn't set errno, assume problem is no disk space. */
+		if (errno == 0)
+			errno = ENOSPC;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m", versionfile)));
+	}
+	pgstat_report_wait_end();
+
+	/* Close the version file. */
+	CloseTransientFile(fd);
+
+	/* Critical section done. */
+	if (!isRedo)
+		END_CRIT_SECTION();
+}
+
+/*
+ * Helper function for ScanSourceDatabasePgClassPage to prepare a single
+ * CreateDBRelInfo element from the input pg_class tuple.
+ */
+CreateDBRelInfo *
+ScanSourceDatabasePgClassTuple(HeapTupleData *tuple, Oid tbid, Oid dbid,
+							   char *srcpath)
+{
+	CreateDBRelInfo	   *relinfo;
+	Form_pg_class		classForm;
+	Oid					relfilenode = InvalidOid;
+
+	classForm = (Form_pg_class) GETSTRUCT(tuple);
+
+	/*
+	 * If this is a shared object, the object doesn't have the storage or a
+	 * temp relation then nothing to be done, so just return.
+	 */
+	if (classForm->reltablespace == GLOBALTABLESPACE_OID ||
+		!RELKIND_HAS_STORAGE(classForm->relkind) ||
+		classForm->relpersistence == RELPERSISTENCE_TEMP)
+		return NULL;
+
+	/*
+	 * If relfilenode is valid then directly use it.  Otherwise,
+	 * consult the relmapper for the mapped relation.
+	 */
+	if (OidIsValid(classForm->relfilenode))
+		relfilenode = classForm->relfilenode;
+	else
+		relfilenode = RelationMapOidToFilenodeForDatabase(srcpath,
+										classForm->oid);
+
+	/* We must have a valid relfilenode oid. */
+	Assert(OidIsValid(relfilenode));
+
+	/* Prepare a rel info element and add it to the list. */
+	relinfo = (CreateDBRelInfo *) palloc(sizeof(CreateDBRelInfo));
+	if (OidIsValid(classForm->reltablespace))
+		relinfo->rnode.spcNode = classForm->reltablespace;
+	else
+		relinfo->rnode.spcNode = tbid;
+
+	relinfo->rnode.dbNode = dbid;
+	relinfo->rnode.relNode = relfilenode;
+	relinfo->reloid = classForm->oid;
+
+	/* We should never reach here for the temp relations. */
+	Assert(classForm->relpersistence != RELPERSISTENCE_TEMP);
+	relinfo->isunlogged =
+		(classForm->relpersistence == RELPERSISTENCE_UNLOGGED) ? true : false;
+
+	return relinfo;
+}
+
+/*
+ * Helper function for ScanSourceDatabasePgClass to identify all the valid
+ * relfilenodes for the given page.
+ */
+static List *
+ScanSourceDatabasePgClassPage(Page page, Buffer buf, Oid tbid, Oid dbid,
+							  char *srcpath, List *rnodelist,
+							  Snapshot snapshot)
+{
+	BlockNumber		blkno = BufferGetBlockNumber(buf);
+	OffsetNumber	offnum;
+	OffsetNumber	maxoff;
+	HeapTupleData	tuple;
+
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/* Iterate over each tuple of the page. */
+	for (offnum = FirstOffsetNumber;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid;
+
+		itemid = PageGetItemId(page, offnum);
+
+		/* Nothing to do if slot is empty or already dead. */
+		if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid) ||
+			ItemIdIsRedirected(itemid))
+			continue;
+
+		Assert(ItemIdIsNormal(itemid));
+		ItemPointerSet(&(tuple.t_self), blkno, offnum);
+
+		/* Initialize a HeapTupleData structure. */
+		tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
+		tuple.t_len = ItemIdGetLength(itemid);
+		tuple.t_tableOid = RelationRelationId;
+
+		/*
+		 * If the pg_class tuple is visible then prepare a CreateDBRelInfo and
+		 * append it to the list.
+		 */
+		if (HeapTupleSatisfiesVisibility(&tuple, snapshot, buf))
+		{
+			CreateDBRelInfo	   *relinfo;
+
+			relinfo = ScanSourceDatabasePgClassTuple(&tuple, tbid, dbid,
+													 srcpath);
+
+			/* Add it to the list. */
+			if (relinfo != NULL)
+				rnodelist = lappend(rnodelist, relinfo);
+		}
+	}
+
+	return rnodelist;
+}
+
+/*
+ * Identify all the valid relfilenodes from the source database so that we can
+ * copy them to the destination database.  In order to identify that, this
+ * function will iterate over each block of the pg_class relation of the source
+ * database.  From there, we will check all the visible tuples in order to get
+ * a list of all the valid relfilenodes in the source database.
+ */
+static List *
+ScanSourceDatabasePgClass(Oid tbid, Oid dbid, char *srcpath)
+{
+	SMgrRelation	rd_smgr;
+	RelFileNode		rnode;
+	BlockNumber		nblocks;
+	BlockNumber		blkno;
+	Buffer			buf;
+	Oid				relfilenode;
+	Page			page;
+	List		   *rnodelist = NIL;
+	LockRelId		relid;
+	Snapshot		snapshot;
+	BufferAccessStrategy bstrategy;
+
+	/* Get pg_class relfilenode. */
+	relfilenode = RelationMapOidToFilenodeForDatabase(srcpath,
+													  RelationRelationId);
+	/*
+	 * We are going to read the buffers associated with the pg_class relation.
+	 * Thus, acquire the relation level lock before start scanning.  As we are
+	 * not connected to the database, we cannot use relation_open directly, so
+	 * we have to lock using relation id.
+	 */
+	relid.dbId = dbid;
+	relid.relId = RelationRelationId;
+	LockRelationId(&relid, AccessShareLock);
+
+	/* Prepare a relnode for pg_class relation. */
+	rnode.spcNode = tbid;
+	rnode.dbNode = dbid;
+	rnode.relNode = relfilenode;
+
+	/*
+	 * We are not connected to the source database so open the pg_class
+	 * relation at the smgr level and get the block count.
+	 */
+	rd_smgr = smgropen(rnode, InvalidBackendId);
+	nblocks = smgrnblocks(rd_smgr, MAIN_FORKNUM);
+
+	/*
+	 * We're going to read the whole pg_class so better to use bulk-read buffer
+	 * access strategy.
+	 */
+	bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	/* Get latest snapshot for scanning the pg_class. */
+	snapshot = GetLatestSnapshot();
+
+	/* Iterate over each block of the pg_class relation. */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		/*
+		 * We are not connected to the source database so directly use the
+		 * lower level bufmgr interface which operates on the rnode.
+		 */
+		buf = ReadBufferWithoutRelcache(rnode, MAIN_FORKNUM, blkno,
+										RBM_NORMAL, bstrategy, false);
+
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buf);
+		if (PageIsNew(page) || PageIsEmpty(page))
+		{
+			UnlockReleaseBuffer(buf);
+			continue;
+		}
+
+		/*
+		 * Process pg_class tuples for the current page and add all the valid
+		 * relfilenode entries to the rnodelist.
+		 */
+		rnodelist = ScanSourceDatabasePgClassPage(page, buf, tbid, dbid,
+												  srcpath, rnodelist,
+												  snapshot);
+
+		/* Release the buffer lock. */
+		UnlockReleaseBuffer(buf);
+	}
+
+	/* Release the lock. */
+	UnlockRelationId(&relid, AccessShareLock);
+
+	return rnodelist;
+}
+
+/*
+ * Copy source database to the target using WAL.  Create target database
+ * directory and copy data files from the source database to the target
+ * database, block by block and WAL log all the operations.
+ */
+static void
+CreateDatabaseUsingWalLog(Oid src_dboid, Oid dst_dboid, Oid src_tsid, Oid dst_tsid)
+{
+	char	   *srcpath;
+	char	   *dstpath;
+	List	   *rnodelist = NULL;
+	ListCell   *cell;
+	LockRelId	relid;
+	RelFileNode	srcrnode;
+	RelFileNode	dstrnode;
+	CreateDBRelInfo	*relinfo;
+
+	/* Get the source database path. */
+	srcpath = GetDatabasePath(src_dboid, src_tsid);
+
+	/* Get the destination database path. */
+	dstpath = GetDatabasePath(dst_dboid, dst_tsid);
+
+	/* Create database directory and write PG_VERSION file. */
+	CreateDirAndVersionFile(dstpath, dst_dboid, dst_tsid, false);
+
+	/* Copy relmap file from source database to the destination database. */
+	RelationMapCopy(dst_dboid, dst_tsid, srcpath, dstpath);
+
+	/* Get list of all valid relnode from the source database. */
+	rnodelist = ScanSourceDatabasePgClass(src_tsid, src_dboid, srcpath);
+	Assert(rnodelist != NIL);
+
+	/*
+	 * Database id is common for all the relation so set it before entering to
+	 * the loop.
+	 */
+	relid.dbId = src_dboid;
+
+	/*
+	 * Iterate over each relfilenode and copy the relation data block by block
+	 * from source database to the destination database.
+	 */
+	foreach(cell, rnodelist)
+	{
+		relinfo = lfirst(cell);
+		srcrnode = relinfo->rnode;
+
+		/*
+		 * If the relation is from the source db's default tablespace then we
+		 * need to create it in the destinations db's default tablespace.
+		 * Otherwise, we need to create in the same tablespace as it is in the
+		 * source database.
+		 */
+		if (srcrnode.spcNode == src_tsid)
+			dstrnode.spcNode = dst_tsid;
+		else
+			dstrnode.spcNode = srcrnode.spcNode;
+
+		dstrnode.dbNode = dst_dboid;
+		dstrnode.relNode = srcrnode.relNode;
+
+		/* Acquire the lock on relation before start copying. */
+		relid.relId = relinfo->reloid;
+		LockRelationId(&relid, AccessShareLock);
+
+		/* Copy relation storage from source to the destination. */
+		CreateAndCopyRelationData(srcrnode, dstrnode, relinfo->isunlogged);
+
+		/* Release the lock. */
+		UnlockRelationId(&relid, AccessShareLock);
+	}
+
+	list_free_deep(rnodelist);
+}
+
+/*
+ * Copy source database directory to the destination directory using file
+ * system level copy operation.
+ */
+static void
+CreateDatabaseUsingFileCopy(Oid src_dboid, Oid dst_dboid, Oid src_tsid,
+							Oid dst_tsid)
+{
+	TableScanDesc	scan;
+	Relation		rel;
+	HeapTuple		tuple;
+
+	/*
+	 * Force a checkpoint before starting the copy. This will force all
+	 * dirty buffers, including those of unlogged tables, out to disk, to
+	 * ensure source database is up-to-date on disk for the copy.
+	 * FlushDatabaseBuffers() would suffice for that, but we also want to
+	 * process any pending unlink requests. Otherwise, if a checkpoint
+	 * happened while we're copying files, a file might be deleted just
+	 * when we're about to copy it, causing the lstat() call in copydir()
+	 * to fail with ENOENT.
+	 */
+	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE |
+					  CHECKPOINT_WAIT | CHECKPOINT_FLUSH_ALL);
+
+	/*
+	 * Iterate through all tablespaces of the template database, and copy
+	 * each one to the new database.
+	 */
+	rel = table_open(TableSpaceRelationId, AccessShareLock);
+	scan = table_beginscan_catalog(rel, 0, NULL);
+	while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+	{
+		Form_pg_tablespace spaceform = (Form_pg_tablespace) GETSTRUCT(tuple);
+		Oid			srctablespace = spaceform->oid;
+		Oid			dsttablespace;
+		char	   *srcpath;
+		char	   *dstpath;
+		struct stat st;
+
+		/* No need to copy global tablespace */
+		if (srctablespace == GLOBALTABLESPACE_OID)
+			continue;
+
+		srcpath = GetDatabasePath(src_dboid, srctablespace);
+
+		if (stat(srcpath, &st) < 0 || !S_ISDIR(st.st_mode) ||
+			directory_is_empty(srcpath))
+		{
+			/* Assume we can ignore it */
+			pfree(srcpath);
+			continue;
+		}
+
+		if (srctablespace == src_tsid)
+			dsttablespace = dst_tsid;
+		else
+			dsttablespace = srctablespace;
+
+		dstpath = GetDatabasePath(dst_dboid, dsttablespace);
+
+		/*
+		 * Copy this subdirectory to the new location
+		 *
+		 * We don't need to copy subdirectories
+		 */
+		copydir(srcpath, dstpath, false);
+
+		/* Record the filesystem change in XLOG */
+		{
+			xl_dbase_create_rec xlrec;
+
+			xlrec.db_id = dst_dboid;
+			xlrec.tablespace_id = dsttablespace;
+			xlrec.src_db_id = src_dboid;
+			xlrec.src_tablespace_id = srctablespace;
 
+			XLogBeginInsert();
+			XLogRegisterData((char *) &xlrec, sizeof(xl_dbase_create_rec));
+
+			(void) XLogInsert(RM_DBASE_ID,
+							  XLOG_DBASE_CREATE | XLR_SPECIAL_REL_UPDATE);
+		}
+	}
+	table_endscan(scan);
+	table_close(rel, AccessShareLock);
+
+	/*
+	 * We force a checkpoint before committing.  This effectively means
+	 * that committed XLOG_DBASE_CREATE operations will never need to be
+	 * replayed (at least not in ordinary crash recovery; we still have to
+	 * make the XLOG entry for the benefit of PITR operations). This
+	 * avoids two nasty scenarios:
+	 *
+	 * #1: When PITR is off, we don't XLOG the contents of newly created
+	 * indexes; therefore the drop-and-recreate-whole-directory behavior
+	 * of DBASE_CREATE replay would lose such indexes.
+	 *
+	 * #2: Since we have to recopy the source database during DBASE_CREATE
+	 * replay, we run the risk of copying changes in it that were
+	 * committed after the original CREATE DATABASE command but before the
+	 * system crash that led to the replay.  This is at least unexpected
+	 * and at worst could lead to inconsistencies, eg duplicate table
+	 * names.
+	 *
+	 * (Both of these were real bugs in releases 8.0 through 8.0.3.)
+	 *
+	 * In PITR replay, the first of these isn't an issue, and the second
+	 * is only a risk if the CREATE DATABASE and subsequent template
+	 * database change both occur while a base backup is being taken.
+	 * There doesn't seem to be much we can do about that except document
+	 * it as a limitation.
+	 *
+	 * Perhaps if we ever implement CREATE DATABASE in a less cheesy way,
+	 * we can avoid this.
+	 */
+	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
+}
 
 /*
  * CREATE DATABASE
@@ -100,8 +627,6 @@ static int	errdetail_busy_db(int notherbackends, int npreparedxacts);
 Oid
 createdb(ParseState *pstate, const CreatedbStmt *stmt)
 {
-	TableScanDesc scan;
-	Relation	rel;
 	Oid			src_dboid;
 	Oid			src_owner;
 	int			src_encoding = -1;
@@ -132,6 +657,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	DefElem    *dallowconnections = NULL;
 	DefElem    *dconnlimit = NULL;
 	DefElem    *dcollversion = NULL;
+	DefElem    *dstrategy = NULL;
 	char	   *dbname = stmt->dbname;
 	char	   *dbowner = NULL;
 	const char *dbtemplate = NULL;
@@ -145,6 +671,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	char	   *dbcollversion = NULL;
 	int			notherbackends;
 	int			npreparedxacts;
+	CreateDBStrategy	dbstrategy = CREATEDB_WAL_LOG;
 	createdb_failure_params fparms;
 
 	/* Extract options from the statement node tree */
@@ -250,6 +777,12 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 						(errcode(ERRCODE_INVALID_PARAMETER_VALUE)),
 						errmsg("OIDs less than %u are reserved for system objects", FirstNormalObjectId));
 		}
+		else if (strcmp(defel->defname, "strategy") == 0)
+		{
+			if (dstrategy)
+				errorConflictingDefElem(defel, pstate);
+			dstrategy = defel;
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -374,6 +907,23 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 							dbtemplate)));
 	}
 
+	/* Validate the database creation strategy. */
+	if (dstrategy && dstrategy->arg)
+	{
+		char	*strategy;
+
+		strategy = defGetString(dstrategy);
+		if (strcmp(strategy, "wal_log") == 0)
+			dbstrategy = CREATEDB_WAL_LOG;
+		else if (strcmp(strategy, "file_copy") == 0)
+			dbstrategy = CREATEDB_FILE_COPY;
+		else
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("invalid create database strategy %s", strategy),
+					 errhint("Valid strategies are \"wal_log\", and \"file_copy\".")));
+	}
+
 	/* If encoding or locales are defaulted, use source's setting */
 	if (encoding < 0)
 		encoding = src_encoding;
@@ -668,19 +1218,6 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	InvokeObjectPostCreateHook(DatabaseRelationId, dboid, 0);
 
 	/*
-	 * Force a checkpoint before starting the copy. This will force all dirty
-	 * buffers, including those of unlogged tables, out to disk, to ensure
-	 * source database is up-to-date on disk for the copy.
-	 * FlushDatabaseBuffers() would suffice for that, but we also want to
-	 * process any pending unlink requests. Otherwise, if a checkpoint
-	 * happened while we're copying files, a file might be deleted just when
-	 * we're about to copy it, causing the lstat() call in copydir() to fail
-	 * with ENOENT.
-	 */
-	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT
-					  | CHECKPOINT_FLUSH_ALL);
-
-	/*
 	 * Once we start copying subdirectories, we need to be able to clean 'em
 	 * up if we fail.  Use an ENSURE block to make sure this happens.  (This
 	 * is not a 100% solution, because of the possibility of failure during
@@ -689,101 +1226,24 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	 */
 	fparms.src_dboid = src_dboid;
 	fparms.dest_dboid = dboid;
+	fparms.strategy = dbstrategy;
+
 	PG_ENSURE_ERROR_CLEANUP(createdb_failure_callback,
 							PointerGetDatum(&fparms));
 	{
 		/*
-		 * Iterate through all tablespaces of the template database, and copy
-		 * each one to the new database.
+		 * If the user has asked to create a database with WAL_LOG strategy
+		 * then call CreateDatabaseUsingWalLog, which will copy the database at the
+		 * block level and it will WAL log each copied block.  Otherwise,
+		 * call CreateDatabaseUsingFileCopy that will copy the database file by
+		 * file.
 		 */
-		rel = table_open(TableSpaceRelationId, AccessShareLock);
-		scan = table_beginscan_catalog(rel, 0, NULL);
-		while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
-		{
-			Form_pg_tablespace spaceform = (Form_pg_tablespace) GETSTRUCT(tuple);
-			Oid			srctablespace = spaceform->oid;
-			Oid			dsttablespace;
-			char	   *srcpath;
-			char	   *dstpath;
-			struct stat st;
-
-			/* No need to copy global tablespace */
-			if (srctablespace == GLOBALTABLESPACE_OID)
-				continue;
-
-			srcpath = GetDatabasePath(src_dboid, srctablespace);
-
-			if (stat(srcpath, &st) < 0 || !S_ISDIR(st.st_mode) ||
-				directory_is_empty(srcpath))
-			{
-				/* Assume we can ignore it */
-				pfree(srcpath);
-				continue;
-			}
-
-			if (srctablespace == src_deftablespace)
-				dsttablespace = dst_deftablespace;
-			else
-				dsttablespace = srctablespace;
-
-			dstpath = GetDatabasePath(dboid, dsttablespace);
-
-			/*
-			 * Copy this subdirectory to the new location
-			 *
-			 * We don't need to copy subdirectories
-			 */
-			copydir(srcpath, dstpath, false);
-
-			/* Record the filesystem change in XLOG */
-			{
-				xl_dbase_create_rec xlrec;
-
-				xlrec.db_id = dboid;
-				xlrec.tablespace_id = dsttablespace;
-				xlrec.src_db_id = src_dboid;
-				xlrec.src_tablespace_id = srctablespace;
-
-				XLogBeginInsert();
-				XLogRegisterData((char *) &xlrec, sizeof(xl_dbase_create_rec));
-
-				(void) XLogInsert(RM_DBASE_ID,
-								  XLOG_DBASE_CREATE | XLR_SPECIAL_REL_UPDATE);
-			}
-		}
-		table_endscan(scan);
-		table_close(rel, AccessShareLock);
-
-		/*
-		 * We force a checkpoint before committing.  This effectively means
-		 * that committed XLOG_DBASE_CREATE operations will never need to be
-		 * replayed (at least not in ordinary crash recovery; we still have to
-		 * make the XLOG entry for the benefit of PITR operations). This
-		 * avoids two nasty scenarios:
-		 *
-		 * #1: When PITR is off, we don't XLOG the contents of newly created
-		 * indexes; therefore the drop-and-recreate-whole-directory behavior
-		 * of DBASE_CREATE replay would lose such indexes.
-		 *
-		 * #2: Since we have to recopy the source database during DBASE_CREATE
-		 * replay, we run the risk of copying changes in it that were
-		 * committed after the original CREATE DATABASE command but before the
-		 * system crash that led to the replay.  This is at least unexpected
-		 * and at worst could lead to inconsistencies, eg duplicate table
-		 * names.
-		 *
-		 * (Both of these were real bugs in releases 8.0 through 8.0.3.)
-		 *
-		 * In PITR replay, the first of these isn't an issue, and the second
-		 * is only a risk if the CREATE DATABASE and subsequent template
-		 * database change both occur while a base backup is being taken.
-		 * There doesn't seem to be much we can do about that except document
-		 * it as a limitation.
-		 *
-		 * Perhaps if we ever implement CREATE DATABASE in a less cheesy way,
-		 * we can avoid this.
-		 */
-		RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
+		if (dbstrategy == CREATEDB_WAL_LOG)
+			CreateDatabaseUsingWalLog(src_dboid, dboid, src_deftablespace,
+									  dst_deftablespace);
+		else
+			CreateDatabaseUsingFileCopy(src_dboid, dboid, src_deftablespace,
+										dst_deftablespace);
 
 		/*
 		 * Close pg_database, but keep lock till commit.
@@ -870,6 +1330,21 @@ createdb_failure_callback(int code, Datum arg)
 	createdb_failure_params *fparms = (createdb_failure_params *) DatumGetPointer(arg);
 
 	/*
+	 * If we were copying database at block levels then drop pages for the
+	 * destination database that are in the shared buffer cache.  And tell
+	 * checkpointer to forget any pending fsync and unlink requests for
+	 * files in the database.  The reasoning behind doing this is same as
+	 * explained in dropdb function.  But unlike dropdb we don't need to call
+	 * pgstat_drop_database because this database is still not created so there
+	 * should not be any stat for this.
+	 */
+	if (fparms->strategy == CREATEDB_WAL_LOG)
+	{
+		DropDatabaseBuffers(fparms->dest_dboid);
+		ForgetDatabaseSyncRequests(fparms->dest_dboid);
+	}
+
+	/*
 	 * Release lock on source database before doing recursive remove. This is
 	 * not essential but it seems desirable to release the lock as soon as
 	 * possible.
@@ -2387,32 +2862,40 @@ dbase_redo(XLogReaderState *record)
 		src_path = GetDatabasePath(xlrec->src_db_id, xlrec->src_tablespace_id);
 		dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
 
-		/*
-		 * Our theory for replaying a CREATE is to forcibly drop the target
-		 * subdirectory if present, then re-copy the source data. This may be
-		 * more work than needed, but it is simple to implement.
-		 */
-		if (stat(dst_path, &st) == 0 && S_ISDIR(st.st_mode))
+		if (!OidIsValid(xlrec->src_db_id))
 		{
-			if (!rmtree(dst_path, true))
-				/* If this failed, copydir() below is going to error. */
-				ereport(WARNING,
-						(errmsg("some useless files may be left behind in old database directory \"%s\"",
-								dst_path)));
+			CreateDirAndVersionFile(dst_path, xlrec->db_id, xlrec->tablespace_id,
+									true);
 		}
+		else
+		{
+			/*
+			 * Our theory for replaying a CREATE is to forcibly drop the target
+			 * subdirectory if present, then re-copy the source data. This may
+			 * be more work than needed, but it is simple to implement.
+			 */
+			if (stat(dst_path, &st) == 0 && S_ISDIR(st.st_mode))
+			{
+				if (!rmtree(dst_path, true))
+					/* If this failed, copydir() below is going to error. */
+					ereport(WARNING,
+							(errmsg("some useless files may be left behind in old database directory \"%s\"",
+									dst_path)));
+			}
 
-		/*
-		 * Force dirty buffers out to disk, to ensure source database is
-		 * up-to-date for the copy.
-		 */
-		FlushDatabaseBuffers(xlrec->src_db_id);
+			/*
+			 * Force dirty buffers out to disk, to ensure source database is
+			 * up-to-date for the copy.
+			 */
+			FlushDatabaseBuffers(xlrec->src_db_id);
 
-		/*
-		 * Copy this subdirectory to the new location
-		 *
-		 * We don't need to copy subdirectories
-		 */
-		copydir(src_path, dst_path, false);
+			/*
+			 * Copy this subdirectory to the new location
+			 *
+			 * We don't need to copy subdirectories
+			 */
+			copydir(src_path, dst_path, false);
+		}
 	}
 	else if (info == XLOG_DBASE_DROP)
 	{
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 3e4926f..91aca23 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -38,6 +38,7 @@
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/storage.h"
+#include "catalog/storage_xlog.h"
 #include "executor/instrument.h"
 #include "lib/binaryheap.h"
 #include "miscadmin.h"
@@ -486,6 +487,9 @@ static void FindAndDropRelFileNodeBuffers(RelFileNode rnode,
 										  ForkNumber forkNum,
 										  BlockNumber nForkBlock,
 										  BlockNumber firstDelBlock);
+static void RelationCopyStorageUsingBuffer(SMgrRelation src, SMgrRelation dst,
+										   ForkNumber forkNum,
+										   bool isunlogged);
 static void AtProcExit_Buffers(int code, Datum arg);
 static void CheckForBufferLeaks(void);
 static int	rnode_comparator(const void *p1, const void *p2);
@@ -3679,6 +3683,155 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 }
 
 /* ---------------------------------------------------------------------
+ *		RelationCopyStorageUsingBuffer
+ *
+ *		Copy fork's data using bufmgr.  Same as RelationCopyStorage but instead
+ *		of using smgrread and smgrextend this will copy using bufmgr APIs.
+ *
+ *		Refer comments atop CreateAndCopyRelationData() for details about
+ *		isunlogged parameter.
+ * --------------------------------------------------------------------
+ */
+static void
+RelationCopyStorageUsingBuffer(SMgrRelation src, SMgrRelation dst,
+							   ForkNumber forkNum, bool isunlogged)
+{
+	Buffer		srcBuf;
+	Buffer		dstBuf;
+	Page		srcPage;
+	Page		dstPage;
+	bool		use_wal;
+	bool		copying_initfork;
+	BlockNumber nblocks;
+	BlockNumber blkno;
+	BufferAccessStrategy bstrategy_src;
+	BufferAccessStrategy bstrategy_dst;
+
+	/* Refer comments in RelationCopyStorage. */
+	copying_initfork = isunlogged && (forkNum == INIT_FORKNUM);
+	use_wal = XLogIsNeeded() && (!isunlogged || copying_initfork);
+
+	/* Get number of blocks in the source relation. */
+	nblocks = smgrnblocks(src, forkNum);
+
+	/* Nothing to copy so directly exit. */
+	if (nblocks == 0)
+		return;
+
+	/*
+	 * We are going to copy whole relation from the source to the destination
+	 * so use BAS_BULKREAD strategy for the source relation and BAS_BULKWRITE
+	 * strategy for the destination relation.
+	 */
+	bstrategy_src = GetAccessStrategy(BAS_BULKREAD);
+	bstrategy_dst = GetAccessStrategy(BAS_BULKWRITE);
+
+	/* Iterate over each block of the source relation file. */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		/* If we got a cancel signal during the copy of the data, quit */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Read block from source relation. */
+		srcBuf = ReadBufferWithoutRelcache(src->smgr_rnode.node, forkNum,
+										   blkno, RBM_NORMAL, bstrategy_src,
+										   isunlogged);
+		srcPage = BufferGetPage(srcBuf);
+		if (PageIsNew(srcPage) || PageIsEmpty(srcPage))
+		{
+			ReleaseBuffer(srcBuf);
+			continue;
+		}
+
+		/* Use P_NEW to extend the relation. */
+		dstBuf = ReadBufferWithoutRelcache(dst->smgr_rnode.node, forkNum,
+										   P_NEW, RBM_NORMAL, bstrategy_dst,
+										   isunlogged);
+		LockBuffer(dstBuf, BUFFER_LOCK_EXCLUSIVE);
+
+		START_CRIT_SECTION();
+
+		/* Initialize the page and write the data. */
+		dstPage = BufferGetPage(dstBuf);
+		PageInit(dstPage, BufferGetPageSize(dstBuf), 0);
+		memcpy(dstPage, srcPage, BLCKSZ);
+		MarkBufferDirty(dstBuf);
+
+		/* WAL-log the copied page. */
+		if (use_wal)
+			log_newpage_buffer(dstBuf, true);
+
+		END_CRIT_SECTION();
+
+		UnlockReleaseBuffer(dstBuf);
+		ReleaseBuffer(srcBuf);
+	}
+}
+
+/* ---------------------------------------------------------------------
+ *		CreateAndCopyRelationData
+ *
+ *		Create destination relation storage and copy source relation's all
+ *		fork's data to the destination.
+ *
+ *		Curretly this API is not supported for the temporary relations.  So
+ *		pass isunlogged as true for the unlogged relation and false for the
+ *		regular relation.
+ * --------------------------------------------------------------------
+ */
+void
+CreateAndCopyRelationData(RelFileNode src_rnode, RelFileNode dst_rnode,
+						  bool isunlogged)
+{
+	SMgrRelation	src_smgr;
+	SMgrRelation	dst_smgr;
+	char			relpersistence;
+
+	relpersistence =
+			isunlogged ? RELPERSISTENCE_UNLOGGED : RELPERSISTENCE_PERMANENT;
+
+	/* Open the source relation at smgr level. */
+	src_smgr = smgropen(src_rnode, InvalidBackendId);
+
+	/*
+	 * Create and copy all forks of the relation.
+	 *
+	 * NOTE: any conflict in relfilenode value will be caught in
+	 * RelationCreateStorage().
+	 */
+	dst_smgr = RelationCreateStorage(dst_rnode, relpersistence);
+
+	/* copy main fork */
+	RelationCopyStorageUsingBuffer(src_smgr, dst_smgr, MAIN_FORKNUM,
+								   isunlogged);
+
+	/* copy those extra forks that exist */
+	for (ForkNumber forkNum = MAIN_FORKNUM + 1;
+		 forkNum <= MAX_FORKNUM; forkNum++)
+	{
+		if (smgrexists(src_smgr, forkNum))
+		{
+			smgrcreate(dst_smgr, forkNum, false);
+
+			/*
+			 * WAL log creation if the relation is persistent, or this is the
+			 * init fork of an unlogged relation.
+			 */
+			if (!isunlogged || forkNum == INIT_FORKNUM)
+				log_smgrcreate(&dst_rnode, forkNum);
+
+			/* Copy a fork's data, block by block. */
+			RelationCopyStorageUsingBuffer(src_smgr, dst_smgr, forkNum,
+										   isunlogged);
+		}
+	}
+
+	/* Close the smgr rel */
+	smgrclose(src_smgr);
+	smgrclose(dst_smgr);
+}
+
+/* ---------------------------------------------------------------------
  *		FlushDatabaseBuffers
  *
  *		This function writes all dirty pages of a database out to disk
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 1717282..d0e3755 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2738,10 +2738,12 @@ psql_completion(const char *text, int start, int end)
 		COMPLETE_WITH("OWNER", "TEMPLATE", "ENCODING", "TABLESPACE",
 					  "IS_TEMPLATE",
 					  "ALLOW_CONNECTIONS", "CONNECTION LIMIT",
-					  "LC_COLLATE", "LC_CTYPE", "LOCALE", "OID");
+					  "LC_COLLATE", "LC_CTYPE", "LOCALE", "OID", "STRATEGY");
 
 	else if (Matches("CREATE", "DATABASE", MatchAny, "TEMPLATE"))
 		COMPLETE_WITH_QUERY(Query_for_list_of_template_databases);
+	else if (Matches("CREATE", "DATABASE", MatchAny, "STRATEGY"))
+		COMPLETE_WITH("WAL_LOG", "FILE_COPY");
 
 	/* CREATE DOMAIN */
 	else if (Matches("CREATE", "DOMAIN", MatchAny))
diff --git a/src/include/commands/dbcommands_xlog.h b/src/include/commands/dbcommands_xlog.h
index 593a857..42f1d65 100644
--- a/src/include/commands/dbcommands_xlog.h
+++ b/src/include/commands/dbcommands_xlog.h
@@ -21,6 +21,13 @@
 #define XLOG_DBASE_CREATE		0x00
 #define XLOG_DBASE_DROP			0x10
 
+/*
+ * This will be used for copying the database at file system level as well as
+ * using the wal log.  During wal log this will only be used for creating the
+ * destination database directory and other data will be copied with the
+ * individual wal operations so in that case we don't need to store the
+ * src_db_id and src_tablespace_id.
+ */
 typedef struct xl_dbase_create_rec
 {
 	/* Records copying of a single subdirectory incl. contents */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 699d06b..1aa3c9f 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -204,6 +204,9 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
 extern void FlushOneBuffer(Buffer buffer);
 extern void FlushRelationBuffers(Relation rel);
 extern void FlushRelationsAllBuffers(struct SMgrRelationData **smgrs, int nrels);
+extern void CreateAndCopyRelationData(RelFileNode src_rnode,
+									  RelFileNode dst_rnode,
+									  bool isunlogged);
 extern void FlushDatabaseBuffers(Oid dbid);
 extern void DropRelFileNodeBuffers(struct SMgrRelationData *smgr_reln, ForkNumber *forkNum,
 								   int nforks, BlockNumber *firstDelBlock);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index eaf3e7a..8d92c37 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -460,6 +460,7 @@ CoverPos
 CreateAmStmt
 CreateCastStmt
 CreateConversionStmt
+CreateDBRelInfo
 CreateDomainStmt
 CreateEnumStmt
 CreateEventTrigStmt
-- 
1.8.3.1

v14-0006-Support-create-database-strategy-in-createdb-too.patchtext/x-patch; charset=US-ASCII; name=v14-0006-Support-create-database-strategy-in-createdb-too.patchDownload

From 91bcffcfee678c0ce7114d201de323f7f81ee79b Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 11 Mar 2022 11:48:55 +0530
Subject: [PATCH v14 6/6] Support create database strategy in createdb tool

---
 doc/src/sgml/ref/createdb.sgml    | 16 ++++++++++++++++
 src/bin/scripts/createdb.c        | 10 +++++++++-
 src/bin/scripts/t/020_createdb.pl | 20 ++++++++++++++++++++
 3 files changed, 45 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/ref/createdb.sgml b/doc/src/sgml/ref/createdb.sgml
index 8647345..2a7beca 100644
--- a/doc/src/sgml/ref/createdb.sgml
+++ b/doc/src/sgml/ref/createdb.sgml
@@ -159,6 +159,22 @@ PostgreSQL documentation
      </varlistentry>
 
      <varlistentry>
+      <term><option>-S <replaceable class="parameter">template</replaceable></option></term>
+      <term><option>--strategy=<replaceable class="parameter">strategy</replaceable></option></term>
+      <listitem>
+       <para>
+        Specifies the database creation strategy.  Currently, we have two
+        strategies the <literal>WAL_LOG</literal> and the <literal>FILE_COPY
+        </literal>.  If <literal>WAL_LOG</literal> strategy is used then the
+        database will be copied block by block and it will also WAL log each
+        copied block.  Otherwise, if <literal>FILE_COPY</literal> strategy is
+        used then it will do the file system level copy so individual the block
+        is not WAL logged.  The default strategy is <literal>WAL_LOG</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
       <term><option>-T <replaceable class="parameter">template</replaceable></option></term>
       <term><option>--template=<replaceable class="parameter">template</replaceable></option></term>
       <listitem>
diff --git a/src/bin/scripts/createdb.c b/src/bin/scripts/createdb.c
index b0c6805..9d3c4ef 100644
--- a/src/bin/scripts/createdb.c
+++ b/src/bin/scripts/createdb.c
@@ -37,6 +37,7 @@ main(int argc, char *argv[])
 		{"lc-collate", required_argument, NULL, 1},
 		{"lc-ctype", required_argument, NULL, 2},
 		{"locale", required_argument, NULL, 'l'},
+		{"strategy", required_argument, NULL, 'S'},
 		{"maintenance-db", required_argument, NULL, 3},
 		{NULL, 0, NULL, 0}
 	};
@@ -61,6 +62,7 @@ main(int argc, char *argv[])
 	char	   *lc_collate = NULL;
 	char	   *lc_ctype = NULL;
 	char	   *locale = NULL;
+	char	   *strategy = NULL;
 
 	PQExpBufferData sql;
 
@@ -73,7 +75,7 @@ main(int argc, char *argv[])
 
 	handle_help_version_opts(argc, argv, "createdb", help);
 
-	while ((c = getopt_long(argc, argv, "h:p:U:wWeO:D:T:E:l:", long_options, &optindex)) != -1)
+	while ((c = getopt_long(argc, argv, "h:p:U:wWeO:D:T:E:l:S:", long_options, &optindex)) != -1)
 	{
 		switch (c)
 		{
@@ -119,6 +121,9 @@ main(int argc, char *argv[])
 			case 3:
 				maintenance_db = pg_strdup(optarg);
 				break;
+			case 'S':
+				strategy = pg_strdup(optarg);
+				break;
 			default:
 				fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
 				exit(1);
@@ -217,6 +222,8 @@ main(int argc, char *argv[])
 		appendPQExpBufferStr(&sql, " LC_CTYPE ");
 		appendStringLiteralConn(&sql, lc_ctype, conn);
 	}
+	if (strategy)
+		appendPQExpBuffer(&sql, " STRATEGY %s ", fmtId(strategy));
 
 	appendPQExpBufferChar(&sql, ';');
 
@@ -274,6 +281,7 @@ help(const char *progname)
 	printf(_("      --lc-collate=LOCALE      LC_COLLATE setting for the database\n"));
 	printf(_("      --lc-ctype=LOCALE        LC_CTYPE setting for the database\n"));
 	printf(_("  -O, --owner=OWNER            database user to own the new database\n"));
+	printf(_("  -S, --strategy=STRATEGY      database creation strategy wal_log or file_copy\n"));
 	printf(_("  -T, --template=TEMPLATE      template database to copy\n"));
 	printf(_("  -V, --version                output version information, then exit\n"));
 	printf(_("  -?, --help                   show this help, then exit\n"));
diff --git a/src/bin/scripts/t/020_createdb.pl b/src/bin/scripts/t/020_createdb.pl
index 6392454..ccfbe17 100644
--- a/src/bin/scripts/t/020_createdb.pl
+++ b/src/bin/scripts/t/020_createdb.pl
@@ -76,4 +76,24 @@ $node->command_checks_all(
 	],
 	'createdb with incorrect --lc-ctype');
 
+$node->command_checks_all(
+	[ 'createdb', '--strategy', "foo", 'foobar2' ],
+	1,
+	[qr/^$/],
+	[
+		qr/^createdb: error: database creation failed: ERROR:  invalid create database strategy|^createdb: error: database creation failed: ERROR:  invalid create database strategy foo/s
+	],
+	'createdb with incorrect --strategy');
+
+# Check database creation strategy
+$node->issues_sql_like(
+	[ 'createdb', '-T', 'foobar2', 'foobar4', '-S', 'wal_log'],
+	qr/statement: CREATE DATABASE foobar4 TEMPLATE foobar2 STRATEGY wal_log/,
+	'create database with WAL_LOG strategy');
+
+$node->issues_sql_like(
+	[ 'createdb', '-T', 'foobar2', 'foobar5', '-S', 'file_copy'],
+	qr/statement: CREATE DATABASE foobar5 TEMPLATE foobar2 STRATEGY file_copy/,
+	'create database with FILE_COPY strategy');
+
 done_testing();
-- 
1.8.3.1

#155

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Dilip Kumar (#152)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Sat, Mar 12, 2022 at 12:36 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

So here my stand is that we need to drop database buffers and remove
pending sync requests because we are deleting underlying files and if
we do not do that in some extreme cases then there is no need to drop
the buffers or remove the pending sync request and the worst
consequences would be waste of disk space.

Hmm, I guess you're right.

On Mon, Mar 14, 2022 at 7:51 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

- Removed unused structure and macro because we are using the same WAL
for copying the database using the old method or creating the
directory and version files for the new method. Do you think we
should introduce a new WAL for that instead of using the same?

I think it would make sense to have two different WAL records e.g.
XLOG_DBASE_CREATE_WAL_LOG and XLOG_DBASE_CREATE_FILE_COPY. Then it's
easy to see how this could be generalized to other strategies in the
future.

--
Robert Haas
EDB: http://www.enterprisedb.com

#156

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Dilip Kumar (#154)

1 attachment(s)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Mon, Mar 14, 2022 at 7:51 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Other than that, I have fixed some mistakes in comments and supported
tab completion for the new options.

I was looking at 0001 and 0002 again and realized that I swapped the
names load_relmap_file() and read_relmap_file() from what I should
have done. Here's a revised version. With this, read_relmap_file() and
write_relmap_file() become functions that just read and write the file
without touching any global variables, and load_relmap_file() is the
function that reads data from the file and puts it into a global
variable, which seems more sensible than the way I had it before.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachments:

relmap-rmh-refactor-v2.patchapplication/octet-stream; name=relmap-rmh-refactor-v2.patchDownload

diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 4f6811f571..c3fef70a09 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -137,8 +137,10 @@ static void apply_map_update(RelMapFile *map, Oid relationId, Oid fileNode,
 static void merge_map_updates(RelMapFile *map, const RelMapFile *updates,
 							  bool add_okay);
 static void load_relmap_file(bool shared, bool lock_held);
-static void write_relmap_file(bool shared, RelMapFile *newmap,
-							  bool write_wal, bool send_sinval, bool preserve_files,
+static void read_relmap_file(RelMapFile *map, char *dbpath, bool lock_held,
+							 int elevel);
+static void write_relmap_file(RelMapFile *newmap, bool write_wal,
+							  bool send_sinval, bool preserve_files,
 							  Oid dbid, Oid tsid, const char *dbpath);
 static void perform_relmap_update(bool shared, const RelMapFile *updates);
 
@@ -568,9 +570,9 @@ RelationMapFinishBootstrap(void)
 	Assert(pending_local_updates.num_mappings == 0);
 
 	/* Write the files; no WAL or sinval needed */
-	write_relmap_file(true, &shared_map, false, false, false,
-					  InvalidOid, GLOBALTABLESPACE_OID, NULL);
-	write_relmap_file(false, &local_map, false, false, false,
+	write_relmap_file(&shared_map, false, false, false,
+					  InvalidOid, GLOBALTABLESPACE_OID, "global");
+	write_relmap_file(&local_map, false, false, false,
 					  MyDatabaseId, MyDatabaseTableSpace, DatabasePath);
 }
 
@@ -687,39 +689,48 @@ RestoreRelationMap(char *startAddress)
 }
 
 /*
- * load_relmap_file -- load data from the shared or local map file
+ * load_relmap_file -- load the shared or local map file
  *
- * Because the map file is essential for access to core system catalogs,
- * failure to read it is a fatal error.
+ * Because these files are essential for access to core system catalogs,
+ * failure to load either of them is a fatal error.
  *
  * Note that the local case requires DatabasePath to be set up.
  */
 static void
 load_relmap_file(bool shared, bool lock_held)
 {
-	RelMapFile *map;
+	if (shared)
+		read_relmap_file(&shared_map, "global", lock_held, FATAL);
+	else
+		read_relmap_file(&local_map, DatabasePath, lock_held, FATAL);
+}
+
+/*
+ * read_relmap_file -- load data from any relation mapper file
+ *
+ * dbpath must be the relevant database path, or "global" for shared relations.
+ *
+ * RelationMappingLock will be acquired released unless lock_held = true.
+ *
+ * Errors will be reported at the indicated elevel, which should be at least
+ * ERROR.
+ */
+static void
+read_relmap_file(RelMapFile *map, char *dbpath, bool lock_held, int elevel)
+{
 	char		mapfilename[MAXPGPATH];
 	pg_crc32c	crc;
 	int			fd;
 	int			r;
 
-	if (shared)
-	{
-		snprintf(mapfilename, sizeof(mapfilename), "global/%s",
-				 RELMAPPER_FILENAME);
-		map = &shared_map;
-	}
-	else
-	{
-		snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
-				 DatabasePath, RELMAPPER_FILENAME);
-		map = &local_map;
-	}
+	Assert(elevel >= ERROR);
 
-	/* Read data ... */
+	/* Open the target file. */
+	snprintf(mapfilename, sizeof(mapfilename), "%s/%s", dbpath,
+			 RELMAPPER_FILENAME);
 	fd = OpenTransientFile(mapfilename, O_RDONLY | PG_BINARY);
 	if (fd < 0)
-		ereport(FATAL,
+		ereport(elevel,
 				(errcode_for_file_access(),
 				 errmsg("could not open file \"%s\": %m",
 						mapfilename)));
@@ -734,16 +745,17 @@ load_relmap_file(bool shared, bool lock_held)
 	if (!lock_held)
 		LWLockAcquire(RelationMappingLock, LW_SHARED);
 
+	/* Now read the data. */
 	pgstat_report_wait_start(WAIT_EVENT_RELATION_MAP_READ);
 	r = read(fd, map, sizeof(RelMapFile));
 	if (r != sizeof(RelMapFile))
 	{
 		if (r < 0)
-			ereport(FATAL,
+			ereport(elevel,
 					(errcode_for_file_access(),
 					 errmsg("could not read file \"%s\": %m", mapfilename)));
 		else
-			ereport(FATAL,
+			ereport(elevel,
 					(errcode(ERRCODE_DATA_CORRUPTED),
 					 errmsg("could not read file \"%s\": read %d of %zu",
 							mapfilename, r, sizeof(RelMapFile))));
@@ -754,7 +766,7 @@ load_relmap_file(bool shared, bool lock_held)
 		LWLockRelease(RelationMappingLock);
 
 	if (CloseTransientFile(fd) != 0)
-		ereport(FATAL,
+		ereport(elevel,
 				(errcode_for_file_access(),
 				 errmsg("could not close file \"%s\": %m",
 						mapfilename)));
@@ -763,7 +775,7 @@ load_relmap_file(bool shared, bool lock_held)
 	if (map->magic != RELMAPPER_FILEMAGIC ||
 		map->num_mappings < 0 ||
 		map->num_mappings > MAX_MAPPINGS)
-		ereport(FATAL,
+		ereport(elevel,
 				(errmsg("relation mapping file \"%s\" contains invalid data",
 						mapfilename)));
 
@@ -773,7 +785,7 @@ load_relmap_file(bool shared, bool lock_held)
 	FIN_CRC32C(crc);
 
 	if (!EQ_CRC32C(crc, map->crc))
-		ereport(FATAL,
+		ereport(elevel,
 				(errmsg("relation mapping file \"%s\" contains incorrect checksum",
 						mapfilename)));
 }
@@ -795,16 +807,16 @@ load_relmap_file(bool shared, bool lock_held)
  *
  * Because this may be called during WAL replay when MyDatabaseId,
  * DatabasePath, etc aren't valid, we require the caller to pass in suitable
- * values.  The caller is also responsible for being sure no concurrent
- * map update could be happening.
+ * values. Pass dbpath as "global" for the shared map.
+ *
+ * The caller is also responsible for being sure no concurrent map update
+ * could be happening.
  */
 static void
-write_relmap_file(bool shared, RelMapFile *newmap,
-				  bool write_wal, bool send_sinval, bool preserve_files,
-				  Oid dbid, Oid tsid, const char *dbpath)
+write_relmap_file(RelMapFile *newmap, bool write_wal, bool send_sinval,
+				  bool preserve_files, Oid dbid, Oid tsid, const char *dbpath)
 {
 	int			fd;
-	RelMapFile *realmap;
 	char		mapfilename[MAXPGPATH];
 
 	/*
@@ -822,19 +834,8 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 	 * Open the target file.  We prefer to do this before entering the
 	 * critical section, so that an open() failure need not force PANIC.
 	 */
-	if (shared)
-	{
-		snprintf(mapfilename, sizeof(mapfilename), "global/%s",
-				 RELMAPPER_FILENAME);
-		realmap = &shared_map;
-	}
-	else
-	{
-		snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
-				 dbpath, RELMAPPER_FILENAME);
-		realmap = &local_map;
-	}
-
+	snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+			 dbpath, RELMAPPER_FILENAME);
 	fd = OpenTransientFile(mapfilename, O_WRONLY | O_CREAT | PG_BINARY);
 	if (fd < 0)
 		ereport(ERROR,
@@ -934,16 +935,6 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		}
 	}
 
-	/*
-	 * Success, update permanent copy.  During bootstrap, we might be working
-	 * on the permanent copy itself, in which case skip the memcpy() to avoid
-	 * invoking nominally-undefined behavior.
-	 */
-	if (realmap != newmap)
-		memcpy(realmap, newmap, sizeof(RelMapFile));
-	else
-		Assert(!send_sinval);	/* must be bootstrapping */
-
 	/* Critical section done */
 	if (write_wal)
 		END_CRIT_SECTION();
@@ -990,10 +981,19 @@ perform_relmap_update(bool shared, const RelMapFile *updates)
 	merge_map_updates(&newmap, updates, allowSystemTableMods);
 
 	/* Write out the updated map and do other necessary tasks */
-	write_relmap_file(shared, &newmap, true, true, true,
+	write_relmap_file(&newmap, true, true, true,
 					  (shared ? InvalidOid : MyDatabaseId),
 					  (shared ? GLOBALTABLESPACE_OID : MyDatabaseTableSpace),
-					  DatabasePath);
+					  (shared ? "global" : DatabasePath));
+
+	/*
+	 * We succesfully wrote the updated file, so it's now safe to rely on the
+	 * new values in this process, too.
+	 */
+	if (shared)
+		memcpy(&shared_map, &newmap, sizeof(RelMapFile));
+	else
+		memcpy(&local_map, &newmap, sizeof(RelMapFile));
 
 	/* Now we can release the lock */
 	LWLockRelease(RelationMappingLock);
@@ -1021,8 +1021,10 @@ relmap_redo(XLogReaderState *record)
 				 xlrec->nbytes);
 		memcpy(&newmap, xlrec->data, sizeof(newmap));
 
-		/* We need to construct the pathname for this database */
-		dbpath = GetDatabasePath(xlrec->dbid, xlrec->tsid);
+		if (xlrec->dbid != InvalidOid)
+			dbpath = GetDatabasePath(xlrec->dbid, xlrec->tsid);
+		else
+			dbpath = pstrdup("global");
 
 		/*
 		 * Write out the new map and send sinval, but of course don't write a
@@ -1033,8 +1035,7 @@ relmap_redo(XLogReaderState *record)
 		 * but grab the lock to interlock against load_relmap_file().
 		 */
 		LWLockAcquire(RelationMappingLock, LW_EXCLUSIVE);
-		write_relmap_file((xlrec->dbid == InvalidOid), &newmap,
-						  false, true, false,
+		write_relmap_file(&newmap, false, true, false,
 						  xlrec->dbid, xlrec->tsid, dbpath);
 		LWLockRelease(RelationMappingLock);

#157

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Robert Haas (#156)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Mon, Mar 14, 2022 at 12:04 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Mar 14, 2022 at 7:51 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Other than that, I have fixed some mistakes in comments and supported
tab completion for the new options.

I was looking at 0001 and 0002 again and realized that I swapped the
names load_relmap_file() and read_relmap_file() from what I should
have done. Here's a revised version. With this, read_relmap_file() and
write_relmap_file() become functions that just read and write the file
without touching any global variables, and load_relmap_file() is the
function that reads data from the file and puts it into a global
variable, which seems more sensible than the way I had it before.

Regarding 0003 and 0005, I'm not a fan of 'bool isunlogged'. I think
'bool permanent' would be better (note BM_PERMANENT). This would
involve reversing true and false.

Regarding 0004, I can't really see a reason for this function to take
a LockRelId as a parameter rather than two separate OIDs. I also can't
entirely see why it should be called LockRelationId. Maybe
LockRelationInDatabaseById(Oid dbid, Oid relid, LOCKMODE lockmode)?
Note that neither caller actually has a LockRelId available; both have
to construct one.

Regarding 0005:

+ CREATEDB_WAL_LOG = 0,
+ CREATEDB_FILE_COPY = 1

I still think you don't need = 0 and = 1 here.

I'll probably go through and do a pass over the comments once you post
the next version of this. There seems to be work needed in a bunch of
places, but it probably makes more sense for me to go through and
adjust the things that seem to need it rather than listing a bunch of
changes for you to make.

--
Robert Haas
EDB: http://www.enterprisedb.com

#158

Dilip Kumar

dilipbalaut@gmail.com

almost 4 years ago

In reply to: Robert Haas (#157)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Mon, Mar 14, 2022 at 10:04 PM Robert Haas <robertmhaas@gmail.com> wrote:

Regarding 0004, I can't really see a reason for this function to take
a LockRelId as a parameter rather than two separate OIDs. I also can't
entirely see why it should be called LockRelationId. Maybe
LockRelationInDatabaseById(Oid dbid, Oid relid, LOCKMODE lockmode)?
Note that neither caller actually has a LockRelId available; both have
to construct one.

Actually we already have an existing function
UnlockRelationId(LockRelId *relid, LOCKMODE lockmode) so it makes more
sense to have a parallel lock function. Do you still think we should
change?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#159

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Dilip Kumar (#158)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Mon, Mar 14, 2022 at 12:44 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Mar 14, 2022 at 10:04 PM Robert Haas <robertmhaas@gmail.com> wrote:

Regarding 0004, I can't really see a reason for this function to take
a LockRelId as a parameter rather than two separate OIDs. I also can't
entirely see why it should be called LockRelationId. Maybe
LockRelationInDatabaseById(Oid dbid, Oid relid, LOCKMODE lockmode)?
Note that neither caller actually has a LockRelId available; both have
to construct one.

Actually we already have an existing function
UnlockRelationId(LockRelId *relid, LOCKMODE lockmode) so it makes more
sense to have a parallel lock function. Do you still think we should
change?

Oh! OK, well, then what you did makes sense, for consistency. Didn't
realize that.

--
Robert Haas
EDB: http://www.enterprisedb.com

#160

Dilip Kumar

dilipbalaut@gmail.com

almost 4 years ago

In reply to: Robert Haas (#157)

6 attachment(s)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Mon, Mar 14, 2022 at 10:04 PM Robert Haas <robertmhaas@gmail.com> wrote:

I think it would make sense to have two different WAL records e.g.
XLOG_DBASE_CREATE_WAL_LOG and XLOG_DBASE_CREATE_FILE_COPY. Then it's
easy to see how this could be generalized to other strategies in the
future.

Done that way. In dbase_desc(), for XLOG_DBASE_CREATE_FILE_COPY I
have kept the older description i.e. "copy dir" and for
XLOG_DBASE_CREATE_WAL_LOG it is "create dir", because logically the
first one is actually copying and the second one is just creating the
directory. Do you think we should be using "copy dir file_copy" and
"copy dir wal_log" in the description as well?

On Mon, Mar 14, 2022 at 12:04 PM Robert Haas <robertmhaas@gmail.com> wrote:

I was looking at 0001 and 0002 again and realized that I swapped the
names load_relmap_file() and read_relmap_file() from what I should
have done. Here's a revised version. With this, read_relmap_file() and
write_relmap_file() become functions that just read and write the file
without touching any global variables, and load_relmap_file() is the
function that reads data from the file and puts it into a global
variable, which seems more sensible than the way I had it before.

Okay, I have included this patch and rebased other patches on top of that.

Regarding 0003 and 0005, I'm not a fan of 'bool isunlogged'. I think
'bool permanent' would be better (note BM_PERMANENT). This would
involve reversing true and false.

Okay changed.

Regarding 0005:

+ CREATEDB_WAL_LOG = 0,
+ CREATEDB_FILE_COPY = 1

I still think you don't need = 0 and = 1 here.

Done

I'll probably go through and do a pass over the comments once you post
the next version of this. There seems to be work needed in a bunch of
places, but it probably makes more sense for me to go through and
adjust the things that seem to need it rather than listing a bunch of
changes for you to make.

Sure, thanks.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v15-0001-Refactor-relmap-load-and-relmap-write-functions.patchtext/x-patch; charset=US-ASCII; name=v15-0001-Refactor-relmap-load-and-relmap-write-functions.patchDownload

From 9a4138bd3590d4df887dc09989a8c72715789b65 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Tue, 15 Mar 2022 09:13:21 +0530
Subject: [PATCH v15 1/6] Refactor relmap load and relmap write functions

Currently, relmap reading and writing interfaces are tightly
coupled with shared_map and local_map of the database
it is connected to.  But as higher level patch set we need
interfaces where we can read relmap into any input memory
and while writing also we should be able to pass the map.
And, also support reading relmap file from input database
path instead of assuming we are connected to the database.

So as part of this patch, we are doing refactoring of the
existing code such that we can expose the read and write
interfaces that are independent of the shared_map and the
local_map, without changing any logic.

Author: Robert Haas
---
 src/backend/utils/cache/relmapper.c | 127 ++++++++++++++++++------------------
 1 file changed, 64 insertions(+), 63 deletions(-)

diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 4f6811f..c3fef70 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -137,8 +137,10 @@ static void apply_map_update(RelMapFile *map, Oid relationId, Oid fileNode,
 static void merge_map_updates(RelMapFile *map, const RelMapFile *updates,
 							  bool add_okay);
 static void load_relmap_file(bool shared, bool lock_held);
-static void write_relmap_file(bool shared, RelMapFile *newmap,
-							  bool write_wal, bool send_sinval, bool preserve_files,
+static void read_relmap_file(RelMapFile *map, char *dbpath, bool lock_held,
+							 int elevel);
+static void write_relmap_file(RelMapFile *newmap, bool write_wal,
+							  bool send_sinval, bool preserve_files,
 							  Oid dbid, Oid tsid, const char *dbpath);
 static void perform_relmap_update(bool shared, const RelMapFile *updates);
 
@@ -568,9 +570,9 @@ RelationMapFinishBootstrap(void)
 	Assert(pending_local_updates.num_mappings == 0);
 
 	/* Write the files; no WAL or sinval needed */
-	write_relmap_file(true, &shared_map, false, false, false,
-					  InvalidOid, GLOBALTABLESPACE_OID, NULL);
-	write_relmap_file(false, &local_map, false, false, false,
+	write_relmap_file(&shared_map, false, false, false,
+					  InvalidOid, GLOBALTABLESPACE_OID, "global");
+	write_relmap_file(&local_map, false, false, false,
 					  MyDatabaseId, MyDatabaseTableSpace, DatabasePath);
 }
 
@@ -687,39 +689,48 @@ RestoreRelationMap(char *startAddress)
 }
 
 /*
- * load_relmap_file -- load data from the shared or local map file
+ * load_relmap_file -- load the shared or local map file
  *
- * Because the map file is essential for access to core system catalogs,
- * failure to read it is a fatal error.
+ * Because these files are essential for access to core system catalogs,
+ * failure to load either of them is a fatal error.
  *
  * Note that the local case requires DatabasePath to be set up.
  */
 static void
 load_relmap_file(bool shared, bool lock_held)
 {
-	RelMapFile *map;
+	if (shared)
+		read_relmap_file(&shared_map, "global", lock_held, FATAL);
+	else
+		read_relmap_file(&local_map, DatabasePath, lock_held, FATAL);
+}
+
+/*
+ * read_relmap_file -- load data from any relation mapper file
+ *
+ * dbpath must be the relevant database path, or "global" for shared relations.
+ *
+ * RelationMappingLock will be acquired released unless lock_held = true.
+ *
+ * Errors will be reported at the indicated elevel, which should be at least
+ * ERROR.
+ */
+static void
+read_relmap_file(RelMapFile *map, char *dbpath, bool lock_held, int elevel)
+{
 	char		mapfilename[MAXPGPATH];
 	pg_crc32c	crc;
 	int			fd;
 	int			r;
 
-	if (shared)
-	{
-		snprintf(mapfilename, sizeof(mapfilename), "global/%s",
-				 RELMAPPER_FILENAME);
-		map = &shared_map;
-	}
-	else
-	{
-		snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
-				 DatabasePath, RELMAPPER_FILENAME);
-		map = &local_map;
-	}
+	Assert(elevel >= ERROR);
 
-	/* Read data ... */
+	/* Open the target file. */
+	snprintf(mapfilename, sizeof(mapfilename), "%s/%s", dbpath,
+			 RELMAPPER_FILENAME);
 	fd = OpenTransientFile(mapfilename, O_RDONLY | PG_BINARY);
 	if (fd < 0)
-		ereport(FATAL,
+		ereport(elevel,
 				(errcode_for_file_access(),
 				 errmsg("could not open file \"%s\": %m",
 						mapfilename)));
@@ -734,16 +745,17 @@ load_relmap_file(bool shared, bool lock_held)
 	if (!lock_held)
 		LWLockAcquire(RelationMappingLock, LW_SHARED);
 
+	/* Now read the data. */
 	pgstat_report_wait_start(WAIT_EVENT_RELATION_MAP_READ);
 	r = read(fd, map, sizeof(RelMapFile));
 	if (r != sizeof(RelMapFile))
 	{
 		if (r < 0)
-			ereport(FATAL,
+			ereport(elevel,
 					(errcode_for_file_access(),
 					 errmsg("could not read file \"%s\": %m", mapfilename)));
 		else
-			ereport(FATAL,
+			ereport(elevel,
 					(errcode(ERRCODE_DATA_CORRUPTED),
 					 errmsg("could not read file \"%s\": read %d of %zu",
 							mapfilename, r, sizeof(RelMapFile))));
@@ -754,7 +766,7 @@ load_relmap_file(bool shared, bool lock_held)
 		LWLockRelease(RelationMappingLock);
 
 	if (CloseTransientFile(fd) != 0)
-		ereport(FATAL,
+		ereport(elevel,
 				(errcode_for_file_access(),
 				 errmsg("could not close file \"%s\": %m",
 						mapfilename)));
@@ -763,7 +775,7 @@ load_relmap_file(bool shared, bool lock_held)
 	if (map->magic != RELMAPPER_FILEMAGIC ||
 		map->num_mappings < 0 ||
 		map->num_mappings > MAX_MAPPINGS)
-		ereport(FATAL,
+		ereport(elevel,
 				(errmsg("relation mapping file \"%s\" contains invalid data",
 						mapfilename)));
 
@@ -773,7 +785,7 @@ load_relmap_file(bool shared, bool lock_held)
 	FIN_CRC32C(crc);
 
 	if (!EQ_CRC32C(crc, map->crc))
-		ereport(FATAL,
+		ereport(elevel,
 				(errmsg("relation mapping file \"%s\" contains incorrect checksum",
 						mapfilename)));
 }
@@ -795,16 +807,16 @@ load_relmap_file(bool shared, bool lock_held)
  *
  * Because this may be called during WAL replay when MyDatabaseId,
  * DatabasePath, etc aren't valid, we require the caller to pass in suitable
- * values.  The caller is also responsible for being sure no concurrent
- * map update could be happening.
+ * values. Pass dbpath as "global" for the shared map.
+ *
+ * The caller is also responsible for being sure no concurrent map update
+ * could be happening.
  */
 static void
-write_relmap_file(bool shared, RelMapFile *newmap,
-				  bool write_wal, bool send_sinval, bool preserve_files,
-				  Oid dbid, Oid tsid, const char *dbpath)
+write_relmap_file(RelMapFile *newmap, bool write_wal, bool send_sinval,
+				  bool preserve_files, Oid dbid, Oid tsid, const char *dbpath)
 {
 	int			fd;
-	RelMapFile *realmap;
 	char		mapfilename[MAXPGPATH];
 
 	/*
@@ -822,19 +834,8 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 	 * Open the target file.  We prefer to do this before entering the
 	 * critical section, so that an open() failure need not force PANIC.
 	 */
-	if (shared)
-	{
-		snprintf(mapfilename, sizeof(mapfilename), "global/%s",
-				 RELMAPPER_FILENAME);
-		realmap = &shared_map;
-	}
-	else
-	{
-		snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
-				 dbpath, RELMAPPER_FILENAME);
-		realmap = &local_map;
-	}
-
+	snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+			 dbpath, RELMAPPER_FILENAME);
 	fd = OpenTransientFile(mapfilename, O_WRONLY | O_CREAT | PG_BINARY);
 	if (fd < 0)
 		ereport(ERROR,
@@ -934,16 +935,6 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		}
 	}
 
-	/*
-	 * Success, update permanent copy.  During bootstrap, we might be working
-	 * on the permanent copy itself, in which case skip the memcpy() to avoid
-	 * invoking nominally-undefined behavior.
-	 */
-	if (realmap != newmap)
-		memcpy(realmap, newmap, sizeof(RelMapFile));
-	else
-		Assert(!send_sinval);	/* must be bootstrapping */
-
 	/* Critical section done */
 	if (write_wal)
 		END_CRIT_SECTION();
@@ -990,10 +981,19 @@ perform_relmap_update(bool shared, const RelMapFile *updates)
 	merge_map_updates(&newmap, updates, allowSystemTableMods);
 
 	/* Write out the updated map and do other necessary tasks */
-	write_relmap_file(shared, &newmap, true, true, true,
+	write_relmap_file(&newmap, true, true, true,
 					  (shared ? InvalidOid : MyDatabaseId),
 					  (shared ? GLOBALTABLESPACE_OID : MyDatabaseTableSpace),
-					  DatabasePath);
+					  (shared ? "global" : DatabasePath));
+
+	/*
+	 * We succesfully wrote the updated file, so it's now safe to rely on the
+	 * new values in this process, too.
+	 */
+	if (shared)
+		memcpy(&shared_map, &newmap, sizeof(RelMapFile));
+	else
+		memcpy(&local_map, &newmap, sizeof(RelMapFile));
 
 	/* Now we can release the lock */
 	LWLockRelease(RelationMappingLock);
@@ -1021,8 +1021,10 @@ relmap_redo(XLogReaderState *record)
 				 xlrec->nbytes);
 		memcpy(&newmap, xlrec->data, sizeof(newmap));
 
-		/* We need to construct the pathname for this database */
-		dbpath = GetDatabasePath(xlrec->dbid, xlrec->tsid);
+		if (xlrec->dbid != InvalidOid)
+			dbpath = GetDatabasePath(xlrec->dbid, xlrec->tsid);
+		else
+			dbpath = pstrdup("global");
 
 		/*
 		 * Write out the new map and send sinval, but of course don't write a
@@ -1033,8 +1035,7 @@ relmap_redo(XLogReaderState *record)
 		 * but grab the lock to interlock against load_relmap_file().
 		 */
 		LWLockAcquire(RelationMappingLock, LW_EXCLUSIVE);
-		write_relmap_file((xlrec->dbid == InvalidOid), &newmap,
-						  false, true, false,
+		write_relmap_file(&newmap, false, true, false,
 						  xlrec->dbid, xlrec->tsid, dbpath);
 		LWLockRelease(RelationMappingLock);
 
-- 
1.8.3.1

v15-0003-Allow-ReadBufferWithoutRelcache-to-support-unlog.patchtext/x-patch; charset=US-ASCII; name=v15-0003-Allow-ReadBufferWithoutRelcache-to-support-unlog.patchDownload

From 0c8aa4599d7dc0706b88297f2a459113f9154250 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Thu, 10 Feb 2022 15:55:33 +0530
Subject: [PATCH v15 3/6] Allow ReadBufferWithoutRelcache to support unlogged
 relpersistence

At present, this function may only be used on permanent relations,
because we only use it during XLOG replay.  But now as part of the
bigger patch set, we will be using this for reading the buffer from
the database to which we are not connected.  So now we need this
for the unlogged relations as well.
---
 src/backend/access/transam/xlogutils.c |  6 +++---
 src/backend/storage/buffer/bufmgr.c    | 18 ++++++++++--------
 src/include/storage/bufmgr.h           |  3 ++-
 3 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 54d5f20..6b10656 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -484,7 +484,7 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 	{
 		/* page exists in file */
 		buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-										   mode, NULL);
+										   mode, NULL, true);
 	}
 	else
 	{
@@ -509,7 +509,7 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 				ReleaseBuffer(buffer);
 			}
 			buffer = ReadBufferWithoutRelcache(rnode, forknum,
-											   P_NEW, mode, NULL);
+											   P_NEW, mode, NULL, true);
 		}
 		while (BufferGetBlockNumber(buffer) < blkno);
 		/* Handle the corner case that P_NEW returns non-consecutive pages */
@@ -519,7 +519,7 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 				LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 			ReleaseBuffer(buffer);
 			buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-											   mode, NULL);
+											   mode, NULL, true);
 		}
 	}
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f5459c6..3cadcd2 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -772,23 +772,25 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
  * ReadBufferWithoutRelcache -- like ReadBufferExtended, but doesn't require
  *		a relcache entry for the relation.
  *
- * NB: At present, this function may only be used on permanent relations, which
- * is OK, because we only use it during XLOG replay.  If in the future we
- * want to use it on temporary or unlogged relations, we could pass additional
- * parameters.
+ * The caller should pass 'permanent' as true for the regular relation and
+ * false for the unlogged relation.
+ *
+ * NB: At present, this function may only be used on unlogged and regular
+ * relations, which is OK, because we only use it during XLOG replay and while
+ * copying the database.  If in the future we want to use it on temporary
+ * relations, we could pass additional parameters.
  */
 Buffer
 ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
 						  BlockNumber blockNum, ReadBufferMode mode,
-						  BufferAccessStrategy strategy)
+						  BufferAccessStrategy strategy, bool permanent)
 {
 	bool		hit;
 
 	SMgrRelation smgr = smgropen(rnode, InvalidBackendId);
 
-	Assert(InRecovery);
-
-	return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
+	return ReadBuffer_common(smgr, permanent ? RELPERSISTENCE_PERMANENT :
+							 RELPERSISTENCE_UNLOGGED, forkNum, blockNum,
 							 mode, strategy, &hit);
 }
 
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index dd01841..fd0452f 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -184,7 +184,8 @@ extern Buffer ReadBufferExtended(Relation reln, ForkNumber forkNum,
 								 BufferAccessStrategy strategy);
 extern Buffer ReadBufferWithoutRelcache(RelFileNode rnode,
 										ForkNumber forkNum, BlockNumber blockNum,
-										ReadBufferMode mode, BufferAccessStrategy strategy);
+										ReadBufferMode mode, BufferAccessStrategy strategy,
+										bool permanent);
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern void MarkBufferDirty(Buffer buffer);
-- 
1.8.3.1

v15-0005-WAL-logged-CREATE-DATABASE.patchtext/x-patch; charset=US-ASCII; name=v15-0005-WAL-logged-CREATE-DATABASE.patchDownload

From 447b4d5d7e0ef7ba1691e853ea3d144b5de70f94 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Tue, 15 Mar 2022 09:41:20 +0530
Subject: [PATCH v15 5/6] WAL logged CREATE DATABASE

Currently, CREATE DATABASE forces a checkpoint, then copies all the files,
then forces another checkpoint. The comments in the createdb() function
explain the reasons for this. The attached patch fixes this problem by making
create database completely WAL logged so that we can avoid the checkpoints.

We are also maintaining the old way of creating the database and for that we
are providing an option to choose the strategy for creating the database.
For the new method the user need to give STRATEGY=WAL_LOG and for the
old method they need to give STRATEGY=FILE_COPY.  The default strategy will
be WAL_LOG.
---
 contrib/bloom/blinsert.c                 |   2 +-
 doc/src/sgml/ref/create_database.sgml    |  23 +
 src/backend/access/heap/heapam_handler.c |   2 +-
 src/backend/access/nbtree/nbtree.c       |   2 +-
 src/backend/access/rmgrdesc/dbasedesc.c  |  20 +-
 src/backend/commands/dbcommands.c        | 716 ++++++++++++++++++++++++++-----
 src/backend/storage/buffer/bufmgr.c      | 156 +++++++
 src/bin/pg_rewind/parsexlog.c            |   9 +-
 src/bin/psql/tab-complete.c              |   4 +-
 src/include/commands/dbcommands_xlog.h   |  24 +-
 src/include/storage/bufmgr.h             |   3 +
 src/tools/pgindent/typedefs.list         |   5 +-
 12 files changed, 838 insertions(+), 128 deletions(-)

diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c
index c94cf34..82378db 100644
--- a/contrib/bloom/blinsert.c
+++ b/contrib/bloom/blinsert.c
@@ -173,7 +173,7 @@ blbuildempty(Relation index)
 	 * Write the page and log it.  It might seem that an immediate sync would
 	 * be sufficient to guarantee that the file exists on disk, but recovery
 	 * itself might remove it while replaying, for example, an
-	 * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we need
+	 * XLOG_DBASE_CREATE* or XLOG_TBLSPC_CREATE record.  Therefore, we need
 	 * this even when wal_level=minimal.
 	 */
 	PageSetChecksumInplace(metapage, BLOOM_METAPAGE_BLKNO);
diff --git a/doc/src/sgml/ref/create_database.sgml b/doc/src/sgml/ref/create_database.sgml
index f70d0c7..b0c94e40 100644
--- a/doc/src/sgml/ref/create_database.sgml
+++ b/doc/src/sgml/ref/create_database.sgml
@@ -34,6 +34,7 @@ CREATE DATABASE <replaceable class="parameter">name</replaceable>
            [ CONNECTION LIMIT [=] <replaceable class="parameter">connlimit</replaceable> ]
            [ IS_TEMPLATE [=] <replaceable class="parameter">istemplate</replaceable> ]
            [ OID [=] <replaceable class="parameter">oid</replaceable> ] ]
+           [ STRATEGY [=] <replaceable class="parameter">strategy</replaceable> ] ]
 </synopsis>
  </refsynopsisdiv>
 
@@ -240,6 +241,28 @@ CREATE DATABASE <replaceable class="parameter">name</replaceable>
        </listitem>
       </varlistentry>
 
+      <varlistentry>
+       <term><replaceable class="parameter">strategy</replaceable></term>
+       <listitem>
+        <para>
+         This is used for copying the database directory.  Currently, we have
+         two strategies the <literal>WAL_LOG</literal> and the
+         <literal>FILE_COPY</literal>.  If <literal>WAL_LOG</literal> strategy
+         is used then the database will be copied block by block and it will
+         also WAL log each copied block.  Otherwise, if <literal>FILE_COPY
+         </literal> strategy is used then it will do the file system level copy
+         but the individual operations will not be WAL logged.  The default
+         strategy is <literal>WAL_LOG</literal>.  If we choose the file system
+         level copy then it has to issue a checkpoint before and after
+         performing the copy and if there are a lot of dirty buffers then
+         performing the checkpoint could be costly and it may impact the
+         performance of the whole system.  On the other hand, if we wal log
+         each block then it may take more time in database creation if the
+         source database is large.
+        </para>
+       </listitem>
+      </varlistentry>
+
     </variablelist>
 
   <para>
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 39ef8a0..2b70ca0 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -601,7 +601,7 @@ heapam_relation_set_new_filenode(Relation rel,
 	 * even if the page has been logged, because the write did not go through
 	 * shared_buffers and therefore a concurrent checkpoint may have moved the
 	 * redo pointer past our xlog record.  Recovery may as well remove it
-	 * while replaying, for example, XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE
+	 * while replaying, for example, XLOG_DBASE_CREATE* or XLOG_TBLSPC_CREATE
 	 * record. Therefore, logging is necessary even if wal_level=minimal.
 	 */
 	if (persistence == RELPERSISTENCE_UNLOGGED)
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index c9b4964..dacf3f7 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -161,7 +161,7 @@ btbuildempty(Relation index)
 	 * Write the page and log it.  It might seem that an immediate sync would
 	 * be sufficient to guarantee that the file exists on disk, but recovery
 	 * itself might remove it while replaying, for example, an
-	 * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we need
+	 * XLOG_DBASE_CREATE* or XLOG_TBLSPC_CREATE record.  Therefore, we need
 	 * this even when wal_level=minimal.
 	 */
 	PageSetChecksumInplace(metapage, BTREE_METAPAGE);
diff --git a/src/backend/access/rmgrdesc/dbasedesc.c b/src/backend/access/rmgrdesc/dbasedesc.c
index 03af3fd..523d0b3 100644
--- a/src/backend/access/rmgrdesc/dbasedesc.c
+++ b/src/backend/access/rmgrdesc/dbasedesc.c
@@ -24,14 +24,23 @@ dbase_desc(StringInfo buf, XLogReaderState *record)
 	char	   *rec = XLogRecGetData(record);
 	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
 
-	if (info == XLOG_DBASE_CREATE)
+	if (info == XLOG_DBASE_CREATE_FILE_COPY)
 	{
-		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) rec;
+		xl_dbase_create_file_copy_rec *xlrec =
+		(xl_dbase_create_file_copy_rec *) rec;
 
 		appendStringInfo(buf, "copy dir %u/%u to %u/%u",
 						 xlrec->src_tablespace_id, xlrec->src_db_id,
 						 xlrec->tablespace_id, xlrec->db_id);
 	}
+	else if (info == XLOG_DBASE_CREATE_WAL_LOG)
+	{
+		xl_dbase_create_wal_log_rec *xlrec =
+		(xl_dbase_create_wal_log_rec *) rec;
+
+		appendStringInfo(buf, "create dir %u/%u",
+						 xlrec->tablespace_id, xlrec->db_id);
+	}
 	else if (info == XLOG_DBASE_DROP)
 	{
 		xl_dbase_drop_rec *xlrec = (xl_dbase_drop_rec *) rec;
@@ -51,8 +60,11 @@ dbase_identify(uint8 info)
 
 	switch (info & ~XLR_INFO_MASK)
 	{
-		case XLOG_DBASE_CREATE:
-			id = "CREATE";
+		case XLOG_DBASE_CREATE_FILE_COPY:
+			id = "CREATE_FILE_COPY";
+			break;
+		case XLOG_DBASE_CREATE_WAL_LOG:
+			id = "CREATE_WAL_LOG";
 			break;
 		case XLOG_DBASE_DROP:
 			id = "DROP";
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index c37e3c9..9636688 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -63,13 +63,27 @@
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
 #include "utils/pg_locale.h"
+#include "utils/relmapper.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 
+/*
+ * Create database strategy.  The CREATEDB_WAL_LOG will copy the database at
+ * the block level and WAL log each copied block.  Whereas the
+ * CREATEDB_FILE_COPY will directly do the file system level copy of the
+ * database so the individual operations will not be WAL logged.
+ */
+typedef enum CreateDBStrategy
+{
+	CREATEDB_WAL_LOG,
+	CREATEDB_FILE_COPY
+} CreateDBStrategy;
+
 typedef struct
 {
 	Oid			src_dboid;		/* source (template) DB */
 	Oid			dest_dboid;		/* DB we are trying to create */
+	CreateDBStrategy strategy;	/* create db strategy */
 } createdb_failure_params;
 
 typedef struct
@@ -78,6 +92,19 @@ typedef struct
 	Oid			dest_tsoid;		/* tablespace we are trying to move to */
 } movedb_failure_params;
 
+/*
+ * When creating a database, we scan the pg_class of the source database to
+ * identify all the relations to be copied.  The structure is used for storing
+ * information about each relation of the source database.
+ */
+typedef struct CreateDBRelInfo
+{
+	RelFileNode rnode;			/* physical relation identifier */
+	Oid			reloid;			/* relation oid */
+	bool		permanent;		/* relation is permanent or unlogged */
+} CreateDBRelInfo;
+
+
 /* non-export function prototypes */
 static void createdb_failure_callback(int code, Datum arg);
 static void movedb(const char *dbname, const char *tblspcname);
@@ -92,7 +119,507 @@ static bool have_createdb_privilege(void);
 static void remove_dbtablespaces(Oid db_id);
 static bool check_db_file_conflict(Oid db_id);
 static int	errdetail_busy_db(int notherbackends, int npreparedxacts);
+static void CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid,
+									bool isRedo);
+static CreateDBRelInfo *ScanSourceDatabasePgClassTuple(HeapTupleData *tuple,
+													   Oid tbid, Oid dbid,
+													   char *srcpath);
+static List *ScanSourceDatabasePgClassPage(Page page, Buffer buf, Oid tbid,
+										   Oid dbid, char *srcpath,
+										   List *rnodelist, Snapshot snapshot);
+static List *ScanSourceDatabasePgClass(Oid srctbid, Oid srcdbid, char *srcpath);
+static void CreateDatabaseUsingWalLog(Oid src_dboid, Oid dboid, Oid src_tsid,
+									  Oid dst_tsid);
+static void CreateDatabaseUsingFileCopy(Oid src_dboid, Oid dboid, Oid src_tsid,
+										Oid dst_tsid);
+
+/*
+ * Create database directory and write out the PG_VERSION file in the database
+ * path.  If isRedo is true, it's okay for the database directory to exist
+ * already.  We can directly write PG_MAJORVERSION in the version file instead
+ * of copying from the source database file because these two must be the same.
+ */
+static void
+CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid, bool isRedo)
+{
+	int			fd;
+	int			nbytes;
+	char		versionfile[MAXPGPATH];
+	char		buf[16];
+
+	/* Prepare version data before starting a critical section. */
+	sprintf(buf, "%s\n", PG_MAJORVERSION);
+	nbytes = strlen(PG_MAJORVERSION) + 1;
+
+	/* If we are not in WAL replay then write the WAL. */
+	if (!isRedo)
+	{
+		xl_dbase_create_wal_log_rec xlrec;
+		XLogRecPtr	lsn;
+
+		START_CRIT_SECTION();
+
+		xlrec.db_id = dbid;
+		xlrec.tablespace_id = tsid;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec),
+						 sizeof(xl_dbase_create_wal_log_rec));
+
+		lsn = XLogInsert(RM_DBASE_ID, XLOG_DBASE_CREATE_WAL_LOG);
+
+		/* As always, WAL must hit the disk before the data update does. */
+		XLogFlush(lsn);
+	}
+
+	/* Create database directory. */
+	if (MakePGDirectory(dbpath) < 0)
+	{
+		/* Failure other than already exists or not in WAL replay? */
+		if (errno != EEXIST || !isRedo)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not create directory \"%s\": %m", dbpath)));
+	}
+
+	/*
+	 * Create PG_VERSION file in the database path.  If the file already
+	 * exists and we are in WAL replay then try again to open it in write
+	 * mode.
+	 */
+	snprintf(versionfile, sizeof(versionfile), "%s/%s", dbpath, "PG_VERSION");
+
+	fd = OpenTransientFile(versionfile, O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
+	if (fd < 0 && errno == EEXIST && isRedo)
+		fd = OpenTransientFile(versionfile, O_WRONLY | O_TRUNC | PG_BINARY);
+
+	if (fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m", versionfile)));
+
+	/* Write PG_MAJORVERSION in the PG_VERSION file. */
+	pgstat_report_wait_start(WAIT_EVENT_COPY_FILE_WRITE);
+	errno = 0;
+	if ((int) write(fd, buf, nbytes) != nbytes)
+	{
+		/* If write didn't set errno, assume problem is no disk space. */
+		if (errno == 0)
+			errno = ENOSPC;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m", versionfile)));
+	}
+	pgstat_report_wait_end();
+
+	/* Close the version file. */
+	CloseTransientFile(fd);
+
+	/* Critical section done. */
+	if (!isRedo)
+		END_CRIT_SECTION();
+}
+
+/*
+ * Helper function for ScanSourceDatabasePgClassPage to prepare a single
+ * CreateDBRelInfo element from the input pg_class tuple.
+ */
+CreateDBRelInfo *
+ScanSourceDatabasePgClassTuple(HeapTupleData *tuple, Oid tbid, Oid dbid,
+							   char *srcpath)
+{
+	CreateDBRelInfo	   *relinfo;
+	Form_pg_class		classForm;
+	Oid					relfilenode = InvalidOid;
+
+	classForm = (Form_pg_class) GETSTRUCT(tuple);
+
+	/*
+	 * If this is a shared object, the object doesn't have the storage or a
+	 * temp relation then nothing to be done, so just return.
+	 */
+	if (classForm->reltablespace == GLOBALTABLESPACE_OID ||
+		!RELKIND_HAS_STORAGE(classForm->relkind) ||
+		classForm->relpersistence == RELPERSISTENCE_TEMP)
+		return NULL;
+
+	/*
+	 * If relfilenode is valid then directly use it.  Otherwise, consult the
+	 * relmapper for the mapped relation.
+	 */
+	if (OidIsValid(classForm->relfilenode))
+		relfilenode = classForm->relfilenode;
+	else
+		relfilenode = RelationMapOidToFilenodeForDatabase(srcpath,
+														  classForm->oid);
+
+	/* We must have a valid relfilenode oid. */
+	Assert(OidIsValid(relfilenode));
+
+	/* Prepare a rel info element and add it to the list. */
+	relinfo = (CreateDBRelInfo *) palloc(sizeof(CreateDBRelInfo));
+	if (OidIsValid(classForm->reltablespace))
+		relinfo->rnode.spcNode = classForm->reltablespace;
+	else
+		relinfo->rnode.spcNode = tbid;
+
+	relinfo->rnode.dbNode = dbid;
+	relinfo->rnode.relNode = relfilenode;
+	relinfo->reloid = classForm->oid;
+
+	/* We should never reach here for the temp relations. */
+	Assert(classForm->relpersistence != RELPERSISTENCE_TEMP);
+	relinfo->permanent =
+		(classForm->relpersistence == RELPERSISTENCE_PERMANENT) ? true : false;
+
+	return relinfo;
+}
+
+/*
+ * Helper function for ScanSourceDatabasePgClass to identify all the valid
+ * relfilenodes for the given page.
+ */
+static List *
+ScanSourceDatabasePgClassPage(Page page, Buffer buf, Oid tbid, Oid dbid,
+							  char *srcpath, List *rnodelist,
+							  Snapshot snapshot)
+{
+	BlockNumber		blkno = BufferGetBlockNumber(buf);
+	OffsetNumber	offnum;
+	OffsetNumber	maxoff;
+	HeapTupleData	tuple;
+
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/* Iterate over each tuple of the page. */
+	for (offnum = FirstOffsetNumber;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid;
+
+		itemid = PageGetItemId(page, offnum);
+
+		/* Nothing to do if slot is empty or already dead. */
+		if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid) ||
+			ItemIdIsRedirected(itemid))
+			continue;
+
+		Assert(ItemIdIsNormal(itemid));
+		ItemPointerSet(&(tuple.t_self), blkno, offnum);
+
+		/* Initialize a HeapTupleData structure. */
+		tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
+		tuple.t_len = ItemIdGetLength(itemid);
+		tuple.t_tableOid = RelationRelationId;
+
+		/*
+		 * If the pg_class tuple is visible then prepare a CreateDBRelInfo and
+		 * append it to the list.
+		 */
+		if (HeapTupleSatisfiesVisibility(&tuple, snapshot, buf))
+		{
+			CreateDBRelInfo *relinfo;
+
+			relinfo = ScanSourceDatabasePgClassTuple(&tuple, tbid, dbid,
+													 srcpath);
+
+			/* Add it to the list. */
+			if (relinfo != NULL)
+				rnodelist = lappend(rnodelist, relinfo);
+		}
+	}
+
+	return rnodelist;
+}
+
+/*
+ * Identify all the valid relfilenodes from the source database so that we can
+ * copy them to the destination database.  In order to identify that, this
+ * function will iterate over each block of the pg_class relation of the source
+ * database.  From there, we will check all the visible tuples in order to get
+ * a list of all the valid relfilenodes in the source database.
+ */
+static List *
+ScanSourceDatabasePgClass(Oid tbid, Oid dbid, char *srcpath)
+{
+	RelFileNode rnode;
+	BlockNumber nblocks;
+	BlockNumber blkno;
+	Buffer		buf;
+	Oid			relfilenode;
+	Page		page;
+	List	   *rnodelist = NIL;
+	LockRelId	relid;
+	Snapshot	snapshot;
+	SMgrRelation rd_smgr;
+	BufferAccessStrategy bstrategy;
+
+	/* Get pg_class relfilenode. */
+	relfilenode = RelationMapOidToFilenodeForDatabase(srcpath,
+													  RelationRelationId);
+
+	/*
+	 * We are going to read the buffers associated with the pg_class relation.
+	 * Thus, acquire the relation level lock before start scanning.  As we are
+	 * not connected to the database, we cannot use relation_open directly, so
+	 * we have to lock using relation id.
+	 */
+	relid.dbId = dbid;
+	relid.relId = RelationRelationId;
+	LockRelationId(&relid, AccessShareLock);
+
+	/* Prepare a relnode for pg_class relation. */
+	rnode.spcNode = tbid;
+	rnode.dbNode = dbid;
+	rnode.relNode = relfilenode;
+
+	/*
+	 * We are not connected to the source database so open the pg_class
+	 * relation at the smgr level and get the block count.
+	 */
+	rd_smgr = smgropen(rnode, InvalidBackendId);
+	nblocks = smgrnblocks(rd_smgr, MAIN_FORKNUM);
+
+	/*
+	 * We're going to read the whole pg_class so better to use bulk-read
+	 * buffer access strategy.
+	 */
+	bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	/* Get latest snapshot for scanning the pg_class. */
+	snapshot = GetLatestSnapshot();
+
+	/* Iterate over each block of the pg_class relation. */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		/*
+		 * We are not connected to the source database so directly use the
+		 * lower level bufmgr interface which operates on the rnode.
+		 */
+		buf = ReadBufferWithoutRelcache(rnode, MAIN_FORKNUM, blkno,
+										RBM_NORMAL, bstrategy, false);
+
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buf);
+		if (PageIsNew(page) || PageIsEmpty(page))
+		{
+			UnlockReleaseBuffer(buf);
+			continue;
+		}
+
+		/*
+		 * Process pg_class tuples for the current page and add all the valid
+		 * relfilenode entries to the rnodelist.
+		 */
+		rnodelist = ScanSourceDatabasePgClassPage(page, buf, tbid, dbid,
+												  srcpath, rnodelist,
+												  snapshot);
+
+		/* Release the buffer lock. */
+		UnlockReleaseBuffer(buf);
+	}
+
+	/* Release the lock. */
+	UnlockRelationId(&relid, AccessShareLock);
+
+	return rnodelist;
+}
+
+/*
+ * Copy source database to the target using WAL.  Create target database
+ * directory and copy data files from the source database to the target
+ * database, block by block and WAL log all the operations.
+ */
+static void
+CreateDatabaseUsingWalLog(Oid src_dboid, Oid dst_dboid, Oid src_tsid, Oid dst_tsid)
+{
+	char	   *srcpath;
+	char	   *dstpath;
+	List	   *rnodelist = NULL;
+	ListCell   *cell;
+	LockRelId	relid;
+	RelFileNode srcrnode;
+	RelFileNode dstrnode;
+	CreateDBRelInfo *relinfo;
+
+	/* Get the source database path. */
+	srcpath = GetDatabasePath(src_dboid, src_tsid);
+
+	/* Get the destination database path. */
+	dstpath = GetDatabasePath(dst_dboid, dst_tsid);
+
+	/* Create database directory and write PG_VERSION file. */
+	CreateDirAndVersionFile(dstpath, dst_dboid, dst_tsid, false);
+
+	/* Copy relmap file from source database to the destination database. */
+	RelationMapCopy(dst_dboid, dst_tsid, srcpath, dstpath);
+
+	/* Get list of all valid relnode from the source database. */
+	rnodelist = ScanSourceDatabasePgClass(src_tsid, src_dboid, srcpath);
+	Assert(rnodelist != NIL);
+
+	/*
+	 * Database id is common for all the relation so set it before entering to
+	 * the loop.
+	 */
+	relid.dbId = src_dboid;
+
+	/*
+	 * Iterate over each relfilenode and copy the relation data block by block
+	 * from source database to the destination database.
+	 */
+	foreach(cell, rnodelist)
+	{
+		relinfo = lfirst(cell);
+		srcrnode = relinfo->rnode;
+
+		/*
+		 * If the relation is from the source db's default tablespace then we
+		 * need to create it in the destinations db's default tablespace.
+		 * Otherwise, we need to create in the same tablespace as it is in the
+		 * source database.
+		 */
+		if (srcrnode.spcNode == src_tsid)
+			dstrnode.spcNode = dst_tsid;
+		else
+			dstrnode.spcNode = srcrnode.spcNode;
+
+		dstrnode.dbNode = dst_dboid;
+		dstrnode.relNode = srcrnode.relNode;
+
+		/* Acquire the lock on relation before start copying. */
+		relid.relId = relinfo->reloid;
+		LockRelationId(&relid, AccessShareLock);
+
+		/* Copy relation storage from source to the destination. */
+		CreateAndCopyRelationData(srcrnode, dstrnode, relinfo->permanent);
 
+		/* Release the lock. */
+		UnlockRelationId(&relid, AccessShareLock);
+	}
+
+	list_free_deep(rnodelist);
+}
+
+/*
+ * Copy source database directory to the destination directory using file
+ * system level copy operation.
+ */
+static void
+CreateDatabaseUsingFileCopy(Oid src_dboid, Oid dst_dboid, Oid src_tsid,
+							Oid dst_tsid)
+{
+	TableScanDesc scan;
+	Relation	rel;
+	HeapTuple	tuple;
+
+	/*
+	 * Force a checkpoint before starting the copy. This will force all dirty
+	 * buffers, including those of unlogged tables, out to disk, to ensure
+	 * source database is up-to-date on disk for the copy.
+	 * FlushDatabaseBuffers() would suffice for that, but we also want to
+	 * process any pending unlink requests. Otherwise, if a checkpoint
+	 * happened while we're copying files, a file might be deleted just when
+	 * we're about to copy it, causing the lstat() call in copydir() to fail
+	 * with ENOENT.
+	 */
+	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE |
+					  CHECKPOINT_WAIT | CHECKPOINT_FLUSH_ALL);
+
+	/*
+	 * Iterate through all tablespaces of the template database, and copy each
+	 * one to the new database.
+	 */
+	rel = table_open(TableSpaceRelationId, AccessShareLock);
+	scan = table_beginscan_catalog(rel, 0, NULL);
+	while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+	{
+		Form_pg_tablespace spaceform = (Form_pg_tablespace) GETSTRUCT(tuple);
+		Oid			srctablespace = spaceform->oid;
+		Oid			dsttablespace;
+		char	   *srcpath;
+		char	   *dstpath;
+		struct stat st;
+
+		/* No need to copy global tablespace */
+		if (srctablespace == GLOBALTABLESPACE_OID)
+			continue;
+
+		srcpath = GetDatabasePath(src_dboid, srctablespace);
+
+		if (stat(srcpath, &st) < 0 || !S_ISDIR(st.st_mode) ||
+			directory_is_empty(srcpath))
+		{
+			/* Assume we can ignore it */
+			pfree(srcpath);
+			continue;
+		}
+
+		if (srctablespace == src_tsid)
+			dsttablespace = dst_tsid;
+		else
+			dsttablespace = srctablespace;
+
+		dstpath = GetDatabasePath(dst_dboid, dsttablespace);
+
+		/*
+		 * Copy this subdirectory to the new location
+		 *
+		 * We don't need to copy subdirectories
+		 */
+		copydir(srcpath, dstpath, false);
+
+		/* Record the filesystem change in XLOG */
+		{
+			xl_dbase_create_file_copy_rec xlrec;
+
+			xlrec.db_id = dst_dboid;
+			xlrec.tablespace_id = dsttablespace;
+			xlrec.src_db_id = src_dboid;
+			xlrec.src_tablespace_id = srctablespace;
+
+			XLogBeginInsert();
+			XLogRegisterData((char *) &xlrec,
+							 sizeof(xl_dbase_create_file_copy_rec));
+
+			(void) XLogInsert(RM_DBASE_ID,
+							  XLOG_DBASE_CREATE_FILE_COPY | XLR_SPECIAL_REL_UPDATE);
+		}
+	}
+	table_endscan(scan);
+	table_close(rel, AccessShareLock);
+
+	/*
+	 * We force a checkpoint before committing.  This effectively means that
+	 * committed XLOG_DBASE_CREATE_FILE_COPY operations will never need to be
+	 * replayed (at least not in ordinary crash recovery; we still have to
+	 * make the XLOG entry for the benefit of PITR operations). This avoids
+	 * two nasty scenarios:
+	 *
+	 * #1: When PITR is off, we don't XLOG the contents of newly created
+	 * indexes; therefore the drop-and-recreate-whole-directory behavior of
+	 * DBASE_CREATE replay would lose such indexes.
+	 *
+	 * #2: Since we have to recopy the source database during DBASE_CREATE
+	 * replay, we run the risk of copying changes in it that were committed
+	 * after the original CREATE DATABASE command but before the system crash
+	 * that led to the replay.  This is at least unexpected and at worst could
+	 * lead to inconsistencies, eg duplicate table names.
+	 *
+	 * (Both of these were real bugs in releases 8.0 through 8.0.3.)
+	 *
+	 * In PITR replay, the first of these isn't an issue, and the second is
+	 * only a risk if the CREATE DATABASE and subsequent template database
+	 * change both occur while a base backup is being taken. There doesn't
+	 * seem to be much we can do about that except document it as a
+	 * limitation.
+	 *
+	 * Perhaps if we ever implement CREATE DATABASE in a less cheesy way, we
+	 * can avoid this.
+	 */
+	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
+}
 
 /*
  * CREATE DATABASE
@@ -100,8 +627,6 @@ static int	errdetail_busy_db(int notherbackends, int npreparedxacts);
 Oid
 createdb(ParseState *pstate, const CreatedbStmt *stmt)
 {
-	TableScanDesc scan;
-	Relation	rel;
 	Oid			src_dboid;
 	Oid			src_owner;
 	int			src_encoding = -1;
@@ -132,6 +657,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	DefElem    *dallowconnections = NULL;
 	DefElem    *dconnlimit = NULL;
 	DefElem    *dcollversion = NULL;
+	DefElem    *dstrategy = NULL;
 	char	   *dbname = stmt->dbname;
 	char	   *dbowner = NULL;
 	const char *dbtemplate = NULL;
@@ -145,6 +671,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	char	   *dbcollversion = NULL;
 	int			notherbackends;
 	int			npreparedxacts;
+	CreateDBStrategy dbstrategy = CREATEDB_WAL_LOG;
 	createdb_failure_params fparms;
 
 	/* Extract options from the statement node tree */
@@ -250,6 +777,12 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 						(errcode(ERRCODE_INVALID_PARAMETER_VALUE)),
 						errmsg("OIDs less than %u are reserved for system objects", FirstNormalObjectId));
 		}
+		else if (strcmp(defel->defname, "strategy") == 0)
+		{
+			if (dstrategy)
+				errorConflictingDefElem(defel, pstate);
+			dstrategy = defel;
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -374,6 +907,23 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 							dbtemplate)));
 	}
 
+	/* Validate the database creation strategy. */
+	if (dstrategy && dstrategy->arg)
+	{
+		char	   *strategy;
+
+		strategy = defGetString(dstrategy);
+		if (strcmp(strategy, "wal_log") == 0)
+			dbstrategy = CREATEDB_WAL_LOG;
+		else if (strcmp(strategy, "file_copy") == 0)
+			dbstrategy = CREATEDB_FILE_COPY;
+		else
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("invalid create database strategy %s", strategy),
+					 errhint("Valid strategies are \"wal_log\", and \"file_copy\".")));
+	}
+
 	/* If encoding or locales are defaulted, use source's setting */
 	if (encoding < 0)
 		encoding = src_encoding;
@@ -668,19 +1218,6 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	InvokeObjectPostCreateHook(DatabaseRelationId, dboid, 0);
 
 	/*
-	 * Force a checkpoint before starting the copy. This will force all dirty
-	 * buffers, including those of unlogged tables, out to disk, to ensure
-	 * source database is up-to-date on disk for the copy.
-	 * FlushDatabaseBuffers() would suffice for that, but we also want to
-	 * process any pending unlink requests. Otherwise, if a checkpoint
-	 * happened while we're copying files, a file might be deleted just when
-	 * we're about to copy it, causing the lstat() call in copydir() to fail
-	 * with ENOENT.
-	 */
-	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT
-					  | CHECKPOINT_FLUSH_ALL);
-
-	/*
 	 * Once we start copying subdirectories, we need to be able to clean 'em
 	 * up if we fail.  Use an ENSURE block to make sure this happens.  (This
 	 * is not a 100% solution, because of the possibility of failure during
@@ -689,101 +1226,24 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	 */
 	fparms.src_dboid = src_dboid;
 	fparms.dest_dboid = dboid;
+	fparms.strategy = dbstrategy;
+
 	PG_ENSURE_ERROR_CLEANUP(createdb_failure_callback,
 							PointerGetDatum(&fparms));
 	{
 		/*
-		 * Iterate through all tablespaces of the template database, and copy
-		 * each one to the new database.
-		 */
-		rel = table_open(TableSpaceRelationId, AccessShareLock);
-		scan = table_beginscan_catalog(rel, 0, NULL);
-		while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
-		{
-			Form_pg_tablespace spaceform = (Form_pg_tablespace) GETSTRUCT(tuple);
-			Oid			srctablespace = spaceform->oid;
-			Oid			dsttablespace;
-			char	   *srcpath;
-			char	   *dstpath;
-			struct stat st;
-
-			/* No need to copy global tablespace */
-			if (srctablespace == GLOBALTABLESPACE_OID)
-				continue;
-
-			srcpath = GetDatabasePath(src_dboid, srctablespace);
-
-			if (stat(srcpath, &st) < 0 || !S_ISDIR(st.st_mode) ||
-				directory_is_empty(srcpath))
-			{
-				/* Assume we can ignore it */
-				pfree(srcpath);
-				continue;
-			}
-
-			if (srctablespace == src_deftablespace)
-				dsttablespace = dst_deftablespace;
-			else
-				dsttablespace = srctablespace;
-
-			dstpath = GetDatabasePath(dboid, dsttablespace);
-
-			/*
-			 * Copy this subdirectory to the new location
-			 *
-			 * We don't need to copy subdirectories
-			 */
-			copydir(srcpath, dstpath, false);
-
-			/* Record the filesystem change in XLOG */
-			{
-				xl_dbase_create_rec xlrec;
-
-				xlrec.db_id = dboid;
-				xlrec.tablespace_id = dsttablespace;
-				xlrec.src_db_id = src_dboid;
-				xlrec.src_tablespace_id = srctablespace;
-
-				XLogBeginInsert();
-				XLogRegisterData((char *) &xlrec, sizeof(xl_dbase_create_rec));
-
-				(void) XLogInsert(RM_DBASE_ID,
-								  XLOG_DBASE_CREATE | XLR_SPECIAL_REL_UPDATE);
-			}
-		}
-		table_endscan(scan);
-		table_close(rel, AccessShareLock);
-
-		/*
-		 * We force a checkpoint before committing.  This effectively means
-		 * that committed XLOG_DBASE_CREATE operations will never need to be
-		 * replayed (at least not in ordinary crash recovery; we still have to
-		 * make the XLOG entry for the benefit of PITR operations). This
-		 * avoids two nasty scenarios:
-		 *
-		 * #1: When PITR is off, we don't XLOG the contents of newly created
-		 * indexes; therefore the drop-and-recreate-whole-directory behavior
-		 * of DBASE_CREATE replay would lose such indexes.
-		 *
-		 * #2: Since we have to recopy the source database during DBASE_CREATE
-		 * replay, we run the risk of copying changes in it that were
-		 * committed after the original CREATE DATABASE command but before the
-		 * system crash that led to the replay.  This is at least unexpected
-		 * and at worst could lead to inconsistencies, eg duplicate table
-		 * names.
-		 *
-		 * (Both of these were real bugs in releases 8.0 through 8.0.3.)
-		 *
-		 * In PITR replay, the first of these isn't an issue, and the second
-		 * is only a risk if the CREATE DATABASE and subsequent template
-		 * database change both occur while a base backup is being taken.
-		 * There doesn't seem to be much we can do about that except document
-		 * it as a limitation.
-		 *
-		 * Perhaps if we ever implement CREATE DATABASE in a less cheesy way,
-		 * we can avoid this.
+		 * If the user has asked to create a database with WAL_LOG strategy
+		 * then call CreateDatabaseUsingWalLog, which will copy the database
+		 * at the block level and it will WAL log each copied block.
+		 * Otherwise, call CreateDatabaseUsingFileCopy that will copy the
+		 * database file by file.
 		 */
-		RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
+		if (dbstrategy == CREATEDB_WAL_LOG)
+			CreateDatabaseUsingWalLog(src_dboid, dboid, src_deftablespace,
+									  dst_deftablespace);
+		else
+			CreateDatabaseUsingFileCopy(src_dboid, dboid, src_deftablespace,
+										dst_deftablespace);
 
 		/*
 		 * Close pg_database, but keep lock till commit.
@@ -870,6 +1330,21 @@ createdb_failure_callback(int code, Datum arg)
 	createdb_failure_params *fparms = (createdb_failure_params *) DatumGetPointer(arg);
 
 	/*
+	 * If we were copying database at block levels then drop pages for the
+	 * destination database that are in the shared buffer cache.  And tell
+	 * checkpointer to forget any pending fsync and unlink requests for files
+	 * in the database.  The reasoning behind doing this is same as explained
+	 * in dropdb function.  But unlike dropdb we don't need to call
+	 * pgstat_drop_database because this database is still not created so
+	 * there should not be any stat for this.
+	 */
+	if (fparms->strategy == CREATEDB_WAL_LOG)
+	{
+		DropDatabaseBuffers(fparms->dest_dboid);
+		ForgetDatabaseSyncRequests(fparms->dest_dboid);
+	}
+
+	/*
 	 * Release lock on source database before doing recursive remove. This is
 	 * not essential but it seems desirable to release the lock as soon as
 	 * possible.
@@ -1393,7 +1868,7 @@ movedb(const char *dbname, const char *tblspcname)
 		 * Record the filesystem change in XLOG
 		 */
 		{
-			xl_dbase_create_rec xlrec;
+			xl_dbase_create_file_copy_rec xlrec;
 
 			xlrec.db_id = db_id;
 			xlrec.tablespace_id = dst_tblspcoid;
@@ -1401,10 +1876,11 @@ movedb(const char *dbname, const char *tblspcname)
 			xlrec.src_tablespace_id = src_tblspcoid;
 
 			XLogBeginInsert();
-			XLogRegisterData((char *) &xlrec, sizeof(xl_dbase_create_rec));
+			XLogRegisterData((char *) &xlrec,
+							 sizeof(xl_dbase_create_file_copy_rec));
 
 			(void) XLogInsert(RM_DBASE_ID,
-							  XLOG_DBASE_CREATE | XLR_SPECIAL_REL_UPDATE);
+							  XLOG_DBASE_CREATE_FILE_COPY | XLR_SPECIAL_REL_UPDATE);
 		}
 
 		/*
@@ -1440,9 +1916,10 @@ movedb(const char *dbname, const char *tblspcname)
 
 		/*
 		 * Force another checkpoint here.  As in CREATE DATABASE, this is to
-		 * ensure that we don't have to replay a committed XLOG_DBASE_CREATE
-		 * operation, which would cause us to lose any unlogged operations
-		 * done in the new DB tablespace before the next checkpoint.
+		 * ensure that we don't have to replay a committed
+		 * XLOG_DBASE_CREATE_FILE_COPY operation, which would cause us to lose
+		 * any unlogged operations done in the new DB tablespace before the
+		 * next checkpoint.
 		 */
 		RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
 
@@ -2377,9 +2854,10 @@ dbase_redo(XLogReaderState *record)
 	/* Backup blocks are not used in dbase records */
 	Assert(!XLogRecHasAnyBlockRefs(record));
 
-	if (info == XLOG_DBASE_CREATE)
+	if (info == XLOG_DBASE_CREATE_FILE_COPY)
 	{
-		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) XLogRecGetData(record);
+		xl_dbase_create_file_copy_rec *xlrec =
+		(xl_dbase_create_file_copy_rec *) XLogRecGetData(record);
 		char	   *src_path;
 		char	   *dst_path;
 		struct stat st;
@@ -2414,6 +2892,18 @@ dbase_redo(XLogReaderState *record)
 		 */
 		copydir(src_path, dst_path, false);
 	}
+	else if (info == XLOG_DBASE_CREATE_WAL_LOG)
+	{
+		xl_dbase_create_wal_log_rec *xlrec =
+		(xl_dbase_create_wal_log_rec *) XLogRecGetData(record);
+		char	   *dbpath;
+
+		dbpath = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
+
+		/* Create the database directory with the version file. */
+		CreateDirAndVersionFile(dbpath, xlrec->db_id, xlrec->tablespace_id,
+								true);
+	}
 	else if (info == XLOG_DBASE_DROP)
 	{
 		xl_dbase_drop_rec *xlrec = (xl_dbase_drop_rec *) XLogRecGetData(record);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 3cadcd2..b1cebc4 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -38,6 +38,7 @@
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/storage.h"
+#include "catalog/storage_xlog.h"
 #include "executor/instrument.h"
 #include "lib/binaryheap.h"
 #include "miscadmin.h"
@@ -486,6 +487,9 @@ static void FindAndDropRelFileNodeBuffers(RelFileNode rnode,
 										  ForkNumber forkNum,
 										  BlockNumber nForkBlock,
 										  BlockNumber firstDelBlock);
+static void RelationCopyStorageUsingBuffer(SMgrRelation src, SMgrRelation dst,
+										   ForkNumber forkNum,
+										   bool isunlogged);
 static void AtProcExit_Buffers(int code, Datum arg);
 static void CheckForBufferLeaks(void);
 static int	rnode_comparator(const void *p1, const void *p2);
@@ -3679,6 +3683,158 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 }
 
 /* ---------------------------------------------------------------------
+ *		RelationCopyStorageUsingBuffer
+ *
+ *		Copy fork's data using bufmgr.  Same as RelationCopyStorage but instead
+ *		of using smgrread and smgrextend this will copy using bufmgr APIs.
+ *
+ *		Refer comments atop CreateAndCopyRelationData() for details about
+ *		'permanent' parameter.
+ * --------------------------------------------------------------------
+ */
+static void
+RelationCopyStorageUsingBuffer(SMgrRelation src, SMgrRelation dst,
+							   ForkNumber forkNum, bool permanent)
+{
+	Buffer		srcBuf;
+	Buffer		dstBuf;
+	Page		srcPage;
+	Page		dstPage;
+	bool		use_wal;
+	BlockNumber nblocks;
+	BlockNumber blkno;
+	BufferAccessStrategy bstrategy_src;
+	BufferAccessStrategy bstrategy_dst;
+
+	/*
+	 * We need to log the copied data in WAL iff WAL archiving/streaming is
+	 * enabled and the relation is persistent, or this is the init fork of an
+	 * unlogged relation.
+	 */
+	use_wal = XLogIsNeeded() && (permanent || forkNum == INIT_FORKNUM);
+
+	/* Get number of blocks in the source relation. */
+	nblocks = smgrnblocks(src, forkNum);
+
+	/* Nothing to copy; just return. */
+	if (nblocks == 0)
+		return;
+
+	/*
+	 * We are going to copy whole relation from the source to the destination
+	 * so use BAS_BULKREAD strategy for the source relation and BAS_BULKWRITE
+	 * strategy for the destination relation.
+	 */
+	bstrategy_src = GetAccessStrategy(BAS_BULKREAD);
+	bstrategy_dst = GetAccessStrategy(BAS_BULKWRITE);
+
+	/* Iterate over each block of the source relation file. */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		/* If we got a cancel signal during the copy of the data, quit */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Read block from source relation. */
+		srcBuf = ReadBufferWithoutRelcache(src->smgr_rnode.node, forkNum,
+										   blkno, RBM_NORMAL, bstrategy_src,
+										   permanent);
+		srcPage = BufferGetPage(srcBuf);
+		if (PageIsNew(srcPage) || PageIsEmpty(srcPage))
+		{
+			ReleaseBuffer(srcBuf);
+			continue;
+		}
+
+		/* Use P_NEW to extend the relation. */
+		dstBuf = ReadBufferWithoutRelcache(dst->smgr_rnode.node, forkNum,
+										   P_NEW, RBM_NORMAL, bstrategy_dst,
+										   permanent);
+		LockBuffer(dstBuf, BUFFER_LOCK_EXCLUSIVE);
+
+		START_CRIT_SECTION();
+
+		/* Initialize the page and write the data. */
+		dstPage = BufferGetPage(dstBuf);
+		PageInit(dstPage, BufferGetPageSize(dstBuf), 0);
+		memcpy(dstPage, srcPage, BLCKSZ);
+		MarkBufferDirty(dstBuf);
+
+		/* WAL-log the copied page. */
+		if (use_wal)
+			log_newpage_buffer(dstBuf, true);
+
+		END_CRIT_SECTION();
+
+		UnlockReleaseBuffer(dstBuf);
+		ReleaseBuffer(srcBuf);
+	}
+}
+
+/* ---------------------------------------------------------------------
+ *		CreateAndCopyRelationData
+ *
+ *		Create destination relation storage and copy source relation's all
+ *		fork's data to the destination.
+ *
+ *		Curretly this API is not supported for the temporary relations.  So
+ *		pass permanent as true for the regular relation and false for the
+ *		unlogged relation.
+ * --------------------------------------------------------------------
+ */
+void
+CreateAndCopyRelationData(RelFileNode src_rnode, RelFileNode dst_rnode,
+						  bool permanent)
+{
+	SMgrRelation	src_smgr;
+	SMgrRelation	dst_smgr;
+	char			relpersistence;
+
+	/* Open the source relation at smgr level. */
+	src_smgr = smgropen(src_rnode, InvalidBackendId);
+
+	/* Set the relpersistence. */
+	relpersistence = permanent ?
+		RELPERSISTENCE_PERMANENT : RELPERSISTENCE_UNLOGGED;
+
+	/*
+	 * Create and copy all forks of the relation.
+	 *
+	 * NOTE: any conflict in relfilenode value will be caught in
+	 * RelationCreateStorage().
+	 */
+	dst_smgr = RelationCreateStorage(dst_rnode, relpersistence);
+
+	/* copy main fork */
+	RelationCopyStorageUsingBuffer(src_smgr, dst_smgr, MAIN_FORKNUM,
+								   permanent);
+
+	/* copy those extra forks that exist */
+	for (ForkNumber forkNum = MAIN_FORKNUM + 1;
+		 forkNum <= MAX_FORKNUM; forkNum++)
+	{
+		if (smgrexists(src_smgr, forkNum))
+		{
+			smgrcreate(dst_smgr, forkNum, false);
+
+			/*
+			 * WAL log creation if the relation is persistent, or this is the
+			 * init fork of an unlogged relation.
+			 */
+			if (permanent || forkNum == INIT_FORKNUM)
+				log_smgrcreate(&dst_rnode, forkNum);
+
+			/* Copy a fork's data, block by block. */
+			RelationCopyStorageUsingBuffer(src_smgr, dst_smgr, forkNum,
+										   permanent);
+		}
+	}
+
+	/* Close the smgr rel */
+	smgrclose(src_smgr);
+	smgrclose(dst_smgr);
+}
+
+/* ---------------------------------------------------------------------
  *		FlushDatabaseBuffers
  *
  *		This function writes all dirty pages of a database out to disk
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 56df08c..d5cf9ed 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -370,7 +370,7 @@ extractPageInfo(XLogReaderState *record)
 
 	/* Is this a special record type that I recognize? */
 
-	if (rmid == RM_DBASE_ID && rminfo == XLOG_DBASE_CREATE)
+	if (rmid == RM_DBASE_ID && rminfo == XLOG_DBASE_CREATE_FILE_COPY)
 	{
 		/*
 		 * New databases can be safely ignored. It won't be present in the
@@ -382,6 +382,13 @@ extractPageInfo(XLogReaderState *record)
 		 * overwriting the database created in the target system.
 		 */
 	}
+	else if (rmid == RM_DBASE_ID && rminfo == XLOG_DBASE_CREATE_WAL_LOG)
+	{
+		/*
+		 * New databases can be safely ignored. It won't be present in the
+		 * source system, so it will be deleted.
+		 */
+	}
 	else if (rmid == RM_DBASE_ID && rminfo == XLOG_DBASE_DROP)
 	{
 		/*
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 1717282..d0e3755 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2738,10 +2738,12 @@ psql_completion(const char *text, int start, int end)
 		COMPLETE_WITH("OWNER", "TEMPLATE", "ENCODING", "TABLESPACE",
 					  "IS_TEMPLATE",
 					  "ALLOW_CONNECTIONS", "CONNECTION LIMIT",
-					  "LC_COLLATE", "LC_CTYPE", "LOCALE", "OID");
+					  "LC_COLLATE", "LC_CTYPE", "LOCALE", "OID", "STRATEGY");
 
 	else if (Matches("CREATE", "DATABASE", MatchAny, "TEMPLATE"))
 		COMPLETE_WITH_QUERY(Query_for_list_of_template_databases);
+	else if (Matches("CREATE", "DATABASE", MatchAny, "STRATEGY"))
+		COMPLETE_WITH("WAL_LOG", "FILE_COPY");
 
 	/* CREATE DOMAIN */
 	else if (Matches("CREATE", "DOMAIN", MatchAny))
diff --git a/src/include/commands/dbcommands_xlog.h b/src/include/commands/dbcommands_xlog.h
index 593a857..077a000 100644
--- a/src/include/commands/dbcommands_xlog.h
+++ b/src/include/commands/dbcommands_xlog.h
@@ -18,17 +18,31 @@
 #include "lib/stringinfo.h"
 
 /* record types */
-#define XLOG_DBASE_CREATE		0x00
-#define XLOG_DBASE_DROP			0x10
+#define XLOG_DBASE_CREATE_FILE_COPY		0x00
+#define XLOG_DBASE_CREATE_WAL_LOG		0x10
+#define XLOG_DBASE_DROP					0x20
 
-typedef struct xl_dbase_create_rec
+/*
+ * Records copying of a single subdirectory incl. contents, while creating a
+ * database using FILE COPY strategy.
+ */
+typedef struct xl_dbase_create_file_copy_rec
 {
-	/* Records copying of a single subdirectory incl. contents */
 	Oid			db_id;
 	Oid			tablespace_id;
 	Oid			src_db_id;
 	Oid			src_tablespace_id;
-} xl_dbase_create_rec;
+} xl_dbase_create_file_copy_rec;
+
+/*
+ * Records creating a database directory with version file, while creating a
+ * database using WAL LOG strategy.
+ */
+typedef struct xl_dbase_create_wal_log_rec
+{
+	Oid			db_id;
+	Oid			tablespace_id;
+} xl_dbase_create_wal_log_rec;
 
 typedef struct xl_dbase_drop_rec
 {
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index fd0452f..a6b657f 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -204,6 +204,9 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
 extern void FlushOneBuffer(Buffer buffer);
 extern void FlushRelationBuffers(Relation rel);
 extern void FlushRelationsAllBuffers(struct SMgrRelationData **smgrs, int nrels);
+extern void CreateAndCopyRelationData(RelFileNode src_rnode,
+									  RelFileNode dst_rnode,
+									  bool permanent);
 extern void FlushDatabaseBuffers(Oid dbid);
 extern void DropRelFileNodeBuffers(struct SMgrRelationData *smgr_reln, ForkNumber *forkNum,
 								   int nforks, BlockNumber *firstDelBlock);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index eaf3e7a..0f01356 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -460,6 +460,8 @@ CoverPos
 CreateAmStmt
 CreateCastStmt
 CreateConversionStmt
+CreateDBRelInfo
+CreateDBStrategy
 CreateDomainStmt
 CreateEnumStmt
 CreateEventTrigStmt
@@ -3694,7 +3696,8 @@ xl_btree_update
 xl_btree_vacuum
 xl_clog_truncate
 xl_commit_ts_truncate
-xl_dbase_create_rec
+xl_dbase_create_file_copy_rec
+xl_dbase_create_wal_log_rec
 xl_dbase_drop_rec
 xl_end_of_recovery
 xl_hash_add_ovfl_page
-- 
1.8.3.1

v15-0004-New-interface-to-lock-relation-id.patchtext/x-patch; charset=US-ASCII; name=v15-0004-New-interface-to-lock-relation-id.patchDownload

From 1e3e2996bb42cc6b3b7c49b2494b54251497535b Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 24 Sep 2021 18:29:17 +0530
Subject: [PATCH v15 4/6] New interface to lock relation id

Currently, we have LockRelationOid which provide a mechanism to
lock the relation oid but we must be connected to the database
from which this relation belong.  As part of this patch we are
providing a new interface which can lock the relation even if we
are not connected to the containing database.
---
 src/backend/storage/lmgr/lmgr.c | 28 ++++++++++++++++++++++++++++
 src/include/storage/lmgr.h      |  1 +
 2 files changed, 29 insertions(+)

diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 5ae52dd..1543da6 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -176,6 +176,34 @@ ConditionalLockRelationOid(Oid relid, LOCKMODE lockmode)
 }
 
 /*
+ *		LockRelationId
+ *
+ * Lock, given a LockRelId.  Same as LockRelationOid but take LockRelId as an
+ * input.
+ */
+void
+LockRelationId(LockRelId *relid, LOCKMODE lockmode)
+{
+	LOCKTAG		tag;
+	LOCALLOCK  *locallock;
+	LockAcquireResult res;
+
+	SET_LOCKTAG_RELATION(tag, relid->dbId, relid->relId);
+
+	res = LockAcquireExtended(&tag, lockmode, false, false, true, &locallock);
+
+	/*
+	 * Now that we have the lock, check for invalidation messages; see notes
+	 * in LockRelationOid.
+	 */
+	if (res != LOCKACQUIRE_ALREADY_CLEAR)
+	{
+		AcceptInvalidationMessages();
+		MarkLockClear(locallock);
+	}
+}
+
+/*
  *		UnlockRelationId
  *
  * Unlock, given a LockRelId.  This is preferred over UnlockRelationOid
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index 49edbcc..be1d2c9 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -38,6 +38,7 @@ extern void RelationInitLockInfo(Relation relation);
 
 /* Lock a relation */
 extern void LockRelationOid(Oid relid, LOCKMODE lockmode);
+extern void LockRelationId(LockRelId *relid, LOCKMODE lockmode);
 extern bool ConditionalLockRelationOid(Oid relid, LOCKMODE lockmode);
 extern void UnlockRelationId(LockRelId *relid, LOCKMODE lockmode);
 extern void UnlockRelationOid(Oid relid, LOCKMODE lockmode);
-- 
1.8.3.1

v15-0002-Extend-relmap-interfaces.patchtext/x-patch; charset=US-ASCII; name=v15-0002-Extend-relmap-interfaces.patchDownload

From 83a0e0539c2d5affa16addd273ba159dca557dd1 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Tue, 15 Mar 2022 09:18:52 +0530
Subject: [PATCH v15 2/6] Extend relmap interfaces

Support new interfaces in relmapper, 1) Support copying the
relmap file from one database path to the other database path.
2) And another interface for getting filenode from oid.  We already
have RelationMapOidToFilenode for the same purpose but that assumes
we are connected to the database for which we want to get the mapping.
So this new interface will do the same but instead, it will get the
mapping for the input database.

These interfaces are required for next patch, for supporting the
wal logged created database.
---
 src/backend/utils/cache/relmapper.c | 60 +++++++++++++++++++++++++++++++++++++
 src/include/utils/relmapper.h       |  4 ++-
 2 files changed, 63 insertions(+), 1 deletion(-)

diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index c3fef70..d2e7890 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -252,6 +252,60 @@ RelationMapFilenodeToOid(Oid filenode, bool shared)
 }
 
 /*
+ * RelationMapOidToFilenodeForDatabase
+ *
+ * Same as RelationMapOidToFilenode, but instead of reading the mapping from
+ * the database we are connected to it will read the mapping from the input
+ * database.
+ */
+Oid
+RelationMapOidToFilenodeForDatabase(char *dbpath, Oid relationId)
+{
+	RelMapFile	map;
+	int			i;
+
+	/* Read the relmap file from the source database. */
+	read_relmap_file(&map, dbpath, false, ERROR);
+
+	/* Iterate over the relmap entries to find the input relation oid. */
+	for (i = 0; i < map.num_mappings; i++)
+	{
+		if (relationId == map.mappings[i].mapoid)
+			return map.mappings[i].mapfilenode;
+	}
+
+	return InvalidOid;
+}
+
+/*
+ * RelationMapCopy
+ *
+ * Copy relmapfile from source db path to the destination db path and WAL log
+ * the operation.
+ */
+void
+RelationMapCopy(Oid dbid, Oid tsid, char *srcdbpath, char *dstdbpath)
+{
+	RelMapFile map;
+
+	/*
+	 * Read the relmap file from the source database.  This function is only
+	 * called during the create database, so elevel can be ERROR.
+	 */
+	read_relmap_file(&map, srcdbpath, false, ERROR);
+
+	/*
+	 * Write map contents into the destination database's relmap file. No
+	 * sinval needed because we are creating new file while creating a new
+	 * database so no one else must be accessing this file and for the same
+	 * reason we don't need to acquire the RelationMappingLock as well.  And,
+	 * we also don't need to preserve files because we are creating a new
+	 * database so in case of anerror relation files will be deleted anyway.
+	 */
+	write_relmap_file(&map, true, false, false, dbid, tsid, dstdbpath);
+}
+
+/*
  * RelationMapUpdateMap
  *
  * Install a new relfilenode mapping for the specified relation.
@@ -1033,6 +1087,12 @@ relmap_redo(XLogReaderState *record)
 		 *
 		 * There shouldn't be anyone else updating relmaps during WAL replay,
 		 * but grab the lock to interlock against load_relmap_file().
+		 *
+		 * Note - this WAL is also written for copying the relmap file while
+		 * creating a database.  Therefore, it makes no sense to acquire a
+		 * relmap lock or send sinval.  But if we want to avoid that, then we
+		 * must set an extra flag in WAL.  So let it grab the lock and send
+		 * sinval because there is no harm in that.
 		 */
 		LWLockAcquire(RelationMappingLock, LW_EXCLUSIVE);
 		write_relmap_file(&newmap, false, true, false,
diff --git a/src/include/utils/relmapper.h b/src/include/utils/relmapper.h
index 9fbb5a7..f10353e 100644
--- a/src/include/utils/relmapper.h
+++ b/src/include/utils/relmapper.h
@@ -38,7 +38,9 @@ typedef struct xl_relmap_update
 extern Oid	RelationMapOidToFilenode(Oid relationId, bool shared);
 
 extern Oid	RelationMapFilenodeToOid(Oid relationId, bool shared);
-
+extern Oid RelationMapOidToFilenodeForDatabase(char *dbpath, Oid relationId);
+extern void RelationMapCopy(Oid dbid, Oid tsid, char *srcdbpath,
+							char *dstdbpath);
 extern void RelationMapUpdateMap(Oid relationId, Oid fileNode, bool shared,
 								 bool immediate);
 
-- 
1.8.3.1

v15-0006-Support-create-database-strategy-in-createdb-too.patchtext/x-patch; charset=US-ASCII; name=v15-0006-Support-create-database-strategy-in-createdb-too.patchDownload

From 32ba4920455f2bfa092a034bcf6268bce35a03e3 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 11 Mar 2022 11:48:55 +0530
Subject: [PATCH v15 6/6] Support create database strategy in createdb tool

---
 doc/src/sgml/ref/createdb.sgml    | 16 ++++++++++++++++
 src/bin/scripts/createdb.c        | 10 +++++++++-
 src/bin/scripts/t/020_createdb.pl | 20 ++++++++++++++++++++
 3 files changed, 45 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/ref/createdb.sgml b/doc/src/sgml/ref/createdb.sgml
index 8647345..2a7beca 100644
--- a/doc/src/sgml/ref/createdb.sgml
+++ b/doc/src/sgml/ref/createdb.sgml
@@ -159,6 +159,22 @@ PostgreSQL documentation
      </varlistentry>
 
      <varlistentry>
+      <term><option>-S <replaceable class="parameter">template</replaceable></option></term>
+      <term><option>--strategy=<replaceable class="parameter">strategy</replaceable></option></term>
+      <listitem>
+       <para>
+        Specifies the database creation strategy.  Currently, we have two
+        strategies the <literal>WAL_LOG</literal> and the <literal>FILE_COPY
+        </literal>.  If <literal>WAL_LOG</literal> strategy is used then the
+        database will be copied block by block and it will also WAL log each
+        copied block.  Otherwise, if <literal>FILE_COPY</literal> strategy is
+        used then it will do the file system level copy so individual the block
+        is not WAL logged.  The default strategy is <literal>WAL_LOG</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
       <term><option>-T <replaceable class="parameter">template</replaceable></option></term>
       <term><option>--template=<replaceable class="parameter">template</replaceable></option></term>
       <listitem>
diff --git a/src/bin/scripts/createdb.c b/src/bin/scripts/createdb.c
index b0c6805..9d3c4ef 100644
--- a/src/bin/scripts/createdb.c
+++ b/src/bin/scripts/createdb.c
@@ -37,6 +37,7 @@ main(int argc, char *argv[])
 		{"lc-collate", required_argument, NULL, 1},
 		{"lc-ctype", required_argument, NULL, 2},
 		{"locale", required_argument, NULL, 'l'},
+		{"strategy", required_argument, NULL, 'S'},
 		{"maintenance-db", required_argument, NULL, 3},
 		{NULL, 0, NULL, 0}
 	};
@@ -61,6 +62,7 @@ main(int argc, char *argv[])
 	char	   *lc_collate = NULL;
 	char	   *lc_ctype = NULL;
 	char	   *locale = NULL;
+	char	   *strategy = NULL;
 
 	PQExpBufferData sql;
 
@@ -73,7 +75,7 @@ main(int argc, char *argv[])
 
 	handle_help_version_opts(argc, argv, "createdb", help);
 
-	while ((c = getopt_long(argc, argv, "h:p:U:wWeO:D:T:E:l:", long_options, &optindex)) != -1)
+	while ((c = getopt_long(argc, argv, "h:p:U:wWeO:D:T:E:l:S:", long_options, &optindex)) != -1)
 	{
 		switch (c)
 		{
@@ -119,6 +121,9 @@ main(int argc, char *argv[])
 			case 3:
 				maintenance_db = pg_strdup(optarg);
 				break;
+			case 'S':
+				strategy = pg_strdup(optarg);
+				break;
 			default:
 				fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
 				exit(1);
@@ -217,6 +222,8 @@ main(int argc, char *argv[])
 		appendPQExpBufferStr(&sql, " LC_CTYPE ");
 		appendStringLiteralConn(&sql, lc_ctype, conn);
 	}
+	if (strategy)
+		appendPQExpBuffer(&sql, " STRATEGY %s ", fmtId(strategy));
 
 	appendPQExpBufferChar(&sql, ';');
 
@@ -274,6 +281,7 @@ help(const char *progname)
 	printf(_("      --lc-collate=LOCALE      LC_COLLATE setting for the database\n"));
 	printf(_("      --lc-ctype=LOCALE        LC_CTYPE setting for the database\n"));
 	printf(_("  -O, --owner=OWNER            database user to own the new database\n"));
+	printf(_("  -S, --strategy=STRATEGY      database creation strategy wal_log or file_copy\n"));
 	printf(_("  -T, --template=TEMPLATE      template database to copy\n"));
 	printf(_("  -V, --version                output version information, then exit\n"));
 	printf(_("  -?, --help                   show this help, then exit\n"));
diff --git a/src/bin/scripts/t/020_createdb.pl b/src/bin/scripts/t/020_createdb.pl
index 6392454..ccfbe17 100644
--- a/src/bin/scripts/t/020_createdb.pl
+++ b/src/bin/scripts/t/020_createdb.pl
@@ -76,4 +76,24 @@ $node->command_checks_all(
 	],
 	'createdb with incorrect --lc-ctype');
 
+$node->command_checks_all(
+	[ 'createdb', '--strategy', "foo", 'foobar2' ],
+	1,
+	[qr/^$/],
+	[
+		qr/^createdb: error: database creation failed: ERROR:  invalid create database strategy|^createdb: error: database creation failed: ERROR:  invalid create database strategy foo/s
+	],
+	'createdb with incorrect --strategy');
+
+# Check database creation strategy
+$node->issues_sql_like(
+	[ 'createdb', '-T', 'foobar2', 'foobar4', '-S', 'wal_log'],
+	qr/statement: CREATE DATABASE foobar4 TEMPLATE foobar2 STRATEGY wal_log/,
+	'create database with WAL_LOG strategy');
+
+$node->issues_sql_like(
+	[ 'createdb', '-T', 'foobar2', 'foobar5', '-S', 'file_copy'],
+	qr/statement: CREATE DATABASE foobar5 TEMPLATE foobar2 STRATEGY file_copy/,
+	'create database with FILE_COPY strategy');
+
 done_testing();
-- 
1.8.3.1

#161

Ashutosh Sharma

ashu.coek88@gmail.com

almost 4 years ago

In reply to: Dilip Kumar (#160)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

Few comments on the latest patch:

-               /* We need to construct the pathname for this database */
-               dbpath = GetDatabasePath(xlrec->dbid, xlrec->tsid);
+               if (xlrec->dbid != InvalidOid)
+                       dbpath = GetDatabasePath(xlrec->dbid, xlrec->tsid);
+               else
+                       dbpath = pstrdup("global");

Do we really need this change? Is GetDatabasePath() alone not capable
of handling it?

+static CreateDBRelInfo *ScanSourceDatabasePgClassTuple(HeapTupleData *tuple,
+
                                    Oid tbid, Oid dbid,
+
                                    char *srcpath);
+static List *ScanSourceDatabasePgClassPage(Page page, Buffer buf, Oid tbid,
+
            Oid dbid, char *srcpath,
+
            List *rnodelist, Snapshot snapshot);
+static List *ScanSourceDatabasePgClass(Oid srctbid, Oid srcdbid, char
*srcpath);

I think we can shorten these function names to probably
ScanSourceDBPgClassRel(), ScanSourceDBPgClassTuple() and likewise?

--
With Regards,
Ashutosh Sharma.

Show quoted text

On Tue, Mar 15, 2022 at 3:24 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Mar 14, 2022 at 10:04 PM Robert Haas <robertmhaas@gmail.com> wrote:

I think it would make sense to have two different WAL records e.g.
XLOG_DBASE_CREATE_WAL_LOG and XLOG_DBASE_CREATE_FILE_COPY. Then it's
easy to see how this could be generalized to other strategies in the
future.

Done that way. In dbase_desc(), for XLOG_DBASE_CREATE_FILE_COPY I
have kept the older description i.e. "copy dir" and for
XLOG_DBASE_CREATE_WAL_LOG it is "create dir", because logically the
first one is actually copying and the second one is just creating the
directory. Do you think we should be using "copy dir file_copy" and
"copy dir wal_log" in the description as well?

On Mon, Mar 14, 2022 at 12:04 PM Robert Haas <robertmhaas@gmail.com> wrote:

I was looking at 0001 and 0002 again and realized that I swapped the
names load_relmap_file() and read_relmap_file() from what I should
have done. Here's a revised version. With this, read_relmap_file() and
write_relmap_file() become functions that just read and write the file
without touching any global variables, and load_relmap_file() is the
function that reads data from the file and puts it into a global
variable, which seems more sensible than the way I had it before.

Okay, I have included this patch and rebased other patches on top of that.

Regarding 0003 and 0005, I'm not a fan of 'bool isunlogged'. I think
'bool permanent' would be better (note BM_PERMANENT). This would
involve reversing true and false.

Okay changed.

Regarding 0005:

+ CREATEDB_WAL_LOG = 0,
+ CREATEDB_FILE_COPY = 1

I still think you don't need = 0 and = 1 here.

Done

I'll probably go through and do a pass over the comments once you post
the next version of this. There seems to be work needed in a bunch of
places, but it probably makes more sense for me to go through and
adjust the things that seem to need it rather than listing a bunch of
changes for you to make.

Sure, thanks.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#162

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Ashutosh Sharma (#161)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Tue, Mar 15, 2022 at 12:30 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Few comments on the latest patch:

-               /* We need to construct the pathname for this database */
-               dbpath = GetDatabasePath(xlrec->dbid, xlrec->tsid);
+               if (xlrec->dbid != InvalidOid)
+                       dbpath = GetDatabasePath(xlrec->dbid, xlrec->tsid);
+               else
+                       dbpath = pstrdup("global");

Do we really need this change? Is GetDatabasePath() alone not capable
of handling it?

Well, I mean, that function has a special case for
GLOBALTABLESPACE_OID, but GLOBALTABLESPACE_OID is 1664, and InvalidOid
is 0.

I think we can shorten these function names to probably
ScanSourceDBPgClassRel(), ScanSourceDBPgClassTuple() and likewise?

We could, but I don't think it's an improvement.

--
Robert Haas
EDB: http://www.enterprisedb.com

#163

Ashutosh Sharma

ashu.coek88@gmail.com

almost 4 years ago

In reply to: Robert Haas (#162)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Tue, Mar 15, 2022 at 10:17 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Mar 15, 2022 at 12:30 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
Few comments on the latest patch:
-               /* We need to construct the pathname for this database */
-               dbpath = GetDatabasePath(xlrec->dbid, xlrec->tsid);
+               if (xlrec->dbid != InvalidOid)
+                       dbpath = GetDatabasePath(xlrec->dbid, xlrec->tsid);
+               else
+                       dbpath = pstrdup("global");
Do we really need this change? Is GetDatabasePath() alone not capable
of handling it?
Well, I mean, that function has a special case for
GLOBALTABLESPACE_OID, but GLOBALTABLESPACE_OID is 1664, and InvalidOid
is 0.

Wouldn't this be true only in case of a shared map file (when dbOid is
Invalid and tblspcOid is globaltablespace_oid) or am I missing
something?

--
With Regards,
Ashutosh Sharma.

#164

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Ashutosh Sharma (#163)

1 attachment(s)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Tue, Mar 15, 2022 at 1:26 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

On Tue, Mar 15, 2022 at 12:30 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
Few comments on the latest patch:
-               /* We need to construct the pathname for this database */
-               dbpath = GetDatabasePath(xlrec->dbid, xlrec->tsid);
+               if (xlrec->dbid != InvalidOid)
+                       dbpath = GetDatabasePath(xlrec->dbid, xlrec->tsid);
+               else
+                       dbpath = pstrdup("global");
Do we really need this change? Is GetDatabasePath() alone not capable
of handling it?
Well, I mean, that function has a special case for
GLOBALTABLESPACE_OID, but GLOBALTABLESPACE_OID is 1664, and InvalidOid
is 0.
Wouldn't this be true only in case of a shared map file (when dbOid is
Invalid and tblspcOid is globaltablespace_oid) or am I missing
something?

*facepalm*

Good catch, sorry that I'm slow on the uptake today.

v3 attached.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachments:

relmap-refactor-rmh-v3.patchapplication/octet-stream; name=relmap-refactor-rmh-v3.patchDownload

diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 4f6811f571..4d0718f001 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -137,8 +137,10 @@ static void apply_map_update(RelMapFile *map, Oid relationId, Oid fileNode,
 static void merge_map_updates(RelMapFile *map, const RelMapFile *updates,
 							  bool add_okay);
 static void load_relmap_file(bool shared, bool lock_held);
-static void write_relmap_file(bool shared, RelMapFile *newmap,
-							  bool write_wal, bool send_sinval, bool preserve_files,
+static void read_relmap_file(RelMapFile *map, char *dbpath, bool lock_held,
+							 int elevel);
+static void write_relmap_file(RelMapFile *newmap, bool write_wal,
+							  bool send_sinval, bool preserve_files,
 							  Oid dbid, Oid tsid, const char *dbpath);
 static void perform_relmap_update(bool shared, const RelMapFile *updates);
 
@@ -568,9 +570,9 @@ RelationMapFinishBootstrap(void)
 	Assert(pending_local_updates.num_mappings == 0);
 
 	/* Write the files; no WAL or sinval needed */
-	write_relmap_file(true, &shared_map, false, false, false,
-					  InvalidOid, GLOBALTABLESPACE_OID, NULL);
-	write_relmap_file(false, &local_map, false, false, false,
+	write_relmap_file(&shared_map, false, false, false,
+					  InvalidOid, GLOBALTABLESPACE_OID, "global");
+	write_relmap_file(&local_map, false, false, false,
 					  MyDatabaseId, MyDatabaseTableSpace, DatabasePath);
 }
 
@@ -687,39 +689,48 @@ RestoreRelationMap(char *startAddress)
 }
 
 /*
- * load_relmap_file -- load data from the shared or local map file
+ * load_relmap_file -- load the shared or local map file
  *
- * Because the map file is essential for access to core system catalogs,
- * failure to read it is a fatal error.
+ * Because these files are essential for access to core system catalogs,
+ * failure to load either of them is a fatal error.
  *
  * Note that the local case requires DatabasePath to be set up.
  */
 static void
 load_relmap_file(bool shared, bool lock_held)
 {
-	RelMapFile *map;
+	if (shared)
+		read_relmap_file(&shared_map, "global", lock_held, FATAL);
+	else
+		read_relmap_file(&local_map, DatabasePath, lock_held, FATAL);
+}
+
+/*
+ * read_relmap_file -- load data from any relation mapper file
+ *
+ * dbpath must be the relevant database path, or "global" for shared relations.
+ *
+ * RelationMappingLock will be acquired released unless lock_held = true.
+ *
+ * Errors will be reported at the indicated elevel, which should be at least
+ * ERROR.
+ */
+static void
+read_relmap_file(RelMapFile *map, char *dbpath, bool lock_held, int elevel)
+{
 	char		mapfilename[MAXPGPATH];
 	pg_crc32c	crc;
 	int			fd;
 	int			r;
 
-	if (shared)
-	{
-		snprintf(mapfilename, sizeof(mapfilename), "global/%s",
-				 RELMAPPER_FILENAME);
-		map = &shared_map;
-	}
-	else
-	{
-		snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
-				 DatabasePath, RELMAPPER_FILENAME);
-		map = &local_map;
-	}
+	Assert(elevel >= ERROR);
 
-	/* Read data ... */
+	/* Open the target file. */
+	snprintf(mapfilename, sizeof(mapfilename), "%s/%s", dbpath,
+			 RELMAPPER_FILENAME);
 	fd = OpenTransientFile(mapfilename, O_RDONLY | PG_BINARY);
 	if (fd < 0)
-		ereport(FATAL,
+		ereport(elevel,
 				(errcode_for_file_access(),
 				 errmsg("could not open file \"%s\": %m",
 						mapfilename)));
@@ -734,16 +745,17 @@ load_relmap_file(bool shared, bool lock_held)
 	if (!lock_held)
 		LWLockAcquire(RelationMappingLock, LW_SHARED);
 
+	/* Now read the data. */
 	pgstat_report_wait_start(WAIT_EVENT_RELATION_MAP_READ);
 	r = read(fd, map, sizeof(RelMapFile));
 	if (r != sizeof(RelMapFile))
 	{
 		if (r < 0)
-			ereport(FATAL,
+			ereport(elevel,
 					(errcode_for_file_access(),
 					 errmsg("could not read file \"%s\": %m", mapfilename)));
 		else
-			ereport(FATAL,
+			ereport(elevel,
 					(errcode(ERRCODE_DATA_CORRUPTED),
 					 errmsg("could not read file \"%s\": read %d of %zu",
 							mapfilename, r, sizeof(RelMapFile))));
@@ -754,7 +766,7 @@ load_relmap_file(bool shared, bool lock_held)
 		LWLockRelease(RelationMappingLock);
 
 	if (CloseTransientFile(fd) != 0)
-		ereport(FATAL,
+		ereport(elevel,
 				(errcode_for_file_access(),
 				 errmsg("could not close file \"%s\": %m",
 						mapfilename)));
@@ -763,7 +775,7 @@ load_relmap_file(bool shared, bool lock_held)
 	if (map->magic != RELMAPPER_FILEMAGIC ||
 		map->num_mappings < 0 ||
 		map->num_mappings > MAX_MAPPINGS)
-		ereport(FATAL,
+		ereport(elevel,
 				(errmsg("relation mapping file \"%s\" contains invalid data",
 						mapfilename)));
 
@@ -773,7 +785,7 @@ load_relmap_file(bool shared, bool lock_held)
 	FIN_CRC32C(crc);
 
 	if (!EQ_CRC32C(crc, map->crc))
-		ereport(FATAL,
+		ereport(elevel,
 				(errmsg("relation mapping file \"%s\" contains incorrect checksum",
 						mapfilename)));
 }
@@ -795,16 +807,16 @@ load_relmap_file(bool shared, bool lock_held)
  *
  * Because this may be called during WAL replay when MyDatabaseId,
  * DatabasePath, etc aren't valid, we require the caller to pass in suitable
- * values.  The caller is also responsible for being sure no concurrent
- * map update could be happening.
+ * values. Pass dbpath as "global" for the shared map.
+ *
+ * The caller is also responsible for being sure no concurrent map update
+ * could be happening.
  */
 static void
-write_relmap_file(bool shared, RelMapFile *newmap,
-				  bool write_wal, bool send_sinval, bool preserve_files,
-				  Oid dbid, Oid tsid, const char *dbpath)
+write_relmap_file(RelMapFile *newmap, bool write_wal, bool send_sinval,
+				  bool preserve_files, Oid dbid, Oid tsid, const char *dbpath)
 {
 	int			fd;
-	RelMapFile *realmap;
 	char		mapfilename[MAXPGPATH];
 
 	/*
@@ -822,19 +834,8 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 	 * Open the target file.  We prefer to do this before entering the
 	 * critical section, so that an open() failure need not force PANIC.
 	 */
-	if (shared)
-	{
-		snprintf(mapfilename, sizeof(mapfilename), "global/%s",
-				 RELMAPPER_FILENAME);
-		realmap = &shared_map;
-	}
-	else
-	{
-		snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
-				 dbpath, RELMAPPER_FILENAME);
-		realmap = &local_map;
-	}
-
+	snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+			 dbpath, RELMAPPER_FILENAME);
 	fd = OpenTransientFile(mapfilename, O_WRONLY | O_CREAT | PG_BINARY);
 	if (fd < 0)
 		ereport(ERROR,
@@ -934,16 +935,6 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		}
 	}
 
-	/*
-	 * Success, update permanent copy.  During bootstrap, we might be working
-	 * on the permanent copy itself, in which case skip the memcpy() to avoid
-	 * invoking nominally-undefined behavior.
-	 */
-	if (realmap != newmap)
-		memcpy(realmap, newmap, sizeof(RelMapFile));
-	else
-		Assert(!send_sinval);	/* must be bootstrapping */
-
 	/* Critical section done */
 	if (write_wal)
 		END_CRIT_SECTION();
@@ -990,10 +981,19 @@ perform_relmap_update(bool shared, const RelMapFile *updates)
 	merge_map_updates(&newmap, updates, allowSystemTableMods);
 
 	/* Write out the updated map and do other necessary tasks */
-	write_relmap_file(shared, &newmap, true, true, true,
+	write_relmap_file(&newmap, true, true, true,
 					  (shared ? InvalidOid : MyDatabaseId),
 					  (shared ? GLOBALTABLESPACE_OID : MyDatabaseTableSpace),
-					  DatabasePath);
+					  (shared ? "global" : DatabasePath));
+
+	/*
+	 * We succesfully wrote the updated file, so it's now safe to rely on the
+	 * new values in this process, too.
+	 */
+	if (shared)
+		memcpy(&shared_map, &newmap, sizeof(RelMapFile));
+	else
+		memcpy(&local_map, &newmap, sizeof(RelMapFile));
 
 	/* Now we can release the lock */
 	LWLockRelease(RelationMappingLock);
@@ -1033,8 +1033,7 @@ relmap_redo(XLogReaderState *record)
 		 * but grab the lock to interlock against load_relmap_file().
 		 */
 		LWLockAcquire(RelationMappingLock, LW_EXCLUSIVE);
-		write_relmap_file((xlrec->dbid == InvalidOid), &newmap,
-						  false, true, false,
+		write_relmap_file(&newmap, false, true, false,
 						  xlrec->dbid, xlrec->tsid, dbpath);
 		LWLockRelease(RelationMappingLock);

#165

Dilip Kumar

dilipbalaut@gmail.com

almost 4 years ago

In reply to: Robert Haas (#164)

6 attachment(s)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Tue, Mar 15, 2022 at 11:09 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Mar 15, 2022 at 1:26 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
On Tue, Mar 15, 2022 at 12:30 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
Few comments on the latest patch:
-               /* We need to construct the pathname for this database */
-               dbpath = GetDatabasePath(xlrec->dbid, xlrec->tsid);
+               if (xlrec->dbid != InvalidOid)
+                       dbpath = GetDatabasePath(xlrec->dbid, xlrec->tsid);
+               else
+                       dbpath = pstrdup("global");
Do we really need this change? Is GetDatabasePath() alone not capable
of handling it?
Well, I mean, that function has a special case for
GLOBALTABLESPACE_OID, but GLOBALTABLESPACE_OID is 1664, and InvalidOid
is 0.
Wouldn't this be true only in case of a shared map file (when dbOid is
Invalid and tblspcOid is globaltablespace_oid) or am I missing
something?
*facepalm*

Good catch, sorry that I'm slow on the uptake today.

v3 attached.

Thanks Ashutosh and Robert. Other pacthes cleanly applied on this
patch still generated a new version so that we can find all patches
together. There are no other changes.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v16-0002-Extend-relmap-interfaces.patchtext/x-patch; charset=US-ASCII; name=v16-0002-Extend-relmap-interfaces.patchDownload

From cd0fe403cd54e5dbeb7e17b321bdf0434b509162 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Tue, 15 Mar 2022 09:18:52 +0530
Subject: [PATCH v16 2/6] Extend relmap interfaces

Support new interfaces in relmapper, 1) Support copying the
relmap file from one database path to the other database path.
2) And another interface for getting filenode from oid.  We already
have RelationMapOidToFilenode for the same purpose but that assumes
we are connected to the database for which we want to get the mapping.
So this new interface will do the same but instead, it will get the
mapping for the input database.

These interfaces are required for next patch, for supporting the
wal logged created database.
---
 src/backend/utils/cache/relmapper.c | 60 +++++++++++++++++++++++++++++++++++++
 src/include/utils/relmapper.h       |  4 ++-
 2 files changed, 63 insertions(+), 1 deletion(-)

diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 4d0718f..5b22dbb 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -252,6 +252,60 @@ RelationMapFilenodeToOid(Oid filenode, bool shared)
 }
 
 /*
+ * RelationMapOidToFilenodeForDatabase
+ *
+ * Same as RelationMapOidToFilenode, but instead of reading the mapping from
+ * the database we are connected to it will read the mapping from the input
+ * database.
+ */
+Oid
+RelationMapOidToFilenodeForDatabase(char *dbpath, Oid relationId)
+{
+	RelMapFile	map;
+	int			i;
+
+	/* Read the relmap file from the source database. */
+	read_relmap_file(&map, dbpath, false, ERROR);
+
+	/* Iterate over the relmap entries to find the input relation oid. */
+	for (i = 0; i < map.num_mappings; i++)
+	{
+		if (relationId == map.mappings[i].mapoid)
+			return map.mappings[i].mapfilenode;
+	}
+
+	return InvalidOid;
+}
+
+/*
+ * RelationMapCopy
+ *
+ * Copy relmapfile from source db path to the destination db path and WAL log
+ * the operation.
+ */
+void
+RelationMapCopy(Oid dbid, Oid tsid, char *srcdbpath, char *dstdbpath)
+{
+	RelMapFile map;
+
+	/*
+	 * Read the relmap file from the source database.  This function is only
+	 * called during the create database, so elevel can be ERROR.
+	 */
+	read_relmap_file(&map, srcdbpath, false, ERROR);
+
+	/*
+	 * Write map contents into the destination database's relmap file. No
+	 * sinval needed because we are creating new file while creating a new
+	 * database so no one else must be accessing this file and for the same
+	 * reason we don't need to acquire the RelationMappingLock as well.  And,
+	 * we also don't need to preserve files because we are creating a new
+	 * database so in case of anerror relation files will be deleted anyway.
+	 */
+	write_relmap_file(&map, true, false, false, dbid, tsid, dstdbpath);
+}
+
+/*
  * RelationMapUpdateMap
  *
  * Install a new relfilenode mapping for the specified relation.
@@ -1031,6 +1085,12 @@ relmap_redo(XLogReaderState *record)
 		 *
 		 * There shouldn't be anyone else updating relmaps during WAL replay,
 		 * but grab the lock to interlock against load_relmap_file().
+		 *
+		 * Note - this WAL is also written for copying the relmap file while
+		 * creating a database.  Therefore, it makes no sense to acquire a
+		 * relmap lock or send sinval.  But if we want to avoid that, then we
+		 * must set an extra flag in WAL.  So let it grab the lock and send
+		 * sinval because there is no harm in that.
 		 */
 		LWLockAcquire(RelationMappingLock, LW_EXCLUSIVE);
 		write_relmap_file(&newmap, false, true, false,
diff --git a/src/include/utils/relmapper.h b/src/include/utils/relmapper.h
index 9fbb5a7..f10353e 100644
--- a/src/include/utils/relmapper.h
+++ b/src/include/utils/relmapper.h
@@ -38,7 +38,9 @@ typedef struct xl_relmap_update
 extern Oid	RelationMapOidToFilenode(Oid relationId, bool shared);
 
 extern Oid	RelationMapFilenodeToOid(Oid relationId, bool shared);
-
+extern Oid RelationMapOidToFilenodeForDatabase(char *dbpath, Oid relationId);
+extern void RelationMapCopy(Oid dbid, Oid tsid, char *srcdbpath,
+							char *dstdbpath);
 extern void RelationMapUpdateMap(Oid relationId, Oid fileNode, bool shared,
 								 bool immediate);
 
-- 
1.8.3.1

v16-0001-Refactor-relmap-load-and-relmap-write-functions.patchtext/x-patch; charset=US-ASCII; name=v16-0001-Refactor-relmap-load-and-relmap-write-functions.patchDownload

From cfe9b1cece03e3704902b375d2a18efc288bbe38 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Wed, 16 Mar 2022 09:53:26 +0530
Subject: [PATCH v16 1/6] Refactor relmap load and relmap write functions

Currently, relmap reading and writing interfaces are tightly
coupled with shared_map and local_map of the database
it is connected to.  But as higher level patch set we need
interfaces where we can read relmap into any input memory
and while writing also we should be able to pass the map.
And, also support reading relmap file from input database
path instead of assuming we are connected to the database.

So as part of this patch, we are doing refactoring of the
existing code such that we can expose the read and write
interfaces that are independent of the shared_map and the
local_map, without changing any logic.

Author: Robert Haas
---
 src/backend/utils/cache/relmapper.c | 121 ++++++++++++++++++------------------
 1 file changed, 60 insertions(+), 61 deletions(-)

diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 4f6811f..4d0718f 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -137,8 +137,10 @@ static void apply_map_update(RelMapFile *map, Oid relationId, Oid fileNode,
 static void merge_map_updates(RelMapFile *map, const RelMapFile *updates,
 							  bool add_okay);
 static void load_relmap_file(bool shared, bool lock_held);
-static void write_relmap_file(bool shared, RelMapFile *newmap,
-							  bool write_wal, bool send_sinval, bool preserve_files,
+static void read_relmap_file(RelMapFile *map, char *dbpath, bool lock_held,
+							 int elevel);
+static void write_relmap_file(RelMapFile *newmap, bool write_wal,
+							  bool send_sinval, bool preserve_files,
 							  Oid dbid, Oid tsid, const char *dbpath);
 static void perform_relmap_update(bool shared, const RelMapFile *updates);
 
@@ -568,9 +570,9 @@ RelationMapFinishBootstrap(void)
 	Assert(pending_local_updates.num_mappings == 0);
 
 	/* Write the files; no WAL or sinval needed */
-	write_relmap_file(true, &shared_map, false, false, false,
-					  InvalidOid, GLOBALTABLESPACE_OID, NULL);
-	write_relmap_file(false, &local_map, false, false, false,
+	write_relmap_file(&shared_map, false, false, false,
+					  InvalidOid, GLOBALTABLESPACE_OID, "global");
+	write_relmap_file(&local_map, false, false, false,
 					  MyDatabaseId, MyDatabaseTableSpace, DatabasePath);
 }
 
@@ -687,39 +689,48 @@ RestoreRelationMap(char *startAddress)
 }
 
 /*
- * load_relmap_file -- load data from the shared or local map file
+ * load_relmap_file -- load the shared or local map file
  *
- * Because the map file is essential for access to core system catalogs,
- * failure to read it is a fatal error.
+ * Because these files are essential for access to core system catalogs,
+ * failure to load either of them is a fatal error.
  *
  * Note that the local case requires DatabasePath to be set up.
  */
 static void
 load_relmap_file(bool shared, bool lock_held)
 {
-	RelMapFile *map;
+	if (shared)
+		read_relmap_file(&shared_map, "global", lock_held, FATAL);
+	else
+		read_relmap_file(&local_map, DatabasePath, lock_held, FATAL);
+}
+
+/*
+ * read_relmap_file -- load data from any relation mapper file
+ *
+ * dbpath must be the relevant database path, or "global" for shared relations.
+ *
+ * RelationMappingLock will be acquired released unless lock_held = true.
+ *
+ * Errors will be reported at the indicated elevel, which should be at least
+ * ERROR.
+ */
+static void
+read_relmap_file(RelMapFile *map, char *dbpath, bool lock_held, int elevel)
+{
 	char		mapfilename[MAXPGPATH];
 	pg_crc32c	crc;
 	int			fd;
 	int			r;
 
-	if (shared)
-	{
-		snprintf(mapfilename, sizeof(mapfilename), "global/%s",
-				 RELMAPPER_FILENAME);
-		map = &shared_map;
-	}
-	else
-	{
-		snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
-				 DatabasePath, RELMAPPER_FILENAME);
-		map = &local_map;
-	}
+	Assert(elevel >= ERROR);
 
-	/* Read data ... */
+	/* Open the target file. */
+	snprintf(mapfilename, sizeof(mapfilename), "%s/%s", dbpath,
+			 RELMAPPER_FILENAME);
 	fd = OpenTransientFile(mapfilename, O_RDONLY | PG_BINARY);
 	if (fd < 0)
-		ereport(FATAL,
+		ereport(elevel,
 				(errcode_for_file_access(),
 				 errmsg("could not open file \"%s\": %m",
 						mapfilename)));
@@ -734,16 +745,17 @@ load_relmap_file(bool shared, bool lock_held)
 	if (!lock_held)
 		LWLockAcquire(RelationMappingLock, LW_SHARED);
 
+	/* Now read the data. */
 	pgstat_report_wait_start(WAIT_EVENT_RELATION_MAP_READ);
 	r = read(fd, map, sizeof(RelMapFile));
 	if (r != sizeof(RelMapFile))
 	{
 		if (r < 0)
-			ereport(FATAL,
+			ereport(elevel,
 					(errcode_for_file_access(),
 					 errmsg("could not read file \"%s\": %m", mapfilename)));
 		else
-			ereport(FATAL,
+			ereport(elevel,
 					(errcode(ERRCODE_DATA_CORRUPTED),
 					 errmsg("could not read file \"%s\": read %d of %zu",
 							mapfilename, r, sizeof(RelMapFile))));
@@ -754,7 +766,7 @@ load_relmap_file(bool shared, bool lock_held)
 		LWLockRelease(RelationMappingLock);
 
 	if (CloseTransientFile(fd) != 0)
-		ereport(FATAL,
+		ereport(elevel,
 				(errcode_for_file_access(),
 				 errmsg("could not close file \"%s\": %m",
 						mapfilename)));
@@ -763,7 +775,7 @@ load_relmap_file(bool shared, bool lock_held)
 	if (map->magic != RELMAPPER_FILEMAGIC ||
 		map->num_mappings < 0 ||
 		map->num_mappings > MAX_MAPPINGS)
-		ereport(FATAL,
+		ereport(elevel,
 				(errmsg("relation mapping file \"%s\" contains invalid data",
 						mapfilename)));
 
@@ -773,7 +785,7 @@ load_relmap_file(bool shared, bool lock_held)
 	FIN_CRC32C(crc);
 
 	if (!EQ_CRC32C(crc, map->crc))
-		ereport(FATAL,
+		ereport(elevel,
 				(errmsg("relation mapping file \"%s\" contains incorrect checksum",
 						mapfilename)));
 }
@@ -795,16 +807,16 @@ load_relmap_file(bool shared, bool lock_held)
  *
  * Because this may be called during WAL replay when MyDatabaseId,
  * DatabasePath, etc aren't valid, we require the caller to pass in suitable
- * values.  The caller is also responsible for being sure no concurrent
- * map update could be happening.
+ * values. Pass dbpath as "global" for the shared map.
+ *
+ * The caller is also responsible for being sure no concurrent map update
+ * could be happening.
  */
 static void
-write_relmap_file(bool shared, RelMapFile *newmap,
-				  bool write_wal, bool send_sinval, bool preserve_files,
-				  Oid dbid, Oid tsid, const char *dbpath)
+write_relmap_file(RelMapFile *newmap, bool write_wal, bool send_sinval,
+				  bool preserve_files, Oid dbid, Oid tsid, const char *dbpath)
 {
 	int			fd;
-	RelMapFile *realmap;
 	char		mapfilename[MAXPGPATH];
 
 	/*
@@ -822,19 +834,8 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 	 * Open the target file.  We prefer to do this before entering the
 	 * critical section, so that an open() failure need not force PANIC.
 	 */
-	if (shared)
-	{
-		snprintf(mapfilename, sizeof(mapfilename), "global/%s",
-				 RELMAPPER_FILENAME);
-		realmap = &shared_map;
-	}
-	else
-	{
-		snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
-				 dbpath, RELMAPPER_FILENAME);
-		realmap = &local_map;
-	}
-
+	snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
+			 dbpath, RELMAPPER_FILENAME);
 	fd = OpenTransientFile(mapfilename, O_WRONLY | O_CREAT | PG_BINARY);
 	if (fd < 0)
 		ereport(ERROR,
@@ -934,16 +935,6 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		}
 	}
 
-	/*
-	 * Success, update permanent copy.  During bootstrap, we might be working
-	 * on the permanent copy itself, in which case skip the memcpy() to avoid
-	 * invoking nominally-undefined behavior.
-	 */
-	if (realmap != newmap)
-		memcpy(realmap, newmap, sizeof(RelMapFile));
-	else
-		Assert(!send_sinval);	/* must be bootstrapping */
-
 	/* Critical section done */
 	if (write_wal)
 		END_CRIT_SECTION();
@@ -990,10 +981,19 @@ perform_relmap_update(bool shared, const RelMapFile *updates)
 	merge_map_updates(&newmap, updates, allowSystemTableMods);
 
 	/* Write out the updated map and do other necessary tasks */
-	write_relmap_file(shared, &newmap, true, true, true,
+	write_relmap_file(&newmap, true, true, true,
 					  (shared ? InvalidOid : MyDatabaseId),
 					  (shared ? GLOBALTABLESPACE_OID : MyDatabaseTableSpace),
-					  DatabasePath);
+					  (shared ? "global" : DatabasePath));
+
+	/*
+	 * We succesfully wrote the updated file, so it's now safe to rely on the
+	 * new values in this process, too.
+	 */
+	if (shared)
+		memcpy(&shared_map, &newmap, sizeof(RelMapFile));
+	else
+		memcpy(&local_map, &newmap, sizeof(RelMapFile));
 
 	/* Now we can release the lock */
 	LWLockRelease(RelationMappingLock);
@@ -1033,8 +1033,7 @@ relmap_redo(XLogReaderState *record)
 		 * but grab the lock to interlock against load_relmap_file().
 		 */
 		LWLockAcquire(RelationMappingLock, LW_EXCLUSIVE);
-		write_relmap_file((xlrec->dbid == InvalidOid), &newmap,
-						  false, true, false,
+		write_relmap_file(&newmap, false, true, false,
 						  xlrec->dbid, xlrec->tsid, dbpath);
 		LWLockRelease(RelationMappingLock);
 
-- 
1.8.3.1

v16-0003-Allow-ReadBufferWithoutRelcache-to-support-unlog.patchtext/x-patch; charset=US-ASCII; name=v16-0003-Allow-ReadBufferWithoutRelcache-to-support-unlog.patchDownload

From 328c353539b4acf5ec9b8c802801a2321dfc2e03 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Thu, 10 Feb 2022 15:55:33 +0530
Subject: [PATCH v16 3/6] Allow ReadBufferWithoutRelcache to support unlogged
 relpersistence

At present, this function may only be used on permanent relations,
because we only use it during XLOG replay.  But now as part of the
bigger patch set, we will be using this for reading the buffer from
the database to which we are not connected.  So now we need this
for the unlogged relations as well.
---
 src/backend/access/transam/xlogutils.c |  6 +++---
 src/backend/storage/buffer/bufmgr.c    | 18 ++++++++++--------
 src/include/storage/bufmgr.h           |  3 ++-
 3 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 54d5f20..6b10656 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -484,7 +484,7 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 	{
 		/* page exists in file */
 		buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-										   mode, NULL);
+										   mode, NULL, true);
 	}
 	else
 	{
@@ -509,7 +509,7 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 				ReleaseBuffer(buffer);
 			}
 			buffer = ReadBufferWithoutRelcache(rnode, forknum,
-											   P_NEW, mode, NULL);
+											   P_NEW, mode, NULL, true);
 		}
 		while (BufferGetBlockNumber(buffer) < blkno);
 		/* Handle the corner case that P_NEW returns non-consecutive pages */
@@ -519,7 +519,7 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 				LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 			ReleaseBuffer(buffer);
 			buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-											   mode, NULL);
+											   mode, NULL, true);
 		}
 	}
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f5459c6..3cadcd2 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -772,23 +772,25 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
  * ReadBufferWithoutRelcache -- like ReadBufferExtended, but doesn't require
  *		a relcache entry for the relation.
  *
- * NB: At present, this function may only be used on permanent relations, which
- * is OK, because we only use it during XLOG replay.  If in the future we
- * want to use it on temporary or unlogged relations, we could pass additional
- * parameters.
+ * The caller should pass 'permanent' as true for the regular relation and
+ * false for the unlogged relation.
+ *
+ * NB: At present, this function may only be used on unlogged and regular
+ * relations, which is OK, because we only use it during XLOG replay and while
+ * copying the database.  If in the future we want to use it on temporary
+ * relations, we could pass additional parameters.
  */
 Buffer
 ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
 						  BlockNumber blockNum, ReadBufferMode mode,
-						  BufferAccessStrategy strategy)
+						  BufferAccessStrategy strategy, bool permanent)
 {
 	bool		hit;
 
 	SMgrRelation smgr = smgropen(rnode, InvalidBackendId);
 
-	Assert(InRecovery);
-
-	return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
+	return ReadBuffer_common(smgr, permanent ? RELPERSISTENCE_PERMANENT :
+							 RELPERSISTENCE_UNLOGGED, forkNum, blockNum,
 							 mode, strategy, &hit);
 }
 
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index dd01841..fd0452f 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -184,7 +184,8 @@ extern Buffer ReadBufferExtended(Relation reln, ForkNumber forkNum,
 								 BufferAccessStrategy strategy);
 extern Buffer ReadBufferWithoutRelcache(RelFileNode rnode,
 										ForkNumber forkNum, BlockNumber blockNum,
-										ReadBufferMode mode, BufferAccessStrategy strategy);
+										ReadBufferMode mode, BufferAccessStrategy strategy,
+										bool permanent);
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern void MarkBufferDirty(Buffer buffer);
-- 
1.8.3.1

v16-0004-New-interface-to-lock-relation-id.patchtext/x-patch; charset=US-ASCII; name=v16-0004-New-interface-to-lock-relation-id.patchDownload

From 96a4aa7be6d8a789ebd7deb5fe8f1e107ddaa19b Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 24 Sep 2021 18:29:17 +0530
Subject: [PATCH v16 4/6] New interface to lock relation id

Currently, we have LockRelationOid which provide a mechanism to
lock the relation oid but we must be connected to the database
from which this relation belong.  As part of this patch we are
providing a new interface which can lock the relation even if we
are not connected to the containing database.
---
 src/backend/storage/lmgr/lmgr.c | 28 ++++++++++++++++++++++++++++
 src/include/storage/lmgr.h      |  1 +
 2 files changed, 29 insertions(+)

diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 5ae52dd..1543da6 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -176,6 +176,34 @@ ConditionalLockRelationOid(Oid relid, LOCKMODE lockmode)
 }
 
 /*
+ *		LockRelationId
+ *
+ * Lock, given a LockRelId.  Same as LockRelationOid but take LockRelId as an
+ * input.
+ */
+void
+LockRelationId(LockRelId *relid, LOCKMODE lockmode)
+{
+	LOCKTAG		tag;
+	LOCALLOCK  *locallock;
+	LockAcquireResult res;
+
+	SET_LOCKTAG_RELATION(tag, relid->dbId, relid->relId);
+
+	res = LockAcquireExtended(&tag, lockmode, false, false, true, &locallock);
+
+	/*
+	 * Now that we have the lock, check for invalidation messages; see notes
+	 * in LockRelationOid.
+	 */
+	if (res != LOCKACQUIRE_ALREADY_CLEAR)
+	{
+		AcceptInvalidationMessages();
+		MarkLockClear(locallock);
+	}
+}
+
+/*
  *		UnlockRelationId
  *
  * Unlock, given a LockRelId.  This is preferred over UnlockRelationOid
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index 49edbcc..be1d2c9 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -38,6 +38,7 @@ extern void RelationInitLockInfo(Relation relation);
 
 /* Lock a relation */
 extern void LockRelationOid(Oid relid, LOCKMODE lockmode);
+extern void LockRelationId(LockRelId *relid, LOCKMODE lockmode);
 extern bool ConditionalLockRelationOid(Oid relid, LOCKMODE lockmode);
 extern void UnlockRelationId(LockRelId *relid, LOCKMODE lockmode);
 extern void UnlockRelationOid(Oid relid, LOCKMODE lockmode);
-- 
1.8.3.1

v16-0005-WAL-logged-CREATE-DATABASE.patchtext/x-patch; charset=US-ASCII; name=v16-0005-WAL-logged-CREATE-DATABASE.patchDownload

From 05d089efccb2f8e60812bfd826ef36d1c9d70d93 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Tue, 15 Mar 2022 09:41:20 +0530
Subject: [PATCH v16 5/6] WAL logged CREATE DATABASE

Currently, CREATE DATABASE forces a checkpoint, then copies all the files,
then forces another checkpoint. The comments in the createdb() function
explain the reasons for this. The attached patch fixes this problem by making
create database completely WAL logged so that we can avoid the checkpoints.

We are also maintaining the old way of creating the database and for that we
are providing an option to choose the strategy for creating the database.
For the new method the user need to give STRATEGY=WAL_LOG and for the
old method they need to give STRATEGY=FILE_COPY.  The default strategy will
be WAL_LOG.
---
 contrib/bloom/blinsert.c                 |   2 +-
 doc/src/sgml/ref/create_database.sgml    |  23 +
 src/backend/access/heap/heapam_handler.c |   2 +-
 src/backend/access/nbtree/nbtree.c       |   2 +-
 src/backend/access/rmgrdesc/dbasedesc.c  |  20 +-
 src/backend/commands/dbcommands.c        | 716 ++++++++++++++++++++++++++-----
 src/backend/storage/buffer/bufmgr.c      | 156 +++++++
 src/bin/pg_rewind/parsexlog.c            |   9 +-
 src/bin/psql/tab-complete.c              |   4 +-
 src/include/commands/dbcommands_xlog.h   |  24 +-
 src/include/storage/bufmgr.h             |   3 +
 src/tools/pgindent/typedefs.list         |   5 +-
 12 files changed, 838 insertions(+), 128 deletions(-)

diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c
index c94cf34..82378db 100644
--- a/contrib/bloom/blinsert.c
+++ b/contrib/bloom/blinsert.c
@@ -173,7 +173,7 @@ blbuildempty(Relation index)
 	 * Write the page and log it.  It might seem that an immediate sync would
 	 * be sufficient to guarantee that the file exists on disk, but recovery
 	 * itself might remove it while replaying, for example, an
-	 * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we need
+	 * XLOG_DBASE_CREATE* or XLOG_TBLSPC_CREATE record.  Therefore, we need
 	 * this even when wal_level=minimal.
 	 */
 	PageSetChecksumInplace(metapage, BLOOM_METAPAGE_BLKNO);
diff --git a/doc/src/sgml/ref/create_database.sgml b/doc/src/sgml/ref/create_database.sgml
index f70d0c7..b0c94e40 100644
--- a/doc/src/sgml/ref/create_database.sgml
+++ b/doc/src/sgml/ref/create_database.sgml
@@ -34,6 +34,7 @@ CREATE DATABASE <replaceable class="parameter">name</replaceable>
            [ CONNECTION LIMIT [=] <replaceable class="parameter">connlimit</replaceable> ]
            [ IS_TEMPLATE [=] <replaceable class="parameter">istemplate</replaceable> ]
            [ OID [=] <replaceable class="parameter">oid</replaceable> ] ]
+           [ STRATEGY [=] <replaceable class="parameter">strategy</replaceable> ] ]
 </synopsis>
  </refsynopsisdiv>
 
@@ -240,6 +241,28 @@ CREATE DATABASE <replaceable class="parameter">name</replaceable>
        </listitem>
       </varlistentry>
 
+      <varlistentry>
+       <term><replaceable class="parameter">strategy</replaceable></term>
+       <listitem>
+        <para>
+         This is used for copying the database directory.  Currently, we have
+         two strategies the <literal>WAL_LOG</literal> and the
+         <literal>FILE_COPY</literal>.  If <literal>WAL_LOG</literal> strategy
+         is used then the database will be copied block by block and it will
+         also WAL log each copied block.  Otherwise, if <literal>FILE_COPY
+         </literal> strategy is used then it will do the file system level copy
+         but the individual operations will not be WAL logged.  The default
+         strategy is <literal>WAL_LOG</literal>.  If we choose the file system
+         level copy then it has to issue a checkpoint before and after
+         performing the copy and if there are a lot of dirty buffers then
+         performing the checkpoint could be costly and it may impact the
+         performance of the whole system.  On the other hand, if we wal log
+         each block then it may take more time in database creation if the
+         source database is large.
+        </para>
+       </listitem>
+      </varlistentry>
+
     </variablelist>
 
   <para>
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 39ef8a0..2b70ca0 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -601,7 +601,7 @@ heapam_relation_set_new_filenode(Relation rel,
 	 * even if the page has been logged, because the write did not go through
 	 * shared_buffers and therefore a concurrent checkpoint may have moved the
 	 * redo pointer past our xlog record.  Recovery may as well remove it
-	 * while replaying, for example, XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE
+	 * while replaying, for example, XLOG_DBASE_CREATE* or XLOG_TBLSPC_CREATE
 	 * record. Therefore, logging is necessary even if wal_level=minimal.
 	 */
 	if (persistence == RELPERSISTENCE_UNLOGGED)
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index c9b4964..dacf3f7 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -161,7 +161,7 @@ btbuildempty(Relation index)
 	 * Write the page and log it.  It might seem that an immediate sync would
 	 * be sufficient to guarantee that the file exists on disk, but recovery
 	 * itself might remove it while replaying, for example, an
-	 * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we need
+	 * XLOG_DBASE_CREATE* or XLOG_TBLSPC_CREATE record.  Therefore, we need
 	 * this even when wal_level=minimal.
 	 */
 	PageSetChecksumInplace(metapage, BTREE_METAPAGE);
diff --git a/src/backend/access/rmgrdesc/dbasedesc.c b/src/backend/access/rmgrdesc/dbasedesc.c
index 03af3fd..523d0b3 100644
--- a/src/backend/access/rmgrdesc/dbasedesc.c
+++ b/src/backend/access/rmgrdesc/dbasedesc.c
@@ -24,14 +24,23 @@ dbase_desc(StringInfo buf, XLogReaderState *record)
 	char	   *rec = XLogRecGetData(record);
 	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
 
-	if (info == XLOG_DBASE_CREATE)
+	if (info == XLOG_DBASE_CREATE_FILE_COPY)
 	{
-		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) rec;
+		xl_dbase_create_file_copy_rec *xlrec =
+		(xl_dbase_create_file_copy_rec *) rec;
 
 		appendStringInfo(buf, "copy dir %u/%u to %u/%u",
 						 xlrec->src_tablespace_id, xlrec->src_db_id,
 						 xlrec->tablespace_id, xlrec->db_id);
 	}
+	else if (info == XLOG_DBASE_CREATE_WAL_LOG)
+	{
+		xl_dbase_create_wal_log_rec *xlrec =
+		(xl_dbase_create_wal_log_rec *) rec;
+
+		appendStringInfo(buf, "create dir %u/%u",
+						 xlrec->tablespace_id, xlrec->db_id);
+	}
 	else if (info == XLOG_DBASE_DROP)
 	{
 		xl_dbase_drop_rec *xlrec = (xl_dbase_drop_rec *) rec;
@@ -51,8 +60,11 @@ dbase_identify(uint8 info)
 
 	switch (info & ~XLR_INFO_MASK)
 	{
-		case XLOG_DBASE_CREATE:
-			id = "CREATE";
+		case XLOG_DBASE_CREATE_FILE_COPY:
+			id = "CREATE_FILE_COPY";
+			break;
+		case XLOG_DBASE_CREATE_WAL_LOG:
+			id = "CREATE_WAL_LOG";
 			break;
 		case XLOG_DBASE_DROP:
 			id = "DROP";
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index c37e3c9..9636688 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -63,13 +63,27 @@
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
 #include "utils/pg_locale.h"
+#include "utils/relmapper.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 
+/*
+ * Create database strategy.  The CREATEDB_WAL_LOG will copy the database at
+ * the block level and WAL log each copied block.  Whereas the
+ * CREATEDB_FILE_COPY will directly do the file system level copy of the
+ * database so the individual operations will not be WAL logged.
+ */
+typedef enum CreateDBStrategy
+{
+	CREATEDB_WAL_LOG,
+	CREATEDB_FILE_COPY
+} CreateDBStrategy;
+
 typedef struct
 {
 	Oid			src_dboid;		/* source (template) DB */
 	Oid			dest_dboid;		/* DB we are trying to create */
+	CreateDBStrategy strategy;	/* create db strategy */
 } createdb_failure_params;
 
 typedef struct
@@ -78,6 +92,19 @@ typedef struct
 	Oid			dest_tsoid;		/* tablespace we are trying to move to */
 } movedb_failure_params;
 
+/*
+ * When creating a database, we scan the pg_class of the source database to
+ * identify all the relations to be copied.  The structure is used for storing
+ * information about each relation of the source database.
+ */
+typedef struct CreateDBRelInfo
+{
+	RelFileNode rnode;			/* physical relation identifier */
+	Oid			reloid;			/* relation oid */
+	bool		permanent;		/* relation is permanent or unlogged */
+} CreateDBRelInfo;
+
+
 /* non-export function prototypes */
 static void createdb_failure_callback(int code, Datum arg);
 static void movedb(const char *dbname, const char *tblspcname);
@@ -92,7 +119,507 @@ static bool have_createdb_privilege(void);
 static void remove_dbtablespaces(Oid db_id);
 static bool check_db_file_conflict(Oid db_id);
 static int	errdetail_busy_db(int notherbackends, int npreparedxacts);
+static void CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid,
+									bool isRedo);
+static CreateDBRelInfo *ScanSourceDatabasePgClassTuple(HeapTupleData *tuple,
+													   Oid tbid, Oid dbid,
+													   char *srcpath);
+static List *ScanSourceDatabasePgClassPage(Page page, Buffer buf, Oid tbid,
+										   Oid dbid, char *srcpath,
+										   List *rnodelist, Snapshot snapshot);
+static List *ScanSourceDatabasePgClass(Oid srctbid, Oid srcdbid, char *srcpath);
+static void CreateDatabaseUsingWalLog(Oid src_dboid, Oid dboid, Oid src_tsid,
+									  Oid dst_tsid);
+static void CreateDatabaseUsingFileCopy(Oid src_dboid, Oid dboid, Oid src_tsid,
+										Oid dst_tsid);
+
+/*
+ * Create database directory and write out the PG_VERSION file in the database
+ * path.  If isRedo is true, it's okay for the database directory to exist
+ * already.  We can directly write PG_MAJORVERSION in the version file instead
+ * of copying from the source database file because these two must be the same.
+ */
+static void
+CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid, bool isRedo)
+{
+	int			fd;
+	int			nbytes;
+	char		versionfile[MAXPGPATH];
+	char		buf[16];
+
+	/* Prepare version data before starting a critical section. */
+	sprintf(buf, "%s\n", PG_MAJORVERSION);
+	nbytes = strlen(PG_MAJORVERSION) + 1;
+
+	/* If we are not in WAL replay then write the WAL. */
+	if (!isRedo)
+	{
+		xl_dbase_create_wal_log_rec xlrec;
+		XLogRecPtr	lsn;
+
+		START_CRIT_SECTION();
+
+		xlrec.db_id = dbid;
+		xlrec.tablespace_id = tsid;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec),
+						 sizeof(xl_dbase_create_wal_log_rec));
+
+		lsn = XLogInsert(RM_DBASE_ID, XLOG_DBASE_CREATE_WAL_LOG);
+
+		/* As always, WAL must hit the disk before the data update does. */
+		XLogFlush(lsn);
+	}
+
+	/* Create database directory. */
+	if (MakePGDirectory(dbpath) < 0)
+	{
+		/* Failure other than already exists or not in WAL replay? */
+		if (errno != EEXIST || !isRedo)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not create directory \"%s\": %m", dbpath)));
+	}
+
+	/*
+	 * Create PG_VERSION file in the database path.  If the file already
+	 * exists and we are in WAL replay then try again to open it in write
+	 * mode.
+	 */
+	snprintf(versionfile, sizeof(versionfile), "%s/%s", dbpath, "PG_VERSION");
+
+	fd = OpenTransientFile(versionfile, O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
+	if (fd < 0 && errno == EEXIST && isRedo)
+		fd = OpenTransientFile(versionfile, O_WRONLY | O_TRUNC | PG_BINARY);
+
+	if (fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m", versionfile)));
+
+	/* Write PG_MAJORVERSION in the PG_VERSION file. */
+	pgstat_report_wait_start(WAIT_EVENT_COPY_FILE_WRITE);
+	errno = 0;
+	if ((int) write(fd, buf, nbytes) != nbytes)
+	{
+		/* If write didn't set errno, assume problem is no disk space. */
+		if (errno == 0)
+			errno = ENOSPC;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m", versionfile)));
+	}
+	pgstat_report_wait_end();
+
+	/* Close the version file. */
+	CloseTransientFile(fd);
+
+	/* Critical section done. */
+	if (!isRedo)
+		END_CRIT_SECTION();
+}
+
+/*
+ * Helper function for ScanSourceDatabasePgClassPage to prepare a single
+ * CreateDBRelInfo element from the input pg_class tuple.
+ */
+CreateDBRelInfo *
+ScanSourceDatabasePgClassTuple(HeapTupleData *tuple, Oid tbid, Oid dbid,
+							   char *srcpath)
+{
+	CreateDBRelInfo	   *relinfo;
+	Form_pg_class		classForm;
+	Oid					relfilenode = InvalidOid;
+
+	classForm = (Form_pg_class) GETSTRUCT(tuple);
+
+	/*
+	 * If this is a shared object, the object doesn't have the storage or a
+	 * temp relation then nothing to be done, so just return.
+	 */
+	if (classForm->reltablespace == GLOBALTABLESPACE_OID ||
+		!RELKIND_HAS_STORAGE(classForm->relkind) ||
+		classForm->relpersistence == RELPERSISTENCE_TEMP)
+		return NULL;
+
+	/*
+	 * If relfilenode is valid then directly use it.  Otherwise, consult the
+	 * relmapper for the mapped relation.
+	 */
+	if (OidIsValid(classForm->relfilenode))
+		relfilenode = classForm->relfilenode;
+	else
+		relfilenode = RelationMapOidToFilenodeForDatabase(srcpath,
+														  classForm->oid);
+
+	/* We must have a valid relfilenode oid. */
+	Assert(OidIsValid(relfilenode));
+
+	/* Prepare a rel info element and add it to the list. */
+	relinfo = (CreateDBRelInfo *) palloc(sizeof(CreateDBRelInfo));
+	if (OidIsValid(classForm->reltablespace))
+		relinfo->rnode.spcNode = classForm->reltablespace;
+	else
+		relinfo->rnode.spcNode = tbid;
+
+	relinfo->rnode.dbNode = dbid;
+	relinfo->rnode.relNode = relfilenode;
+	relinfo->reloid = classForm->oid;
+
+	/* We should never reach here for the temp relations. */
+	Assert(classForm->relpersistence != RELPERSISTENCE_TEMP);
+	relinfo->permanent =
+		(classForm->relpersistence == RELPERSISTENCE_PERMANENT) ? true : false;
+
+	return relinfo;
+}
+
+/*
+ * Helper function for ScanSourceDatabasePgClass to identify all the valid
+ * relfilenodes for the given page.
+ */
+static List *
+ScanSourceDatabasePgClassPage(Page page, Buffer buf, Oid tbid, Oid dbid,
+							  char *srcpath, List *rnodelist,
+							  Snapshot snapshot)
+{
+	BlockNumber		blkno = BufferGetBlockNumber(buf);
+	OffsetNumber	offnum;
+	OffsetNumber	maxoff;
+	HeapTupleData	tuple;
+
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/* Iterate over each tuple of the page. */
+	for (offnum = FirstOffsetNumber;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid;
+
+		itemid = PageGetItemId(page, offnum);
+
+		/* Nothing to do if slot is empty or already dead. */
+		if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid) ||
+			ItemIdIsRedirected(itemid))
+			continue;
+
+		Assert(ItemIdIsNormal(itemid));
+		ItemPointerSet(&(tuple.t_self), blkno, offnum);
+
+		/* Initialize a HeapTupleData structure. */
+		tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
+		tuple.t_len = ItemIdGetLength(itemid);
+		tuple.t_tableOid = RelationRelationId;
+
+		/*
+		 * If the pg_class tuple is visible then prepare a CreateDBRelInfo and
+		 * append it to the list.
+		 */
+		if (HeapTupleSatisfiesVisibility(&tuple, snapshot, buf))
+		{
+			CreateDBRelInfo *relinfo;
+
+			relinfo = ScanSourceDatabasePgClassTuple(&tuple, tbid, dbid,
+													 srcpath);
+
+			/* Add it to the list. */
+			if (relinfo != NULL)
+				rnodelist = lappend(rnodelist, relinfo);
+		}
+	}
+
+	return rnodelist;
+}
+
+/*
+ * Identify all the valid relfilenodes from the source database so that we can
+ * copy them to the destination database.  In order to identify that, this
+ * function will iterate over each block of the pg_class relation of the source
+ * database.  From there, we will check all the visible tuples in order to get
+ * a list of all the valid relfilenodes in the source database.
+ */
+static List *
+ScanSourceDatabasePgClass(Oid tbid, Oid dbid, char *srcpath)
+{
+	RelFileNode rnode;
+	BlockNumber nblocks;
+	BlockNumber blkno;
+	Buffer		buf;
+	Oid			relfilenode;
+	Page		page;
+	List	   *rnodelist = NIL;
+	LockRelId	relid;
+	Snapshot	snapshot;
+	SMgrRelation rd_smgr;
+	BufferAccessStrategy bstrategy;
+
+	/* Get pg_class relfilenode. */
+	relfilenode = RelationMapOidToFilenodeForDatabase(srcpath,
+													  RelationRelationId);
+
+	/*
+	 * We are going to read the buffers associated with the pg_class relation.
+	 * Thus, acquire the relation level lock before start scanning.  As we are
+	 * not connected to the database, we cannot use relation_open directly, so
+	 * we have to lock using relation id.
+	 */
+	relid.dbId = dbid;
+	relid.relId = RelationRelationId;
+	LockRelationId(&relid, AccessShareLock);
+
+	/* Prepare a relnode for pg_class relation. */
+	rnode.spcNode = tbid;
+	rnode.dbNode = dbid;
+	rnode.relNode = relfilenode;
+
+	/*
+	 * We are not connected to the source database so open the pg_class
+	 * relation at the smgr level and get the block count.
+	 */
+	rd_smgr = smgropen(rnode, InvalidBackendId);
+	nblocks = smgrnblocks(rd_smgr, MAIN_FORKNUM);
+
+	/*
+	 * We're going to read the whole pg_class so better to use bulk-read
+	 * buffer access strategy.
+	 */
+	bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	/* Get latest snapshot for scanning the pg_class. */
+	snapshot = GetLatestSnapshot();
+
+	/* Iterate over each block of the pg_class relation. */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		/*
+		 * We are not connected to the source database so directly use the
+		 * lower level bufmgr interface which operates on the rnode.
+		 */
+		buf = ReadBufferWithoutRelcache(rnode, MAIN_FORKNUM, blkno,
+										RBM_NORMAL, bstrategy, false);
+
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buf);
+		if (PageIsNew(page) || PageIsEmpty(page))
+		{
+			UnlockReleaseBuffer(buf);
+			continue;
+		}
+
+		/*
+		 * Process pg_class tuples for the current page and add all the valid
+		 * relfilenode entries to the rnodelist.
+		 */
+		rnodelist = ScanSourceDatabasePgClassPage(page, buf, tbid, dbid,
+												  srcpath, rnodelist,
+												  snapshot);
+
+		/* Release the buffer lock. */
+		UnlockReleaseBuffer(buf);
+	}
+
+	/* Release the lock. */
+	UnlockRelationId(&relid, AccessShareLock);
+
+	return rnodelist;
+}
+
+/*
+ * Copy source database to the target using WAL.  Create target database
+ * directory and copy data files from the source database to the target
+ * database, block by block and WAL log all the operations.
+ */
+static void
+CreateDatabaseUsingWalLog(Oid src_dboid, Oid dst_dboid, Oid src_tsid, Oid dst_tsid)
+{
+	char	   *srcpath;
+	char	   *dstpath;
+	List	   *rnodelist = NULL;
+	ListCell   *cell;
+	LockRelId	relid;
+	RelFileNode srcrnode;
+	RelFileNode dstrnode;
+	CreateDBRelInfo *relinfo;
+
+	/* Get the source database path. */
+	srcpath = GetDatabasePath(src_dboid, src_tsid);
+
+	/* Get the destination database path. */
+	dstpath = GetDatabasePath(dst_dboid, dst_tsid);
+
+	/* Create database directory and write PG_VERSION file. */
+	CreateDirAndVersionFile(dstpath, dst_dboid, dst_tsid, false);
+
+	/* Copy relmap file from source database to the destination database. */
+	RelationMapCopy(dst_dboid, dst_tsid, srcpath, dstpath);
+
+	/* Get list of all valid relnode from the source database. */
+	rnodelist = ScanSourceDatabasePgClass(src_tsid, src_dboid, srcpath);
+	Assert(rnodelist != NIL);
+
+	/*
+	 * Database id is common for all the relation so set it before entering to
+	 * the loop.
+	 */
+	relid.dbId = src_dboid;
+
+	/*
+	 * Iterate over each relfilenode and copy the relation data block by block
+	 * from source database to the destination database.
+	 */
+	foreach(cell, rnodelist)
+	{
+		relinfo = lfirst(cell);
+		srcrnode = relinfo->rnode;
+
+		/*
+		 * If the relation is from the source db's default tablespace then we
+		 * need to create it in the destinations db's default tablespace.
+		 * Otherwise, we need to create in the same tablespace as it is in the
+		 * source database.
+		 */
+		if (srcrnode.spcNode == src_tsid)
+			dstrnode.spcNode = dst_tsid;
+		else
+			dstrnode.spcNode = srcrnode.spcNode;
+
+		dstrnode.dbNode = dst_dboid;
+		dstrnode.relNode = srcrnode.relNode;
+
+		/* Acquire the lock on relation before start copying. */
+		relid.relId = relinfo->reloid;
+		LockRelationId(&relid, AccessShareLock);
+
+		/* Copy relation storage from source to the destination. */
+		CreateAndCopyRelationData(srcrnode, dstrnode, relinfo->permanent);
 
+		/* Release the lock. */
+		UnlockRelationId(&relid, AccessShareLock);
+	}
+
+	list_free_deep(rnodelist);
+}
+
+/*
+ * Copy source database directory to the destination directory using file
+ * system level copy operation.
+ */
+static void
+CreateDatabaseUsingFileCopy(Oid src_dboid, Oid dst_dboid, Oid src_tsid,
+							Oid dst_tsid)
+{
+	TableScanDesc scan;
+	Relation	rel;
+	HeapTuple	tuple;
+
+	/*
+	 * Force a checkpoint before starting the copy. This will force all dirty
+	 * buffers, including those of unlogged tables, out to disk, to ensure
+	 * source database is up-to-date on disk for the copy.
+	 * FlushDatabaseBuffers() would suffice for that, but we also want to
+	 * process any pending unlink requests. Otherwise, if a checkpoint
+	 * happened while we're copying files, a file might be deleted just when
+	 * we're about to copy it, causing the lstat() call in copydir() to fail
+	 * with ENOENT.
+	 */
+	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE |
+					  CHECKPOINT_WAIT | CHECKPOINT_FLUSH_ALL);
+
+	/*
+	 * Iterate through all tablespaces of the template database, and copy each
+	 * one to the new database.
+	 */
+	rel = table_open(TableSpaceRelationId, AccessShareLock);
+	scan = table_beginscan_catalog(rel, 0, NULL);
+	while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+	{
+		Form_pg_tablespace spaceform = (Form_pg_tablespace) GETSTRUCT(tuple);
+		Oid			srctablespace = spaceform->oid;
+		Oid			dsttablespace;
+		char	   *srcpath;
+		char	   *dstpath;
+		struct stat st;
+
+		/* No need to copy global tablespace */
+		if (srctablespace == GLOBALTABLESPACE_OID)
+			continue;
+
+		srcpath = GetDatabasePath(src_dboid, srctablespace);
+
+		if (stat(srcpath, &st) < 0 || !S_ISDIR(st.st_mode) ||
+			directory_is_empty(srcpath))
+		{
+			/* Assume we can ignore it */
+			pfree(srcpath);
+			continue;
+		}
+
+		if (srctablespace == src_tsid)
+			dsttablespace = dst_tsid;
+		else
+			dsttablespace = srctablespace;
+
+		dstpath = GetDatabasePath(dst_dboid, dsttablespace);
+
+		/*
+		 * Copy this subdirectory to the new location
+		 *
+		 * We don't need to copy subdirectories
+		 */
+		copydir(srcpath, dstpath, false);
+
+		/* Record the filesystem change in XLOG */
+		{
+			xl_dbase_create_file_copy_rec xlrec;
+
+			xlrec.db_id = dst_dboid;
+			xlrec.tablespace_id = dsttablespace;
+			xlrec.src_db_id = src_dboid;
+			xlrec.src_tablespace_id = srctablespace;
+
+			XLogBeginInsert();
+			XLogRegisterData((char *) &xlrec,
+							 sizeof(xl_dbase_create_file_copy_rec));
+
+			(void) XLogInsert(RM_DBASE_ID,
+							  XLOG_DBASE_CREATE_FILE_COPY | XLR_SPECIAL_REL_UPDATE);
+		}
+	}
+	table_endscan(scan);
+	table_close(rel, AccessShareLock);
+
+	/*
+	 * We force a checkpoint before committing.  This effectively means that
+	 * committed XLOG_DBASE_CREATE_FILE_COPY operations will never need to be
+	 * replayed (at least not in ordinary crash recovery; we still have to
+	 * make the XLOG entry for the benefit of PITR operations). This avoids
+	 * two nasty scenarios:
+	 *
+	 * #1: When PITR is off, we don't XLOG the contents of newly created
+	 * indexes; therefore the drop-and-recreate-whole-directory behavior of
+	 * DBASE_CREATE replay would lose such indexes.
+	 *
+	 * #2: Since we have to recopy the source database during DBASE_CREATE
+	 * replay, we run the risk of copying changes in it that were committed
+	 * after the original CREATE DATABASE command but before the system crash
+	 * that led to the replay.  This is at least unexpected and at worst could
+	 * lead to inconsistencies, eg duplicate table names.
+	 *
+	 * (Both of these were real bugs in releases 8.0 through 8.0.3.)
+	 *
+	 * In PITR replay, the first of these isn't an issue, and the second is
+	 * only a risk if the CREATE DATABASE and subsequent template database
+	 * change both occur while a base backup is being taken. There doesn't
+	 * seem to be much we can do about that except document it as a
+	 * limitation.
+	 *
+	 * Perhaps if we ever implement CREATE DATABASE in a less cheesy way, we
+	 * can avoid this.
+	 */
+	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
+}
 
 /*
  * CREATE DATABASE
@@ -100,8 +627,6 @@ static int	errdetail_busy_db(int notherbackends, int npreparedxacts);
 Oid
 createdb(ParseState *pstate, const CreatedbStmt *stmt)
 {
-	TableScanDesc scan;
-	Relation	rel;
 	Oid			src_dboid;
 	Oid			src_owner;
 	int			src_encoding = -1;
@@ -132,6 +657,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	DefElem    *dallowconnections = NULL;
 	DefElem    *dconnlimit = NULL;
 	DefElem    *dcollversion = NULL;
+	DefElem    *dstrategy = NULL;
 	char	   *dbname = stmt->dbname;
 	char	   *dbowner = NULL;
 	const char *dbtemplate = NULL;
@@ -145,6 +671,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	char	   *dbcollversion = NULL;
 	int			notherbackends;
 	int			npreparedxacts;
+	CreateDBStrategy dbstrategy = CREATEDB_WAL_LOG;
 	createdb_failure_params fparms;
 
 	/* Extract options from the statement node tree */
@@ -250,6 +777,12 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 						(errcode(ERRCODE_INVALID_PARAMETER_VALUE)),
 						errmsg("OIDs less than %u are reserved for system objects", FirstNormalObjectId));
 		}
+		else if (strcmp(defel->defname, "strategy") == 0)
+		{
+			if (dstrategy)
+				errorConflictingDefElem(defel, pstate);
+			dstrategy = defel;
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -374,6 +907,23 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 							dbtemplate)));
 	}
 
+	/* Validate the database creation strategy. */
+	if (dstrategy && dstrategy->arg)
+	{
+		char	   *strategy;
+
+		strategy = defGetString(dstrategy);
+		if (strcmp(strategy, "wal_log") == 0)
+			dbstrategy = CREATEDB_WAL_LOG;
+		else if (strcmp(strategy, "file_copy") == 0)
+			dbstrategy = CREATEDB_FILE_COPY;
+		else
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("invalid create database strategy %s", strategy),
+					 errhint("Valid strategies are \"wal_log\", and \"file_copy\".")));
+	}
+
 	/* If encoding or locales are defaulted, use source's setting */
 	if (encoding < 0)
 		encoding = src_encoding;
@@ -668,19 +1218,6 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	InvokeObjectPostCreateHook(DatabaseRelationId, dboid, 0);
 
 	/*
-	 * Force a checkpoint before starting the copy. This will force all dirty
-	 * buffers, including those of unlogged tables, out to disk, to ensure
-	 * source database is up-to-date on disk for the copy.
-	 * FlushDatabaseBuffers() would suffice for that, but we also want to
-	 * process any pending unlink requests. Otherwise, if a checkpoint
-	 * happened while we're copying files, a file might be deleted just when
-	 * we're about to copy it, causing the lstat() call in copydir() to fail
-	 * with ENOENT.
-	 */
-	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT
-					  | CHECKPOINT_FLUSH_ALL);
-
-	/*
 	 * Once we start copying subdirectories, we need to be able to clean 'em
 	 * up if we fail.  Use an ENSURE block to make sure this happens.  (This
 	 * is not a 100% solution, because of the possibility of failure during
@@ -689,101 +1226,24 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	 */
 	fparms.src_dboid = src_dboid;
 	fparms.dest_dboid = dboid;
+	fparms.strategy = dbstrategy;
+
 	PG_ENSURE_ERROR_CLEANUP(createdb_failure_callback,
 							PointerGetDatum(&fparms));
 	{
 		/*
-		 * Iterate through all tablespaces of the template database, and copy
-		 * each one to the new database.
-		 */
-		rel = table_open(TableSpaceRelationId, AccessShareLock);
-		scan = table_beginscan_catalog(rel, 0, NULL);
-		while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
-		{
-			Form_pg_tablespace spaceform = (Form_pg_tablespace) GETSTRUCT(tuple);
-			Oid			srctablespace = spaceform->oid;
-			Oid			dsttablespace;
-			char	   *srcpath;
-			char	   *dstpath;
-			struct stat st;
-
-			/* No need to copy global tablespace */
-			if (srctablespace == GLOBALTABLESPACE_OID)
-				continue;
-
-			srcpath = GetDatabasePath(src_dboid, srctablespace);
-
-			if (stat(srcpath, &st) < 0 || !S_ISDIR(st.st_mode) ||
-				directory_is_empty(srcpath))
-			{
-				/* Assume we can ignore it */
-				pfree(srcpath);
-				continue;
-			}
-
-			if (srctablespace == src_deftablespace)
-				dsttablespace = dst_deftablespace;
-			else
-				dsttablespace = srctablespace;
-
-			dstpath = GetDatabasePath(dboid, dsttablespace);
-
-			/*
-			 * Copy this subdirectory to the new location
-			 *
-			 * We don't need to copy subdirectories
-			 */
-			copydir(srcpath, dstpath, false);
-
-			/* Record the filesystem change in XLOG */
-			{
-				xl_dbase_create_rec xlrec;
-
-				xlrec.db_id = dboid;
-				xlrec.tablespace_id = dsttablespace;
-				xlrec.src_db_id = src_dboid;
-				xlrec.src_tablespace_id = srctablespace;
-
-				XLogBeginInsert();
-				XLogRegisterData((char *) &xlrec, sizeof(xl_dbase_create_rec));
-
-				(void) XLogInsert(RM_DBASE_ID,
-								  XLOG_DBASE_CREATE | XLR_SPECIAL_REL_UPDATE);
-			}
-		}
-		table_endscan(scan);
-		table_close(rel, AccessShareLock);
-
-		/*
-		 * We force a checkpoint before committing.  This effectively means
-		 * that committed XLOG_DBASE_CREATE operations will never need to be
-		 * replayed (at least not in ordinary crash recovery; we still have to
-		 * make the XLOG entry for the benefit of PITR operations). This
-		 * avoids two nasty scenarios:
-		 *
-		 * #1: When PITR is off, we don't XLOG the contents of newly created
-		 * indexes; therefore the drop-and-recreate-whole-directory behavior
-		 * of DBASE_CREATE replay would lose such indexes.
-		 *
-		 * #2: Since we have to recopy the source database during DBASE_CREATE
-		 * replay, we run the risk of copying changes in it that were
-		 * committed after the original CREATE DATABASE command but before the
-		 * system crash that led to the replay.  This is at least unexpected
-		 * and at worst could lead to inconsistencies, eg duplicate table
-		 * names.
-		 *
-		 * (Both of these were real bugs in releases 8.0 through 8.0.3.)
-		 *
-		 * In PITR replay, the first of these isn't an issue, and the second
-		 * is only a risk if the CREATE DATABASE and subsequent template
-		 * database change both occur while a base backup is being taken.
-		 * There doesn't seem to be much we can do about that except document
-		 * it as a limitation.
-		 *
-		 * Perhaps if we ever implement CREATE DATABASE in a less cheesy way,
-		 * we can avoid this.
+		 * If the user has asked to create a database with WAL_LOG strategy
+		 * then call CreateDatabaseUsingWalLog, which will copy the database
+		 * at the block level and it will WAL log each copied block.
+		 * Otherwise, call CreateDatabaseUsingFileCopy that will copy the
+		 * database file by file.
 		 */
-		RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
+		if (dbstrategy == CREATEDB_WAL_LOG)
+			CreateDatabaseUsingWalLog(src_dboid, dboid, src_deftablespace,
+									  dst_deftablespace);
+		else
+			CreateDatabaseUsingFileCopy(src_dboid, dboid, src_deftablespace,
+										dst_deftablespace);
 
 		/*
 		 * Close pg_database, but keep lock till commit.
@@ -870,6 +1330,21 @@ createdb_failure_callback(int code, Datum arg)
 	createdb_failure_params *fparms = (createdb_failure_params *) DatumGetPointer(arg);
 
 	/*
+	 * If we were copying database at block levels then drop pages for the
+	 * destination database that are in the shared buffer cache.  And tell
+	 * checkpointer to forget any pending fsync and unlink requests for files
+	 * in the database.  The reasoning behind doing this is same as explained
+	 * in dropdb function.  But unlike dropdb we don't need to call
+	 * pgstat_drop_database because this database is still not created so
+	 * there should not be any stat for this.
+	 */
+	if (fparms->strategy == CREATEDB_WAL_LOG)
+	{
+		DropDatabaseBuffers(fparms->dest_dboid);
+		ForgetDatabaseSyncRequests(fparms->dest_dboid);
+	}
+
+	/*
 	 * Release lock on source database before doing recursive remove. This is
 	 * not essential but it seems desirable to release the lock as soon as
 	 * possible.
@@ -1393,7 +1868,7 @@ movedb(const char *dbname, const char *tblspcname)
 		 * Record the filesystem change in XLOG
 		 */
 		{
-			xl_dbase_create_rec xlrec;
+			xl_dbase_create_file_copy_rec xlrec;
 
 			xlrec.db_id = db_id;
 			xlrec.tablespace_id = dst_tblspcoid;
@@ -1401,10 +1876,11 @@ movedb(const char *dbname, const char *tblspcname)
 			xlrec.src_tablespace_id = src_tblspcoid;
 
 			XLogBeginInsert();
-			XLogRegisterData((char *) &xlrec, sizeof(xl_dbase_create_rec));
+			XLogRegisterData((char *) &xlrec,
+							 sizeof(xl_dbase_create_file_copy_rec));
 
 			(void) XLogInsert(RM_DBASE_ID,
-							  XLOG_DBASE_CREATE | XLR_SPECIAL_REL_UPDATE);
+							  XLOG_DBASE_CREATE_FILE_COPY | XLR_SPECIAL_REL_UPDATE);
 		}
 
 		/*
@@ -1440,9 +1916,10 @@ movedb(const char *dbname, const char *tblspcname)
 
 		/*
 		 * Force another checkpoint here.  As in CREATE DATABASE, this is to
-		 * ensure that we don't have to replay a committed XLOG_DBASE_CREATE
-		 * operation, which would cause us to lose any unlogged operations
-		 * done in the new DB tablespace before the next checkpoint.
+		 * ensure that we don't have to replay a committed
+		 * XLOG_DBASE_CREATE_FILE_COPY operation, which would cause us to lose
+		 * any unlogged operations done in the new DB tablespace before the
+		 * next checkpoint.
 		 */
 		RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
 
@@ -2377,9 +2854,10 @@ dbase_redo(XLogReaderState *record)
 	/* Backup blocks are not used in dbase records */
 	Assert(!XLogRecHasAnyBlockRefs(record));
 
-	if (info == XLOG_DBASE_CREATE)
+	if (info == XLOG_DBASE_CREATE_FILE_COPY)
 	{
-		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) XLogRecGetData(record);
+		xl_dbase_create_file_copy_rec *xlrec =
+		(xl_dbase_create_file_copy_rec *) XLogRecGetData(record);
 		char	   *src_path;
 		char	   *dst_path;
 		struct stat st;
@@ -2414,6 +2892,18 @@ dbase_redo(XLogReaderState *record)
 		 */
 		copydir(src_path, dst_path, false);
 	}
+	else if (info == XLOG_DBASE_CREATE_WAL_LOG)
+	{
+		xl_dbase_create_wal_log_rec *xlrec =
+		(xl_dbase_create_wal_log_rec *) XLogRecGetData(record);
+		char	   *dbpath;
+
+		dbpath = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
+
+		/* Create the database directory with the version file. */
+		CreateDirAndVersionFile(dbpath, xlrec->db_id, xlrec->tablespace_id,
+								true);
+	}
 	else if (info == XLOG_DBASE_DROP)
 	{
 		xl_dbase_drop_rec *xlrec = (xl_dbase_drop_rec *) XLogRecGetData(record);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 3cadcd2..b1cebc4 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -38,6 +38,7 @@
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/storage.h"
+#include "catalog/storage_xlog.h"
 #include "executor/instrument.h"
 #include "lib/binaryheap.h"
 #include "miscadmin.h"
@@ -486,6 +487,9 @@ static void FindAndDropRelFileNodeBuffers(RelFileNode rnode,
 										  ForkNumber forkNum,
 										  BlockNumber nForkBlock,
 										  BlockNumber firstDelBlock);
+static void RelationCopyStorageUsingBuffer(SMgrRelation src, SMgrRelation dst,
+										   ForkNumber forkNum,
+										   bool isunlogged);
 static void AtProcExit_Buffers(int code, Datum arg);
 static void CheckForBufferLeaks(void);
 static int	rnode_comparator(const void *p1, const void *p2);
@@ -3679,6 +3683,158 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 }
 
 /* ---------------------------------------------------------------------
+ *		RelationCopyStorageUsingBuffer
+ *
+ *		Copy fork's data using bufmgr.  Same as RelationCopyStorage but instead
+ *		of using smgrread and smgrextend this will copy using bufmgr APIs.
+ *
+ *		Refer comments atop CreateAndCopyRelationData() for details about
+ *		'permanent' parameter.
+ * --------------------------------------------------------------------
+ */
+static void
+RelationCopyStorageUsingBuffer(SMgrRelation src, SMgrRelation dst,
+							   ForkNumber forkNum, bool permanent)
+{
+	Buffer		srcBuf;
+	Buffer		dstBuf;
+	Page		srcPage;
+	Page		dstPage;
+	bool		use_wal;
+	BlockNumber nblocks;
+	BlockNumber blkno;
+	BufferAccessStrategy bstrategy_src;
+	BufferAccessStrategy bstrategy_dst;
+
+	/*
+	 * We need to log the copied data in WAL iff WAL archiving/streaming is
+	 * enabled and the relation is persistent, or this is the init fork of an
+	 * unlogged relation.
+	 */
+	use_wal = XLogIsNeeded() && (permanent || forkNum == INIT_FORKNUM);
+
+	/* Get number of blocks in the source relation. */
+	nblocks = smgrnblocks(src, forkNum);
+
+	/* Nothing to copy; just return. */
+	if (nblocks == 0)
+		return;
+
+	/*
+	 * We are going to copy whole relation from the source to the destination
+	 * so use BAS_BULKREAD strategy for the source relation and BAS_BULKWRITE
+	 * strategy for the destination relation.
+	 */
+	bstrategy_src = GetAccessStrategy(BAS_BULKREAD);
+	bstrategy_dst = GetAccessStrategy(BAS_BULKWRITE);
+
+	/* Iterate over each block of the source relation file. */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		/* If we got a cancel signal during the copy of the data, quit */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Read block from source relation. */
+		srcBuf = ReadBufferWithoutRelcache(src->smgr_rnode.node, forkNum,
+										   blkno, RBM_NORMAL, bstrategy_src,
+										   permanent);
+		srcPage = BufferGetPage(srcBuf);
+		if (PageIsNew(srcPage) || PageIsEmpty(srcPage))
+		{
+			ReleaseBuffer(srcBuf);
+			continue;
+		}
+
+		/* Use P_NEW to extend the relation. */
+		dstBuf = ReadBufferWithoutRelcache(dst->smgr_rnode.node, forkNum,
+										   P_NEW, RBM_NORMAL, bstrategy_dst,
+										   permanent);
+		LockBuffer(dstBuf, BUFFER_LOCK_EXCLUSIVE);
+
+		START_CRIT_SECTION();
+
+		/* Initialize the page and write the data. */
+		dstPage = BufferGetPage(dstBuf);
+		PageInit(dstPage, BufferGetPageSize(dstBuf), 0);
+		memcpy(dstPage, srcPage, BLCKSZ);
+		MarkBufferDirty(dstBuf);
+
+		/* WAL-log the copied page. */
+		if (use_wal)
+			log_newpage_buffer(dstBuf, true);
+
+		END_CRIT_SECTION();
+
+		UnlockReleaseBuffer(dstBuf);
+		ReleaseBuffer(srcBuf);
+	}
+}
+
+/* ---------------------------------------------------------------------
+ *		CreateAndCopyRelationData
+ *
+ *		Create destination relation storage and copy source relation's all
+ *		fork's data to the destination.
+ *
+ *		Curretly this API is not supported for the temporary relations.  So
+ *		pass permanent as true for the regular relation and false for the
+ *		unlogged relation.
+ * --------------------------------------------------------------------
+ */
+void
+CreateAndCopyRelationData(RelFileNode src_rnode, RelFileNode dst_rnode,
+						  bool permanent)
+{
+	SMgrRelation	src_smgr;
+	SMgrRelation	dst_smgr;
+	char			relpersistence;
+
+	/* Open the source relation at smgr level. */
+	src_smgr = smgropen(src_rnode, InvalidBackendId);
+
+	/* Set the relpersistence. */
+	relpersistence = permanent ?
+		RELPERSISTENCE_PERMANENT : RELPERSISTENCE_UNLOGGED;
+
+	/*
+	 * Create and copy all forks of the relation.
+	 *
+	 * NOTE: any conflict in relfilenode value will be caught in
+	 * RelationCreateStorage().
+	 */
+	dst_smgr = RelationCreateStorage(dst_rnode, relpersistence);
+
+	/* copy main fork */
+	RelationCopyStorageUsingBuffer(src_smgr, dst_smgr, MAIN_FORKNUM,
+								   permanent);
+
+	/* copy those extra forks that exist */
+	for (ForkNumber forkNum = MAIN_FORKNUM + 1;
+		 forkNum <= MAX_FORKNUM; forkNum++)
+	{
+		if (smgrexists(src_smgr, forkNum))
+		{
+			smgrcreate(dst_smgr, forkNum, false);
+
+			/*
+			 * WAL log creation if the relation is persistent, or this is the
+			 * init fork of an unlogged relation.
+			 */
+			if (permanent || forkNum == INIT_FORKNUM)
+				log_smgrcreate(&dst_rnode, forkNum);
+
+			/* Copy a fork's data, block by block. */
+			RelationCopyStorageUsingBuffer(src_smgr, dst_smgr, forkNum,
+										   permanent);
+		}
+	}
+
+	/* Close the smgr rel */
+	smgrclose(src_smgr);
+	smgrclose(dst_smgr);
+}
+
+/* ---------------------------------------------------------------------
  *		FlushDatabaseBuffers
  *
  *		This function writes all dirty pages of a database out to disk
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 56df08c..d5cf9ed 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -370,7 +370,7 @@ extractPageInfo(XLogReaderState *record)
 
 	/* Is this a special record type that I recognize? */
 
-	if (rmid == RM_DBASE_ID && rminfo == XLOG_DBASE_CREATE)
+	if (rmid == RM_DBASE_ID && rminfo == XLOG_DBASE_CREATE_FILE_COPY)
 	{
 		/*
 		 * New databases can be safely ignored. It won't be present in the
@@ -382,6 +382,13 @@ extractPageInfo(XLogReaderState *record)
 		 * overwriting the database created in the target system.
 		 */
 	}
+	else if (rmid == RM_DBASE_ID && rminfo == XLOG_DBASE_CREATE_WAL_LOG)
+	{
+		/*
+		 * New databases can be safely ignored. It won't be present in the
+		 * source system, so it will be deleted.
+		 */
+	}
 	else if (rmid == RM_DBASE_ID && rminfo == XLOG_DBASE_DROP)
 	{
 		/*
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 1717282..d0e3755 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2738,10 +2738,12 @@ psql_completion(const char *text, int start, int end)
 		COMPLETE_WITH("OWNER", "TEMPLATE", "ENCODING", "TABLESPACE",
 					  "IS_TEMPLATE",
 					  "ALLOW_CONNECTIONS", "CONNECTION LIMIT",
-					  "LC_COLLATE", "LC_CTYPE", "LOCALE", "OID");
+					  "LC_COLLATE", "LC_CTYPE", "LOCALE", "OID", "STRATEGY");
 
 	else if (Matches("CREATE", "DATABASE", MatchAny, "TEMPLATE"))
 		COMPLETE_WITH_QUERY(Query_for_list_of_template_databases);
+	else if (Matches("CREATE", "DATABASE", MatchAny, "STRATEGY"))
+		COMPLETE_WITH("WAL_LOG", "FILE_COPY");
 
 	/* CREATE DOMAIN */
 	else if (Matches("CREATE", "DOMAIN", MatchAny))
diff --git a/src/include/commands/dbcommands_xlog.h b/src/include/commands/dbcommands_xlog.h
index 593a857..077a000 100644
--- a/src/include/commands/dbcommands_xlog.h
+++ b/src/include/commands/dbcommands_xlog.h
@@ -18,17 +18,31 @@
 #include "lib/stringinfo.h"
 
 /* record types */
-#define XLOG_DBASE_CREATE		0x00
-#define XLOG_DBASE_DROP			0x10
+#define XLOG_DBASE_CREATE_FILE_COPY		0x00
+#define XLOG_DBASE_CREATE_WAL_LOG		0x10
+#define XLOG_DBASE_DROP					0x20
 
-typedef struct xl_dbase_create_rec
+/*
+ * Records copying of a single subdirectory incl. contents, while creating a
+ * database using FILE COPY strategy.
+ */
+typedef struct xl_dbase_create_file_copy_rec
 {
-	/* Records copying of a single subdirectory incl. contents */
 	Oid			db_id;
 	Oid			tablespace_id;
 	Oid			src_db_id;
 	Oid			src_tablespace_id;
-} xl_dbase_create_rec;
+} xl_dbase_create_file_copy_rec;
+
+/*
+ * Records creating a database directory with version file, while creating a
+ * database using WAL LOG strategy.
+ */
+typedef struct xl_dbase_create_wal_log_rec
+{
+	Oid			db_id;
+	Oid			tablespace_id;
+} xl_dbase_create_wal_log_rec;
 
 typedef struct xl_dbase_drop_rec
 {
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index fd0452f..a6b657f 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -204,6 +204,9 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
 extern void FlushOneBuffer(Buffer buffer);
 extern void FlushRelationBuffers(Relation rel);
 extern void FlushRelationsAllBuffers(struct SMgrRelationData **smgrs, int nrels);
+extern void CreateAndCopyRelationData(RelFileNode src_rnode,
+									  RelFileNode dst_rnode,
+									  bool permanent);
 extern void FlushDatabaseBuffers(Oid dbid);
 extern void DropRelFileNodeBuffers(struct SMgrRelationData *smgr_reln, ForkNumber *forkNum,
 								   int nforks, BlockNumber *firstDelBlock);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index eaf3e7a..0f01356 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -460,6 +460,8 @@ CoverPos
 CreateAmStmt
 CreateCastStmt
 CreateConversionStmt
+CreateDBRelInfo
+CreateDBStrategy
 CreateDomainStmt
 CreateEnumStmt
 CreateEventTrigStmt
@@ -3694,7 +3696,8 @@ xl_btree_update
 xl_btree_vacuum
 xl_clog_truncate
 xl_commit_ts_truncate
-xl_dbase_create_rec
+xl_dbase_create_file_copy_rec
+xl_dbase_create_wal_log_rec
 xl_dbase_drop_rec
 xl_end_of_recovery
 xl_hash_add_ovfl_page
-- 
1.8.3.1

v16-0006-Support-create-database-strategy-in-createdb-too.patchtext/x-patch; charset=US-ASCII; name=v16-0006-Support-create-database-strategy-in-createdb-too.patchDownload

From a3da9f080552f072b14dcc05d4c2d5fe5e0d02ba Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 11 Mar 2022 11:48:55 +0530
Subject: [PATCH v16 6/6] Support create database strategy in createdb tool

---
 doc/src/sgml/ref/createdb.sgml    | 16 ++++++++++++++++
 src/bin/scripts/createdb.c        | 10 +++++++++-
 src/bin/scripts/t/020_createdb.pl | 20 ++++++++++++++++++++
 3 files changed, 45 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/ref/createdb.sgml b/doc/src/sgml/ref/createdb.sgml
index 8647345..2a7beca 100644
--- a/doc/src/sgml/ref/createdb.sgml
+++ b/doc/src/sgml/ref/createdb.sgml
@@ -159,6 +159,22 @@ PostgreSQL documentation
      </varlistentry>
 
      <varlistentry>
+      <term><option>-S <replaceable class="parameter">template</replaceable></option></term>
+      <term><option>--strategy=<replaceable class="parameter">strategy</replaceable></option></term>
+      <listitem>
+       <para>
+        Specifies the database creation strategy.  Currently, we have two
+        strategies the <literal>WAL_LOG</literal> and the <literal>FILE_COPY
+        </literal>.  If <literal>WAL_LOG</literal> strategy is used then the
+        database will be copied block by block and it will also WAL log each
+        copied block.  Otherwise, if <literal>FILE_COPY</literal> strategy is
+        used then it will do the file system level copy so individual the block
+        is not WAL logged.  The default strategy is <literal>WAL_LOG</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
       <term><option>-T <replaceable class="parameter">template</replaceable></option></term>
       <term><option>--template=<replaceable class="parameter">template</replaceable></option></term>
       <listitem>
diff --git a/src/bin/scripts/createdb.c b/src/bin/scripts/createdb.c
index b0c6805..9d3c4ef 100644
--- a/src/bin/scripts/createdb.c
+++ b/src/bin/scripts/createdb.c
@@ -37,6 +37,7 @@ main(int argc, char *argv[])
 		{"lc-collate", required_argument, NULL, 1},
 		{"lc-ctype", required_argument, NULL, 2},
 		{"locale", required_argument, NULL, 'l'},
+		{"strategy", required_argument, NULL, 'S'},
 		{"maintenance-db", required_argument, NULL, 3},
 		{NULL, 0, NULL, 0}
 	};
@@ -61,6 +62,7 @@ main(int argc, char *argv[])
 	char	   *lc_collate = NULL;
 	char	   *lc_ctype = NULL;
 	char	   *locale = NULL;
+	char	   *strategy = NULL;
 
 	PQExpBufferData sql;
 
@@ -73,7 +75,7 @@ main(int argc, char *argv[])
 
 	handle_help_version_opts(argc, argv, "createdb", help);
 
-	while ((c = getopt_long(argc, argv, "h:p:U:wWeO:D:T:E:l:", long_options, &optindex)) != -1)
+	while ((c = getopt_long(argc, argv, "h:p:U:wWeO:D:T:E:l:S:", long_options, &optindex)) != -1)
 	{
 		switch (c)
 		{
@@ -119,6 +121,9 @@ main(int argc, char *argv[])
 			case 3:
 				maintenance_db = pg_strdup(optarg);
 				break;
+			case 'S':
+				strategy = pg_strdup(optarg);
+				break;
 			default:
 				fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
 				exit(1);
@@ -217,6 +222,8 @@ main(int argc, char *argv[])
 		appendPQExpBufferStr(&sql, " LC_CTYPE ");
 		appendStringLiteralConn(&sql, lc_ctype, conn);
 	}
+	if (strategy)
+		appendPQExpBuffer(&sql, " STRATEGY %s ", fmtId(strategy));
 
 	appendPQExpBufferChar(&sql, ';');
 
@@ -274,6 +281,7 @@ help(const char *progname)
 	printf(_("      --lc-collate=LOCALE      LC_COLLATE setting for the database\n"));
 	printf(_("      --lc-ctype=LOCALE        LC_CTYPE setting for the database\n"));
 	printf(_("  -O, --owner=OWNER            database user to own the new database\n"));
+	printf(_("  -S, --strategy=STRATEGY      database creation strategy wal_log or file_copy\n"));
 	printf(_("  -T, --template=TEMPLATE      template database to copy\n"));
 	printf(_("  -V, --version                output version information, then exit\n"));
 	printf(_("  -?, --help                   show this help, then exit\n"));
diff --git a/src/bin/scripts/t/020_createdb.pl b/src/bin/scripts/t/020_createdb.pl
index 6392454..ccfbe17 100644
--- a/src/bin/scripts/t/020_createdb.pl
+++ b/src/bin/scripts/t/020_createdb.pl
@@ -76,4 +76,24 @@ $node->command_checks_all(
 	],
 	'createdb with incorrect --lc-ctype');
 
+$node->command_checks_all(
+	[ 'createdb', '--strategy', "foo", 'foobar2' ],
+	1,
+	[qr/^$/],
+	[
+		qr/^createdb: error: database creation failed: ERROR:  invalid create database strategy|^createdb: error: database creation failed: ERROR:  invalid create database strategy foo/s
+	],
+	'createdb with incorrect --strategy');
+
+# Check database creation strategy
+$node->issues_sql_like(
+	[ 'createdb', '-T', 'foobar2', 'foobar4', '-S', 'wal_log'],
+	qr/statement: CREATE DATABASE foobar4 TEMPLATE foobar2 STRATEGY wal_log/,
+	'create database with WAL_LOG strategy');
+
+$node->issues_sql_like(
+	[ 'createdb', '-T', 'foobar2', 'foobar5', '-S', 'file_copy'],
+	qr/statement: CREATE DATABASE foobar5 TEMPLATE foobar2 STRATEGY file_copy/,
+	'create database with FILE_COPY strategy');
+
 done_testing();
-- 
1.8.3.1

#166

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Dilip Kumar (#165)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Wed, Mar 16, 2022 at 12:53 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Thanks Ashutosh and Robert. Other pacthes cleanly applied on this
patch still generated a new version so that we can find all patches
together. There are no other changes.

I committed my v3 of my refactoring patch, here 0001.

I'm working over the comments in the rest of the patch series and will
post an updated version when I get done. I think I will likely merge
all the remaining patches together just to make it simpler to manage;
we can split things out again if we need to do that.

One question that occurred to me when looking this over is whether, or
why, it's safe against concurrent smgr invalidations. It seems to me
that every loop in the new CREATE DATABASE code needs to
CHECK_FOR_INTERRUPTS() -- some do already -- and when they do that, I
think we might receive an invalidation message that causes us to
smgrclose() some or all of the things where we previously did
smgropen(). I don't quite see why that can't cause problems here. I
tried running the src/bin/scripts regression tests with
debug_discard_caches=1 and none of the tests failed, so there may very
well be a reason why this is actually totally fine, but I don't know
what it is. On the other hand, it may be that things went horribly
wrong and the tests are just smart enough to catch it, or maybe
there's a problematic scenario which those tests just don't hit. I
don't know. Thoughts?

--
Robert Haas
EDB: http://www.enterprisedb.com

#167

Dilip Kumar

dilipbalaut@gmail.com

almost 4 years ago

In reply to: Robert Haas (#166)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Fri, Mar 18, 2022 at 1:44 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Mar 16, 2022 at 12:53 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Thanks Ashutosh and Robert. Other pacthes cleanly applied on this
patch still generated a new version so that we can find all patches
together. There are no other changes.

I committed my v3 of my refactoring patch, here 0001.

I'm working over the comments in the rest of the patch series and will
post an updated version when I get done. I think I will likely merge
all the remaining patches together just to make it simpler to manage;
we can split things out again if we need to do that.

Thanks for the effort.

One question that occurred to me when looking this over is whether, or
why, it's safe against concurrent smgr invalidations.

We are only accessing the smgr of the source database and the
destination database. And there is no one else that can be connected
to the source db and the destination db is not visible to anyone. So
do we really need to worry about the concurrent smgr invalidation?
What am I missing?

It seems to me

that every loop in the new CREATE DATABASE code needs to
CHECK_FOR_INTERRUPTS() -- some do already -- and when they do that,

Yes, the pg_class reading code is missing this check so we need to put
it. But copying code like
CreateDatabaseUsingWalLog() have it inside the deepest loop in
RelationCopyStorageUsingBuffer() and similarly
CreateDatabaseUsingFileCopy() have it in copydir(). Maybe we should
put it in all loop so that we do not skip checking due to some
condition.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#168

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Dilip Kumar (#167)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Fri, Mar 18, 2022 at 12:39 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

One question that occurred to me when looking this over is whether, or
why, it's safe against concurrent smgr invalidations.

We are only accessing the smgr of the source database and the
destination database. And there is no one else that can be connected
to the source db and the destination db is not visible to anyone. So
do we really need to worry about the concurrent smgr invalidation?
What am I missing?

A sinval reset can occur at any moment due to an overflow of the
queue. That acts as a universal reset of everything. So you can't
reason on the basis of what somebody might be sending.

--
Robert Haas
EDB: http://www.enterprisedb.com

#169

Dilip Kumar

dilipbalaut@gmail.com

almost 4 years ago

In reply to: Robert Haas (#168)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Sun, Mar 20, 2022 at 12:03 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Mar 18, 2022 at 12:39 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

One question that occurred to me when looking this over is whether, or
why, it's safe against concurrent smgr invalidations.

We are only accessing the smgr of the source database and the
destination database. And there is no one else that can be connected
to the source db and the destination db is not visible to anyone. So
do we really need to worry about the concurrent smgr invalidation?
What am I missing?

A sinval reset can occur at any moment due to an overflow of the
queue. That acts as a universal reset of everything. So you can't
reason on the basis of what somebody might be sending.

I thought that way because IIUC, when we are locking the database
tuple we are ensuring that we are calling
ReceiveSharedInvalidMessages() right? And IIUC
ReceiveSharedInvalidMessages(), is designed such a way that it will
consume all the outstanding messages and that's the reason it loops
multiple times if it identifies that the queue is full. And if my
assumption here is correct then I think it is also correct that now we
only need to worry about anyone generating new invalidations and that
is not possible in this case.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#170

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Dilip Kumar (#169)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Sun, Mar 20, 2022 at 1:34 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I thought that way because IIUC, when we are locking the database
tuple we are ensuring that we are calling
ReceiveSharedInvalidMessages() right? And IIUC
ReceiveSharedInvalidMessages(), is designed such a way that it will
consume all the outstanding messages and that's the reason it loops
multiple times if it identifies that the queue is full. And if my
assumption here is correct then I think it is also correct that now we
only need to worry about anyone generating new invalidations and that
is not possible in this case.

Well, I don't see how that chain of logic addresses my concern about
sinval reset.

Mind you, I'm not sure there's an actual problem here, because I tried
testing the patch with debug_discard_caches=1 and nothing failed. But
I still don't understand WHY nothing failed.

--
Robert Haas
EDB: http://www.enterprisedb.com

#171

Dilip Kumar

dilipbalaut@gmail.com

almost 4 years ago

In reply to: Robert Haas (#170)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Mon, Mar 21, 2022 at 7:07 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Sun, Mar 20, 2022 at 1:34 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I thought that way because IIUC, when we are locking the database
tuple we are ensuring that we are calling
ReceiveSharedInvalidMessages() right? And IIUC
ReceiveSharedInvalidMessages(), is designed such a way that it will
consume all the outstanding messages and that's the reason it loops
multiple times if it identifies that the queue is full. And if my
assumption here is correct then I think it is also correct that now we
only need to worry about anyone generating new invalidations and that
is not possible in this case.

Well, I don't see how that chain of logic addresses my concern about
sinval reset.

Mind you, I'm not sure there's an actual problem here, because I tried
testing the patch with debug_discard_caches=1 and nothing failed. But
I still don't understand WHY nothing failed.

Okay, I see what you are saying. Yeah this looks like a problem to me
as well. I will try to reproduce this issue.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#172

Dilip Kumar

dilipbalaut@gmail.com

almost 4 years ago

In reply to: Dilip Kumar (#171)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Mon, Mar 21, 2022 at 8:29 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Mar 21, 2022 at 7:07 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Sun, Mar 20, 2022 at 1:34 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I thought that way because IIUC, when we are locking the database
tuple we are ensuring that we are calling
ReceiveSharedInvalidMessages() right? And IIUC
ReceiveSharedInvalidMessages(), is designed such a way that it will
consume all the outstanding messages and that's the reason it loops
multiple times if it identifies that the queue is full. And if my
assumption here is correct then I think it is also correct that now we
only need to worry about anyone generating new invalidations and that
is not possible in this case.

Well, I don't see how that chain of logic addresses my concern about
sinval reset.

Mind you, I'm not sure there's an actual problem here, because I tried
testing the patch with debug_discard_caches=1 and nothing failed. But
I still don't understand WHY nothing failed.

Okay, I see what you are saying. Yeah this looks like a problem to me
as well. I will try to reproduce this issue.

I tried to debug the case but I realized that somehow
CHECK_FOR_INTERRUPTS() is not calling the
AcceptInvalidationMessages() and I could not find the same while
looking into the code as well. While debugging I noticed that
AcceptInvalidationMessages() is called multiple times but that is only
through LockRelationId() but while locking the relation we had already
closed the previous smgr because at a time we keep only one smgr open.
And that's the reason it is not hitting the issue which we think it
could. Is there any condition under which it will call
AcceptInvalidationMessages() through CHECK_FOR_INTERRUPTS() ? because
I could not see while debugging as well as in code.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#173

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Dilip Kumar (#172)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Mon, Mar 21, 2022 at 11:21 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I tried to debug the case but I realized that somehow
CHECK_FOR_INTERRUPTS() is not calling the
AcceptInvalidationMessages() and I could not find the same while
looking into the code as well. While debugging I noticed that
AcceptInvalidationMessages() is called multiple times but that is only
through LockRelationId() but while locking the relation we had already
closed the previous smgr because at a time we keep only one smgr open.
And that's the reason it is not hitting the issue which we think it
could. Is there any condition under which it will call
AcceptInvalidationMessages() through CHECK_FOR_INTERRUPTS() ? because
I could not see while debugging as well as in code.

Yeah, I think the reason you can't find it is that it's not there. I
was confused in what I wrote earlier. I think we only process sinval
catchups when we're idle, not at every CHECK_FOR_INTERRUPTS(). And I
think the reason for that is precisely that it would be hard to write
correct code otherwise, since invalidations might then get processed
in a lot more places. So ... I guess all we really need to do here is
avoid assuming that the results of smgropen() are valid across any
code that might acquire relation locks. Which I think the code already
does.

But on a related note, why doesn't CreateDatabaseUsingWalLog() acquire
locks on both the source and destination relations? It looks like
you're only taking locks for the source template database ... but I
thought the intention here was to make sure that we didn't pull pages
into shared_buffers without holding a lock on the relation and/or the
database? I suppose the point is that while the template database
might be concurrently dropped, nobody can be doing anything
concurrently to the target database because nobody knows that it
exists yet. Still, I think that this would be the only case where we
let pages into shared_buffers without a relation or database lock,
though maybe I'm confused about this point, too. If not, perhaps we
should consider locking the target database OID and each relation OID
as we are copying it?

I guess I'm imagining that there might be more code pathways in the
future that want to ensure that there are no remaining buffers for
some particular database or relation OID. It seems natural to want to
be able to take some lock that prevents buffers from being added, and
then go and get rid of all the ones that are there already. But I
admit I can't quite think of a concrete case where we'd want to do
something like this where the patch as coded would be a problem. I'm
just thinking perhaps taking locks is fairly harmless and might avoid
some hypothetical problem later.

Thoughts?

--
Robert Haas
EDB: http://www.enterprisedb.com

#174

Dilip Kumar

dilipbalaut@gmail.com

almost 4 years ago

In reply to: Robert Haas (#173)

1 attachment(s)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Mon, Mar 21, 2022 at 11:53 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Mar 21, 2022 at 11:21 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I tried to debug the case but I realized that somehow
CHECK_FOR_INTERRUPTS() is not calling the
AcceptInvalidationMessages() and I could not find the same while
looking into the code as well. While debugging I noticed that
AcceptInvalidationMessages() is called multiple times but that is only
through LockRelationId() but while locking the relation we had already
closed the previous smgr because at a time we keep only one smgr open.
And that's the reason it is not hitting the issue which we think it
could. Is there any condition under which it will call
AcceptInvalidationMessages() through CHECK_FOR_INTERRUPTS() ? because
I could not see while debugging as well as in code.

Yeah, I think the reason you can't find it is that it's not there. I
was confused in what I wrote earlier. I think we only process sinval
catchups when we're idle, not at every CHECK_FOR_INTERRUPTS(). And I
think the reason for that is precisely that it would be hard to write
correct code otherwise, since invalidations might then get processed
in a lot more places. So ... I guess all we really need to do here is
avoid assuming that the results of smgropen() are valid across any
code that might acquire relation locks. Which I think the code already
does.

But on a related note, why doesn't CreateDatabaseUsingWalLog() acquire
locks on both the source and destination relations? It looks like
you're only taking locks for the source template database ... but I
thought the intention here was to make sure that we didn't pull pages
into shared_buffers without holding a lock on the relation and/or the
database? I suppose the point is that while the template database
might be concurrently dropped, nobody can be doing anything
concurrently to the target database because nobody knows that it
exists yet. Still, I think that this would be the only case where we
let pages into shared_buffers without a relation or database lock,
though maybe I'm confused about this point, too. If not, perhaps we
should consider locking the target database OID and each relation OID
as we are copying it?

I guess I'm imagining that there might be more code pathways in the
future that want to ensure that there are no remaining buffers for
some particular database or relation OID. It seems natural to want to
be able to take some lock that prevents buffers from being added, and
then go and get rid of all the ones that are there already. But I
admit I can't quite think of a concrete case where we'd want to do
something like this where the patch as coded would be a problem. I'm
just thinking perhaps taking locks is fairly harmless and might avoid
some hypothetical problem later.

Thoughts?

I think this make sense. I haven't changed the original patch as you
told you were improving on some comments, so in order to avoid
conflict I have created this add on patch.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

lock_destination_db_and_rel.patchtext/x-patch; charset=US-ASCII; name=lock_destination_db_and_rel.patchDownload

diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 9636688..5d0750f 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -460,12 +460,6 @@ CreateDatabaseUsingWalLog(Oid src_dboid, Oid dst_dboid, Oid src_tsid, Oid dst_ts
 	Assert(rnodelist != NIL);
 
 	/*
-	 * Database id is common for all the relation so set it before entering to
-	 * the loop.
-	 */
-	relid.dbId = src_dboid;
-
-	/*
 	 * Iterate over each relfilenode and copy the relation data block by block
 	 * from source database to the destination database.
 	 */
@@ -488,7 +482,15 @@ CreateDatabaseUsingWalLog(Oid src_dboid, Oid dst_dboid, Oid src_tsid, Oid dst_ts
 		dstrnode.dbNode = dst_dboid;
 		dstrnode.relNode = srcrnode.relNode;
 
-		/* Acquire the lock on relation before start copying. */
+		/*
+		 * Acquire relation lock on the source and the destination relation id
+		 * before start copying.
+		 */
+		relid.dbId = src_dboid;
+		relid.relId = relinfo->reloid;
+		LockRelationId(&relid, AccessShareLock);
+
+		relid.dbId = dst_dboid;
 		relid.relId = relinfo->reloid;
 		LockRelationId(&relid, AccessShareLock);
 
@@ -1218,6 +1220,17 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	InvokeObjectPostCreateHook(DatabaseRelationId, dboid, 0);
 
 	/*
+	 * Acquire a lock on the target database, although this is a new database
+	 * and no one else should be able to see it.  But if we are using wal log
+	 * strategy then we are going to access the relation pages using shared
+	 * buffers.  Therefore, it would be better to take the database lock.  And,
+	 * later we would acquire the relation lock as and when we would access the
+	 * individual relations' shared buffers.
+	 */
+	if (dbstrategy == CREATEDB_WAL_LOG)
+		LockSharedObject(DatabaseRelationId, dboid, 0, ExclusiveLock);
+
+	/*
 	 * Once we start copying subdirectories, we need to be able to clean 'em
 	 * up if we fail.  Use an ENSURE block to make sure this happens.  (This
 	 * is not a 100% solution, because of the possibility of failure during
@@ -1342,6 +1355,10 @@ createdb_failure_callback(int code, Datum arg)
 	{
 		DropDatabaseBuffers(fparms->dest_dboid);
 		ForgetDatabaseSyncRequests(fparms->dest_dboid);
+
+		/* Release lock on the target database. */
+		UnlockSharedObject(DatabaseRelationId, fparms->src_dboid, 0,
+						   ExclusiveLock);
 	}
 
 	/*

#175

Dilip Kumar

dilipbalaut@gmail.com

almost 4 years ago

In reply to: Dilip Kumar (#174)

1 attachment(s)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Tue, Mar 22, 2022 at 10:28 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I think this make sense. I haven't changed the original patch as you
told you were improving on some comments, so in order to avoid
conflict I have created this add on patch.

In my previous patch mistakenly I used src_dboid instead of
dest_dboid. Fixed in this version. For destination db I have used
lock mode as AccessSharedLock. Logically if we see access wise we
don't want anyone else to be accessing that db but that is anyway
protected because it is not visible to anyone else. So I think
AccessSharedLock should be correct here because we are just taking
this lock because we are accessing pages in shared buffers from this
database's relations.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

lock_destination_db_and_rel_v1.patchtext/x-patch; charset=US-ASCII; name=lock_destination_db_and_rel_v1.patchDownload

diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 9636688..49fe104 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -460,12 +460,6 @@ CreateDatabaseUsingWalLog(Oid src_dboid, Oid dst_dboid, Oid src_tsid, Oid dst_ts
 	Assert(rnodelist != NIL);
 
 	/*
-	 * Database id is common for all the relation so set it before entering to
-	 * the loop.
-	 */
-	relid.dbId = src_dboid;
-
-	/*
 	 * Iterate over each relfilenode and copy the relation data block by block
 	 * from source database to the destination database.
 	 */
@@ -488,7 +482,15 @@ CreateDatabaseUsingWalLog(Oid src_dboid, Oid dst_dboid, Oid src_tsid, Oid dst_ts
 		dstrnode.dbNode = dst_dboid;
 		dstrnode.relNode = srcrnode.relNode;
 
-		/* Acquire the lock on relation before start copying. */
+		/*
+		 * Acquire relation lock on the source and the destination relation id
+		 * before start copying.
+		 */
+		relid.dbId = src_dboid;
+		relid.relId = relinfo->reloid;
+		LockRelationId(&relid, AccessShareLock);
+
+		relid.dbId = dst_dboid;
 		relid.relId = relinfo->reloid;
 		LockRelationId(&relid, AccessShareLock);
 
@@ -1218,6 +1220,17 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	InvokeObjectPostCreateHook(DatabaseRelationId, dboid, 0);
 
 	/*
+	 * Acquire a lock on the target database, although this is a new database
+	 * and no one else should be able to see it.  But if we are using wal log
+	 * strategy then we are going to access the relation pages using shared
+	 * buffers.  Therefore, it would be better to take the database lock.  And,
+	 * later we would acquire the relation lock as and when we would access the
+	 * individual relations' shared buffers.
+	 */
+	if (dbstrategy == CREATEDB_WAL_LOG)
+		LockSharedObject(DatabaseRelationId, dboid, 0, AccessShareLock);
+
+	/*
 	 * Once we start copying subdirectories, we need to be able to clean 'em
 	 * up if we fail.  Use an ENSURE block to make sure this happens.  (This
 	 * is not a 100% solution, because of the possibility of failure during
@@ -1342,6 +1355,10 @@ createdb_failure_callback(int code, Datum arg)
 	{
 		DropDatabaseBuffers(fparms->dest_dboid);
 		ForgetDatabaseSyncRequests(fparms->dest_dboid);
+
+		/* Release lock on the target database. */
+		UnlockSharedObject(DatabaseRelationId, fparms->dest_dboid, 0,
+						   AccessShareLock);
 	}
 
 	/*

#176

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Dilip Kumar (#175)

1 attachment(s)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Tue, Mar 22, 2022 at 5:00 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

In my previous patch mistakenly I used src_dboid instead of
dest_dboid. Fixed in this version. For destination db I have used
lock mode as AccessSharedLock. Logically if we see access wise we
don't want anyone else to be accessing that db but that is anyway
protected because it is not visible to anyone else. So I think
AccessSharedLock should be correct here because we are just taking
this lock because we are accessing pages in shared buffers from this
database's relations.

Here's my worked-over version of your previous patch. I haven't tried
to incorporate your incremental patch that you just posted.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachments:

v1-0001-Add-new-block-by-block-strategy-for-CREATE-DATABA.patchapplication/octet-stream; name=v1-0001-Add-new-block-by-block-strategy-for-CREATE-DATABA.patchDownload

From 116bcdb6174a750b7ef7ae05ef6f39cebaf9bcf5 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Tue, 22 Mar 2022 11:22:26 -0400
Subject: [PATCH v1] Add new block-by-block strategy for CREATE DATABASE.

Because this strategy logs changes on a block-by-block basis, it
avoids the need to checkpoint before and after the operation.
However, because it logs each changed block individually, it might
generate a lot of extra write-ahead logging if the template database
is large. Therefore, the older strategy remains available via a new
STRATEGY parameter to CREATE DATABASE, and a corresponding --strategy
option to createdb.

Somewhat controversially, this patch assembles the list of relations
to be copied to the new database by reading the pg_class relation of
the template database. Cross-database access like this isn't normally
possible, but it can be made to work here because there can't be any
connections to the database being copied, nor can it contain any
in-doubt transactions. Even so, we have to use lower-level interfaces
than normal, since the table scan and relcache interfaces will not
work for a database to which we're not connected. The advantage of
this approach is that we do not need to rely on the filesystem to
determine what ought to be copied, but instead on PostgreSQL's own
knowledge of the database structure. This avoids, for example,
copying stray files that happen to be located in the source database
directory.

Dilip Kumar, with a fairly large number of cosmetic changes by me.
---
 contrib/bloom/blinsert.c                 |   2 +-
 doc/src/sgml/ref/create_database.sgml    |  22 +
 doc/src/sgml/ref/createdb.sgml           |  11 +
 src/backend/access/heap/heapam_handler.c |   2 +-
 src/backend/access/nbtree/nbtree.c       |   2 +-
 src/backend/access/rmgrdesc/dbasedesc.c  |  20 +-
 src/backend/access/transam/xlogutils.c   |   6 +-
 src/backend/commands/dbcommands.c        | 746 +++++++++++++++++++----
 src/backend/storage/buffer/bufmgr.c      | 168 ++++-
 src/backend/storage/lmgr/lmgr.c          |  28 +
 src/backend/utils/cache/relmapper.c      |  64 ++
 src/bin/pg_rewind/parsexlog.c            |   9 +-
 src/bin/psql/tab-complete.c              |   4 +-
 src/bin/scripts/createdb.c               |  10 +-
 src/bin/scripts/t/020_createdb.pl        |  20 +
 src/include/commands/dbcommands_xlog.h   |  25 +-
 src/include/storage/bufmgr.h             |   6 +-
 src/include/storage/lmgr.h               |   1 +
 src/include/utils/relmapper.h            |   4 +-
 src/tools/pgindent/typedefs.list         |   5 +-
 20 files changed, 1013 insertions(+), 142 deletions(-)

diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c
index c94cf34e69..82378db441 100644
--- a/contrib/bloom/blinsert.c
+++ b/contrib/bloom/blinsert.c
@@ -173,7 +173,7 @@ blbuildempty(Relation index)
 	 * Write the page and log it.  It might seem that an immediate sync would
 	 * be sufficient to guarantee that the file exists on disk, but recovery
 	 * itself might remove it while replaying, for example, an
-	 * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we need
+	 * XLOG_DBASE_CREATE* or XLOG_TBLSPC_CREATE record.  Therefore, we need
 	 * this even when wal_level=minimal.
 	 */
 	PageSetChecksumInplace(metapage, BLOOM_METAPAGE_BLKNO);
diff --git a/doc/src/sgml/ref/create_database.sgml b/doc/src/sgml/ref/create_database.sgml
index 5ae785ab95..255ad3a1ce 100644
--- a/doc/src/sgml/ref/create_database.sgml
+++ b/doc/src/sgml/ref/create_database.sgml
@@ -25,6 +25,7 @@ CREATE DATABASE <replaceable class="parameter">name</replaceable>
     [ [ WITH ] [ OWNER [=] <replaceable class="parameter">user_name</replaceable> ]
            [ TEMPLATE [=] <replaceable class="parameter">template</replaceable> ]
            [ ENCODING [=] <replaceable class="parameter">encoding</replaceable> ]
+           [ STRATEGY [=] <replaceable class="parameter">strategy</replaceable> ] ]
            [ LOCALE [=] <replaceable class="parameter">locale</replaceable> ]
            [ LC_COLLATE [=] <replaceable class="parameter">lc_collate</replaceable> ]
            [ LC_CTYPE [=] <replaceable class="parameter">lc_ctype</replaceable> ]
@@ -118,6 +119,27 @@ CREATE DATABASE <replaceable class="parameter">name</replaceable>
        </para>
       </listitem>
      </varlistentry>
+     <varlistentry id="create-database-strategy" xreflabel="CREATE DATABASE STRATEGY">
+      <term><replaceable class="parameter">strategy</replaceable></term>
+      <listitem>
+       <para>
+        Strategy to be used in creating the new database.  If
+        the <literal>WAL_LOG</literal> strategy is used, the database will be
+        copied block by block and each block will be separately written
+        to the write-ahead log. This is the most efficient strategy in
+        cases where the template database is small, and therefore it is the
+        default. The older <literal>FILE_COPY</literal> strategy is also
+        available. This strategy writes a small record to the write-ahead log
+        for each tablespace used by the target database. Each such record
+        represents copying an entire directory to a new location at the
+        filesystem level. While this does reduce the write-ahed
+        log volume substantially, especially if the template database is large,
+        it also forces the system to perform a checkpoint both before and
+        after the creation of the new database. In some situations, this may
+        have a noticeable negative impact on overall system performance.
+       </para>
+      </listitem>
+     </varlistentry>
      <varlistentry>
       <term><replaceable class="parameter">locale</replaceable></term>
       <listitem>
diff --git a/doc/src/sgml/ref/createdb.sgml b/doc/src/sgml/ref/createdb.sgml
index be42e502d6..671cd362d9 100644
--- a/doc/src/sgml/ref/createdb.sgml
+++ b/doc/src/sgml/ref/createdb.sgml
@@ -177,6 +177,17 @@ PostgreSQL documentation
       </listitem>
      </varlistentry>
 
+     <varlistentry>
+      <term><option>-S <replaceable class="parameter">template</replaceable></option></term>
+      <term><option>--strategy=<replaceable class="parameter">strategy</replaceable></option></term>
+      <listitem>
+       <para>
+        Specifies the database creation strategy.  See
+        <xref linkend="create-database-strategy" /> for more details.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry>
       <term><option>-T <replaceable class="parameter">template</replaceable></option></term>
       <term><option>--template=<replaceable class="parameter">template</replaceable></option></term>
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 39ef8a0b77..2b70ca0596 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -601,7 +601,7 @@ heapam_relation_set_new_filenode(Relation rel,
 	 * even if the page has been logged, because the write did not go through
 	 * shared_buffers and therefore a concurrent checkpoint may have moved the
 	 * redo pointer past our xlog record.  Recovery may as well remove it
-	 * while replaying, for example, XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE
+	 * while replaying, for example, XLOG_DBASE_CREATE* or XLOG_TBLSPC_CREATE
 	 * record. Therefore, logging is necessary even if wal_level=minimal.
 	 */
 	if (persistence == RELPERSISTENCE_UNLOGGED)
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index c9b4964c1e..dacf3f7a58 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -161,7 +161,7 @@ btbuildempty(Relation index)
 	 * Write the page and log it.  It might seem that an immediate sync would
 	 * be sufficient to guarantee that the file exists on disk, but recovery
 	 * itself might remove it while replaying, for example, an
-	 * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we need
+	 * XLOG_DBASE_CREATE* or XLOG_TBLSPC_CREATE record.  Therefore, we need
 	 * this even when wal_level=minimal.
 	 */
 	PageSetChecksumInplace(metapage, BTREE_METAPAGE);
diff --git a/src/backend/access/rmgrdesc/dbasedesc.c b/src/backend/access/rmgrdesc/dbasedesc.c
index 03af3fdbcf..523d0b3c1d 100644
--- a/src/backend/access/rmgrdesc/dbasedesc.c
+++ b/src/backend/access/rmgrdesc/dbasedesc.c
@@ -24,14 +24,23 @@ dbase_desc(StringInfo buf, XLogReaderState *record)
 	char	   *rec = XLogRecGetData(record);
 	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
 
-	if (info == XLOG_DBASE_CREATE)
+	if (info == XLOG_DBASE_CREATE_FILE_COPY)
 	{
-		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) rec;
+		xl_dbase_create_file_copy_rec *xlrec =
+		(xl_dbase_create_file_copy_rec *) rec;
 
 		appendStringInfo(buf, "copy dir %u/%u to %u/%u",
 						 xlrec->src_tablespace_id, xlrec->src_db_id,
 						 xlrec->tablespace_id, xlrec->db_id);
 	}
+	else if (info == XLOG_DBASE_CREATE_WAL_LOG)
+	{
+		xl_dbase_create_wal_log_rec *xlrec =
+		(xl_dbase_create_wal_log_rec *) rec;
+
+		appendStringInfo(buf, "create dir %u/%u",
+						 xlrec->tablespace_id, xlrec->db_id);
+	}
 	else if (info == XLOG_DBASE_DROP)
 	{
 		xl_dbase_drop_rec *xlrec = (xl_dbase_drop_rec *) rec;
@@ -51,8 +60,11 @@ dbase_identify(uint8 info)
 
 	switch (info & ~XLR_INFO_MASK)
 	{
-		case XLOG_DBASE_CREATE:
-			id = "CREATE";
+		case XLOG_DBASE_CREATE_FILE_COPY:
+			id = "CREATE_FILE_COPY";
+			break;
+		case XLOG_DBASE_CREATE_WAL_LOG:
+			id = "CREATE_WAL_LOG";
 			break;
 		case XLOG_DBASE_DROP:
 			id = "DROP";
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 511f2f186f..a4dedc58b7 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -484,7 +484,7 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 	{
 		/* page exists in file */
 		buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-										   mode, NULL);
+										   mode, NULL, true);
 	}
 	else
 	{
@@ -509,7 +509,7 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 				ReleaseBuffer(buffer);
 			}
 			buffer = ReadBufferWithoutRelcache(rnode, forknum,
-											   P_NEW, mode, NULL);
+											   P_NEW, mode, NULL, true);
 		}
 		while (BufferGetBlockNumber(buffer) < blkno);
 		/* Handle the corner case that P_NEW returns non-consecutive pages */
@@ -519,7 +519,7 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 				LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 			ReleaseBuffer(buffer);
 			buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-											   mode, NULL);
+											   mode, NULL, true);
 		}
 	}
 
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 623e5ec778..9f96753eea 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -63,13 +63,31 @@
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
 #include "utils/pg_locale.h"
+#include "utils/relmapper.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 
+/*
+ * Create database strategy.
+ *
+ * CREATEDB_WAL_LOG will copy the database at the block level and WAL log each
+ * copied block.
+ *
+ * CREATEDB_FILE_COPY will simply perform a file system level copy of the
+ * database and log a single record for each tablespace copied. To make this
+ * safe, it also triggers checkpoints before and after the operation.
+ */
+typedef enum CreateDBStrategy
+{
+	CREATEDB_WAL_LOG,
+	CREATEDB_FILE_COPY
+} CreateDBStrategy;
+
 typedef struct
 {
 	Oid			src_dboid;		/* source (template) DB */
 	Oid			dest_dboid;		/* DB we are trying to create */
+	CreateDBStrategy strategy;	/* create db strategy */
 } createdb_failure_params;
 
 typedef struct
@@ -78,6 +96,17 @@ typedef struct
 	Oid			dest_tsoid;		/* tablespace we are trying to move to */
 } movedb_failure_params;
 
+/*
+ * Information about a relation to be copied when creating a database.
+ */
+typedef struct CreateDBRelInfo
+{
+	RelFileNode rnode;			/* physical relation identifier */
+	Oid			reloid;			/* relation oid */
+	bool		permanent;		/* relation is permanent or unlogged */
+} CreateDBRelInfo;
+
+
 /* non-export function prototypes */
 static void createdb_failure_callback(int code, Datum arg);
 static void movedb(const char *dbname, const char *tblspcname);
@@ -93,7 +122,535 @@ static bool have_createdb_privilege(void);
 static void remove_dbtablespaces(Oid db_id);
 static bool check_db_file_conflict(Oid db_id);
 static int	errdetail_busy_db(int notherbackends, int npreparedxacts);
+static void CreateDatabaseUsingWalLog(Oid src_dboid, Oid dboid, Oid src_tsid,
+									  Oid dst_tsid);
+static List *ScanSourceDatabasePgClass(Oid srctbid, Oid srcdbid, char *srcpath);
+static List *ScanSourceDatabasePgClassPage(Page page, Buffer buf, Oid tbid,
+										   Oid dbid, char *srcpath,
+										   List *rnodelist, Snapshot snapshot);
+static CreateDBRelInfo *ScanSourceDatabasePgClassTuple(HeapTupleData *tuple,
+													   Oid tbid, Oid dbid,
+													   char *srcpath);
+static void CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid,
+									bool isRedo);
+static void CreateDatabaseUsingFileCopy(Oid src_dboid, Oid dboid, Oid src_tsid,
+										Oid dst_tsid);
+
+/*
+ * Create a new database using the WAL_LOG strategy.
+ *
+ * Each copied block is separately written to the write-ahead log.
+ */
+static void
+CreateDatabaseUsingWalLog(Oid src_dboid, Oid dst_dboid,
+						  Oid src_tsid, Oid dst_tsid)
+{
+	char	   *srcpath;
+	char	   *dstpath;
+	List	   *rnodelist = NULL;
+	ListCell   *cell;
+	LockRelId	relid;
+	RelFileNode srcrnode;
+	RelFileNode dstrnode;
+	CreateDBRelInfo *relinfo;
+
+	/* Get source and destination database paths. */
+	srcpath = GetDatabasePath(src_dboid, src_tsid);
+	dstpath = GetDatabasePath(dst_dboid, dst_tsid);
+
+	/* Create database directory and write PG_VERSION file. */
+	CreateDirAndVersionFile(dstpath, dst_dboid, dst_tsid, false);
+
+	/* Copy relmap file from source database to the destination database. */
+	RelationMapCopy(dst_dboid, dst_tsid, srcpath, dstpath);
+
+	/* Get list of relfilenodes to copy from the source database. */
+	rnodelist = ScanSourceDatabasePgClass(src_tsid, src_dboid, srcpath);
+	Assert(rnodelist != NIL);
+
+	/*
+	 * Database id is common for all the relation so set it before entering
+	 * the loop.
+	 */
+	relid.dbId = src_dboid;
+
+	/* Loop over our list of relfilenodes and copy each one. */
+	foreach(cell, rnodelist)
+	{
+		relinfo = lfirst(cell);
+		srcrnode = relinfo->rnode;
+
+		/*
+		 * If the relation is from the source db's default tablespace then we
+		 * need to create it in the destinations db's default tablespace.
+		 * Otherwise, we need to create in the same tablespace as it is in the
+		 * source database.
+		 */
+		if (srcrnode.spcNode == src_tsid)
+			dstrnode.spcNode = dst_tsid;
+		else
+			dstrnode.spcNode = srcrnode.spcNode;
+
+		dstrnode.dbNode = dst_dboid;
+		dstrnode.relNode = srcrnode.relNode;
+
+		/* Acquire locks on source and target relations before copying. */
+		relid.relId = relinfo->reloid;
+		LockRelationId(&relid, AccessShareLock);
+		/* XXX lock target */
+
+		/* Copy relation storage from source to the destination. */
+		CreateAndCopyRelationData(srcrnode, dstrnode, relinfo->permanent);
+
+		/* Release the locks. */
+		UnlockRelationId(&relid, AccessShareLock);
+		/* XXX unlock target */
+	}
+
+	list_free_deep(rnodelist);
+}
+
+/*
+ * Scan the pg_class table in the source database to identify the relations
+ * that need to be copied to the destination database.
+ *
+ * This is an exception to the usual rule that cross-database access is
+ * not possible. We can make it work here because we know that there are no
+ * connections to the source database and (since there can't be prepared
+ * transactions touching that database) no in-doubt tuples either. This
+ * means that we don't need to worry about pruning removing anything from
+ * under us, and we don't need to be too picky about our snapshot either.
+ * As long as it sees all previously-committed XIDs as committed and all
+ * aborted XIDs as aborted, we should be fine: nothing else is possible
+ * here.
+ *
+ * We can't rely on the relcache for anything here, because that only knows
+ * about the database to which we are connected, and can't handle access to
+ * other databases. That also means we can't rely on the heap scan
+ * infrastructure, which would be a bad idea anyway since it might try
+ * to do things like HOT pruning which we definitely can't do safely in
+ * a database to which we're not even connected.
+ */
+static List *
+ScanSourceDatabasePgClass(Oid tbid, Oid dbid, char *srcpath)
+{
+	RelFileNode rnode;
+	BlockNumber nblocks;
+	BlockNumber blkno;
+	Buffer		buf;
+	Oid			relfilenode;
+	Page		page;
+	List	   *rnodelist = NIL;
+	LockRelId	relid;
+	Snapshot	snapshot;
+	SMgrRelation rd_smgr;
+	BufferAccessStrategy bstrategy;
+
+	/* Get pg_class relfilenode. */
+	relfilenode = RelationMapOidToFilenodeForDatabase(srcpath,
+													  RelationRelationId);
+
+	/*
+	 * The system elsewhere assumes that we only read data for a relation
+	 * into shared_buffers while holding some sort of a lock on a relation,
+	 * so lock the source database's pg_class before we do anything else.
+	 */
+	relid.dbId = dbid;
+	relid.relId = RelationRelationId;
+	LockRelationId(&relid, AccessShareLock);
+
+	/*
+	 * Open the source database's pg_class at the smgr level and get the
+	 * block count.
+	 */
+	rnode.spcNode = tbid;
+	rnode.dbNode = dbid;
+	rnode.relNode = relfilenode;
+	rd_smgr = smgropen(rnode, InvalidBackendId);
+	nblocks = smgrnblocks(rd_smgr, MAIN_FORKNUM);
+
+	/* Use a buffer access strategy since this is a bulk read operation. */
+	bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	/*
+	 * As explained in the function header comments, we need a snapshot that
+	 * will see all committed transactions as committed, and our transaction
+	 * snapshot - or the active snapshot - might not be new enough for that,
+	 * but the return value of GetLatestSnapshot() should work fine.
+	 */
+	snapshot = GetLatestSnapshot();
+
+	/* Process the relation block by block. */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		/* XXX Shouldn't we CHECK_FOR_INTERRUPTS() here? */
+
+		buf = ReadBufferWithoutRelcache(rnode, MAIN_FORKNUM, blkno,
+										RBM_NORMAL, bstrategy, false);
+
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buf);
+		if (PageIsNew(page) || PageIsEmpty(page))
+		{
+			UnlockReleaseBuffer(buf);
+			continue;
+		}
+
+		/* Append relevant pg_class tuples for current page to rnodelist. */
+		rnodelist = ScanSourceDatabasePgClassPage(page, buf, tbid, dbid,
+												  srcpath, rnodelist,
+												  snapshot);
+
+		UnlockReleaseBuffer(buf);
+	}
+
+	/* Release relation lock. */
+	UnlockRelationId(&relid, AccessShareLock);
+
+	return rnodelist;
+}
+
+/*
+ * Scan one page of the source database's pg_class relation and add relevant
+ * entries to rnodelist. The return value is the updated list.
+ */
+static List *
+ScanSourceDatabasePgClassPage(Page page, Buffer buf, Oid tbid, Oid dbid,
+							  char *srcpath, List *rnodelist,
+							  Snapshot snapshot)
+{
+	BlockNumber		blkno = BufferGetBlockNumber(buf);
+	OffsetNumber	offnum;
+	OffsetNumber	maxoff;
+	HeapTupleData	tuple;
+
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/* Loop over offsets. */
+	for (offnum = FirstOffsetNumber;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid;
+
+		itemid = PageGetItemId(page, offnum);
+
+		/* Nothing to do if slot is empty or already dead. */
+		if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid) ||
+			ItemIdIsRedirected(itemid))
+			continue;
+
+		Assert(ItemIdIsNormal(itemid));
+		ItemPointerSet(&(tuple.t_self), blkno, offnum);
+
+		/* Initialize a HeapTupleData structure. */
+		tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
+		tuple.t_len = ItemIdGetLength(itemid);
+		tuple.t_tableOid = RelationRelationId;
+
+		/* Skip tuples that are not visible to this snapshot. */
+		if (HeapTupleSatisfiesVisibility(&tuple, snapshot, buf))
+		{
+			CreateDBRelInfo *relinfo;
+
+			/*
+			 * ScanSourceDatabasePgClassTuple is in charge of constructing
+			 * a CreateDBRelInfo object for this tuple, but can also decide
+			 * that this tuple isn't something we need to copy. If we do need
+			 * to copy the relation, add it to the list.
+			 */
+			relinfo = ScanSourceDatabasePgClassTuple(&tuple, tbid, dbid,
+													 srcpath);
+			if (relinfo != NULL)
+				rnodelist = lappend(rnodelist, relinfo);
+		}
+	}
+
+	return rnodelist;
+}
+
+/*
+ * Decide whether a certain pg_class tuple represents something that
+ * needs to be copied from the source database to the destination database,
+ * and if so, construct a CreateDBRelInfo for it.
+ *
+ * Visbility checks are handled by the caller, so our job here is just
+ * to assess the data stored in the tuple.
+ */
+CreateDBRelInfo *
+ScanSourceDatabasePgClassTuple(HeapTupleData *tuple, Oid tbid, Oid dbid,
+							   char *srcpath)
+{
+	CreateDBRelInfo	   *relinfo;
+	Form_pg_class		classForm;
+	Oid					relfilenode = InvalidOid;
+
+	classForm = (Form_pg_class) GETSTRUCT(tuple);
+
+	/*
+	 * Return NULL if this object does not need to be copied.
+	 *
+	 * Shared objects don't need to be copied, because they are shared.
+	 * Objects without storage can't be copied, because there's nothing to
+	 * copy. Temporary relations don't need to be copied either, because
+	 * they are inaccessible outside of the session that created them,
+	 * which must be gone already, and couldn't connect to a different database
+	 * if it still existed. autovacuum will eventually remove the pg_class
+	 * entries as well.
+	 */
+	if (classForm->reltablespace == GLOBALTABLESPACE_OID ||
+		!RELKIND_HAS_STORAGE(classForm->relkind) ||
+		classForm->relpersistence == RELPERSISTENCE_TEMP)
+		return NULL;
+
+	/*
+	 * If relfilenode is valid then directly use it.  Otherwise, consult the
+	 * relmap.
+	 */
+	if (OidIsValid(classForm->relfilenode))
+		relfilenode = classForm->relfilenode;
+	else
+		relfilenode = RelationMapOidToFilenodeForDatabase(srcpath,
+														  classForm->oid);
+
+	/* We must have a valid relfilenode oid. */
+	if (!OidIsValid(relfilenode))
+		elog(ERROR, "relation with OID %u does not have a valid relfilenode",
+			 classForm->oid);
+
+	/* Prepare a rel info element and add it to the list. */
+	relinfo = (CreateDBRelInfo *) palloc(sizeof(CreateDBRelInfo));
+	if (OidIsValid(classForm->reltablespace))
+		relinfo->rnode.spcNode = classForm->reltablespace;
+	else
+		relinfo->rnode.spcNode = tbid;
+
+	relinfo->rnode.dbNode = dbid;
+	relinfo->rnode.relNode = relfilenode;
+	relinfo->reloid = classForm->oid;
+
+	/* Temporary relations were rejected above. */
+	Assert(classForm->relpersistence != RELPERSISTENCE_TEMP);
+	relinfo->permanent =
+		(classForm->relpersistence == RELPERSISTENCE_PERMANENT) ? true : false;
+
+	return relinfo;
+}
+
+/*
+ * Create database directory and write out the PG_VERSION file in the database
+ * path.  If isRedo is true, it's okay for the database directory to exist
+ * already.
+ */
+static void
+CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid, bool isRedo)
+{
+	int			fd;
+	int			nbytes;
+	char		versionfile[MAXPGPATH];
+	char		buf[16];
+
+	/*
+	 * Prepare version data before starting a critical section.
+	 *
+	 * Note that we don't have to copy this from the source database; there's
+	 * only one legal value.
+	 */
+	sprintf(buf, "%s\n", PG_MAJORVERSION);
+	nbytes = strlen(PG_MAJORVERSION) + 1;
+
+	/* If we are not in WAL replay then write the WAL. */
+	if (!isRedo)
+	{
+		xl_dbase_create_wal_log_rec xlrec;
+		XLogRecPtr	lsn;
+
+		START_CRIT_SECTION();
+
+		xlrec.db_id = dbid;
+		xlrec.tablespace_id = tsid;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec),
+						 sizeof(xl_dbase_create_wal_log_rec));
+
+		lsn = XLogInsert(RM_DBASE_ID, XLOG_DBASE_CREATE_WAL_LOG);
+
+		/* As always, WAL must hit the disk before the data update does. */
+		XLogFlush(lsn);
+	}
+
+	/* Create database directory. */
+	if (MakePGDirectory(dbpath) < 0)
+	{
+		/* Failure other than already exists or not in WAL replay? */
+		if (errno != EEXIST || !isRedo)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not create directory \"%s\": %m", dbpath)));
+	}
+
+	/*
+	 * Create PG_VERSION file in the database path.  If the file already
+	 * exists and we are in WAL replay then try again to open it in write
+	 * mode.
+	 */
+	snprintf(versionfile, sizeof(versionfile), "%s/%s", dbpath, "PG_VERSION");
+
+	fd = OpenTransientFile(versionfile, O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
+	if (fd < 0 && errno == EEXIST && isRedo)
+		fd = OpenTransientFile(versionfile, O_WRONLY | O_TRUNC | PG_BINARY);
+
+	if (fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m", versionfile)));
+
+	/* Write PG_MAJORVERSION in the PG_VERSION file. */
+	/* XXX create a new wait event for this */
+	pgstat_report_wait_start(WAIT_EVENT_COPY_FILE_WRITE);
+	errno = 0;
+	if ((int) write(fd, buf, nbytes) != nbytes)
+	{
+		/* If write didn't set errno, assume problem is no disk space. */
+		if (errno == 0)
+			errno = ENOSPC;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m", versionfile)));
+	}
+	pgstat_report_wait_end();
 
+	/* Close the version file. */
+	CloseTransientFile(fd);
+
+	/* Critical section done. */
+	if (!isRedo)
+		END_CRIT_SECTION();
+}
+
+/*
+ * Create a new database using the FILE_COPY strategy.
+ *
+ * Copy each tablespace at the filesystem level, and log a single WAL record
+ * for each tablespace copied.  This requires a checkpoint before and after the
+ * copy, which may be expensive, but it does greatly reduce WAL generation
+ * if the copied database is large.
+ */
+static void
+CreateDatabaseUsingFileCopy(Oid src_dboid, Oid dst_dboid, Oid src_tsid,
+							Oid dst_tsid)
+{
+	TableScanDesc scan;
+	Relation	rel;
+	HeapTuple	tuple;
+
+	/*
+	 * Force a checkpoint before starting the copy. This will force all dirty
+	 * buffers, including those of unlogged tables, out to disk, to ensure
+	 * source database is up-to-date on disk for the copy.
+	 * FlushDatabaseBuffers() would suffice for that, but we also want to
+	 * process any pending unlink requests. Otherwise, if a checkpoint
+	 * happened while we're copying files, a file might be deleted just when
+	 * we're about to copy it, causing the lstat() call in copydir() to fail
+	 * with ENOENT.
+	 */
+	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE |
+					  CHECKPOINT_WAIT | CHECKPOINT_FLUSH_ALL);
+
+	/*
+	 * Iterate through all tablespaces of the template database, and copy each
+	 * one to the new database.
+	 */
+	rel = table_open(TableSpaceRelationId, AccessShareLock);
+	scan = table_beginscan_catalog(rel, 0, NULL);
+	while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+	{
+		Form_pg_tablespace spaceform = (Form_pg_tablespace) GETSTRUCT(tuple);
+		Oid			srctablespace = spaceform->oid;
+		Oid			dsttablespace;
+		char	   *srcpath;
+		char	   *dstpath;
+		struct stat st;
+
+		/* No need to copy global tablespace */
+		if (srctablespace == GLOBALTABLESPACE_OID)
+			continue;
+
+		srcpath = GetDatabasePath(src_dboid, srctablespace);
+
+		if (stat(srcpath, &st) < 0 || !S_ISDIR(st.st_mode) ||
+			directory_is_empty(srcpath))
+		{
+			/* Assume we can ignore it */
+			pfree(srcpath);
+			continue;
+		}
+
+		if (srctablespace == src_tsid)
+			dsttablespace = dst_tsid;
+		else
+			dsttablespace = srctablespace;
+
+		dstpath = GetDatabasePath(dst_dboid, dsttablespace);
+
+		/*
+		 * Copy this subdirectory to the new location
+		 *
+		 * We don't need to copy subdirectories
+		 */
+		copydir(srcpath, dstpath, false);
+
+		/* Record the filesystem change in XLOG */
+		{
+			xl_dbase_create_file_copy_rec xlrec;
+
+			xlrec.db_id = dst_dboid;
+			xlrec.tablespace_id = dsttablespace;
+			xlrec.src_db_id = src_dboid;
+			xlrec.src_tablespace_id = srctablespace;
+
+			XLogBeginInsert();
+			XLogRegisterData((char *) &xlrec,
+							 sizeof(xl_dbase_create_file_copy_rec));
+
+			(void) XLogInsert(RM_DBASE_ID,
+							  XLOG_DBASE_CREATE_FILE_COPY | XLR_SPECIAL_REL_UPDATE);
+		}
+	}
+	table_endscan(scan);
+	table_close(rel, AccessShareLock);
+
+	/*
+	 * We force a checkpoint before committing.  This effectively means that
+	 * committed XLOG_DBASE_CREATE_FILE_COPY operations will never need to be
+	 * replayed (at least not in ordinary crash recovery; we still have to
+	 * make the XLOG entry for the benefit of PITR operations). This avoids
+	 * two nasty scenarios:
+	 *
+	 * #1: When PITR is off, we don't XLOG the contents of newly created
+	 * indexes; therefore the drop-and-recreate-whole-directory behavior of
+	 * DBASE_CREATE replay would lose such indexes.
+	 *
+	 * #2: Since we have to recopy the source database during DBASE_CREATE
+	 * replay, we run the risk of copying changes in it that were committed
+	 * after the original CREATE DATABASE command but before the system crash
+	 * that led to the replay.  This is at least unexpected and at worst could
+	 * lead to inconsistencies, eg duplicate table names.
+	 *
+	 * (Both of these were real bugs in releases 8.0 through 8.0.3.)
+	 *
+	 * In PITR replay, the first of these isn't an issue, and the second is
+	 * only a risk if the CREATE DATABASE and subsequent template database
+	 * change both occur while a base backup is being taken. There doesn't
+	 * seem to be much we can do about that except document it as a
+	 * limitation.
+	 *
+	 * See CreateDatabaseUsingWalLog() for a less cheesy CREATE DATABASE
+	 * strategy that avoids these problems.
+	 */
+	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
+}
 
 /*
  * CREATE DATABASE
@@ -101,8 +658,6 @@ static int	errdetail_busy_db(int notherbackends, int npreparedxacts);
 Oid
 createdb(ParseState *pstate, const CreatedbStmt *stmt)
 {
-	TableScanDesc scan;
-	Relation	rel;
 	Oid			src_dboid;
 	Oid			src_owner;
 	int			src_encoding = -1;
@@ -137,6 +692,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	DefElem    *dallowconnections = NULL;
 	DefElem    *dconnlimit = NULL;
 	DefElem    *dcollversion = NULL;
+	DefElem    *dstrategy = NULL;
 	char	   *dbname = stmt->dbname;
 	char	   *dbowner = NULL;
 	const char *dbtemplate = NULL;
@@ -152,6 +708,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	char	   *dbcollversion = NULL;
 	int			notherbackends;
 	int			npreparedxacts;
+	CreateDBStrategy dbstrategy = CREATEDB_WAL_LOG;
 	createdb_failure_params fparms;
 
 	/* Extract options from the statement node tree */
@@ -269,6 +826,12 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 						(errcode(ERRCODE_INVALID_PARAMETER_VALUE)),
 						errmsg("OIDs less than %u are reserved for system objects", FirstNormalObjectId));
 		}
+		else if (strcmp(defel->defname, "strategy") == 0)
+		{
+			if (dstrategy)
+				errorConflictingDefElem(defel, pstate);
+			dstrategy = defel;
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -413,6 +976,23 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 							dbtemplate)));
 	}
 
+	/* Validate the database creation strategy. */
+	if (dstrategy && dstrategy->arg)
+	{
+		char	   *strategy;
+
+		strategy = defGetString(dstrategy);
+		if (strcmp(strategy, "wal_log") == 0)
+			dbstrategy = CREATEDB_WAL_LOG;
+		else if (strcmp(strategy, "file_copy") == 0)
+			dbstrategy = CREATEDB_FILE_COPY;
+		else
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("invalid create database strategy %s", strategy),
+					 errhint("Valid strategies are \"wal_log\", and \"file_copy\".")));
+	}
+
 	/* If encoding or locales are defaulted, use source's setting */
 	if (encoding < 0)
 		encoding = src_encoding;
@@ -752,19 +1332,6 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	/* Post creation hook for new database */
 	InvokeObjectPostCreateHook(DatabaseRelationId, dboid, 0);
 
-	/*
-	 * Force a checkpoint before starting the copy. This will force all dirty
-	 * buffers, including those of unlogged tables, out to disk, to ensure
-	 * source database is up-to-date on disk for the copy.
-	 * FlushDatabaseBuffers() would suffice for that, but we also want to
-	 * process any pending unlink requests. Otherwise, if a checkpoint
-	 * happened while we're copying files, a file might be deleted just when
-	 * we're about to copy it, causing the lstat() call in copydir() to fail
-	 * with ENOENT.
-	 */
-	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT
-					  | CHECKPOINT_FLUSH_ALL);
-
 	/*
 	 * Once we start copying subdirectories, we need to be able to clean 'em
 	 * up if we fail.  Use an ENSURE block to make sure this happens.  (This
@@ -774,101 +1341,24 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	 */
 	fparms.src_dboid = src_dboid;
 	fparms.dest_dboid = dboid;
+	fparms.strategy = dbstrategy;
+
 	PG_ENSURE_ERROR_CLEANUP(createdb_failure_callback,
 							PointerGetDatum(&fparms));
 	{
 		/*
-		 * Iterate through all tablespaces of the template database, and copy
-		 * each one to the new database.
-		 */
-		rel = table_open(TableSpaceRelationId, AccessShareLock);
-		scan = table_beginscan_catalog(rel, 0, NULL);
-		while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
-		{
-			Form_pg_tablespace spaceform = (Form_pg_tablespace) GETSTRUCT(tuple);
-			Oid			srctablespace = spaceform->oid;
-			Oid			dsttablespace;
-			char	   *srcpath;
-			char	   *dstpath;
-			struct stat st;
-
-			/* No need to copy global tablespace */
-			if (srctablespace == GLOBALTABLESPACE_OID)
-				continue;
-
-			srcpath = GetDatabasePath(src_dboid, srctablespace);
-
-			if (stat(srcpath, &st) < 0 || !S_ISDIR(st.st_mode) ||
-				directory_is_empty(srcpath))
-			{
-				/* Assume we can ignore it */
-				pfree(srcpath);
-				continue;
-			}
-
-			if (srctablespace == src_deftablespace)
-				dsttablespace = dst_deftablespace;
-			else
-				dsttablespace = srctablespace;
-
-			dstpath = GetDatabasePath(dboid, dsttablespace);
-
-			/*
-			 * Copy this subdirectory to the new location
-			 *
-			 * We don't need to copy subdirectories
-			 */
-			copydir(srcpath, dstpath, false);
-
-			/* Record the filesystem change in XLOG */
-			{
-				xl_dbase_create_rec xlrec;
-
-				xlrec.db_id = dboid;
-				xlrec.tablespace_id = dsttablespace;
-				xlrec.src_db_id = src_dboid;
-				xlrec.src_tablespace_id = srctablespace;
-
-				XLogBeginInsert();
-				XLogRegisterData((char *) &xlrec, sizeof(xl_dbase_create_rec));
-
-				(void) XLogInsert(RM_DBASE_ID,
-								  XLOG_DBASE_CREATE | XLR_SPECIAL_REL_UPDATE);
-			}
-		}
-		table_endscan(scan);
-		table_close(rel, AccessShareLock);
-
-		/*
-		 * We force a checkpoint before committing.  This effectively means
-		 * that committed XLOG_DBASE_CREATE operations will never need to be
-		 * replayed (at least not in ordinary crash recovery; we still have to
-		 * make the XLOG entry for the benefit of PITR operations). This
-		 * avoids two nasty scenarios:
-		 *
-		 * #1: When PITR is off, we don't XLOG the contents of newly created
-		 * indexes; therefore the drop-and-recreate-whole-directory behavior
-		 * of DBASE_CREATE replay would lose such indexes.
-		 *
-		 * #2: Since we have to recopy the source database during DBASE_CREATE
-		 * replay, we run the risk of copying changes in it that were
-		 * committed after the original CREATE DATABASE command but before the
-		 * system crash that led to the replay.  This is at least unexpected
-		 * and at worst could lead to inconsistencies, eg duplicate table
-		 * names.
-		 *
-		 * (Both of these were real bugs in releases 8.0 through 8.0.3.)
-		 *
-		 * In PITR replay, the first of these isn't an issue, and the second
-		 * is only a risk if the CREATE DATABASE and subsequent template
-		 * database change both occur while a base backup is being taken.
-		 * There doesn't seem to be much we can do about that except document
-		 * it as a limitation.
-		 *
-		 * Perhaps if we ever implement CREATE DATABASE in a less cheesy way,
-		 * we can avoid this.
+		 * If the user has asked to create a database with WAL_LOG strategy
+		 * then call CreateDatabaseUsingWalLog, which will copy the database
+		 * at the block level and it will WAL log each copied block.
+		 * Otherwise, call CreateDatabaseUsingFileCopy that will copy the
+		 * database file by file.
 		 */
-		RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
+		if (dbstrategy == CREATEDB_WAL_LOG)
+			CreateDatabaseUsingWalLog(src_dboid, dboid, src_deftablespace,
+									  dst_deftablespace);
+		else
+			CreateDatabaseUsingFileCopy(src_dboid, dboid, src_deftablespace,
+										dst_deftablespace);
 
 		/*
 		 * Close pg_database, but keep lock till commit.
@@ -954,6 +1444,21 @@ createdb_failure_callback(int code, Datum arg)
 {
 	createdb_failure_params *fparms = (createdb_failure_params *) DatumGetPointer(arg);
 
+	/*
+	 * If we were copying database at block levels then drop pages for the
+	 * destination database that are in the shared buffer cache.  And tell
+	 * checkpointer to forget any pending fsync and unlink requests for files
+	 * in the database.  The reasoning behind doing this is same as explained
+	 * in dropdb function.  But unlike dropdb we don't need to call
+	 * pgstat_drop_database because this database is still not created so
+	 * there should not be any stat for this.
+	 */
+	if (fparms->strategy == CREATEDB_WAL_LOG)
+	{
+		DropDatabaseBuffers(fparms->dest_dboid);
+		ForgetDatabaseSyncRequests(fparms->dest_dboid);
+	}
+
 	/*
 	 * Release lock on source database before doing recursive remove. This is
 	 * not essential but it seems desirable to release the lock as soon as
@@ -1478,7 +1983,7 @@ movedb(const char *dbname, const char *tblspcname)
 		 * Record the filesystem change in XLOG
 		 */
 		{
-			xl_dbase_create_rec xlrec;
+			xl_dbase_create_file_copy_rec xlrec;
 
 			xlrec.db_id = db_id;
 			xlrec.tablespace_id = dst_tblspcoid;
@@ -1486,10 +1991,11 @@ movedb(const char *dbname, const char *tblspcname)
 			xlrec.src_tablespace_id = src_tblspcoid;
 
 			XLogBeginInsert();
-			XLogRegisterData((char *) &xlrec, sizeof(xl_dbase_create_rec));
+			XLogRegisterData((char *) &xlrec,
+							 sizeof(xl_dbase_create_file_copy_rec));
 
 			(void) XLogInsert(RM_DBASE_ID,
-							  XLOG_DBASE_CREATE | XLR_SPECIAL_REL_UPDATE);
+							  XLOG_DBASE_CREATE_FILE_COPY | XLR_SPECIAL_REL_UPDATE);
 		}
 
 		/*
@@ -1525,9 +2031,10 @@ movedb(const char *dbname, const char *tblspcname)
 
 		/*
 		 * Force another checkpoint here.  As in CREATE DATABASE, this is to
-		 * ensure that we don't have to replay a committed XLOG_DBASE_CREATE
-		 * operation, which would cause us to lose any unlogged operations
-		 * done in the new DB tablespace before the next checkpoint.
+		 * ensure that we don't have to replay a committed
+		 * XLOG_DBASE_CREATE_FILE_COPY operation, which would cause us to lose
+		 * any unlogged operations done in the new DB tablespace before the
+		 * next checkpoint.
 		 */
 		RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
 
@@ -2478,9 +2985,10 @@ dbase_redo(XLogReaderState *record)
 	/* Backup blocks are not used in dbase records */
 	Assert(!XLogRecHasAnyBlockRefs(record));
 
-	if (info == XLOG_DBASE_CREATE)
+	if (info == XLOG_DBASE_CREATE_FILE_COPY)
 	{
-		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) XLogRecGetData(record);
+		xl_dbase_create_file_copy_rec *xlrec =
+		(xl_dbase_create_file_copy_rec *) XLogRecGetData(record);
 		char	   *src_path;
 		char	   *dst_path;
 		struct stat st;
@@ -2515,6 +3023,18 @@ dbase_redo(XLogReaderState *record)
 		 */
 		copydir(src_path, dst_path, false);
 	}
+	else if (info == XLOG_DBASE_CREATE_WAL_LOG)
+	{
+		xl_dbase_create_wal_log_rec *xlrec =
+		(xl_dbase_create_wal_log_rec *) XLogRecGetData(record);
+		char	   *dbpath;
+
+		dbpath = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
+
+		/* Create the database directory with the version file. */
+		CreateDirAndVersionFile(dbpath, xlrec->db_id, xlrec->tablespace_id,
+								true);
+	}
 	else if (info == XLOG_DBASE_DROP)
 	{
 		xl_dbase_drop_rec *xlrec = (xl_dbase_drop_rec *) XLogRecGetData(record);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f5459c68f8..0ad8ef4bbd 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -38,6 +38,7 @@
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/storage.h"
+#include "catalog/storage_xlog.h"
 #include "executor/instrument.h"
 #include "lib/binaryheap.h"
 #include "miscadmin.h"
@@ -486,6 +487,9 @@ static void FindAndDropRelFileNodeBuffers(RelFileNode rnode,
 										  ForkNumber forkNum,
 										  BlockNumber nForkBlock,
 										  BlockNumber firstDelBlock);
+static void RelationCopyStorageUsingBuffer(SMgrRelation src, SMgrRelation dst,
+										   ForkNumber forkNum,
+										   bool isunlogged);
 static void AtProcExit_Buffers(int code, Datum arg);
 static void CheckForBufferLeaks(void);
 static int	rnode_comparator(const void *p1, const void *p2);
@@ -772,23 +776,23 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
  * ReadBufferWithoutRelcache -- like ReadBufferExtended, but doesn't require
  *		a relcache entry for the relation.
  *
- * NB: At present, this function may only be used on permanent relations, which
- * is OK, because we only use it during XLOG replay.  If in the future we
- * want to use it on temporary or unlogged relations, we could pass additional
- * parameters.
+ * Pass permanent = true for a RELPERSISTENCE_PERMANENT relation, and
+ * permanent = false for a RELPERSISTENCE_UNLOGGED relation. This function
+ * cannot be used for temporary relations (and making that work might be
+ * difficult, unless we only want to read temporary relations for our own
+ * BackendId).
  */
 Buffer
 ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
 						  BlockNumber blockNum, ReadBufferMode mode,
-						  BufferAccessStrategy strategy)
+						  BufferAccessStrategy strategy, bool permanent)
 {
 	bool		hit;
 
 	SMgrRelation smgr = smgropen(rnode, InvalidBackendId);
 
-	Assert(InRecovery);
-
-	return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
+	return ReadBuffer_common(smgr, permanent ? RELPERSISTENCE_PERMANENT :
+							 RELPERSISTENCE_UNLOGGED, forkNum, blockNum,
 							 mode, strategy, &hit);
 }
 
@@ -3676,6 +3680,154 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 	pfree(srels);
 }
 
+/* ---------------------------------------------------------------------
+ *		RelationCopyStorageUsingBuffer
+ *
+ *		Copy fork's data using bufmgr.  Same as RelationCopyStorage but instead
+ *		of using smgrread and smgrextend this will copy using bufmgr APIs.
+ *
+ *		Refer comments atop CreateAndCopyRelationData() for details about
+ *		'permanent' parameter.
+ * --------------------------------------------------------------------
+ */
+static void
+RelationCopyStorageUsingBuffer(SMgrRelation src, SMgrRelation dst,
+							   ForkNumber forkNum, bool permanent)
+{
+	Buffer		srcBuf;
+	Buffer		dstBuf;
+	Page		srcPage;
+	Page		dstPage;
+	bool		use_wal;
+	BlockNumber nblocks;
+	BlockNumber blkno;
+	BufferAccessStrategy bstrategy_src;
+	BufferAccessStrategy bstrategy_dst;
+
+	/*
+	 * In general, we want to write WAL whenever wal_level > 'minimal', but
+	 * we can skip it when copying any fork of an unlogged relation other
+	 * than the init fork.
+	 */
+	use_wal = XLogIsNeeded() && (permanent || forkNum == INIT_FORKNUM);
+
+	/* Get number of blocks in the source relation. */
+	nblocks = smgrnblocks(src, forkNum);
+
+	/* Nothing to copy; just return. */
+	if (nblocks == 0)
+		return;
+
+	/* This is a bulk operation, so use buffer access strategies. */
+	bstrategy_src = GetAccessStrategy(BAS_BULKREAD);
+	bstrategy_dst = GetAccessStrategy(BAS_BULKWRITE);
+
+	/* Iterate over each block of the source relation file. */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		CHECK_FOR_INTERRUPTS();
+
+		/* Read block from source relation. */
+		srcBuf = ReadBufferWithoutRelcache(src->smgr_rnode.node, forkNum,
+										   blkno, RBM_NORMAL, bstrategy_src,
+										   permanent);
+		srcPage = BufferGetPage(srcBuf);
+		if (PageIsNew(srcPage) || PageIsEmpty(srcPage))
+		{
+			ReleaseBuffer(srcBuf);
+			continue;
+		}
+
+		/* Use P_NEW to extend the destination relation. */
+		dstBuf = ReadBufferWithoutRelcache(dst->smgr_rnode.node, forkNum,
+										   P_NEW, RBM_NORMAL, bstrategy_dst,
+										   permanent);
+		LockBuffer(dstBuf, BUFFER_LOCK_EXCLUSIVE);
+
+		START_CRIT_SECTION();
+
+		/* Initialize the page and write the data. */
+		dstPage = BufferGetPage(dstBuf);
+		/* XXX. What's the point of calling PageInit here? */
+		PageInit(dstPage, BufferGetPageSize(dstBuf), 0);
+		memcpy(dstPage, srcPage, BLCKSZ);
+		MarkBufferDirty(dstBuf);
+
+		/* WAL-log the copied page. */
+		if (use_wal)
+			log_newpage_buffer(dstBuf, true);
+
+		END_CRIT_SECTION();
+
+		UnlockReleaseBuffer(dstBuf);
+		ReleaseBuffer(srcBuf);
+	}
+}
+
+/* ---------------------------------------------------------------------
+ *		CreateAndCopyRelationData
+ *
+ *		Create destination relation storage and copy all forks from the
+ *		source relation to the destination.
+ *
+ *		Pass permanent as true for permanent relations and false for
+ *		unlogged relations.  Currently this API is not supported for
+ *		temporary relations.
+ * --------------------------------------------------------------------
+ */
+void
+CreateAndCopyRelationData(RelFileNode src_rnode, RelFileNode dst_rnode,
+						  bool permanent)
+{
+	SMgrRelation	src_smgr;
+	SMgrRelation	dst_smgr;
+	char			relpersistence;
+
+	/* Open the source relation at smgr level. */
+	src_smgr = smgropen(src_rnode, InvalidBackendId);
+
+	/* Set the relpersistence. */
+	relpersistence = permanent ?
+		RELPERSISTENCE_PERMANENT : RELPERSISTENCE_UNLOGGED;
+
+	/*
+	 * Create and copy all forks of the relation.
+	 *
+	 * NOTE: any conflict in relfilenode value will be caught in
+	 * RelationCreateStorage().
+	 */
+	dst_smgr = RelationCreateStorage(dst_rnode, relpersistence);
+
+	/* copy main fork */
+	RelationCopyStorageUsingBuffer(src_smgr, dst_smgr, MAIN_FORKNUM,
+								   permanent);
+
+	/* copy those extra forks that exist */
+	for (ForkNumber forkNum = MAIN_FORKNUM + 1;
+		 forkNum <= MAX_FORKNUM; forkNum++)
+	{
+		if (smgrexists(src_smgr, forkNum))
+		{
+			smgrcreate(dst_smgr, forkNum, false);
+
+			/*
+			 * WAL log creation if the relation is persistent, or this is the
+			 * init fork of an unlogged relation.
+			 */
+			if (permanent || forkNum == INIT_FORKNUM)
+				log_smgrcreate(&dst_rnode, forkNum);
+
+			/* Copy a fork's data, block by block. */
+			RelationCopyStorageUsingBuffer(src_smgr, dst_smgr, forkNum,
+										   permanent);
+		}
+	}
+
+	/* Close the smgr rel */
+	smgrclose(src_smgr);
+	smgrclose(dst_smgr);
+}
+
 /* ---------------------------------------------------------------------
  *		FlushDatabaseBuffers
  *
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 5ae52dd14d..1543da6162 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -175,6 +175,34 @@ ConditionalLockRelationOid(Oid relid, LOCKMODE lockmode)
 	return true;
 }
 
+/*
+ *		LockRelationId
+ *
+ * Lock, given a LockRelId.  Same as LockRelationOid but take LockRelId as an
+ * input.
+ */
+void
+LockRelationId(LockRelId *relid, LOCKMODE lockmode)
+{
+	LOCKTAG		tag;
+	LOCALLOCK  *locallock;
+	LockAcquireResult res;
+
+	SET_LOCKTAG_RELATION(tag, relid->dbId, relid->relId);
+
+	res = LockAcquireExtended(&tag, lockmode, false, false, true, &locallock);
+
+	/*
+	 * Now that we have the lock, check for invalidation messages; see notes
+	 * in LockRelationOid.
+	 */
+	if (res != LOCKACQUIRE_ALREADY_CLEAR)
+	{
+		AcceptInvalidationMessages();
+		MarkLockClear(locallock);
+	}
+}
+
 /*
  *		UnlockRelationId
  *
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 4d0718f001..dee3387d02 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -251,6 +251,63 @@ RelationMapFilenodeToOid(Oid filenode, bool shared)
 	return InvalidOid;
 }
 
+/*
+ * RelationMapOidToFilenodeForDatabase
+ *
+ * Like RelationMapOidToFilenode, but reads the mapping from the indicated
+ * path instead of using the one for the current database.
+ */
+Oid
+RelationMapOidToFilenodeForDatabase(char *dbpath, Oid relationId)
+{
+	RelMapFile	map;
+	int			i;
+
+	/* Read the relmap file from the source database. */
+	read_relmap_file(&map, dbpath, false, ERROR);
+
+	/* Iterate over the relmap entries to find the input relation OID. */
+	for (i = 0; i < map.num_mappings; i++)
+	{
+		if (relationId == map.mappings[i].mapoid)
+			return map.mappings[i].mapfilenode;
+	}
+
+	return InvalidOid;
+}
+
+/*
+ * RelationMapCopy
+ *
+ * Copy relmapfile from source db path to the destination db path and WAL log
+ * the operation. This is intended for use in creating a new relmap file
+ * for a database that doesn't have one yet, not for replacing an existing
+ * relmap file.
+ */
+void
+RelationMapCopy(Oid dbid, Oid tsid, char *srcdbpath, char *dstdbpath)
+{
+	RelMapFile map;
+
+	/*
+	 * Read the relmap file from the source database.
+	 */
+	read_relmap_file(&map, srcdbpath, false, ERROR);
+
+	/*
+	 * Write the same data into the destination database's relmap file.
+	 *
+	 * No sinval is needed because no one can be connected to the destination
+	 * database yet. For the same reason, there is no need to acquire
+	 * RelationMappingLock.
+	 *
+	 * There's no point in trying to preserve files here. The new database
+	 * isn't usable yet anyway, and won't ever be if we can't install a
+	 * relmap file.
+	 */
+	write_relmap_file(&map, true, false, false, dbid, tsid, dstdbpath);
+}
+
 /*
  * RelationMapUpdateMap
  *
@@ -1031,6 +1088,13 @@ relmap_redo(XLogReaderState *record)
 		 *
 		 * There shouldn't be anyone else updating relmaps during WAL replay,
 		 * but grab the lock to interlock against load_relmap_file().
+		 *
+		 * Note that we use the same WAL record for updating the relmap of
+		 * an existing database as we do for creating a new database. In
+		 * the latter case, taking the relmap log and sending sinval messages
+		 * is unnecessary, but harmless. If we wanted to avoid it, we could
+		 * add a flag to the WAL record to indicate which opration is being
+		 * performed.
 		 */
 		LWLockAcquire(RelationMappingLock, LW_EXCLUSIVE);
 		write_relmap_file(&newmap, false, true, false,
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 7cfa169e9b..bd1ec42ac6 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -370,7 +370,7 @@ extractPageInfo(XLogReaderState *record)
 
 	/* Is this a special record type that I recognize? */
 
-	if (rmid == RM_DBASE_ID && rminfo == XLOG_DBASE_CREATE)
+	if (rmid == RM_DBASE_ID && rminfo == XLOG_DBASE_CREATE_FILE_COPY)
 	{
 		/*
 		 * New databases can be safely ignored. It won't be present in the
@@ -382,6 +382,13 @@ extractPageInfo(XLogReaderState *record)
 		 * overwriting the database created in the target system.
 		 */
 	}
+	else if (rmid == RM_DBASE_ID && rminfo == XLOG_DBASE_CREATE_WAL_LOG)
+	{
+		/*
+		 * New databases can be safely ignored. It won't be present in the
+		 * source system, so it will be deleted.
+		 */
+	}
 	else if (rmid == RM_DBASE_ID && rminfo == XLOG_DBASE_DROP)
 	{
 		/*
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 183abcc275..ee06b0f0a4 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2773,13 +2773,15 @@ psql_completion(const char *text, int start, int end)
 	/* CREATE DATABASE */
 	else if (Matches("CREATE", "DATABASE", MatchAny))
 		COMPLETE_WITH("OWNER", "TEMPLATE", "ENCODING", "TABLESPACE",
-					  "IS_TEMPLATE",
+					  "IS_TEMPLATE", "STRATEGY",
 					  "ALLOW_CONNECTIONS", "CONNECTION LIMIT",
 					  "LC_COLLATE", "LC_CTYPE", "LOCALE", "OID",
 					  "LOCALE_PROVIDER", "ICU_LOCALE");
 
 	else if (Matches("CREATE", "DATABASE", MatchAny, "TEMPLATE"))
 		COMPLETE_WITH_QUERY(Query_for_list_of_template_databases);
+	else if (Matches("CREATE", "DATABASE", MatchAny, "STRATEGY"))
+		COMPLETE_WITH("WAL_LOG", "FILE_COPY");
 
 	/* CREATE DOMAIN */
 	else if (Matches("CREATE", "DOMAIN", MatchAny))
diff --git a/src/bin/scripts/createdb.c b/src/bin/scripts/createdb.c
index 6f612abf7c..883752099c 100644
--- a/src/bin/scripts/createdb.c
+++ b/src/bin/scripts/createdb.c
@@ -34,6 +34,7 @@ main(int argc, char *argv[])
 		{"tablespace", required_argument, NULL, 'D'},
 		{"template", required_argument, NULL, 'T'},
 		{"encoding", required_argument, NULL, 'E'},
+		{"strategy", required_argument, NULL, 'S'},
 		{"lc-collate", required_argument, NULL, 1},
 		{"lc-ctype", required_argument, NULL, 2},
 		{"locale", required_argument, NULL, 'l'},
@@ -60,6 +61,7 @@ main(int argc, char *argv[])
 	char	   *tablespace = NULL;
 	char	   *template = NULL;
 	char	   *encoding = NULL;
+	char	   *strategy = NULL;
 	char	   *lc_collate = NULL;
 	char	   *lc_ctype = NULL;
 	char	   *locale = NULL;
@@ -77,7 +79,7 @@ main(int argc, char *argv[])
 
 	handle_help_version_opts(argc, argv, "createdb", help);
 
-	while ((c = getopt_long(argc, argv, "h:p:U:wWeO:D:T:E:l:", long_options, &optindex)) != -1)
+	while ((c = getopt_long(argc, argv, "h:p:U:wWeO:D:T:E:l:S:", long_options, &optindex)) != -1)
 	{
 		switch (c)
 		{
@@ -111,6 +113,9 @@ main(int argc, char *argv[])
 			case 'E':
 				encoding = pg_strdup(optarg);
 				break;
+			case 'S':
+				strategy = pg_strdup(optarg);
+				break;
 			case 1:
 				lc_collate = pg_strdup(optarg);
 				break;
@@ -215,6 +220,8 @@ main(int argc, char *argv[])
 		appendPQExpBufferStr(&sql, " ENCODING ");
 		appendStringLiteralConn(&sql, encoding, conn);
 	}
+	if (strategy)
+		appendPQExpBuffer(&sql, " STRATEGY %s ", fmtId(strategy));
 	if (template)
 		appendPQExpBuffer(&sql, " TEMPLATE %s", fmtId(template));
 	if (lc_collate)
@@ -294,6 +301,7 @@ help(const char *progname)
 	printf(_("      --locale-provider={libc|icu}\n"
 			 "                               locale provider for the database's default collation\n"));
 	printf(_("  -O, --owner=OWNER            database user to own the new database\n"));
+	printf(_("  -S, --strategy=STRATEGY      database creation strategy wal_log or file_copy\n"));
 	printf(_("  -T, --template=TEMPLATE      template database to copy\n"));
 	printf(_("  -V, --version                output version information, then exit\n"));
 	printf(_("  -?, --help                   show this help, then exit\n"));
diff --git a/src/bin/scripts/t/020_createdb.pl b/src/bin/scripts/t/020_createdb.pl
index 35deec9a92..c662c61c2f 100644
--- a/src/bin/scripts/t/020_createdb.pl
+++ b/src/bin/scripts/t/020_createdb.pl
@@ -104,4 +104,24 @@ $node->command_checks_all(
 	],
 	'createdb with incorrect --lc-ctype');
 
+$node->command_checks_all(
+	[ 'createdb', '--strategy', "foo", 'foobar2' ],
+	1,
+	[qr/^$/],
+	[
+		qr/^createdb: error: database creation failed: ERROR:  invalid create database strategy|^createdb: error: database creation failed: ERROR:  invalid create database strategy foo/s
+	],
+	'createdb with incorrect --strategy');
+
+# Check database creation strategy
+$node->issues_sql_like(
+	[ 'createdb', '-T', 'foobar2', 'foobar4', '-S', 'wal_log'],
+	qr/statement: CREATE DATABASE foobar4 TEMPLATE foobar2 STRATEGY wal_log/,
+	'create database with WAL_LOG strategy');
+
+$node->issues_sql_like(
+	[ 'createdb', '-T', 'foobar2', 'foobar5', '-S', 'file_copy'],
+	qr/statement: CREATE DATABASE foobar5 TEMPLATE foobar2 STRATEGY file_copy/,
+	'create database with FILE_COPY strategy');
+
 done_testing();
diff --git a/src/include/commands/dbcommands_xlog.h b/src/include/commands/dbcommands_xlog.h
index 593a8578a4..0ee2452feb 100644
--- a/src/include/commands/dbcommands_xlog.h
+++ b/src/include/commands/dbcommands_xlog.h
@@ -18,17 +18,32 @@
 #include "lib/stringinfo.h"
 
 /* record types */
-#define XLOG_DBASE_CREATE		0x00
-#define XLOG_DBASE_DROP			0x10
+#define XLOG_DBASE_CREATE_FILE_COPY		0x00
+#define XLOG_DBASE_CREATE_WAL_LOG		0x10
+#define XLOG_DBASE_DROP					0x20
 
-typedef struct xl_dbase_create_rec
+/*
+ * Single WAL record for an entire CREATE DATABASE operation. This is used
+ * by the FILE_COPY strategy.
+ */
+typedef struct xl_dbase_create_file_copy_rec
 {
-	/* Records copying of a single subdirectory incl. contents */
 	Oid			db_id;
 	Oid			tablespace_id;
 	Oid			src_db_id;
 	Oid			src_tablespace_id;
-} xl_dbase_create_rec;
+} xl_dbase_create_file_copy_rec;
+
+/*
+ * WAL record for the beginning of a CREATE DATABASE operation, when the
+ * WAL_LOG strategy is used. Each individual block will be logged separately
+ * afterward.
+ */
+typedef struct xl_dbase_create_wal_log_rec
+{
+	Oid			db_id;
+	Oid			tablespace_id;
+} xl_dbase_create_wal_log_rec;
 
 typedef struct xl_dbase_drop_rec
 {
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index dd01841c30..a6b657f0ba 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -184,7 +184,8 @@ extern Buffer ReadBufferExtended(Relation reln, ForkNumber forkNum,
 								 BufferAccessStrategy strategy);
 extern Buffer ReadBufferWithoutRelcache(RelFileNode rnode,
 										ForkNumber forkNum, BlockNumber blockNum,
-										ReadBufferMode mode, BufferAccessStrategy strategy);
+										ReadBufferMode mode, BufferAccessStrategy strategy,
+										bool permanent);
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern void MarkBufferDirty(Buffer buffer);
@@ -203,6 +204,9 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
 extern void FlushOneBuffer(Buffer buffer);
 extern void FlushRelationBuffers(Relation rel);
 extern void FlushRelationsAllBuffers(struct SMgrRelationData **smgrs, int nrels);
+extern void CreateAndCopyRelationData(RelFileNode src_rnode,
+									  RelFileNode dst_rnode,
+									  bool permanent);
 extern void FlushDatabaseBuffers(Oid dbid);
 extern void DropRelFileNodeBuffers(struct SMgrRelationData *smgr_reln, ForkNumber *forkNum,
 								   int nforks, BlockNumber *firstDelBlock);
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index 49edbcc81b..be1d2c99a9 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -38,6 +38,7 @@ extern void RelationInitLockInfo(Relation relation);
 
 /* Lock a relation */
 extern void LockRelationOid(Oid relid, LOCKMODE lockmode);
+extern void LockRelationId(LockRelId *relid, LOCKMODE lockmode);
 extern bool ConditionalLockRelationOid(Oid relid, LOCKMODE lockmode);
 extern void UnlockRelationId(LockRelId *relid, LOCKMODE lockmode);
 extern void UnlockRelationOid(Oid relid, LOCKMODE lockmode);
diff --git a/src/include/utils/relmapper.h b/src/include/utils/relmapper.h
index 9fbb5a7f9b..f10353e139 100644
--- a/src/include/utils/relmapper.h
+++ b/src/include/utils/relmapper.h
@@ -38,7 +38,9 @@ typedef struct xl_relmap_update
 extern Oid	RelationMapOidToFilenode(Oid relationId, bool shared);
 
 extern Oid	RelationMapFilenodeToOid(Oid relationId, bool shared);
-
+extern Oid RelationMapOidToFilenodeForDatabase(char *dbpath, Oid relationId);
+extern void RelationMapCopy(Oid dbid, Oid tsid, char *srcdbpath,
+							char *dstdbpath);
 extern void RelationMapUpdateMap(Oid relationId, Oid fileNode, bool shared,
 								 bool immediate);
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 93d5190508..07472055dd 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -460,6 +460,8 @@ CoverPos
 CreateAmStmt
 CreateCastStmt
 CreateConversionStmt
+CreateDBRelInfo
+CreateDBStrategy
 CreateDomainStmt
 CreateEnumStmt
 CreateEventTrigStmt
@@ -3701,7 +3703,8 @@ xl_btree_update
 xl_btree_vacuum
 xl_clog_truncate
 xl_commit_ts_truncate
-xl_dbase_create_rec
+xl_dbase_create_file_copy_rec
+xl_dbase_create_wal_log_rec
 xl_dbase_drop_rec
 xl_end_of_recovery
 xl_hash_add_ovfl_page
-- 
2.24.3 (Apple Git-128)

#177

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Robert Haas (#176)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Tue, Mar 22, 2022 at 11:23 AM Robert Haas <robertmhaas@gmail.com> wrote:

Here's my worked-over version of your previous patch. I haven't tried
to incorporate your incremental patch that you just posted.

Also, please have a look at the XXX comments that I added in a few
places where I think you need to make further changes.

--
Robert Haas
EDB: http://www.enterprisedb.com

#178

Andres Freund

andres@anarazel.de

almost 4 years ago

In reply to: Robert Haas (#176)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

Hi,

On 2022-03-22 11:23:16 -0400, Robert Haas wrote:

From 116bcdb6174a750b7ef7ae05ef6f39cebaf9bcf5 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Tue, 22 Mar 2022 11:22:26 -0400
Subject: [PATCH v1] Add new block-by-block strategy for CREATE DATABASE.

I might have missed it because I just skimmed the patch. But I still think it
should contain a comment detailing why accessing catalogs from another
database is safe in this instance, and perhaps a comment or three in places
that could break it (e.g. snapshot computation, horizon stuff).

Greetings,

Andres Freund

#179

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Andres Freund (#178)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Tue, Mar 22, 2022 at 11:42 AM Andres Freund <andres@anarazel.de> wrote:

I might have missed it because I just skimmed the patch. But I still think it
should contain a comment detailing why accessing catalogs from another
database is safe in this instance, and perhaps a comment or three in places
that could break it (e.g. snapshot computation, horizon stuff).

Please see the function header comment for ScanSourceDatabasePgClass.
I don't quite see how changes in those places would break this, but if
you want to be more specific perhaps I will see the light?

--
Robert Haas
EDB: http://www.enterprisedb.com

#180

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Robert Haas (#173)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Mon, Mar 21, 2022 at 2:23 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Mar 21, 2022 at 11:21 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I tried to debug the case but I realized that somehow
CHECK_FOR_INTERRUPTS() is not calling the
AcceptInvalidationMessages() and I could not find the same while
looking into the code as well. While debugging I noticed that
AcceptInvalidationMessages() is called multiple times but that is only
through LockRelationId() but while locking the relation we had already
closed the previous smgr because at a time we keep only one smgr open.
And that's the reason it is not hitting the issue which we think it
could. Is there any condition under which it will call
AcceptInvalidationMessages() through CHECK_FOR_INTERRUPTS() ? because
I could not see while debugging as well as in code.

Yeah, I think the reason you can't find it is that it's not there. I
was confused in what I wrote earlier. I think we only process sinval
catchups when we're idle, not at every CHECK_FOR_INTERRUPTS(). And I
think the reason for that is precisely that it would be hard to write
correct code otherwise, since invalidations might then get processed
in a lot more places. So ... I guess all we really need to do here is
avoid assuming that the results of smgropen() are valid across any
code that might acquire relation locks. Which I think the code already
does.

So I talked to Andres and Thomas about this and they told me that I
was right to worry about this problem. Over on the thread about "wrong
fds used for refilenodes after pg_upgrade relfilenode changes
Reply-To:" there is a plan to make use ProcSignalBarrier to make smgr
objects disappear, and ProcSignalBarrier can be processed at any
CHECK_FOR_INTERRUPTS(), so then we'd have a problem here. Commit
f10f0ae420ee62400876ab34dca2c09c20dcd030 established a policy that you
should always re-fetch the smgr object instead of reusing one you've
already got, and even before that it was known to be unsafe to keep
them around for any period of time, because anything that opened a
relation, including a syscache lookup, could potentially accept
invalidations. So most of our code is already hardened against the
possibility of smgr objects disappearing. I have a feeling there may
be some that isn't, but it would be good if this patch didn't
introduce more such code at the same time that patch is trying to
introduce more ways to get rid of smgr objects. It was suggested to me
that what this patch ought to be doing is calling
CreateFakeRelcacheEntry() and then using RelationGetSmgr(fakerel)
every time we need the SmgrRelation, without ever keeping it around
for any amount of code. That way, if the smgr relation gets closed out
from under us at a CHECK_FOR_INTERRUPTS(), we'll just recreate it at
the next RelationGetSmgr() call.

Andres also noted that he thinks the patch performs redundant cleanup,
because of the fact that it uses RelationCreateStorage. That will
arrange to remove files on abort, but createdb() also has its own
mechanism for that. It doesn't seem like a thing to do twice in two
different ways.

--
Robert Haas
EDB: http://www.enterprisedb.com

#181

Dilip Kumar

dilipbalaut@gmail.com

almost 4 years ago

In reply to: Robert Haas (#176)

1 attachment(s)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Tue, Mar 22, 2022 at 8:53 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Mar 22, 2022 at 5:00 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

In my previous patch mistakenly I used src_dboid instead of
dest_dboid. Fixed in this version. For destination db I have used
lock mode as AccessSharedLock. Logically if we see access wise we
don't want anyone else to be accessing that db but that is anyway
protected because it is not visible to anyone else. So I think
AccessSharedLock should be correct here because we are just taking
this lock because we are accessing pages in shared buffers from this
database's relations.

Here's my worked-over version of your previous patch. I haven't tried
to incorporate your incremental patch that you just posted.

Thanks for working on the comments. Please find the updated version
which include below changes
- Worked on the XXX comments added by you.
- Added database level lock for the target database as well.
- Used fake relcache and removed direct access to the smgr, I think it
was not really necessary in
ScanSourceDatabasePgClass() because we are using it for a very short
period of time but still I have changed it, let me know if you think
that it is unneccessary to create the fake relcache here.
- Removed extra space in createdb.c and fixed test case.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v2-0001-Add-new-block-by-block-strategy-for-CREATE-DATABA.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Add-new-block-by-block-strategy-for-CREATE-DATABA.patchDownload

From 0305c910ab144d183b252611fa26c210f1cb0af2 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Tue, 22 Mar 2022 11:22:26 -0400
Subject: [PATCH v2] Add new block-by-block strategy for CREATE DATABASE.

Because this strategy logs changes on a block-by-block basis, it
avoids the need to checkpoint before and after the operation.
However, because it logs each changed block individually, it might
generate a lot of extra write-ahead logging if the template database
is large. Therefore, the older strategy remains available via a new
STRATEGY parameter to CREATE DATABASE, and a corresponding --strategy
option to createdb.

Somewhat controversially, this patch assembles the list of relations
to be copied to the new database by reading the pg_class relation of
the template database. Cross-database access like this isn't normally
possible, but it can be made to work here because there can't be any
connections to the database being copied, nor can it contain any
in-doubt transactions. Even so, we have to use lower-level interfaces
than normal, since the table scan and relcache interfaces will not
work for a database to which we're not connected. The advantage of
this approach is that we do not need to rely on the filesystem to
determine what ought to be copied, but instead on PostgreSQL's own
knowledge of the database structure. This avoids, for example,
copying stray files that happen to be located in the source database
directory.

Dilip Kumar, with a fairly large number of cosmetic changes by me.
---
 contrib/bloom/blinsert.c                 |   2 +-
 doc/src/sgml/ref/create_database.sgml    |  22 +
 doc/src/sgml/ref/createdb.sgml           |  11 +
 src/backend/access/heap/heapam_handler.c |   2 +-
 src/backend/access/nbtree/nbtree.c       |   2 +-
 src/backend/access/rmgrdesc/dbasedesc.c  |  20 +-
 src/backend/access/transam/xlogutils.c   |   6 +-
 src/backend/commands/dbcommands.c        | 761 ++++++++++++++++++++++++++-----
 src/backend/storage/buffer/bufmgr.c      | 172 ++++++-
 src/backend/storage/lmgr/lmgr.c          |  28 ++
 src/backend/utils/activity/wait_event.c  |   3 +
 src/backend/utils/cache/relmapper.c      |  64 +++
 src/bin/pg_rewind/parsexlog.c            |   9 +-
 src/bin/psql/tab-complete.c              |   4 +-
 src/bin/scripts/createdb.c               |  10 +-
 src/bin/scripts/t/020_createdb.pl        |  20 +
 src/include/commands/dbcommands_xlog.h   |  25 +-
 src/include/storage/bufmgr.h             |   6 +-
 src/include/storage/lmgr.h               |   1 +
 src/include/utils/relmapper.h            |   4 +-
 src/include/utils/wait_event.h           |   1 +
 src/tools/pgindent/typedefs.list         |   5 +-
 22 files changed, 1039 insertions(+), 139 deletions(-)

diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c
index c94cf34..82378db 100644
--- a/contrib/bloom/blinsert.c
+++ b/contrib/bloom/blinsert.c
@@ -173,7 +173,7 @@ blbuildempty(Relation index)
 	 * Write the page and log it.  It might seem that an immediate sync would
 	 * be sufficient to guarantee that the file exists on disk, but recovery
 	 * itself might remove it while replaying, for example, an
-	 * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we need
+	 * XLOG_DBASE_CREATE* or XLOG_TBLSPC_CREATE record.  Therefore, we need
 	 * this even when wal_level=minimal.
 	 */
 	PageSetChecksumInplace(metapage, BLOOM_METAPAGE_BLKNO);
diff --git a/doc/src/sgml/ref/create_database.sgml b/doc/src/sgml/ref/create_database.sgml
index 5ae785a..255ad3a 100644
--- a/doc/src/sgml/ref/create_database.sgml
+++ b/doc/src/sgml/ref/create_database.sgml
@@ -25,6 +25,7 @@ CREATE DATABASE <replaceable class="parameter">name</replaceable>
     [ [ WITH ] [ OWNER [=] <replaceable class="parameter">user_name</replaceable> ]
            [ TEMPLATE [=] <replaceable class="parameter">template</replaceable> ]
            [ ENCODING [=] <replaceable class="parameter">encoding</replaceable> ]
+           [ STRATEGY [=] <replaceable class="parameter">strategy</replaceable> ] ]
            [ LOCALE [=] <replaceable class="parameter">locale</replaceable> ]
            [ LC_COLLATE [=] <replaceable class="parameter">lc_collate</replaceable> ]
            [ LC_CTYPE [=] <replaceable class="parameter">lc_ctype</replaceable> ]
@@ -118,6 +119,27 @@ CREATE DATABASE <replaceable class="parameter">name</replaceable>
        </para>
       </listitem>
      </varlistentry>
+     <varlistentry id="create-database-strategy" xreflabel="CREATE DATABASE STRATEGY">
+      <term><replaceable class="parameter">strategy</replaceable></term>
+      <listitem>
+       <para>
+        Strategy to be used in creating the new database.  If
+        the <literal>WAL_LOG</literal> strategy is used, the database will be
+        copied block by block and each block will be separately written
+        to the write-ahead log. This is the most efficient strategy in
+        cases where the template database is small, and therefore it is the
+        default. The older <literal>FILE_COPY</literal> strategy is also
+        available. This strategy writes a small record to the write-ahead log
+        for each tablespace used by the target database. Each such record
+        represents copying an entire directory to a new location at the
+        filesystem level. While this does reduce the write-ahed
+        log volume substantially, especially if the template database is large,
+        it also forces the system to perform a checkpoint both before and
+        after the creation of the new database. In some situations, this may
+        have a noticeable negative impact on overall system performance.
+       </para>
+      </listitem>
+     </varlistentry>
      <varlistentry>
       <term><replaceable class="parameter">locale</replaceable></term>
       <listitem>
diff --git a/doc/src/sgml/ref/createdb.sgml b/doc/src/sgml/ref/createdb.sgml
index be42e50..671cd362 100644
--- a/doc/src/sgml/ref/createdb.sgml
+++ b/doc/src/sgml/ref/createdb.sgml
@@ -178,6 +178,17 @@ PostgreSQL documentation
      </varlistentry>
 
      <varlistentry>
+      <term><option>-S <replaceable class="parameter">template</replaceable></option></term>
+      <term><option>--strategy=<replaceable class="parameter">strategy</replaceable></option></term>
+      <listitem>
+       <para>
+        Specifies the database creation strategy.  See
+        <xref linkend="create-database-strategy" /> for more details.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
       <term><option>-T <replaceable class="parameter">template</replaceable></option></term>
       <term><option>--template=<replaceable class="parameter">template</replaceable></option></term>
       <listitem>
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 39ef8a0..2b70ca0 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -601,7 +601,7 @@ heapam_relation_set_new_filenode(Relation rel,
 	 * even if the page has been logged, because the write did not go through
 	 * shared_buffers and therefore a concurrent checkpoint may have moved the
 	 * redo pointer past our xlog record.  Recovery may as well remove it
-	 * while replaying, for example, XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE
+	 * while replaying, for example, XLOG_DBASE_CREATE* or XLOG_TBLSPC_CREATE
 	 * record. Therefore, logging is necessary even if wal_level=minimal.
 	 */
 	if (persistence == RELPERSISTENCE_UNLOGGED)
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index c9b4964..dacf3f7 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -161,7 +161,7 @@ btbuildempty(Relation index)
 	 * Write the page and log it.  It might seem that an immediate sync would
 	 * be sufficient to guarantee that the file exists on disk, but recovery
 	 * itself might remove it while replaying, for example, an
-	 * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we need
+	 * XLOG_DBASE_CREATE* or XLOG_TBLSPC_CREATE record.  Therefore, we need
 	 * this even when wal_level=minimal.
 	 */
 	PageSetChecksumInplace(metapage, BTREE_METAPAGE);
diff --git a/src/backend/access/rmgrdesc/dbasedesc.c b/src/backend/access/rmgrdesc/dbasedesc.c
index 03af3fd..523d0b3 100644
--- a/src/backend/access/rmgrdesc/dbasedesc.c
+++ b/src/backend/access/rmgrdesc/dbasedesc.c
@@ -24,14 +24,23 @@ dbase_desc(StringInfo buf, XLogReaderState *record)
 	char	   *rec = XLogRecGetData(record);
 	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
 
-	if (info == XLOG_DBASE_CREATE)
+	if (info == XLOG_DBASE_CREATE_FILE_COPY)
 	{
-		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) rec;
+		xl_dbase_create_file_copy_rec *xlrec =
+		(xl_dbase_create_file_copy_rec *) rec;
 
 		appendStringInfo(buf, "copy dir %u/%u to %u/%u",
 						 xlrec->src_tablespace_id, xlrec->src_db_id,
 						 xlrec->tablespace_id, xlrec->db_id);
 	}
+	else if (info == XLOG_DBASE_CREATE_WAL_LOG)
+	{
+		xl_dbase_create_wal_log_rec *xlrec =
+		(xl_dbase_create_wal_log_rec *) rec;
+
+		appendStringInfo(buf, "create dir %u/%u",
+						 xlrec->tablespace_id, xlrec->db_id);
+	}
 	else if (info == XLOG_DBASE_DROP)
 	{
 		xl_dbase_drop_rec *xlrec = (xl_dbase_drop_rec *) rec;
@@ -51,8 +60,11 @@ dbase_identify(uint8 info)
 
 	switch (info & ~XLR_INFO_MASK)
 	{
-		case XLOG_DBASE_CREATE:
-			id = "CREATE";
+		case XLOG_DBASE_CREATE_FILE_COPY:
+			id = "CREATE_FILE_COPY";
+			break;
+		case XLOG_DBASE_CREATE_WAL_LOG:
+			id = "CREATE_WAL_LOG";
 			break;
 		case XLOG_DBASE_DROP:
 			id = "DROP";
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 511f2f1..a4dedc5 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -484,7 +484,7 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 	{
 		/* page exists in file */
 		buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-										   mode, NULL);
+										   mode, NULL, true);
 	}
 	else
 	{
@@ -509,7 +509,7 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 				ReleaseBuffer(buffer);
 			}
 			buffer = ReadBufferWithoutRelcache(rnode, forknum,
-											   P_NEW, mode, NULL);
+											   P_NEW, mode, NULL, true);
 		}
 		while (BufferGetBlockNumber(buffer) < blkno);
 		/* Handle the corner case that P_NEW returns non-consecutive pages */
@@ -519,7 +519,7 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 				LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 			ReleaseBuffer(buffer);
 			buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-											   mode, NULL);
+											   mode, NULL, true);
 		}
 	}
 
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 623e5ec..02a096c 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -63,13 +63,31 @@
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
 #include "utils/pg_locale.h"
+#include "utils/relmapper.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 
+/*
+ * Create database strategy.
+ *
+ * CREATEDB_WAL_LOG will copy the database at the block level and WAL log each
+ * copied block.
+ *
+ * CREATEDB_FILE_COPY will simply perform a file system level copy of the
+ * database and log a single record for each tablespace copied. To make this
+ * safe, it also triggers checkpoints before and after the operation.
+ */
+typedef enum CreateDBStrategy
+{
+	CREATEDB_WAL_LOG,
+	CREATEDB_FILE_COPY
+} CreateDBStrategy;
+
 typedef struct
 {
 	Oid			src_dboid;		/* source (template) DB */
 	Oid			dest_dboid;		/* DB we are trying to create */
+	CreateDBStrategy strategy;	/* create db strategy */
 } createdb_failure_params;
 
 typedef struct
@@ -78,6 +96,17 @@ typedef struct
 	Oid			dest_tsoid;		/* tablespace we are trying to move to */
 } movedb_failure_params;
 
+/*
+ * Information about a relation to be copied when creating a database.
+ */
+typedef struct CreateDBRelInfo
+{
+	RelFileNode rnode;			/* physical relation identifier */
+	Oid			reloid;			/* relation oid */
+	bool		permanent;		/* relation is permanent or unlogged */
+} CreateDBRelInfo;
+
+
 /* non-export function prototypes */
 static void createdb_failure_callback(int code, Datum arg);
 static void movedb(const char *dbname, const char *tblspcname);
@@ -93,7 +122,540 @@ static bool have_createdb_privilege(void);
 static void remove_dbtablespaces(Oid db_id);
 static bool check_db_file_conflict(Oid db_id);
 static int	errdetail_busy_db(int notherbackends, int npreparedxacts);
+static void CreateDatabaseUsingWalLog(Oid src_dboid, Oid dboid, Oid src_tsid,
+									  Oid dst_tsid);
+static List *ScanSourceDatabasePgClass(Oid srctbid, Oid srcdbid, char *srcpath);
+static List *ScanSourceDatabasePgClassPage(Page page, Buffer buf, Oid tbid,
+										   Oid dbid, char *srcpath,
+										   List *rnodelist, Snapshot snapshot);
+static CreateDBRelInfo *ScanSourceDatabasePgClassTuple(HeapTupleData *tuple,
+													   Oid tbid, Oid dbid,
+													   char *srcpath);
+static void CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid,
+									bool isRedo);
+static void CreateDatabaseUsingFileCopy(Oid src_dboid, Oid dboid, Oid src_tsid,
+										Oid dst_tsid);
+
+/*
+ * Create a new database using the WAL_LOG strategy.
+ *
+ * Each copied block is separately written to the write-ahead log.
+ */
+static void
+CreateDatabaseUsingWalLog(Oid src_dboid, Oid dst_dboid,
+						  Oid src_tsid, Oid dst_tsid)
+{
+	char	   *srcpath;
+	char	   *dstpath;
+	List	   *rnodelist = NULL;
+	ListCell   *cell;
+	LockRelId	srcrelid;
+	LockRelId	dstrelid;
+	RelFileNode srcrnode;
+	RelFileNode dstrnode;
+	CreateDBRelInfo *relinfo;
+
+	/* Get source and destination database paths. */
+	srcpath = GetDatabasePath(src_dboid, src_tsid);
+	dstpath = GetDatabasePath(dst_dboid, dst_tsid);
+
+	/* Create database directory and write PG_VERSION file. */
+	CreateDirAndVersionFile(dstpath, dst_dboid, dst_tsid, false);
+
+	/* Copy relmap file from source database to the destination database. */
+	RelationMapCopy(dst_dboid, dst_tsid, srcpath, dstpath);
+
+	/* Get list of relfilenodes to copy from the source database. */
+	rnodelist = ScanSourceDatabasePgClass(src_tsid, src_dboid, srcpath);
+	Assert(rnodelist != NIL);
+
+	/*
+	 * Database ids are common for all the relation so set it before entering
+	 * the loop.
+	 */
+	srcrelid.dbId = src_dboid;
+	dstrelid.dbId = dst_dboid;
+
+	/* Loop over our list of relfilenodes and copy each one. */
+	foreach(cell, rnodelist)
+	{
+		relinfo = lfirst(cell);
+		srcrnode = relinfo->rnode;
+
+		/*
+		 * If the relation is from the source db's default tablespace then we
+		 * need to create it in the destinations db's default tablespace.
+		 * Otherwise, we need to create in the same tablespace as it is in the
+		 * source database.
+		 */
+		if (srcrnode.spcNode == src_tsid)
+			dstrnode.spcNode = dst_tsid;
+		else
+			dstrnode.spcNode = srcrnode.spcNode;
 
+		dstrnode.dbNode = dst_dboid;
+		dstrnode.relNode = srcrnode.relNode;
+
+		/* Acquire locks on source and target relations before copying. */
+		dstrelid.relId = srcrelid.relId = relinfo->reloid;
+		LockRelationId(&srcrelid, AccessShareLock);
+		LockRelationId(&dstrelid, AccessShareLock);
+
+		/* Copy relation storage from source to the destination. */
+		CreateAndCopyRelationData(srcrnode, dstrnode, relinfo->permanent);
+
+		/* Release the locks. */
+		UnlockRelationId(&srcrelid, AccessShareLock);
+		UnlockRelationId(&dstrelid, AccessShareLock);
+	}
+
+	list_free_deep(rnodelist);
+}
+
+/*
+ * Scan the pg_class table in the source database to identify the relations
+ * that need to be copied to the destination database.
+ *
+ * This is an exception to the usual rule that cross-database access is
+ * not possible. We can make it work here because we know that there are no
+ * connections to the source database and (since there can't be prepared
+ * transactions touching that database) no in-doubt tuples either. This
+ * means that we don't need to worry about pruning removing anything from
+ * under us, and we don't need to be too picky about our snapshot either.
+ * As long as it sees all previously-committed XIDs as committed and all
+ * aborted XIDs as aborted, we should be fine: nothing else is possible
+ * here.
+ *
+ * We can't rely on the relcache for anything here, because that only knows
+ * about the database to which we are connected, and can't handle access to
+ * other databases. That also means we can't rely on the heap scan
+ * infrastructure, which would be a bad idea anyway since it might try
+ * to do things like HOT pruning which we definitely can't do safely in
+ * a database to which we're not even connected.
+ */
+static List *
+ScanSourceDatabasePgClass(Oid tbid, Oid dbid, char *srcpath)
+{
+	RelFileNode rnode;
+	BlockNumber nblocks;
+	BlockNumber blkno;
+	Buffer		buf;
+	Oid			relfilenode;
+	Page		page;
+	List	   *rnodelist = NIL;
+	LockRelId	relid;
+	Relation	rel;
+	Snapshot	snapshot;
+	BufferAccessStrategy bstrategy;
+
+	/* Get pg_class relfilenode. */
+	relfilenode = RelationMapOidToFilenodeForDatabase(srcpath,
+													  RelationRelationId);
+
+	/*
+	 * The system elsewhere assumes that we only read data for a relation
+	 * into shared_buffers while holding some sort of a lock on a relation,
+	 * so lock the source database's pg_class before we do anything else.
+	 */
+	relid.dbId = dbid;
+	relid.relId = RelationRelationId;
+	LockRelationId(&relid, AccessShareLock);
+
+	/* Prepare a RelFileNode for the pg_class relation. */
+	rnode.spcNode = tbid;
+	rnode.dbNode = dbid;
+	rnode.relNode = relfilenode;
+
+	/*
+	 * Create a fake relcache entry for the pg_class relation and get the
+	 * number of blocks.  Refer to the comments in CreateAndCopyRelationData()
+	 * for the rationale behind using the fake relcache entry.
+	 */
+	rel = CreateFakeRelcacheEntry(rnode);
+	nblocks = smgrnblocks(RelationGetSmgr(rel), MAIN_FORKNUM);
+	FreeFakeRelcacheEntry(rel);
+
+	/* Use a buffer access strategy since this is a bulk read operation. */
+	bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	/*
+	 * As explained in the function header comments, we need a snapshot that
+	 * will see all committed transactions as committed, and our transaction
+	 * snapshot - or the active snapshot - might not be new enough for that,
+	 * but the return value of GetLatestSnapshot() should work fine.
+	 */
+	snapshot = GetLatestSnapshot();
+
+	/* Process the relation block by block. */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		CHECK_FOR_INTERRUPTS();
+
+		buf = ReadBufferWithoutRelcache(rnode, MAIN_FORKNUM, blkno,
+										RBM_NORMAL, bstrategy, false);
+
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buf);
+		if (PageIsNew(page) || PageIsEmpty(page))
+		{
+			UnlockReleaseBuffer(buf);
+			continue;
+		}
+
+		/* Append relevant pg_class tuples for current page to rnodelist. */
+		rnodelist = ScanSourceDatabasePgClassPage(page, buf, tbid, dbid,
+												  srcpath, rnodelist,
+												  snapshot);
+
+		UnlockReleaseBuffer(buf);
+	}
+
+	/* Release relation lock. */
+	UnlockRelationId(&relid, AccessShareLock);
+
+	return rnodelist;
+}
+
+/*
+ * Scan one page of the source database's pg_class relation and add relevant
+ * entries to rnodelist. The return value is the updated list.
+ */
+static List *
+ScanSourceDatabasePgClassPage(Page page, Buffer buf, Oid tbid, Oid dbid,
+							  char *srcpath, List *rnodelist,
+							  Snapshot snapshot)
+{
+	BlockNumber		blkno = BufferGetBlockNumber(buf);
+	OffsetNumber	offnum;
+	OffsetNumber	maxoff;
+	HeapTupleData	tuple;
+
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/* Loop over offsets. */
+	for (offnum = FirstOffsetNumber;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid;
+
+		itemid = PageGetItemId(page, offnum);
+
+		/* Nothing to do if slot is empty or already dead. */
+		if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid) ||
+			ItemIdIsRedirected(itemid))
+			continue;
+
+		Assert(ItemIdIsNormal(itemid));
+		ItemPointerSet(&(tuple.t_self), blkno, offnum);
+
+		/* Initialize a HeapTupleData structure. */
+		tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
+		tuple.t_len = ItemIdGetLength(itemid);
+		tuple.t_tableOid = RelationRelationId;
+
+		/* Skip tuples that are not visible to this snapshot. */
+		if (HeapTupleSatisfiesVisibility(&tuple, snapshot, buf))
+		{
+			CreateDBRelInfo *relinfo;
+
+			/*
+			 * ScanSourceDatabasePgClassTuple is in charge of constructing
+			 * a CreateDBRelInfo object for this tuple, but can also decide
+			 * that this tuple isn't something we need to copy. If we do need
+			 * to copy the relation, add it to the list.
+			 */
+			relinfo = ScanSourceDatabasePgClassTuple(&tuple, tbid, dbid,
+													 srcpath);
+			if (relinfo != NULL)
+				rnodelist = lappend(rnodelist, relinfo);
+		}
+	}
+
+	return rnodelist;
+}
+
+/*
+ * Decide whether a certain pg_class tuple represents something that
+ * needs to be copied from the source database to the destination database,
+ * and if so, construct a CreateDBRelInfo for it.
+ *
+ * Visbility checks are handled by the caller, so our job here is just
+ * to assess the data stored in the tuple.
+ */
+CreateDBRelInfo *
+ScanSourceDatabasePgClassTuple(HeapTupleData *tuple, Oid tbid, Oid dbid,
+							   char *srcpath)
+{
+	CreateDBRelInfo	   *relinfo;
+	Form_pg_class		classForm;
+	Oid					relfilenode = InvalidOid;
+
+	classForm = (Form_pg_class) GETSTRUCT(tuple);
+
+	/*
+	 * Return NULL if this object does not need to be copied.
+	 *
+	 * Shared objects don't need to be copied, because they are shared.
+	 * Objects without storage can't be copied, because there's nothing to
+	 * copy. Temporary relations don't need to be copied either, because
+	 * they are inaccessible outside of the session that created them,
+	 * which must be gone already, and couldn't connect to a different database
+	 * if it still existed. autovacuum will eventually remove the pg_class
+	 * entries as well.
+	 */
+	if (classForm->reltablespace == GLOBALTABLESPACE_OID ||
+		!RELKIND_HAS_STORAGE(classForm->relkind) ||
+		classForm->relpersistence == RELPERSISTENCE_TEMP)
+		return NULL;
+
+	/*
+	 * If relfilenode is valid then directly use it.  Otherwise, consult the
+	 * relmap.
+	 */
+	if (OidIsValid(classForm->relfilenode))
+		relfilenode = classForm->relfilenode;
+	else
+		relfilenode = RelationMapOidToFilenodeForDatabase(srcpath,
+														  classForm->oid);
+
+	/* We must have a valid relfilenode oid. */
+	if (!OidIsValid(relfilenode))
+		elog(ERROR, "relation with OID %u does not have a valid relfilenode",
+			 classForm->oid);
+
+	/* Prepare a rel info element and add it to the list. */
+	relinfo = (CreateDBRelInfo *) palloc(sizeof(CreateDBRelInfo));
+	if (OidIsValid(classForm->reltablespace))
+		relinfo->rnode.spcNode = classForm->reltablespace;
+	else
+		relinfo->rnode.spcNode = tbid;
+
+	relinfo->rnode.dbNode = dbid;
+	relinfo->rnode.relNode = relfilenode;
+	relinfo->reloid = classForm->oid;
+
+	/* Temporary relations were rejected above. */
+	Assert(classForm->relpersistence != RELPERSISTENCE_TEMP);
+	relinfo->permanent =
+		(classForm->relpersistence == RELPERSISTENCE_PERMANENT) ? true : false;
+
+	return relinfo;
+}
+
+/*
+ * Create database directory and write out the PG_VERSION file in the database
+ * path.  If isRedo is true, it's okay for the database directory to exist
+ * already.
+ */
+static void
+CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid, bool isRedo)
+{
+	int			fd;
+	int			nbytes;
+	char		versionfile[MAXPGPATH];
+	char		buf[16];
+
+	/*
+	 * Prepare version data before starting a critical section.
+	 *
+	 * Note that we don't have to copy this from the source database; there's
+	 * only one legal value.
+	 */
+	sprintf(buf, "%s\n", PG_MAJORVERSION);
+	nbytes = strlen(PG_MAJORVERSION) + 1;
+
+	/* If we are not in WAL replay then write the WAL. */
+	if (!isRedo)
+	{
+		xl_dbase_create_wal_log_rec xlrec;
+		XLogRecPtr	lsn;
+
+		START_CRIT_SECTION();
+
+		xlrec.db_id = dbid;
+		xlrec.tablespace_id = tsid;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec),
+						 sizeof(xl_dbase_create_wal_log_rec));
+
+		lsn = XLogInsert(RM_DBASE_ID, XLOG_DBASE_CREATE_WAL_LOG);
+
+		/* As always, WAL must hit the disk before the data update does. */
+		XLogFlush(lsn);
+	}
+
+	/* Create database directory. */
+	if (MakePGDirectory(dbpath) < 0)
+	{
+		/* Failure other than already exists or not in WAL replay? */
+		if (errno != EEXIST || !isRedo)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not create directory \"%s\": %m", dbpath)));
+	}
+
+	/*
+	 * Create PG_VERSION file in the database path.  If the file already
+	 * exists and we are in WAL replay then try again to open it in write
+	 * mode.
+	 */
+	snprintf(versionfile, sizeof(versionfile), "%s/%s", dbpath, "PG_VERSION");
+
+	fd = OpenTransientFile(versionfile, O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
+	if (fd < 0 && errno == EEXIST && isRedo)
+		fd = OpenTransientFile(versionfile, O_WRONLY | O_TRUNC | PG_BINARY);
+
+	if (fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m", versionfile)));
+
+	/* Write PG_MAJORVERSION in the PG_VERSION file. */
+	pgstat_report_wait_start(WAIT_EVENT_VERSION_FILE_WRITE);
+	errno = 0;
+	if ((int) write(fd, buf, nbytes) != nbytes)
+	{
+		/* If write didn't set errno, assume problem is no disk space. */
+		if (errno == 0)
+			errno = ENOSPC;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m", versionfile)));
+	}
+	pgstat_report_wait_end();
+
+	/* Close the version file. */
+	CloseTransientFile(fd);
+
+	/* Critical section done. */
+	if (!isRedo)
+		END_CRIT_SECTION();
+}
+
+/*
+ * Create a new database using the FILE_COPY strategy.
+ *
+ * Copy each tablespace at the filesystem level, and log a single WAL record
+ * for each tablespace copied.  This requires a checkpoint before and after the
+ * copy, which may be expensive, but it does greatly reduce WAL generation
+ * if the copied database is large.
+ */
+static void
+CreateDatabaseUsingFileCopy(Oid src_dboid, Oid dst_dboid, Oid src_tsid,
+							Oid dst_tsid)
+{
+	TableScanDesc scan;
+	Relation	rel;
+	HeapTuple	tuple;
+
+	/*
+	 * Force a checkpoint before starting the copy. This will force all dirty
+	 * buffers, including those of unlogged tables, out to disk, to ensure
+	 * source database is up-to-date on disk for the copy.
+	 * FlushDatabaseBuffers() would suffice for that, but we also want to
+	 * process any pending unlink requests. Otherwise, if a checkpoint
+	 * happened while we're copying files, a file might be deleted just when
+	 * we're about to copy it, causing the lstat() call in copydir() to fail
+	 * with ENOENT.
+	 */
+	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE |
+					  CHECKPOINT_WAIT | CHECKPOINT_FLUSH_ALL);
+
+	/*
+	 * Iterate through all tablespaces of the template database, and copy each
+	 * one to the new database.
+	 */
+	rel = table_open(TableSpaceRelationId, AccessShareLock);
+	scan = table_beginscan_catalog(rel, 0, NULL);
+	while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+	{
+		Form_pg_tablespace spaceform = (Form_pg_tablespace) GETSTRUCT(tuple);
+		Oid			srctablespace = spaceform->oid;
+		Oid			dsttablespace;
+		char	   *srcpath;
+		char	   *dstpath;
+		struct stat st;
+
+		/* No need to copy global tablespace */
+		if (srctablespace == GLOBALTABLESPACE_OID)
+			continue;
+
+		srcpath = GetDatabasePath(src_dboid, srctablespace);
+
+		if (stat(srcpath, &st) < 0 || !S_ISDIR(st.st_mode) ||
+			directory_is_empty(srcpath))
+		{
+			/* Assume we can ignore it */
+			pfree(srcpath);
+			continue;
+		}
+
+		if (srctablespace == src_tsid)
+			dsttablespace = dst_tsid;
+		else
+			dsttablespace = srctablespace;
+
+		dstpath = GetDatabasePath(dst_dboid, dsttablespace);
+
+		/*
+		 * Copy this subdirectory to the new location
+		 *
+		 * We don't need to copy subdirectories
+		 */
+		copydir(srcpath, dstpath, false);
+
+		/* Record the filesystem change in XLOG */
+		{
+			xl_dbase_create_file_copy_rec xlrec;
+
+			xlrec.db_id = dst_dboid;
+			xlrec.tablespace_id = dsttablespace;
+			xlrec.src_db_id = src_dboid;
+			xlrec.src_tablespace_id = srctablespace;
+
+			XLogBeginInsert();
+			XLogRegisterData((char *) &xlrec,
+							 sizeof(xl_dbase_create_file_copy_rec));
+
+			(void) XLogInsert(RM_DBASE_ID,
+							  XLOG_DBASE_CREATE_FILE_COPY | XLR_SPECIAL_REL_UPDATE);
+		}
+	}
+	table_endscan(scan);
+	table_close(rel, AccessShareLock);
+
+	/*
+	 * We force a checkpoint before committing.  This effectively means that
+	 * committed XLOG_DBASE_CREATE_FILE_COPY operations will never need to be
+	 * replayed (at least not in ordinary crash recovery; we still have to
+	 * make the XLOG entry for the benefit of PITR operations). This avoids
+	 * two nasty scenarios:
+	 *
+	 * #1: When PITR is off, we don't XLOG the contents of newly created
+	 * indexes; therefore the drop-and-recreate-whole-directory behavior of
+	 * DBASE_CREATE replay would lose such indexes.
+	 *
+	 * #2: Since we have to recopy the source database during DBASE_CREATE
+	 * replay, we run the risk of copying changes in it that were committed
+	 * after the original CREATE DATABASE command but before the system crash
+	 * that led to the replay.  This is at least unexpected and at worst could
+	 * lead to inconsistencies, eg duplicate table names.
+	 *
+	 * (Both of these were real bugs in releases 8.0 through 8.0.3.)
+	 *
+	 * In PITR replay, the first of these isn't an issue, and the second is
+	 * only a risk if the CREATE DATABASE and subsequent template database
+	 * change both occur while a base backup is being taken. There doesn't
+	 * seem to be much we can do about that except document it as a
+	 * limitation.
+	 *
+	 * See CreateDatabaseUsingWalLog() for a less cheesy CREATE DATABASE
+	 * strategy that avoids these problems.
+	 */
+	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
+}
 
 /*
  * CREATE DATABASE
@@ -101,8 +663,6 @@ static int	errdetail_busy_db(int notherbackends, int npreparedxacts);
 Oid
 createdb(ParseState *pstate, const CreatedbStmt *stmt)
 {
-	TableScanDesc scan;
-	Relation	rel;
 	Oid			src_dboid;
 	Oid			src_owner;
 	int			src_encoding = -1;
@@ -137,6 +697,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	DefElem    *dallowconnections = NULL;
 	DefElem    *dconnlimit = NULL;
 	DefElem    *dcollversion = NULL;
+	DefElem    *dstrategy = NULL;
 	char	   *dbname = stmt->dbname;
 	char	   *dbowner = NULL;
 	const char *dbtemplate = NULL;
@@ -152,6 +713,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	char	   *dbcollversion = NULL;
 	int			notherbackends;
 	int			npreparedxacts;
+	CreateDBStrategy dbstrategy = CREATEDB_WAL_LOG;
 	createdb_failure_params fparms;
 
 	/* Extract options from the statement node tree */
@@ -269,6 +831,12 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 						(errcode(ERRCODE_INVALID_PARAMETER_VALUE)),
 						errmsg("OIDs less than %u are reserved for system objects", FirstNormalObjectId));
 		}
+		else if (strcmp(defel->defname, "strategy") == 0)
+		{
+			if (dstrategy)
+				errorConflictingDefElem(defel, pstate);
+			dstrategy = defel;
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -413,6 +981,23 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 							dbtemplate)));
 	}
 
+	/* Validate the database creation strategy. */
+	if (dstrategy && dstrategy->arg)
+	{
+		char	   *strategy;
+
+		strategy = defGetString(dstrategy);
+		if (strcmp(strategy, "wal_log") == 0)
+			dbstrategy = CREATEDB_WAL_LOG;
+		else if (strcmp(strategy, "file_copy") == 0)
+			dbstrategy = CREATEDB_FILE_COPY;
+		else
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("invalid create database strategy %s", strategy),
+					 errhint("Valid strategies are \"wal_log\", and \"file_copy\".")));
+	}
+
 	/* If encoding or locales are defaulted, use source's setting */
 	if (encoding < 0)
 		encoding = src_encoding;
@@ -753,17 +1338,16 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	InvokeObjectPostCreateHook(DatabaseRelationId, dboid, 0);
 
 	/*
-	 * Force a checkpoint before starting the copy. This will force all dirty
-	 * buffers, including those of unlogged tables, out to disk, to ensure
-	 * source database is up-to-date on disk for the copy.
-	 * FlushDatabaseBuffers() would suffice for that, but we also want to
-	 * process any pending unlink requests. Otherwise, if a checkpoint
-	 * happened while we're copying files, a file might be deleted just when
-	 * we're about to copy it, causing the lstat() call in copydir() to fail
-	 * with ENOENT.
+	 * Acquire a lock on the target database, although this is a new database
+	 * and no one else should be able to access it.  But if we are using wal
+	 * log strategy then we are going to access the relation pages using shared
+	 * buffers.  Therefore, as a general principle, we should acquire the
+	 * database lock and the relation lock before accessing any shared buffers.
+	 * Individual relation level locks would be acquired in the
+	 * CreateDatabaseUsingWalLog() when reading pages from the shared buffer.
 	 */
-	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT
-					  | CHECKPOINT_FLUSH_ALL);
+	if (dbstrategy == CREATEDB_WAL_LOG)
+		LockSharedObject(DatabaseRelationId, dboid, 0, AccessShareLock);
 
 	/*
 	 * Once we start copying subdirectories, we need to be able to clean 'em
@@ -774,101 +1358,24 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	 */
 	fparms.src_dboid = src_dboid;
 	fparms.dest_dboid = dboid;
+	fparms.strategy = dbstrategy;
+
 	PG_ENSURE_ERROR_CLEANUP(createdb_failure_callback,
 							PointerGetDatum(&fparms));
 	{
 		/*
-		 * Iterate through all tablespaces of the template database, and copy
-		 * each one to the new database.
-		 */
-		rel = table_open(TableSpaceRelationId, AccessShareLock);
-		scan = table_beginscan_catalog(rel, 0, NULL);
-		while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
-		{
-			Form_pg_tablespace spaceform = (Form_pg_tablespace) GETSTRUCT(tuple);
-			Oid			srctablespace = spaceform->oid;
-			Oid			dsttablespace;
-			char	   *srcpath;
-			char	   *dstpath;
-			struct stat st;
-
-			/* No need to copy global tablespace */
-			if (srctablespace == GLOBALTABLESPACE_OID)
-				continue;
-
-			srcpath = GetDatabasePath(src_dboid, srctablespace);
-
-			if (stat(srcpath, &st) < 0 || !S_ISDIR(st.st_mode) ||
-				directory_is_empty(srcpath))
-			{
-				/* Assume we can ignore it */
-				pfree(srcpath);
-				continue;
-			}
-
-			if (srctablespace == src_deftablespace)
-				dsttablespace = dst_deftablespace;
-			else
-				dsttablespace = srctablespace;
-
-			dstpath = GetDatabasePath(dboid, dsttablespace);
-
-			/*
-			 * Copy this subdirectory to the new location
-			 *
-			 * We don't need to copy subdirectories
-			 */
-			copydir(srcpath, dstpath, false);
-
-			/* Record the filesystem change in XLOG */
-			{
-				xl_dbase_create_rec xlrec;
-
-				xlrec.db_id = dboid;
-				xlrec.tablespace_id = dsttablespace;
-				xlrec.src_db_id = src_dboid;
-				xlrec.src_tablespace_id = srctablespace;
-
-				XLogBeginInsert();
-				XLogRegisterData((char *) &xlrec, sizeof(xl_dbase_create_rec));
-
-				(void) XLogInsert(RM_DBASE_ID,
-								  XLOG_DBASE_CREATE | XLR_SPECIAL_REL_UPDATE);
-			}
-		}
-		table_endscan(scan);
-		table_close(rel, AccessShareLock);
-
-		/*
-		 * We force a checkpoint before committing.  This effectively means
-		 * that committed XLOG_DBASE_CREATE operations will never need to be
-		 * replayed (at least not in ordinary crash recovery; we still have to
-		 * make the XLOG entry for the benefit of PITR operations). This
-		 * avoids two nasty scenarios:
-		 *
-		 * #1: When PITR is off, we don't XLOG the contents of newly created
-		 * indexes; therefore the drop-and-recreate-whole-directory behavior
-		 * of DBASE_CREATE replay would lose such indexes.
-		 *
-		 * #2: Since we have to recopy the source database during DBASE_CREATE
-		 * replay, we run the risk of copying changes in it that were
-		 * committed after the original CREATE DATABASE command but before the
-		 * system crash that led to the replay.  This is at least unexpected
-		 * and at worst could lead to inconsistencies, eg duplicate table
-		 * names.
-		 *
-		 * (Both of these were real bugs in releases 8.0 through 8.0.3.)
-		 *
-		 * In PITR replay, the first of these isn't an issue, and the second
-		 * is only a risk if the CREATE DATABASE and subsequent template
-		 * database change both occur while a base backup is being taken.
-		 * There doesn't seem to be much we can do about that except document
-		 * it as a limitation.
-		 *
-		 * Perhaps if we ever implement CREATE DATABASE in a less cheesy way,
-		 * we can avoid this.
+		 * If the user has asked to create a database with WAL_LOG strategy
+		 * then call CreateDatabaseUsingWalLog, which will copy the database
+		 * at the block level and it will WAL log each copied block.
+		 * Otherwise, call CreateDatabaseUsingFileCopy that will copy the
+		 * database file by file.
 		 */
-		RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
+		if (dbstrategy == CREATEDB_WAL_LOG)
+			CreateDatabaseUsingWalLog(src_dboid, dboid, src_deftablespace,
+									  dst_deftablespace);
+		else
+			CreateDatabaseUsingFileCopy(src_dboid, dboid, src_deftablespace,
+										dst_deftablespace);
 
 		/*
 		 * Close pg_database, but keep lock till commit.
@@ -955,6 +1462,25 @@ createdb_failure_callback(int code, Datum arg)
 	createdb_failure_params *fparms = (createdb_failure_params *) DatumGetPointer(arg);
 
 	/*
+	 * If we were copying database at block levels then drop pages for the
+	 * destination database that are in the shared buffer cache.  And tell
+	 * checkpointer to forget any pending fsync and unlink requests for files
+	 * in the database.  The reasoning behind doing this is same as explained
+	 * in dropdb function.  But unlike dropdb we don't need to call
+	 * pgstat_drop_database because this database is still not created so
+	 * there should not be any stat for this.
+	 */
+	if (fparms->strategy == CREATEDB_WAL_LOG)
+	{
+		DropDatabaseBuffers(fparms->dest_dboid);
+		ForgetDatabaseSyncRequests(fparms->dest_dboid);
+
+		/* Release lock on the target database. */
+		UnlockSharedObject(DatabaseRelationId, fparms->dest_dboid, 0,
+						   AccessShareLock);
+	}
+
+	/*
 	 * Release lock on source database before doing recursive remove. This is
 	 * not essential but it seems desirable to release the lock as soon as
 	 * possible.
@@ -1478,7 +2004,7 @@ movedb(const char *dbname, const char *tblspcname)
 		 * Record the filesystem change in XLOG
 		 */
 		{
-			xl_dbase_create_rec xlrec;
+			xl_dbase_create_file_copy_rec xlrec;
 
 			xlrec.db_id = db_id;
 			xlrec.tablespace_id = dst_tblspcoid;
@@ -1486,10 +2012,11 @@ movedb(const char *dbname, const char *tblspcname)
 			xlrec.src_tablespace_id = src_tblspcoid;
 
 			XLogBeginInsert();
-			XLogRegisterData((char *) &xlrec, sizeof(xl_dbase_create_rec));
+			XLogRegisterData((char *) &xlrec,
+							 sizeof(xl_dbase_create_file_copy_rec));
 
 			(void) XLogInsert(RM_DBASE_ID,
-							  XLOG_DBASE_CREATE | XLR_SPECIAL_REL_UPDATE);
+							  XLOG_DBASE_CREATE_FILE_COPY | XLR_SPECIAL_REL_UPDATE);
 		}
 
 		/*
@@ -1525,9 +2052,10 @@ movedb(const char *dbname, const char *tblspcname)
 
 		/*
 		 * Force another checkpoint here.  As in CREATE DATABASE, this is to
-		 * ensure that we don't have to replay a committed XLOG_DBASE_CREATE
-		 * operation, which would cause us to lose any unlogged operations
-		 * done in the new DB tablespace before the next checkpoint.
+		 * ensure that we don't have to replay a committed
+		 * XLOG_DBASE_CREATE_FILE_COPY operation, which would cause us to lose
+		 * any unlogged operations done in the new DB tablespace before the
+		 * next checkpoint.
 		 */
 		RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
 
@@ -2478,9 +3006,10 @@ dbase_redo(XLogReaderState *record)
 	/* Backup blocks are not used in dbase records */
 	Assert(!XLogRecHasAnyBlockRefs(record));
 
-	if (info == XLOG_DBASE_CREATE)
+	if (info == XLOG_DBASE_CREATE_FILE_COPY)
 	{
-		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) XLogRecGetData(record);
+		xl_dbase_create_file_copy_rec *xlrec =
+		(xl_dbase_create_file_copy_rec *) XLogRecGetData(record);
 		char	   *src_path;
 		char	   *dst_path;
 		struct stat st;
@@ -2515,6 +3044,18 @@ dbase_redo(XLogReaderState *record)
 		 */
 		copydir(src_path, dst_path, false);
 	}
+	else if (info == XLOG_DBASE_CREATE_WAL_LOG)
+	{
+		xl_dbase_create_wal_log_rec *xlrec =
+		(xl_dbase_create_wal_log_rec *) XLogRecGetData(record);
+		char	   *dbpath;
+
+		dbpath = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
+
+		/* Create the database directory with the version file. */
+		CreateDirAndVersionFile(dbpath, xlrec->db_id, xlrec->tablespace_id,
+								true);
+	}
 	else if (info == XLOG_DBASE_DROP)
 	{
 		xl_dbase_drop_rec *xlrec = (xl_dbase_drop_rec *) XLogRecGetData(record);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f5459c6..771a064 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -38,6 +38,7 @@
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/storage.h"
+#include "catalog/storage_xlog.h"
 #include "executor/instrument.h"
 #include "lib/binaryheap.h"
 #include "miscadmin.h"
@@ -486,6 +487,9 @@ static void FindAndDropRelFileNodeBuffers(RelFileNode rnode,
 										  ForkNumber forkNum,
 										  BlockNumber nForkBlock,
 										  BlockNumber firstDelBlock);
+static void RelationCopyStorageUsingBuffer(Relation src, Relation dst,
+										   ForkNumber forkNum,
+										   bool isunlogged);
 static void AtProcExit_Buffers(int code, Datum arg);
 static void CheckForBufferLeaks(void);
 static int	rnode_comparator(const void *p1, const void *p2);
@@ -772,23 +776,23 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
  * ReadBufferWithoutRelcache -- like ReadBufferExtended, but doesn't require
  *		a relcache entry for the relation.
  *
- * NB: At present, this function may only be used on permanent relations, which
- * is OK, because we only use it during XLOG replay.  If in the future we
- * want to use it on temporary or unlogged relations, we could pass additional
- * parameters.
+ * Pass permanent = true for a RELPERSISTENCE_PERMANENT relation, and
+ * permanent = false for a RELPERSISTENCE_UNLOGGED relation. This function
+ * cannot be used for temporary relations (and making that work might be
+ * difficult, unless we only want to read temporary relations for our own
+ * BackendId).
  */
 Buffer
 ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
 						  BlockNumber blockNum, ReadBufferMode mode,
-						  BufferAccessStrategy strategy)
+						  BufferAccessStrategy strategy, bool permanent)
 {
 	bool		hit;
 
 	SMgrRelation smgr = smgropen(rnode, InvalidBackendId);
 
-	Assert(InRecovery);
-
-	return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
+	return ReadBuffer_common(smgr, permanent ? RELPERSISTENCE_PERMANENT :
+							 RELPERSISTENCE_UNLOGGED, forkNum, blockNum,
 							 mode, strategy, &hit);
 }
 
@@ -3677,6 +3681,158 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 }
 
 /* ---------------------------------------------------------------------
+ *		RelationCopyStorageUsingBuffer
+ *
+ *		Copy fork's data using bufmgr.  Same as RelationCopyStorage but instead
+ *		of using smgrread and smgrextend this will copy using bufmgr APIs.
+ *
+ *		Refer comments atop CreateAndCopyRelationData() for details about
+ *		'permanent' parameter.
+ * --------------------------------------------------------------------
+ */
+static void
+RelationCopyStorageUsingBuffer(Relation src, Relation dst, ForkNumber forkNum,
+							   bool permanent)
+{
+	Buffer		srcBuf;
+	Buffer		dstBuf;
+	Page		srcPage;
+	Page		dstPage;
+	bool		use_wal;
+	BlockNumber nblocks;
+	BlockNumber blkno;
+	BufferAccessStrategy bstrategy_src;
+	BufferAccessStrategy bstrategy_dst;
+
+	/*
+	 * In general, we want to write WAL whenever wal_level > 'minimal', but
+	 * we can skip it when copying any fork of an unlogged relation other
+	 * than the init fork.
+	 */
+	use_wal = XLogIsNeeded() && (permanent || forkNum == INIT_FORKNUM);
+
+	/* Get number of blocks in the source relation. */
+	nblocks = smgrnblocks(RelationGetSmgr(src), forkNum);
+
+	/* Nothing to copy; just return. */
+	if (nblocks == 0)
+		return;
+
+	/* This is a bulk operation, so use buffer access strategies. */
+	bstrategy_src = GetAccessStrategy(BAS_BULKREAD);
+	bstrategy_dst = GetAccessStrategy(BAS_BULKWRITE);
+
+	/* Iterate over each block of the source relation file. */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		CHECK_FOR_INTERRUPTS();
+
+		/* Read block from source relation. */
+		srcBuf = ReadBufferWithoutRelcache(src->rd_node, forkNum, blkno,
+										   RBM_NORMAL, bstrategy_src,
+										   permanent);
+		srcPage = BufferGetPage(srcBuf);
+		if (PageIsNew(srcPage) || PageIsEmpty(srcPage))
+		{
+			ReleaseBuffer(srcBuf);
+			continue;
+		}
+
+		/* Use P_NEW to extend the destination relation. */
+		dstBuf = ReadBufferWithoutRelcache(dst->rd_node, forkNum, P_NEW,
+										   RBM_NORMAL, bstrategy_dst,
+										   permanent);
+		LockBuffer(dstBuf, BUFFER_LOCK_EXCLUSIVE);
+
+		START_CRIT_SECTION();
+
+		/* Copy page data from the source to the destination. */
+		dstPage = BufferGetPage(dstBuf);
+		memcpy(dstPage, srcPage, BLCKSZ);
+		MarkBufferDirty(dstBuf);
+
+		/* WAL-log the copied page. */
+		if (use_wal)
+			log_newpage_buffer(dstBuf, true);
+
+		END_CRIT_SECTION();
+
+		UnlockReleaseBuffer(dstBuf);
+		ReleaseBuffer(srcBuf);
+	}
+}
+
+/* ---------------------------------------------------------------------
+ *		CreateAndCopyRelationData
+ *
+ *		Create destination relation storage and copy all forks from the
+ *		source relation to the destination.
+ *
+ *		Pass permanent as true for permanent relations and false for
+ *		unlogged relations.  Currently this API is not supported for
+ *		temporary relations.
+ * --------------------------------------------------------------------
+ */
+void
+CreateAndCopyRelationData(RelFileNode src_rnode, RelFileNode dst_rnode,
+						  bool permanent)
+{
+	Relation		src_rel;
+	Relation		dst_rel;
+	char			relpersistence;
+
+	/* Set the relpersistence. */
+	relpersistence = permanent ?
+		RELPERSISTENCE_PERMANENT : RELPERSISTENCE_UNLOGGED;
+
+	/*
+	 * Prepare fake relcache entries for the srource and the destination.  It
+	 * is safe to use the fake relcache here because we are only going to
+	 * access the fields related to the physical storage.  We are using the
+	 * fake relcache entry only because it isn't safe to hold the smgr
+	 * pointers, for more details refer comments atop RelationGetSmgr.
+	 */
+	src_rel = CreateFakeRelcacheEntry(src_rnode);
+	dst_rel = CreateFakeRelcacheEntry(dst_rnode);
+
+	/*
+	 * Create and copy all forks of the relation.
+	 *
+	 * NOTE: any conflict in relfilenode value will be caught in
+	 * RelationCreateStorage().
+	 */
+	RelationCreateStorage(dst_rnode, relpersistence);
+
+	/* copy main fork. */
+	RelationCopyStorageUsingBuffer(src_rel, dst_rel, MAIN_FORKNUM, permanent);
+
+	/* copy those extra forks that exist */
+	for (ForkNumber forkNum = MAIN_FORKNUM + 1;
+		 forkNum <= MAX_FORKNUM; forkNum++)
+	{
+		if (smgrexists(RelationGetSmgr(src_rel), forkNum))
+		{
+			smgrcreate(RelationGetSmgr(dst_rel), forkNum, false);
+
+			/*
+			 * WAL log creation if the relation is persistent, or this is the
+			 * init fork of an unlogged relation.
+			 */
+			if (permanent || forkNum == INIT_FORKNUM)
+				log_smgrcreate(&dst_rnode, forkNum);
+
+			/* Copy a fork's data, block by block. */
+			RelationCopyStorageUsingBuffer(src_rel, dst_rel, forkNum,
+										   permanent);
+		}
+	}
+
+	/* Release fake relcache entries. */
+	FreeFakeRelcacheEntry(src_rel);
+	FreeFakeRelcacheEntry(dst_rel);
+}
+
+/* ---------------------------------------------------------------------
  *		FlushDatabaseBuffers
  *
  *		This function writes all dirty pages of a database out to disk
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 5ae52dd..1543da6 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -176,6 +176,34 @@ ConditionalLockRelationOid(Oid relid, LOCKMODE lockmode)
 }
 
 /*
+ *		LockRelationId
+ *
+ * Lock, given a LockRelId.  Same as LockRelationOid but take LockRelId as an
+ * input.
+ */
+void
+LockRelationId(LockRelId *relid, LOCKMODE lockmode)
+{
+	LOCKTAG		tag;
+	LOCALLOCK  *locallock;
+	LockAcquireResult res;
+
+	SET_LOCKTAG_RELATION(tag, relid->dbId, relid->relId);
+
+	res = LockAcquireExtended(&tag, lockmode, false, false, true, &locallock);
+
+	/*
+	 * Now that we have the lock, check for invalidation messages; see notes
+	 * in LockRelationOid.
+	 */
+	if (res != LOCKACQUIRE_ALREADY_CLEAR)
+	{
+		AcceptInvalidationMessages();
+		MarkLockClear(locallock);
+	}
+}
+
+/*
  *		UnlockRelationId
  *
  * Unlock, given a LockRelId.  This is preferred over UnlockRelationOid
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index ff46a0e..1c8aba4 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -705,6 +705,9 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_TWOPHASE_FILE_WRITE:
 			event_name = "TwophaseFileWrite";
 			break;
+		case WAIT_EVENT_VERSION_FILE_WRITE:
+			event_name = "VersionFileWrite";
+			break;
 		case WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ:
 			event_name = "WALSenderTimelineHistoryRead";
 			break;
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 4d0718f..dee3387 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -252,6 +252,63 @@ RelationMapFilenodeToOid(Oid filenode, bool shared)
 }
 
 /*
+ * RelationMapOidToFilenodeForDatabase
+ *
+ * Like RelationMapOidToFilenode, but reads the mapping from the indicated
+ * path instead of using the one for the current database.
+ */
+Oid
+RelationMapOidToFilenodeForDatabase(char *dbpath, Oid relationId)
+{
+	RelMapFile	map;
+	int			i;
+
+	/* Read the relmap file from the source database. */
+	read_relmap_file(&map, dbpath, false, ERROR);
+
+	/* Iterate over the relmap entries to find the input relation OID. */
+	for (i = 0; i < map.num_mappings; i++)
+	{
+		if (relationId == map.mappings[i].mapoid)
+			return map.mappings[i].mapfilenode;
+	}
+
+	return InvalidOid;
+}
+
+/*
+ * RelationMapCopy
+ *
+ * Copy relmapfile from source db path to the destination db path and WAL log
+ * the operation. This is intended for use in creating a new relmap file
+ * for a database that doesn't have one yet, not for replacing an existing
+ * relmap file.
+ */
+void
+RelationMapCopy(Oid dbid, Oid tsid, char *srcdbpath, char *dstdbpath)
+{
+	RelMapFile map;
+
+	/*
+	 * Read the relmap file from the source database.
+	 */
+	read_relmap_file(&map, srcdbpath, false, ERROR);
+
+	/*
+	 * Write the same data into the destination database's relmap file.
+	 *
+	 * No sinval is needed because no one can be connected to the destination
+	 * database yet. For the same reason, there is no need to acquire
+	 * RelationMappingLock.
+	 *
+	 * There's no point in trying to preserve files here. The new database
+	 * isn't usable yet anyway, and won't ever be if we can't install a
+	 * relmap file.
+	 */
+	write_relmap_file(&map, true, false, false, dbid, tsid, dstdbpath);
+}
+
+/*
  * RelationMapUpdateMap
  *
  * Install a new relfilenode mapping for the specified relation.
@@ -1031,6 +1088,13 @@ relmap_redo(XLogReaderState *record)
 		 *
 		 * There shouldn't be anyone else updating relmaps during WAL replay,
 		 * but grab the lock to interlock against load_relmap_file().
+		 *
+		 * Note that we use the same WAL record for updating the relmap of
+		 * an existing database as we do for creating a new database. In
+		 * the latter case, taking the relmap log and sending sinval messages
+		 * is unnecessary, but harmless. If we wanted to avoid it, we could
+		 * add a flag to the WAL record to indicate which opration is being
+		 * performed.
 		 */
 		LWLockAcquire(RelationMappingLock, LW_EXCLUSIVE);
 		write_relmap_file(&newmap, false, true, false,
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 7cfa169..bd1ec42 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -370,7 +370,7 @@ extractPageInfo(XLogReaderState *record)
 
 	/* Is this a special record type that I recognize? */
 
-	if (rmid == RM_DBASE_ID && rminfo == XLOG_DBASE_CREATE)
+	if (rmid == RM_DBASE_ID && rminfo == XLOG_DBASE_CREATE_FILE_COPY)
 	{
 		/*
 		 * New databases can be safely ignored. It won't be present in the
@@ -382,6 +382,13 @@ extractPageInfo(XLogReaderState *record)
 		 * overwriting the database created in the target system.
 		 */
 	}
+	else if (rmid == RM_DBASE_ID && rminfo == XLOG_DBASE_CREATE_WAL_LOG)
+	{
+		/*
+		 * New databases can be safely ignored. It won't be present in the
+		 * source system, so it will be deleted.
+		 */
+	}
 	else if (rmid == RM_DBASE_ID && rminfo == XLOG_DBASE_DROP)
 	{
 		/*
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 5c06459..baabf98 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2776,13 +2776,15 @@ psql_completion(const char *text, int start, int end)
 	/* CREATE DATABASE */
 	else if (Matches("CREATE", "DATABASE", MatchAny))
 		COMPLETE_WITH("OWNER", "TEMPLATE", "ENCODING", "TABLESPACE",
-					  "IS_TEMPLATE",
+					  "IS_TEMPLATE", "STRATEGY",
 					  "ALLOW_CONNECTIONS", "CONNECTION LIMIT",
 					  "LC_COLLATE", "LC_CTYPE", "LOCALE", "OID",
 					  "LOCALE_PROVIDER", "ICU_LOCALE");
 
 	else if (Matches("CREATE", "DATABASE", MatchAny, "TEMPLATE"))
 		COMPLETE_WITH_QUERY(Query_for_list_of_template_databases);
+	else if (Matches("CREATE", "DATABASE", MatchAny, "STRATEGY"))
+		COMPLETE_WITH("WAL_LOG", "FILE_COPY");
 
 	/* CREATE DOMAIN */
 	else if (Matches("CREATE", "DOMAIN", MatchAny))
diff --git a/src/bin/scripts/createdb.c b/src/bin/scripts/createdb.c
index 6f612ab..0bffa2f 100644
--- a/src/bin/scripts/createdb.c
+++ b/src/bin/scripts/createdb.c
@@ -34,6 +34,7 @@ main(int argc, char *argv[])
 		{"tablespace", required_argument, NULL, 'D'},
 		{"template", required_argument, NULL, 'T'},
 		{"encoding", required_argument, NULL, 'E'},
+		{"strategy", required_argument, NULL, 'S'},
 		{"lc-collate", required_argument, NULL, 1},
 		{"lc-ctype", required_argument, NULL, 2},
 		{"locale", required_argument, NULL, 'l'},
@@ -60,6 +61,7 @@ main(int argc, char *argv[])
 	char	   *tablespace = NULL;
 	char	   *template = NULL;
 	char	   *encoding = NULL;
+	char	   *strategy = NULL;
 	char	   *lc_collate = NULL;
 	char	   *lc_ctype = NULL;
 	char	   *locale = NULL;
@@ -77,7 +79,7 @@ main(int argc, char *argv[])
 
 	handle_help_version_opts(argc, argv, "createdb", help);
 
-	while ((c = getopt_long(argc, argv, "h:p:U:wWeO:D:T:E:l:", long_options, &optindex)) != -1)
+	while ((c = getopt_long(argc, argv, "h:p:U:wWeO:D:T:E:l:S:", long_options, &optindex)) != -1)
 	{
 		switch (c)
 		{
@@ -111,6 +113,9 @@ main(int argc, char *argv[])
 			case 'E':
 				encoding = pg_strdup(optarg);
 				break;
+			case 'S':
+				strategy = pg_strdup(optarg);
+				break;
 			case 1:
 				lc_collate = pg_strdup(optarg);
 				break;
@@ -215,6 +220,8 @@ main(int argc, char *argv[])
 		appendPQExpBufferStr(&sql, " ENCODING ");
 		appendStringLiteralConn(&sql, encoding, conn);
 	}
+	if (strategy)
+		appendPQExpBuffer(&sql, " STRATEGY %s", fmtId(strategy));
 	if (template)
 		appendPQExpBuffer(&sql, " TEMPLATE %s", fmtId(template));
 	if (lc_collate)
@@ -294,6 +301,7 @@ help(const char *progname)
 	printf(_("      --locale-provider={libc|icu}\n"
 			 "                               locale provider for the database's default collation\n"));
 	printf(_("  -O, --owner=OWNER            database user to own the new database\n"));
+	printf(_("  -S, --strategy=STRATEGY      database creation strategy wal_log or file_copy\n"));
 	printf(_("  -T, --template=TEMPLATE      template database to copy\n"));
 	printf(_("  -V, --version                output version information, then exit\n"));
 	printf(_("  -?, --help                   show this help, then exit\n"));
diff --git a/src/bin/scripts/t/020_createdb.pl b/src/bin/scripts/t/020_createdb.pl
index 35deec9..44d3c6d 100644
--- a/src/bin/scripts/t/020_createdb.pl
+++ b/src/bin/scripts/t/020_createdb.pl
@@ -104,4 +104,24 @@ $node->command_checks_all(
 	],
 	'createdb with incorrect --lc-ctype');
 
+$node->command_checks_all(
+	[ 'createdb', '--strategy', "foo", 'foobar2' ],
+	1,
+	[qr/^$/],
+	[
+		qr/^createdb: error: database creation failed: ERROR:  invalid create database strategy|^createdb: error: database creation failed: ERROR:  invalid create database strategy foo/s
+	],
+	'createdb with incorrect --strategy');
+
+# Check database creation strategy
+$node->issues_sql_like(
+	[ 'createdb', '-T', 'foobar2', 'foobar4', '-S', 'wal_log'],
+	qr/statement: CREATE DATABASE foobar4 STRATEGY wal_log TEMPLATE foobar2/,
+	'create database with WAL_LOG strategy');
+
+$node->issues_sql_like(
+	[ 'createdb', '-T', 'foobar2', 'foobar5', '-S', 'file_copy'],
+	qr/statement: CREATE DATABASE foobar5 STRATEGY file_copy TEMPLATE foobar2/,
+	'create database with FILE_COPY strategy');
+
 done_testing();
diff --git a/src/include/commands/dbcommands_xlog.h b/src/include/commands/dbcommands_xlog.h
index 593a857..0ee2452 100644
--- a/src/include/commands/dbcommands_xlog.h
+++ b/src/include/commands/dbcommands_xlog.h
@@ -18,17 +18,32 @@
 #include "lib/stringinfo.h"
 
 /* record types */
-#define XLOG_DBASE_CREATE		0x00
-#define XLOG_DBASE_DROP			0x10
+#define XLOG_DBASE_CREATE_FILE_COPY		0x00
+#define XLOG_DBASE_CREATE_WAL_LOG		0x10
+#define XLOG_DBASE_DROP					0x20
 
-typedef struct xl_dbase_create_rec
+/*
+ * Single WAL record for an entire CREATE DATABASE operation. This is used
+ * by the FILE_COPY strategy.
+ */
+typedef struct xl_dbase_create_file_copy_rec
 {
-	/* Records copying of a single subdirectory incl. contents */
 	Oid			db_id;
 	Oid			tablespace_id;
 	Oid			src_db_id;
 	Oid			src_tablespace_id;
-} xl_dbase_create_rec;
+} xl_dbase_create_file_copy_rec;
+
+/*
+ * WAL record for the beginning of a CREATE DATABASE operation, when the
+ * WAL_LOG strategy is used. Each individual block will be logged separately
+ * afterward.
+ */
+typedef struct xl_dbase_create_wal_log_rec
+{
+	Oid			db_id;
+	Oid			tablespace_id;
+} xl_dbase_create_wal_log_rec;
 
 typedef struct xl_dbase_drop_rec
 {
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index dd01841..a6b657f 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -184,7 +184,8 @@ extern Buffer ReadBufferExtended(Relation reln, ForkNumber forkNum,
 								 BufferAccessStrategy strategy);
 extern Buffer ReadBufferWithoutRelcache(RelFileNode rnode,
 										ForkNumber forkNum, BlockNumber blockNum,
-										ReadBufferMode mode, BufferAccessStrategy strategy);
+										ReadBufferMode mode, BufferAccessStrategy strategy,
+										bool permanent);
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern void MarkBufferDirty(Buffer buffer);
@@ -203,6 +204,9 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
 extern void FlushOneBuffer(Buffer buffer);
 extern void FlushRelationBuffers(Relation rel);
 extern void FlushRelationsAllBuffers(struct SMgrRelationData **smgrs, int nrels);
+extern void CreateAndCopyRelationData(RelFileNode src_rnode,
+									  RelFileNode dst_rnode,
+									  bool permanent);
 extern void FlushDatabaseBuffers(Oid dbid);
 extern void DropRelFileNodeBuffers(struct SMgrRelationData *smgr_reln, ForkNumber *forkNum,
 								   int nforks, BlockNumber *firstDelBlock);
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index 49edbcc..be1d2c9 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -38,6 +38,7 @@ extern void RelationInitLockInfo(Relation relation);
 
 /* Lock a relation */
 extern void LockRelationOid(Oid relid, LOCKMODE lockmode);
+extern void LockRelationId(LockRelId *relid, LOCKMODE lockmode);
 extern bool ConditionalLockRelationOid(Oid relid, LOCKMODE lockmode);
 extern void UnlockRelationId(LockRelId *relid, LOCKMODE lockmode);
 extern void UnlockRelationOid(Oid relid, LOCKMODE lockmode);
diff --git a/src/include/utils/relmapper.h b/src/include/utils/relmapper.h
index 9fbb5a7..f10353e 100644
--- a/src/include/utils/relmapper.h
+++ b/src/include/utils/relmapper.h
@@ -38,7 +38,9 @@ typedef struct xl_relmap_update
 extern Oid	RelationMapOidToFilenode(Oid relationId, bool shared);
 
 extern Oid	RelationMapFilenodeToOid(Oid relationId, bool shared);
-
+extern Oid RelationMapOidToFilenodeForDatabase(char *dbpath, Oid relationId);
+extern void RelationMapCopy(Oid dbid, Oid tsid, char *srcdbpath,
+							char *dstdbpath);
 extern void RelationMapUpdateMap(Oid relationId, Oid fileNode, bool shared,
 								 bool immediate);
 
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 1c39ce0..d870c59 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -218,6 +218,7 @@ typedef enum
 	WAIT_EVENT_TWOPHASE_FILE_READ,
 	WAIT_EVENT_TWOPHASE_FILE_SYNC,
 	WAIT_EVENT_TWOPHASE_FILE_WRITE,
+	WAIT_EVENT_VERSION_FILE_WRITE,
 	WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ,
 	WAIT_EVENT_WAL_BOOTSTRAP_SYNC,
 	WAIT_EVENT_WAL_BOOTSTRAP_WRITE,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 93d5190..0747205 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -460,6 +460,8 @@ CoverPos
 CreateAmStmt
 CreateCastStmt
 CreateConversionStmt
+CreateDBRelInfo
+CreateDBStrategy
 CreateDomainStmt
 CreateEnumStmt
 CreateEventTrigStmt
@@ -3701,7 +3703,8 @@ xl_btree_update
 xl_btree_vacuum
 xl_clog_truncate
 xl_commit_ts_truncate
-xl_dbase_create_rec
+xl_dbase_create_file_copy_rec
+xl_dbase_create_wal_log_rec
 xl_dbase_drop_rec
 xl_end_of_recovery
 xl_hash_add_ovfl_page
-- 
1.8.3.1

#182

Dilip Kumar

dilipbalaut@gmail.com

almost 4 years ago

In reply to: Robert Haas (#180)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Wed, Mar 23, 2022 at 2:14 AM Robert Haas <robertmhaas@gmail.com> wrote:

So I talked to Andres and Thomas about this and they told me that I
was right to worry about this problem. Over on the thread about "wrong
fds used for refilenodes after pg_upgrade relfilenode changes
Reply-To:" there is a plan to make use ProcSignalBarrier to make smgr
objects disappear, and ProcSignalBarrier can be processed at any
CHECK_FOR_INTERRUPTS(), so then we'd have a problem here. Commit
f10f0ae420ee62400876ab34dca2c09c20dcd030 established a policy that you
should always re-fetch the smgr object instead of reusing one you've
already got, and even before that it was known to be unsafe to keep
them around for any period of time, because anything that opened a
relation, including a syscache lookup, could potentially accept
invalidations. So most of our code is already hardened against the
possibility of smgr objects disappearing. I have a feeling there may
be some that isn't, but it would be good if this patch didn't
introduce more such code at the same time that patch is trying to
introduce more ways to get rid of smgr objects. It was suggested to me
that what this patch ought to be doing is calling
CreateFakeRelcacheEntry() and then using RelationGetSmgr(fakerel)
every time we need the SmgrRelation, without ever keeping it around
for any amount of code. That way, if the smgr relation gets closed out
from under us at a CHECK_FOR_INTERRUPTS(), we'll just recreate it at
the next RelationGetSmgr() call.

Okay, I have changed this in my latest version of the patch.

Andres also noted that he thinks the patch performs redundant cleanup,
because of the fact that it uses RelationCreateStorage. That will
arrange to remove files on abort, but createdb() also has its own
mechanism for that. It doesn't seem like a thing to do twice in two
different ways.

Okay this is an interesting point. So one option is that in case of
failure while using the wal log strategy we do not remove the database
directory, because an abort transaction will take care of removing the
relation file. But then in failure case we will leave the orphaned
database directory with version file and the relmap file. Another
option is to do the redundant cleanup as we are doing now. Any other
options?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#183

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Dilip Kumar (#182)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Wed, Mar 23, 2022 at 4:42 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Okay this is an interesting point. So one option is that in case of
failure while using the wal log strategy we do not remove the database
directory, because an abort transaction will take care of removing the
relation file. But then in failure case we will leave the orphaned
database directory with version file and the relmap file. Another
option is to do the redundant cleanup as we are doing now. Any other
options?

I think our overriding goal should be to get everything using one
mechanism. It doesn't look straightforward to get everything to go
through the PendingRelDelete mechanism, because as you say, it can't
handle non-relation files or directories. However, what if we opt out
of that mechanism? We could do that either by not using
RelationCreateStorage() in the first place and directly calling
smgrcreate(), or by using RelationPreserveStorage() afterwards to yank
the file back out of the list.

--
Robert Haas
EDB: http://www.enterprisedb.com

#184

Dilip Kumar

dilipbalaut@gmail.com

almost 4 years ago

In reply to: Robert Haas (#183)

1 attachment(s)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Wed, Mar 23, 2022 at 5:54 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Mar 23, 2022 at 4:42 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Okay this is an interesting point. So one option is that in case of
failure while using the wal log strategy we do not remove the database
directory, because an abort transaction will take care of removing the
relation file. But then in failure case we will leave the orphaned
database directory with version file and the relmap file. Another
option is to do the redundant cleanup as we are doing now. Any other
options?

I think our overriding goal should be to get everything using one
mechanism. It doesn't look straightforward to get everything to go
through the PendingRelDelete mechanism, because as you say, it can't
handle non-relation files or directories. However, what if we opt out
of that mechanism? We could do that either by not using
RelationCreateStorage() in the first place and directly calling
smgrcreate(), or by using RelationPreserveStorage() afterwards to yank
the file back out of the list.

I think directly using smgrcreate() is a better idea instead of first
registering and then unregistering it. I have made that change in
the attached patch. After this change now we can merge creating the
MAIN_FORKNUM also in the loop below where we are creating other
fork[1]+ /* + * Create and copy all forks of the relation. We are not using + * RelationCreateStorage() as it is registering the cleanup for the + * underlying relation storage on the transaction abort. But during create + * database failure, we have a separate cleanup mechanism for the whole + * database directory. Therefore, we don't need to register cleanup for + * each individual relation storage. + */ + smgrcreate(RelationGetSmgr(dst_rel), MAIN_FORKNUM, false); + if (permanent) + log_smgrcreate(&dst_rnode, MAIN_FORKNUM); + + /* copy main fork. */ + RelationCopyStorageUsingBuffer(src_rel, dst_rel, MAIN_FORKNUM, permanent); + + /* copy those extra forks that exist */ + for (ForkNumber forkNum = MAIN_FORKNUM + 1; + forkNum <= MAX_FORKNUM; forkNum++) + { + if (smgrexists(RelationGetSmgr(src_rel), forkNum)) + { + smgrcreate(RelationGetSmgr(dst_rel), forkNum, false); + with one extra condition but I think current code is in more
sync with the other code where we are doing the similar things so I
have not merged it in the loop. Please let me know if you think
otherwise.

[1]
+    /*
+     * Create and copy all forks of the relation.  We are not using
+     * RelationCreateStorage() as it is registering the cleanup for the
+     * underlying relation storage on the transaction abort.  But during create
+     * database failure, we have a separate cleanup mechanism for the whole
+     * database directory. Therefore, we don't need to register cleanup for
+     * each individual relation storage.
+     */
+    smgrcreate(RelationGetSmgr(dst_rel), MAIN_FORKNUM, false);
+    if (permanent)
+        log_smgrcreate(&dst_rnode, MAIN_FORKNUM);
+
+    /* copy main fork. */
+    RelationCopyStorageUsingBuffer(src_rel, dst_rel, MAIN_FORKNUM, permanent);
+
+    /* copy those extra forks that exist */
+    for (ForkNumber forkNum = MAIN_FORKNUM + 1;
+         forkNum <= MAX_FORKNUM; forkNum++)
+    {
+        if (smgrexists(RelationGetSmgr(src_rel), forkNum))
+        {
+            smgrcreate(RelationGetSmgr(dst_rel), forkNum, false);
+

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v3-0001-Add-new-block-by-block-strategy-for-CREATE-DATABA.patchtext/x-patch; charset=US-ASCII; name=v3-0001-Add-new-block-by-block-strategy-for-CREATE-DATABA.patchDownload

From fe6d9d5a2e1d0791749768a92a08dcb5dd4ca0ce Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Tue, 22 Mar 2022 11:22:26 -0400
Subject: [PATCH v3] Add new block-by-block strategy for CREATE DATABASE.

Because this strategy logs changes on a block-by-block basis, it
avoids the need to checkpoint before and after the operation.
However, because it logs each changed block individually, it might
generate a lot of extra write-ahead logging if the template database
is large. Therefore, the older strategy remains available via a new
STRATEGY parameter to CREATE DATABASE, and a corresponding --strategy
option to createdb.

Somewhat controversially, this patch assembles the list of relations
to be copied to the new database by reading the pg_class relation of
the template database. Cross-database access like this isn't normally
possible, but it can be made to work here because there can't be any
connections to the database being copied, nor can it contain any
in-doubt transactions. Even so, we have to use lower-level interfaces
than normal, since the table scan and relcache interfaces will not
work for a database to which we're not connected. The advantage of
this approach is that we do not need to rely on the filesystem to
determine what ought to be copied, but instead on PostgreSQL's own
knowledge of the database structure. This avoids, for example,
copying stray files that happen to be located in the source database
directory.

Dilip Kumar, with a fairly large number of cosmetic changes by me.
---
 contrib/bloom/blinsert.c                 |   2 +-
 doc/src/sgml/ref/create_database.sgml    |  22 +
 doc/src/sgml/ref/createdb.sgml           |  11 +
 src/backend/access/heap/heapam_handler.c |   2 +-
 src/backend/access/nbtree/nbtree.c       |   2 +-
 src/backend/access/rmgrdesc/dbasedesc.c  |  20 +-
 src/backend/access/transam/xlogutils.c   |   6 +-
 src/backend/commands/dbcommands.c        | 761 ++++++++++++++++++++++++++-----
 src/backend/storage/buffer/bufmgr.c      | 171 ++++++-
 src/backend/storage/lmgr/lmgr.c          |  28 ++
 src/backend/utils/activity/wait_event.c  |   3 +
 src/backend/utils/cache/relmapper.c      |  64 +++
 src/bin/pg_rewind/parsexlog.c            |   9 +-
 src/bin/psql/tab-complete.c              |   4 +-
 src/bin/scripts/createdb.c               |  10 +-
 src/bin/scripts/t/020_createdb.pl        |  20 +
 src/include/commands/dbcommands_xlog.h   |  25 +-
 src/include/storage/bufmgr.h             |   6 +-
 src/include/storage/lmgr.h               |   1 +
 src/include/utils/relmapper.h            |   4 +-
 src/include/utils/wait_event.h           |   1 +
 src/tools/pgindent/typedefs.list         |   5 +-
 22 files changed, 1038 insertions(+), 139 deletions(-)

diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c
index c94cf34..82378db 100644
--- a/contrib/bloom/blinsert.c
+++ b/contrib/bloom/blinsert.c
@@ -173,7 +173,7 @@ blbuildempty(Relation index)
 	 * Write the page and log it.  It might seem that an immediate sync would
 	 * be sufficient to guarantee that the file exists on disk, but recovery
 	 * itself might remove it while replaying, for example, an
-	 * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we need
+	 * XLOG_DBASE_CREATE* or XLOG_TBLSPC_CREATE record.  Therefore, we need
 	 * this even when wal_level=minimal.
 	 */
 	PageSetChecksumInplace(metapage, BLOOM_METAPAGE_BLKNO);
diff --git a/doc/src/sgml/ref/create_database.sgml b/doc/src/sgml/ref/create_database.sgml
index 5ae785a..255ad3a 100644
--- a/doc/src/sgml/ref/create_database.sgml
+++ b/doc/src/sgml/ref/create_database.sgml
@@ -25,6 +25,7 @@ CREATE DATABASE <replaceable class="parameter">name</replaceable>
     [ [ WITH ] [ OWNER [=] <replaceable class="parameter">user_name</replaceable> ]
            [ TEMPLATE [=] <replaceable class="parameter">template</replaceable> ]
            [ ENCODING [=] <replaceable class="parameter">encoding</replaceable> ]
+           [ STRATEGY [=] <replaceable class="parameter">strategy</replaceable> ] ]
            [ LOCALE [=] <replaceable class="parameter">locale</replaceable> ]
            [ LC_COLLATE [=] <replaceable class="parameter">lc_collate</replaceable> ]
            [ LC_CTYPE [=] <replaceable class="parameter">lc_ctype</replaceable> ]
@@ -118,6 +119,27 @@ CREATE DATABASE <replaceable class="parameter">name</replaceable>
        </para>
       </listitem>
      </varlistentry>
+     <varlistentry id="create-database-strategy" xreflabel="CREATE DATABASE STRATEGY">
+      <term><replaceable class="parameter">strategy</replaceable></term>
+      <listitem>
+       <para>
+        Strategy to be used in creating the new database.  If
+        the <literal>WAL_LOG</literal> strategy is used, the database will be
+        copied block by block and each block will be separately written
+        to the write-ahead log. This is the most efficient strategy in
+        cases where the template database is small, and therefore it is the
+        default. The older <literal>FILE_COPY</literal> strategy is also
+        available. This strategy writes a small record to the write-ahead log
+        for each tablespace used by the target database. Each such record
+        represents copying an entire directory to a new location at the
+        filesystem level. While this does reduce the write-ahed
+        log volume substantially, especially if the template database is large,
+        it also forces the system to perform a checkpoint both before and
+        after the creation of the new database. In some situations, this may
+        have a noticeable negative impact on overall system performance.
+       </para>
+      </listitem>
+     </varlistentry>
      <varlistentry>
       <term><replaceable class="parameter">locale</replaceable></term>
       <listitem>
diff --git a/doc/src/sgml/ref/createdb.sgml b/doc/src/sgml/ref/createdb.sgml
index be42e50..671cd362 100644
--- a/doc/src/sgml/ref/createdb.sgml
+++ b/doc/src/sgml/ref/createdb.sgml
@@ -178,6 +178,17 @@ PostgreSQL documentation
      </varlistentry>
 
      <varlistentry>
+      <term><option>-S <replaceable class="parameter">template</replaceable></option></term>
+      <term><option>--strategy=<replaceable class="parameter">strategy</replaceable></option></term>
+      <listitem>
+       <para>
+        Specifies the database creation strategy.  See
+        <xref linkend="create-database-strategy" /> for more details.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
       <term><option>-T <replaceable class="parameter">template</replaceable></option></term>
       <term><option>--template=<replaceable class="parameter">template</replaceable></option></term>
       <listitem>
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 39ef8a0..2b70ca0 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -601,7 +601,7 @@ heapam_relation_set_new_filenode(Relation rel,
 	 * even if the page has been logged, because the write did not go through
 	 * shared_buffers and therefore a concurrent checkpoint may have moved the
 	 * redo pointer past our xlog record.  Recovery may as well remove it
-	 * while replaying, for example, XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE
+	 * while replaying, for example, XLOG_DBASE_CREATE* or XLOG_TBLSPC_CREATE
 	 * record. Therefore, logging is necessary even if wal_level=minimal.
 	 */
 	if (persistence == RELPERSISTENCE_UNLOGGED)
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index c9b4964..dacf3f7 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -161,7 +161,7 @@ btbuildempty(Relation index)
 	 * Write the page and log it.  It might seem that an immediate sync would
 	 * be sufficient to guarantee that the file exists on disk, but recovery
 	 * itself might remove it while replaying, for example, an
-	 * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we need
+	 * XLOG_DBASE_CREATE* or XLOG_TBLSPC_CREATE record.  Therefore, we need
 	 * this even when wal_level=minimal.
 	 */
 	PageSetChecksumInplace(metapage, BTREE_METAPAGE);
diff --git a/src/backend/access/rmgrdesc/dbasedesc.c b/src/backend/access/rmgrdesc/dbasedesc.c
index 03af3fd..523d0b3 100644
--- a/src/backend/access/rmgrdesc/dbasedesc.c
+++ b/src/backend/access/rmgrdesc/dbasedesc.c
@@ -24,14 +24,23 @@ dbase_desc(StringInfo buf, XLogReaderState *record)
 	char	   *rec = XLogRecGetData(record);
 	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
 
-	if (info == XLOG_DBASE_CREATE)
+	if (info == XLOG_DBASE_CREATE_FILE_COPY)
 	{
-		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) rec;
+		xl_dbase_create_file_copy_rec *xlrec =
+		(xl_dbase_create_file_copy_rec *) rec;
 
 		appendStringInfo(buf, "copy dir %u/%u to %u/%u",
 						 xlrec->src_tablespace_id, xlrec->src_db_id,
 						 xlrec->tablespace_id, xlrec->db_id);
 	}
+	else if (info == XLOG_DBASE_CREATE_WAL_LOG)
+	{
+		xl_dbase_create_wal_log_rec *xlrec =
+		(xl_dbase_create_wal_log_rec *) rec;
+
+		appendStringInfo(buf, "create dir %u/%u",
+						 xlrec->tablespace_id, xlrec->db_id);
+	}
 	else if (info == XLOG_DBASE_DROP)
 	{
 		xl_dbase_drop_rec *xlrec = (xl_dbase_drop_rec *) rec;
@@ -51,8 +60,11 @@ dbase_identify(uint8 info)
 
 	switch (info & ~XLR_INFO_MASK)
 	{
-		case XLOG_DBASE_CREATE:
-			id = "CREATE";
+		case XLOG_DBASE_CREATE_FILE_COPY:
+			id = "CREATE_FILE_COPY";
+			break;
+		case XLOG_DBASE_CREATE_WAL_LOG:
+			id = "CREATE_WAL_LOG";
 			break;
 		case XLOG_DBASE_DROP:
 			id = "DROP";
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 511f2f1..a4dedc5 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -484,7 +484,7 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 	{
 		/* page exists in file */
 		buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-										   mode, NULL);
+										   mode, NULL, true);
 	}
 	else
 	{
@@ -509,7 +509,7 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 				ReleaseBuffer(buffer);
 			}
 			buffer = ReadBufferWithoutRelcache(rnode, forknum,
-											   P_NEW, mode, NULL);
+											   P_NEW, mode, NULL, true);
 		}
 		while (BufferGetBlockNumber(buffer) < blkno);
 		/* Handle the corner case that P_NEW returns non-consecutive pages */
@@ -519,7 +519,7 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 				LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 			ReleaseBuffer(buffer);
 			buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-											   mode, NULL);
+											   mode, NULL, true);
 		}
 	}
 
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 623e5ec..02a096c 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -63,13 +63,31 @@
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
 #include "utils/pg_locale.h"
+#include "utils/relmapper.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 
+/*
+ * Create database strategy.
+ *
+ * CREATEDB_WAL_LOG will copy the database at the block level and WAL log each
+ * copied block.
+ *
+ * CREATEDB_FILE_COPY will simply perform a file system level copy of the
+ * database and log a single record for each tablespace copied. To make this
+ * safe, it also triggers checkpoints before and after the operation.
+ */
+typedef enum CreateDBStrategy
+{
+	CREATEDB_WAL_LOG,
+	CREATEDB_FILE_COPY
+} CreateDBStrategy;
+
 typedef struct
 {
 	Oid			src_dboid;		/* source (template) DB */
 	Oid			dest_dboid;		/* DB we are trying to create */
+	CreateDBStrategy strategy;	/* create db strategy */
 } createdb_failure_params;
 
 typedef struct
@@ -78,6 +96,17 @@ typedef struct
 	Oid			dest_tsoid;		/* tablespace we are trying to move to */
 } movedb_failure_params;
 
+/*
+ * Information about a relation to be copied when creating a database.
+ */
+typedef struct CreateDBRelInfo
+{
+	RelFileNode rnode;			/* physical relation identifier */
+	Oid			reloid;			/* relation oid */
+	bool		permanent;		/* relation is permanent or unlogged */
+} CreateDBRelInfo;
+
+
 /* non-export function prototypes */
 static void createdb_failure_callback(int code, Datum arg);
 static void movedb(const char *dbname, const char *tblspcname);
@@ -93,7 +122,540 @@ static bool have_createdb_privilege(void);
 static void remove_dbtablespaces(Oid db_id);
 static bool check_db_file_conflict(Oid db_id);
 static int	errdetail_busy_db(int notherbackends, int npreparedxacts);
+static void CreateDatabaseUsingWalLog(Oid src_dboid, Oid dboid, Oid src_tsid,
+									  Oid dst_tsid);
+static List *ScanSourceDatabasePgClass(Oid srctbid, Oid srcdbid, char *srcpath);
+static List *ScanSourceDatabasePgClassPage(Page page, Buffer buf, Oid tbid,
+										   Oid dbid, char *srcpath,
+										   List *rnodelist, Snapshot snapshot);
+static CreateDBRelInfo *ScanSourceDatabasePgClassTuple(HeapTupleData *tuple,
+													   Oid tbid, Oid dbid,
+													   char *srcpath);
+static void CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid,
+									bool isRedo);
+static void CreateDatabaseUsingFileCopy(Oid src_dboid, Oid dboid, Oid src_tsid,
+										Oid dst_tsid);
+
+/*
+ * Create a new database using the WAL_LOG strategy.
+ *
+ * Each copied block is separately written to the write-ahead log.
+ */
+static void
+CreateDatabaseUsingWalLog(Oid src_dboid, Oid dst_dboid,
+						  Oid src_tsid, Oid dst_tsid)
+{
+	char	   *srcpath;
+	char	   *dstpath;
+	List	   *rnodelist = NULL;
+	ListCell   *cell;
+	LockRelId	srcrelid;
+	LockRelId	dstrelid;
+	RelFileNode srcrnode;
+	RelFileNode dstrnode;
+	CreateDBRelInfo *relinfo;
+
+	/* Get source and destination database paths. */
+	srcpath = GetDatabasePath(src_dboid, src_tsid);
+	dstpath = GetDatabasePath(dst_dboid, dst_tsid);
+
+	/* Create database directory and write PG_VERSION file. */
+	CreateDirAndVersionFile(dstpath, dst_dboid, dst_tsid, false);
+
+	/* Copy relmap file from source database to the destination database. */
+	RelationMapCopy(dst_dboid, dst_tsid, srcpath, dstpath);
+
+	/* Get list of relfilenodes to copy from the source database. */
+	rnodelist = ScanSourceDatabasePgClass(src_tsid, src_dboid, srcpath);
+	Assert(rnodelist != NIL);
+
+	/*
+	 * Database ids are common for all the relation so set it before entering
+	 * the loop.
+	 */
+	srcrelid.dbId = src_dboid;
+	dstrelid.dbId = dst_dboid;
+
+	/* Loop over our list of relfilenodes and copy each one. */
+	foreach(cell, rnodelist)
+	{
+		relinfo = lfirst(cell);
+		srcrnode = relinfo->rnode;
+
+		/*
+		 * If the relation is from the source db's default tablespace then we
+		 * need to create it in the destinations db's default tablespace.
+		 * Otherwise, we need to create in the same tablespace as it is in the
+		 * source database.
+		 */
+		if (srcrnode.spcNode == src_tsid)
+			dstrnode.spcNode = dst_tsid;
+		else
+			dstrnode.spcNode = srcrnode.spcNode;
 
+		dstrnode.dbNode = dst_dboid;
+		dstrnode.relNode = srcrnode.relNode;
+
+		/* Acquire locks on source and target relations before copying. */
+		dstrelid.relId = srcrelid.relId = relinfo->reloid;
+		LockRelationId(&srcrelid, AccessShareLock);
+		LockRelationId(&dstrelid, AccessShareLock);
+
+		/* Copy relation storage from source to the destination. */
+		CreateAndCopyRelationData(srcrnode, dstrnode, relinfo->permanent);
+
+		/* Release the locks. */
+		UnlockRelationId(&srcrelid, AccessShareLock);
+		UnlockRelationId(&dstrelid, AccessShareLock);
+	}
+
+	list_free_deep(rnodelist);
+}
+
+/*
+ * Scan the pg_class table in the source database to identify the relations
+ * that need to be copied to the destination database.
+ *
+ * This is an exception to the usual rule that cross-database access is
+ * not possible. We can make it work here because we know that there are no
+ * connections to the source database and (since there can't be prepared
+ * transactions touching that database) no in-doubt tuples either. This
+ * means that we don't need to worry about pruning removing anything from
+ * under us, and we don't need to be too picky about our snapshot either.
+ * As long as it sees all previously-committed XIDs as committed and all
+ * aborted XIDs as aborted, we should be fine: nothing else is possible
+ * here.
+ *
+ * We can't rely on the relcache for anything here, because that only knows
+ * about the database to which we are connected, and can't handle access to
+ * other databases. That also means we can't rely on the heap scan
+ * infrastructure, which would be a bad idea anyway since it might try
+ * to do things like HOT pruning which we definitely can't do safely in
+ * a database to which we're not even connected.
+ */
+static List *
+ScanSourceDatabasePgClass(Oid tbid, Oid dbid, char *srcpath)
+{
+	RelFileNode rnode;
+	BlockNumber nblocks;
+	BlockNumber blkno;
+	Buffer		buf;
+	Oid			relfilenode;
+	Page		page;
+	List	   *rnodelist = NIL;
+	LockRelId	relid;
+	Relation	rel;
+	Snapshot	snapshot;
+	BufferAccessStrategy bstrategy;
+
+	/* Get pg_class relfilenode. */
+	relfilenode = RelationMapOidToFilenodeForDatabase(srcpath,
+													  RelationRelationId);
+
+	/*
+	 * The system elsewhere assumes that we only read data for a relation
+	 * into shared_buffers while holding some sort of a lock on a relation,
+	 * so lock the source database's pg_class before we do anything else.
+	 */
+	relid.dbId = dbid;
+	relid.relId = RelationRelationId;
+	LockRelationId(&relid, AccessShareLock);
+
+	/* Prepare a RelFileNode for the pg_class relation. */
+	rnode.spcNode = tbid;
+	rnode.dbNode = dbid;
+	rnode.relNode = relfilenode;
+
+	/*
+	 * Create a fake relcache entry for the pg_class relation and get the
+	 * number of blocks.  Refer to the comments in CreateAndCopyRelationData()
+	 * for the rationale behind using the fake relcache entry.
+	 */
+	rel = CreateFakeRelcacheEntry(rnode);
+	nblocks = smgrnblocks(RelationGetSmgr(rel), MAIN_FORKNUM);
+	FreeFakeRelcacheEntry(rel);
+
+	/* Use a buffer access strategy since this is a bulk read operation. */
+	bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	/*
+	 * As explained in the function header comments, we need a snapshot that
+	 * will see all committed transactions as committed, and our transaction
+	 * snapshot - or the active snapshot - might not be new enough for that,
+	 * but the return value of GetLatestSnapshot() should work fine.
+	 */
+	snapshot = GetLatestSnapshot();
+
+	/* Process the relation block by block. */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		CHECK_FOR_INTERRUPTS();
+
+		buf = ReadBufferWithoutRelcache(rnode, MAIN_FORKNUM, blkno,
+										RBM_NORMAL, bstrategy, false);
+
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buf);
+		if (PageIsNew(page) || PageIsEmpty(page))
+		{
+			UnlockReleaseBuffer(buf);
+			continue;
+		}
+
+		/* Append relevant pg_class tuples for current page to rnodelist. */
+		rnodelist = ScanSourceDatabasePgClassPage(page, buf, tbid, dbid,
+												  srcpath, rnodelist,
+												  snapshot);
+
+		UnlockReleaseBuffer(buf);
+	}
+
+	/* Release relation lock. */
+	UnlockRelationId(&relid, AccessShareLock);
+
+	return rnodelist;
+}
+
+/*
+ * Scan one page of the source database's pg_class relation and add relevant
+ * entries to rnodelist. The return value is the updated list.
+ */
+static List *
+ScanSourceDatabasePgClassPage(Page page, Buffer buf, Oid tbid, Oid dbid,
+							  char *srcpath, List *rnodelist,
+							  Snapshot snapshot)
+{
+	BlockNumber		blkno = BufferGetBlockNumber(buf);
+	OffsetNumber	offnum;
+	OffsetNumber	maxoff;
+	HeapTupleData	tuple;
+
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/* Loop over offsets. */
+	for (offnum = FirstOffsetNumber;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid;
+
+		itemid = PageGetItemId(page, offnum);
+
+		/* Nothing to do if slot is empty or already dead. */
+		if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid) ||
+			ItemIdIsRedirected(itemid))
+			continue;
+
+		Assert(ItemIdIsNormal(itemid));
+		ItemPointerSet(&(tuple.t_self), blkno, offnum);
+
+		/* Initialize a HeapTupleData structure. */
+		tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
+		tuple.t_len = ItemIdGetLength(itemid);
+		tuple.t_tableOid = RelationRelationId;
+
+		/* Skip tuples that are not visible to this snapshot. */
+		if (HeapTupleSatisfiesVisibility(&tuple, snapshot, buf))
+		{
+			CreateDBRelInfo *relinfo;
+
+			/*
+			 * ScanSourceDatabasePgClassTuple is in charge of constructing
+			 * a CreateDBRelInfo object for this tuple, but can also decide
+			 * that this tuple isn't something we need to copy. If we do need
+			 * to copy the relation, add it to the list.
+			 */
+			relinfo = ScanSourceDatabasePgClassTuple(&tuple, tbid, dbid,
+													 srcpath);
+			if (relinfo != NULL)
+				rnodelist = lappend(rnodelist, relinfo);
+		}
+	}
+
+	return rnodelist;
+}
+
+/*
+ * Decide whether a certain pg_class tuple represents something that
+ * needs to be copied from the source database to the destination database,
+ * and if so, construct a CreateDBRelInfo for it.
+ *
+ * Visbility checks are handled by the caller, so our job here is just
+ * to assess the data stored in the tuple.
+ */
+CreateDBRelInfo *
+ScanSourceDatabasePgClassTuple(HeapTupleData *tuple, Oid tbid, Oid dbid,
+							   char *srcpath)
+{
+	CreateDBRelInfo	   *relinfo;
+	Form_pg_class		classForm;
+	Oid					relfilenode = InvalidOid;
+
+	classForm = (Form_pg_class) GETSTRUCT(tuple);
+
+	/*
+	 * Return NULL if this object does not need to be copied.
+	 *
+	 * Shared objects don't need to be copied, because they are shared.
+	 * Objects without storage can't be copied, because there's nothing to
+	 * copy. Temporary relations don't need to be copied either, because
+	 * they are inaccessible outside of the session that created them,
+	 * which must be gone already, and couldn't connect to a different database
+	 * if it still existed. autovacuum will eventually remove the pg_class
+	 * entries as well.
+	 */
+	if (classForm->reltablespace == GLOBALTABLESPACE_OID ||
+		!RELKIND_HAS_STORAGE(classForm->relkind) ||
+		classForm->relpersistence == RELPERSISTENCE_TEMP)
+		return NULL;
+
+	/*
+	 * If relfilenode is valid then directly use it.  Otherwise, consult the
+	 * relmap.
+	 */
+	if (OidIsValid(classForm->relfilenode))
+		relfilenode = classForm->relfilenode;
+	else
+		relfilenode = RelationMapOidToFilenodeForDatabase(srcpath,
+														  classForm->oid);
+
+	/* We must have a valid relfilenode oid. */
+	if (!OidIsValid(relfilenode))
+		elog(ERROR, "relation with OID %u does not have a valid relfilenode",
+			 classForm->oid);
+
+	/* Prepare a rel info element and add it to the list. */
+	relinfo = (CreateDBRelInfo *) palloc(sizeof(CreateDBRelInfo));
+	if (OidIsValid(classForm->reltablespace))
+		relinfo->rnode.spcNode = classForm->reltablespace;
+	else
+		relinfo->rnode.spcNode = tbid;
+
+	relinfo->rnode.dbNode = dbid;
+	relinfo->rnode.relNode = relfilenode;
+	relinfo->reloid = classForm->oid;
+
+	/* Temporary relations were rejected above. */
+	Assert(classForm->relpersistence != RELPERSISTENCE_TEMP);
+	relinfo->permanent =
+		(classForm->relpersistence == RELPERSISTENCE_PERMANENT) ? true : false;
+
+	return relinfo;
+}
+
+/*
+ * Create database directory and write out the PG_VERSION file in the database
+ * path.  If isRedo is true, it's okay for the database directory to exist
+ * already.
+ */
+static void
+CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid, bool isRedo)
+{
+	int			fd;
+	int			nbytes;
+	char		versionfile[MAXPGPATH];
+	char		buf[16];
+
+	/*
+	 * Prepare version data before starting a critical section.
+	 *
+	 * Note that we don't have to copy this from the source database; there's
+	 * only one legal value.
+	 */
+	sprintf(buf, "%s\n", PG_MAJORVERSION);
+	nbytes = strlen(PG_MAJORVERSION) + 1;
+
+	/* If we are not in WAL replay then write the WAL. */
+	if (!isRedo)
+	{
+		xl_dbase_create_wal_log_rec xlrec;
+		XLogRecPtr	lsn;
+
+		START_CRIT_SECTION();
+
+		xlrec.db_id = dbid;
+		xlrec.tablespace_id = tsid;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec),
+						 sizeof(xl_dbase_create_wal_log_rec));
+
+		lsn = XLogInsert(RM_DBASE_ID, XLOG_DBASE_CREATE_WAL_LOG);
+
+		/* As always, WAL must hit the disk before the data update does. */
+		XLogFlush(lsn);
+	}
+
+	/* Create database directory. */
+	if (MakePGDirectory(dbpath) < 0)
+	{
+		/* Failure other than already exists or not in WAL replay? */
+		if (errno != EEXIST || !isRedo)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not create directory \"%s\": %m", dbpath)));
+	}
+
+	/*
+	 * Create PG_VERSION file in the database path.  If the file already
+	 * exists and we are in WAL replay then try again to open it in write
+	 * mode.
+	 */
+	snprintf(versionfile, sizeof(versionfile), "%s/%s", dbpath, "PG_VERSION");
+
+	fd = OpenTransientFile(versionfile, O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
+	if (fd < 0 && errno == EEXIST && isRedo)
+		fd = OpenTransientFile(versionfile, O_WRONLY | O_TRUNC | PG_BINARY);
+
+	if (fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m", versionfile)));
+
+	/* Write PG_MAJORVERSION in the PG_VERSION file. */
+	pgstat_report_wait_start(WAIT_EVENT_VERSION_FILE_WRITE);
+	errno = 0;
+	if ((int) write(fd, buf, nbytes) != nbytes)
+	{
+		/* If write didn't set errno, assume problem is no disk space. */
+		if (errno == 0)
+			errno = ENOSPC;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m", versionfile)));
+	}
+	pgstat_report_wait_end();
+
+	/* Close the version file. */
+	CloseTransientFile(fd);
+
+	/* Critical section done. */
+	if (!isRedo)
+		END_CRIT_SECTION();
+}
+
+/*
+ * Create a new database using the FILE_COPY strategy.
+ *
+ * Copy each tablespace at the filesystem level, and log a single WAL record
+ * for each tablespace copied.  This requires a checkpoint before and after the
+ * copy, which may be expensive, but it does greatly reduce WAL generation
+ * if the copied database is large.
+ */
+static void
+CreateDatabaseUsingFileCopy(Oid src_dboid, Oid dst_dboid, Oid src_tsid,
+							Oid dst_tsid)
+{
+	TableScanDesc scan;
+	Relation	rel;
+	HeapTuple	tuple;
+
+	/*
+	 * Force a checkpoint before starting the copy. This will force all dirty
+	 * buffers, including those of unlogged tables, out to disk, to ensure
+	 * source database is up-to-date on disk for the copy.
+	 * FlushDatabaseBuffers() would suffice for that, but we also want to
+	 * process any pending unlink requests. Otherwise, if a checkpoint
+	 * happened while we're copying files, a file might be deleted just when
+	 * we're about to copy it, causing the lstat() call in copydir() to fail
+	 * with ENOENT.
+	 */
+	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE |
+					  CHECKPOINT_WAIT | CHECKPOINT_FLUSH_ALL);
+
+	/*
+	 * Iterate through all tablespaces of the template database, and copy each
+	 * one to the new database.
+	 */
+	rel = table_open(TableSpaceRelationId, AccessShareLock);
+	scan = table_beginscan_catalog(rel, 0, NULL);
+	while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+	{
+		Form_pg_tablespace spaceform = (Form_pg_tablespace) GETSTRUCT(tuple);
+		Oid			srctablespace = spaceform->oid;
+		Oid			dsttablespace;
+		char	   *srcpath;
+		char	   *dstpath;
+		struct stat st;
+
+		/* No need to copy global tablespace */
+		if (srctablespace == GLOBALTABLESPACE_OID)
+			continue;
+
+		srcpath = GetDatabasePath(src_dboid, srctablespace);
+
+		if (stat(srcpath, &st) < 0 || !S_ISDIR(st.st_mode) ||
+			directory_is_empty(srcpath))
+		{
+			/* Assume we can ignore it */
+			pfree(srcpath);
+			continue;
+		}
+
+		if (srctablespace == src_tsid)
+			dsttablespace = dst_tsid;
+		else
+			dsttablespace = srctablespace;
+
+		dstpath = GetDatabasePath(dst_dboid, dsttablespace);
+
+		/*
+		 * Copy this subdirectory to the new location
+		 *
+		 * We don't need to copy subdirectories
+		 */
+		copydir(srcpath, dstpath, false);
+
+		/* Record the filesystem change in XLOG */
+		{
+			xl_dbase_create_file_copy_rec xlrec;
+
+			xlrec.db_id = dst_dboid;
+			xlrec.tablespace_id = dsttablespace;
+			xlrec.src_db_id = src_dboid;
+			xlrec.src_tablespace_id = srctablespace;
+
+			XLogBeginInsert();
+			XLogRegisterData((char *) &xlrec,
+							 sizeof(xl_dbase_create_file_copy_rec));
+
+			(void) XLogInsert(RM_DBASE_ID,
+							  XLOG_DBASE_CREATE_FILE_COPY | XLR_SPECIAL_REL_UPDATE);
+		}
+	}
+	table_endscan(scan);
+	table_close(rel, AccessShareLock);
+
+	/*
+	 * We force a checkpoint before committing.  This effectively means that
+	 * committed XLOG_DBASE_CREATE_FILE_COPY operations will never need to be
+	 * replayed (at least not in ordinary crash recovery; we still have to
+	 * make the XLOG entry for the benefit of PITR operations). This avoids
+	 * two nasty scenarios:
+	 *
+	 * #1: When PITR is off, we don't XLOG the contents of newly created
+	 * indexes; therefore the drop-and-recreate-whole-directory behavior of
+	 * DBASE_CREATE replay would lose such indexes.
+	 *
+	 * #2: Since we have to recopy the source database during DBASE_CREATE
+	 * replay, we run the risk of copying changes in it that were committed
+	 * after the original CREATE DATABASE command but before the system crash
+	 * that led to the replay.  This is at least unexpected and at worst could
+	 * lead to inconsistencies, eg duplicate table names.
+	 *
+	 * (Both of these were real bugs in releases 8.0 through 8.0.3.)
+	 *
+	 * In PITR replay, the first of these isn't an issue, and the second is
+	 * only a risk if the CREATE DATABASE and subsequent template database
+	 * change both occur while a base backup is being taken. There doesn't
+	 * seem to be much we can do about that except document it as a
+	 * limitation.
+	 *
+	 * See CreateDatabaseUsingWalLog() for a less cheesy CREATE DATABASE
+	 * strategy that avoids these problems.
+	 */
+	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
+}
 
 /*
  * CREATE DATABASE
@@ -101,8 +663,6 @@ static int	errdetail_busy_db(int notherbackends, int npreparedxacts);
 Oid
 createdb(ParseState *pstate, const CreatedbStmt *stmt)
 {
-	TableScanDesc scan;
-	Relation	rel;
 	Oid			src_dboid;
 	Oid			src_owner;
 	int			src_encoding = -1;
@@ -137,6 +697,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	DefElem    *dallowconnections = NULL;
 	DefElem    *dconnlimit = NULL;
 	DefElem    *dcollversion = NULL;
+	DefElem    *dstrategy = NULL;
 	char	   *dbname = stmt->dbname;
 	char	   *dbowner = NULL;
 	const char *dbtemplate = NULL;
@@ -152,6 +713,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	char	   *dbcollversion = NULL;
 	int			notherbackends;
 	int			npreparedxacts;
+	CreateDBStrategy dbstrategy = CREATEDB_WAL_LOG;
 	createdb_failure_params fparms;
 
 	/* Extract options from the statement node tree */
@@ -269,6 +831,12 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 						(errcode(ERRCODE_INVALID_PARAMETER_VALUE)),
 						errmsg("OIDs less than %u are reserved for system objects", FirstNormalObjectId));
 		}
+		else if (strcmp(defel->defname, "strategy") == 0)
+		{
+			if (dstrategy)
+				errorConflictingDefElem(defel, pstate);
+			dstrategy = defel;
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -413,6 +981,23 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 							dbtemplate)));
 	}
 
+	/* Validate the database creation strategy. */
+	if (dstrategy && dstrategy->arg)
+	{
+		char	   *strategy;
+
+		strategy = defGetString(dstrategy);
+		if (strcmp(strategy, "wal_log") == 0)
+			dbstrategy = CREATEDB_WAL_LOG;
+		else if (strcmp(strategy, "file_copy") == 0)
+			dbstrategy = CREATEDB_FILE_COPY;
+		else
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("invalid create database strategy %s", strategy),
+					 errhint("Valid strategies are \"wal_log\", and \"file_copy\".")));
+	}
+
 	/* If encoding or locales are defaulted, use source's setting */
 	if (encoding < 0)
 		encoding = src_encoding;
@@ -753,17 +1338,16 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	InvokeObjectPostCreateHook(DatabaseRelationId, dboid, 0);
 
 	/*
-	 * Force a checkpoint before starting the copy. This will force all dirty
-	 * buffers, including those of unlogged tables, out to disk, to ensure
-	 * source database is up-to-date on disk for the copy.
-	 * FlushDatabaseBuffers() would suffice for that, but we also want to
-	 * process any pending unlink requests. Otherwise, if a checkpoint
-	 * happened while we're copying files, a file might be deleted just when
-	 * we're about to copy it, causing the lstat() call in copydir() to fail
-	 * with ENOENT.
+	 * Acquire a lock on the target database, although this is a new database
+	 * and no one else should be able to access it.  But if we are using wal
+	 * log strategy then we are going to access the relation pages using shared
+	 * buffers.  Therefore, as a general principle, we should acquire the
+	 * database lock and the relation lock before accessing any shared buffers.
+	 * Individual relation level locks would be acquired in the
+	 * CreateDatabaseUsingWalLog() when reading pages from the shared buffer.
 	 */
-	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT
-					  | CHECKPOINT_FLUSH_ALL);
+	if (dbstrategy == CREATEDB_WAL_LOG)
+		LockSharedObject(DatabaseRelationId, dboid, 0, AccessShareLock);
 
 	/*
 	 * Once we start copying subdirectories, we need to be able to clean 'em
@@ -774,101 +1358,24 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	 */
 	fparms.src_dboid = src_dboid;
 	fparms.dest_dboid = dboid;
+	fparms.strategy = dbstrategy;
+
 	PG_ENSURE_ERROR_CLEANUP(createdb_failure_callback,
 							PointerGetDatum(&fparms));
 	{
 		/*
-		 * Iterate through all tablespaces of the template database, and copy
-		 * each one to the new database.
-		 */
-		rel = table_open(TableSpaceRelationId, AccessShareLock);
-		scan = table_beginscan_catalog(rel, 0, NULL);
-		while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
-		{
-			Form_pg_tablespace spaceform = (Form_pg_tablespace) GETSTRUCT(tuple);
-			Oid			srctablespace = spaceform->oid;
-			Oid			dsttablespace;
-			char	   *srcpath;
-			char	   *dstpath;
-			struct stat st;
-
-			/* No need to copy global tablespace */
-			if (srctablespace == GLOBALTABLESPACE_OID)
-				continue;
-
-			srcpath = GetDatabasePath(src_dboid, srctablespace);
-
-			if (stat(srcpath, &st) < 0 || !S_ISDIR(st.st_mode) ||
-				directory_is_empty(srcpath))
-			{
-				/* Assume we can ignore it */
-				pfree(srcpath);
-				continue;
-			}
-
-			if (srctablespace == src_deftablespace)
-				dsttablespace = dst_deftablespace;
-			else
-				dsttablespace = srctablespace;
-
-			dstpath = GetDatabasePath(dboid, dsttablespace);
-
-			/*
-			 * Copy this subdirectory to the new location
-			 *
-			 * We don't need to copy subdirectories
-			 */
-			copydir(srcpath, dstpath, false);
-
-			/* Record the filesystem change in XLOG */
-			{
-				xl_dbase_create_rec xlrec;
-
-				xlrec.db_id = dboid;
-				xlrec.tablespace_id = dsttablespace;
-				xlrec.src_db_id = src_dboid;
-				xlrec.src_tablespace_id = srctablespace;
-
-				XLogBeginInsert();
-				XLogRegisterData((char *) &xlrec, sizeof(xl_dbase_create_rec));
-
-				(void) XLogInsert(RM_DBASE_ID,
-								  XLOG_DBASE_CREATE | XLR_SPECIAL_REL_UPDATE);
-			}
-		}
-		table_endscan(scan);
-		table_close(rel, AccessShareLock);
-
-		/*
-		 * We force a checkpoint before committing.  This effectively means
-		 * that committed XLOG_DBASE_CREATE operations will never need to be
-		 * replayed (at least not in ordinary crash recovery; we still have to
-		 * make the XLOG entry for the benefit of PITR operations). This
-		 * avoids two nasty scenarios:
-		 *
-		 * #1: When PITR is off, we don't XLOG the contents of newly created
-		 * indexes; therefore the drop-and-recreate-whole-directory behavior
-		 * of DBASE_CREATE replay would lose such indexes.
-		 *
-		 * #2: Since we have to recopy the source database during DBASE_CREATE
-		 * replay, we run the risk of copying changes in it that were
-		 * committed after the original CREATE DATABASE command but before the
-		 * system crash that led to the replay.  This is at least unexpected
-		 * and at worst could lead to inconsistencies, eg duplicate table
-		 * names.
-		 *
-		 * (Both of these were real bugs in releases 8.0 through 8.0.3.)
-		 *
-		 * In PITR replay, the first of these isn't an issue, and the second
-		 * is only a risk if the CREATE DATABASE and subsequent template
-		 * database change both occur while a base backup is being taken.
-		 * There doesn't seem to be much we can do about that except document
-		 * it as a limitation.
-		 *
-		 * Perhaps if we ever implement CREATE DATABASE in a less cheesy way,
-		 * we can avoid this.
+		 * If the user has asked to create a database with WAL_LOG strategy
+		 * then call CreateDatabaseUsingWalLog, which will copy the database
+		 * at the block level and it will WAL log each copied block.
+		 * Otherwise, call CreateDatabaseUsingFileCopy that will copy the
+		 * database file by file.
 		 */
-		RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
+		if (dbstrategy == CREATEDB_WAL_LOG)
+			CreateDatabaseUsingWalLog(src_dboid, dboid, src_deftablespace,
+									  dst_deftablespace);
+		else
+			CreateDatabaseUsingFileCopy(src_dboid, dboid, src_deftablespace,
+										dst_deftablespace);
 
 		/*
 		 * Close pg_database, but keep lock till commit.
@@ -955,6 +1462,25 @@ createdb_failure_callback(int code, Datum arg)
 	createdb_failure_params *fparms = (createdb_failure_params *) DatumGetPointer(arg);
 
 	/*
+	 * If we were copying database at block levels then drop pages for the
+	 * destination database that are in the shared buffer cache.  And tell
+	 * checkpointer to forget any pending fsync and unlink requests for files
+	 * in the database.  The reasoning behind doing this is same as explained
+	 * in dropdb function.  But unlike dropdb we don't need to call
+	 * pgstat_drop_database because this database is still not created so
+	 * there should not be any stat for this.
+	 */
+	if (fparms->strategy == CREATEDB_WAL_LOG)
+	{
+		DropDatabaseBuffers(fparms->dest_dboid);
+		ForgetDatabaseSyncRequests(fparms->dest_dboid);
+
+		/* Release lock on the target database. */
+		UnlockSharedObject(DatabaseRelationId, fparms->dest_dboid, 0,
+						   AccessShareLock);
+	}
+
+	/*
 	 * Release lock on source database before doing recursive remove. This is
 	 * not essential but it seems desirable to release the lock as soon as
 	 * possible.
@@ -1478,7 +2004,7 @@ movedb(const char *dbname, const char *tblspcname)
 		 * Record the filesystem change in XLOG
 		 */
 		{
-			xl_dbase_create_rec xlrec;
+			xl_dbase_create_file_copy_rec xlrec;
 
 			xlrec.db_id = db_id;
 			xlrec.tablespace_id = dst_tblspcoid;
@@ -1486,10 +2012,11 @@ movedb(const char *dbname, const char *tblspcname)
 			xlrec.src_tablespace_id = src_tblspcoid;
 
 			XLogBeginInsert();
-			XLogRegisterData((char *) &xlrec, sizeof(xl_dbase_create_rec));
+			XLogRegisterData((char *) &xlrec,
+							 sizeof(xl_dbase_create_file_copy_rec));
 
 			(void) XLogInsert(RM_DBASE_ID,
-							  XLOG_DBASE_CREATE | XLR_SPECIAL_REL_UPDATE);
+							  XLOG_DBASE_CREATE_FILE_COPY | XLR_SPECIAL_REL_UPDATE);
 		}
 
 		/*
@@ -1525,9 +2052,10 @@ movedb(const char *dbname, const char *tblspcname)
 
 		/*
 		 * Force another checkpoint here.  As in CREATE DATABASE, this is to
-		 * ensure that we don't have to replay a committed XLOG_DBASE_CREATE
-		 * operation, which would cause us to lose any unlogged operations
-		 * done in the new DB tablespace before the next checkpoint.
+		 * ensure that we don't have to replay a committed
+		 * XLOG_DBASE_CREATE_FILE_COPY operation, which would cause us to lose
+		 * any unlogged operations done in the new DB tablespace before the
+		 * next checkpoint.
 		 */
 		RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
 
@@ -2478,9 +3006,10 @@ dbase_redo(XLogReaderState *record)
 	/* Backup blocks are not used in dbase records */
 	Assert(!XLogRecHasAnyBlockRefs(record));
 
-	if (info == XLOG_DBASE_CREATE)
+	if (info == XLOG_DBASE_CREATE_FILE_COPY)
 	{
-		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) XLogRecGetData(record);
+		xl_dbase_create_file_copy_rec *xlrec =
+		(xl_dbase_create_file_copy_rec *) XLogRecGetData(record);
 		char	   *src_path;
 		char	   *dst_path;
 		struct stat st;
@@ -2515,6 +3044,18 @@ dbase_redo(XLogReaderState *record)
 		 */
 		copydir(src_path, dst_path, false);
 	}
+	else if (info == XLOG_DBASE_CREATE_WAL_LOG)
+	{
+		xl_dbase_create_wal_log_rec *xlrec =
+		(xl_dbase_create_wal_log_rec *) XLogRecGetData(record);
+		char	   *dbpath;
+
+		dbpath = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
+
+		/* Create the database directory with the version file. */
+		CreateDirAndVersionFile(dbpath, xlrec->db_id, xlrec->tablespace_id,
+								true);
+	}
 	else if (info == XLOG_DBASE_DROP)
 	{
 		xl_dbase_drop_rec *xlrec = (xl_dbase_drop_rec *) XLogRecGetData(record);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f5459c6..15b0ee3 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -38,6 +38,7 @@
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/storage.h"
+#include "catalog/storage_xlog.h"
 #include "executor/instrument.h"
 #include "lib/binaryheap.h"
 #include "miscadmin.h"
@@ -486,6 +487,9 @@ static void FindAndDropRelFileNodeBuffers(RelFileNode rnode,
 										  ForkNumber forkNum,
 										  BlockNumber nForkBlock,
 										  BlockNumber firstDelBlock);
+static void RelationCopyStorageUsingBuffer(Relation src, Relation dst,
+										   ForkNumber forkNum,
+										   bool isunlogged);
 static void AtProcExit_Buffers(int code, Datum arg);
 static void CheckForBufferLeaks(void);
 static int	rnode_comparator(const void *p1, const void *p2);
@@ -772,23 +776,23 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
  * ReadBufferWithoutRelcache -- like ReadBufferExtended, but doesn't require
  *		a relcache entry for the relation.
  *
- * NB: At present, this function may only be used on permanent relations, which
- * is OK, because we only use it during XLOG replay.  If in the future we
- * want to use it on temporary or unlogged relations, we could pass additional
- * parameters.
+ * Pass permanent = true for a RELPERSISTENCE_PERMANENT relation, and
+ * permanent = false for a RELPERSISTENCE_UNLOGGED relation. This function
+ * cannot be used for temporary relations (and making that work might be
+ * difficult, unless we only want to read temporary relations for our own
+ * BackendId).
  */
 Buffer
 ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
 						  BlockNumber blockNum, ReadBufferMode mode,
-						  BufferAccessStrategy strategy)
+						  BufferAccessStrategy strategy, bool permanent)
 {
 	bool		hit;
 
 	SMgrRelation smgr = smgropen(rnode, InvalidBackendId);
 
-	Assert(InRecovery);
-
-	return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
+	return ReadBuffer_common(smgr, permanent ? RELPERSISTENCE_PERMANENT :
+							 RELPERSISTENCE_UNLOGGED, forkNum, blockNum,
 							 mode, strategy, &hit);
 }
 
@@ -3677,6 +3681,157 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 }
 
 /* ---------------------------------------------------------------------
+ *		RelationCopyStorageUsingBuffer
+ *
+ *		Copy fork's data using bufmgr.  Same as RelationCopyStorage but instead
+ *		of using smgrread and smgrextend this will copy using bufmgr APIs.
+ *
+ *		Refer comments atop CreateAndCopyRelationData() for details about
+ *		'permanent' parameter.
+ * --------------------------------------------------------------------
+ */
+static void
+RelationCopyStorageUsingBuffer(Relation src, Relation dst, ForkNumber forkNum,
+							   bool permanent)
+{
+	Buffer		srcBuf;
+	Buffer		dstBuf;
+	Page		srcPage;
+	Page		dstPage;
+	bool		use_wal;
+	BlockNumber nblocks;
+	BlockNumber blkno;
+	BufferAccessStrategy bstrategy_src;
+	BufferAccessStrategy bstrategy_dst;
+
+	/*
+	 * In general, we want to write WAL whenever wal_level > 'minimal', but
+	 * we can skip it when copying any fork of an unlogged relation other
+	 * than the init fork.
+	 */
+	use_wal = XLogIsNeeded() && (permanent || forkNum == INIT_FORKNUM);
+
+	/* Get number of blocks in the source relation. */
+	nblocks = smgrnblocks(RelationGetSmgr(src), forkNum);
+
+	/* Nothing to copy; just return. */
+	if (nblocks == 0)
+		return;
+
+	/* This is a bulk operation, so use buffer access strategies. */
+	bstrategy_src = GetAccessStrategy(BAS_BULKREAD);
+	bstrategy_dst = GetAccessStrategy(BAS_BULKWRITE);
+
+	/* Iterate over each block of the source relation file. */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		CHECK_FOR_INTERRUPTS();
+
+		/* Read block from source relation. */
+		srcBuf = ReadBufferWithoutRelcache(src->rd_node, forkNum, blkno,
+										   RBM_NORMAL, bstrategy_src,
+										   permanent);
+		srcPage = BufferGetPage(srcBuf);
+		if (PageIsNew(srcPage) || PageIsEmpty(srcPage))
+		{
+			ReleaseBuffer(srcBuf);
+			continue;
+		}
+
+		/* Use P_NEW to extend the destination relation. */
+		dstBuf = ReadBufferWithoutRelcache(dst->rd_node, forkNum, P_NEW,
+										   RBM_NORMAL, bstrategy_dst,
+										   permanent);
+		LockBuffer(dstBuf, BUFFER_LOCK_EXCLUSIVE);
+
+		START_CRIT_SECTION();
+
+		/* Copy page data from the source to the destination. */
+		dstPage = BufferGetPage(dstBuf);
+		memcpy(dstPage, srcPage, BLCKSZ);
+		MarkBufferDirty(dstBuf);
+
+		/* WAL-log the copied page. */
+		if (use_wal)
+			log_newpage_buffer(dstBuf, true);
+
+		END_CRIT_SECTION();
+
+		UnlockReleaseBuffer(dstBuf);
+		ReleaseBuffer(srcBuf);
+	}
+}
+
+/* ---------------------------------------------------------------------
+ *		CreateAndCopyRelationData
+ *
+ *		Create destination relation storage and copy all forks from the
+ *		source relation to the destination.
+ *
+ *		Pass permanent as true for permanent relations and false for
+ *		unlogged relations.  Currently this API is not supported for
+ *		temporary relations.
+ * --------------------------------------------------------------------
+ */
+void
+CreateAndCopyRelationData(RelFileNode src_rnode, RelFileNode dst_rnode,
+						  bool permanent)
+{
+	Relation		src_rel;
+	Relation		dst_rel;
+
+	/*
+	 * Prepare fake relcache entries for the srource and the destination.  It
+	 * is safe to use the fake relcache here because we are only going to
+	 * access the fields related to the physical storage.  We are using the
+	 * fake relcache entry only because it isn't safe to hold the smgr
+	 * pointers, for more details refer comments atop RelationGetSmgr.
+	 */
+	src_rel = CreateFakeRelcacheEntry(src_rnode);
+	dst_rel = CreateFakeRelcacheEntry(dst_rnode);
+
+	/*
+	 * Create and copy all forks of the relation.  We are not using
+	 * RelationCreateStorage() as it is registering the cleanup for the
+	 * underlying relation storage on the transaction abort.  But during create
+	 * database failure, we have a separate cleanup mechanism for the whole
+	 * database directory. Therefore, we don't need to register cleanup for
+	 * each individual relation storage.
+	 */
+	smgrcreate(RelationGetSmgr(dst_rel), MAIN_FORKNUM, false);
+	if (permanent)
+		log_smgrcreate(&dst_rnode, MAIN_FORKNUM);
+
+	/* copy main fork. */
+	RelationCopyStorageUsingBuffer(src_rel, dst_rel, MAIN_FORKNUM, permanent);
+
+	/* copy those extra forks that exist */
+	for (ForkNumber forkNum = MAIN_FORKNUM + 1;
+		 forkNum <= MAX_FORKNUM; forkNum++)
+	{
+		if (smgrexists(RelationGetSmgr(src_rel), forkNum))
+		{
+			smgrcreate(RelationGetSmgr(dst_rel), forkNum, false);
+
+			/*
+			 * WAL log creation if the relation is persistent, or this is the
+			 * init fork of an unlogged relation.
+			 */
+			if (permanent || forkNum == INIT_FORKNUM)
+				log_smgrcreate(&dst_rnode, forkNum);
+
+			/* Copy a fork's data, block by block. */
+			RelationCopyStorageUsingBuffer(src_rel, dst_rel, forkNum,
+										   permanent);
+		}
+	}
+
+	/* Release fake relcache entries. */
+	FreeFakeRelcacheEntry(src_rel);
+	FreeFakeRelcacheEntry(dst_rel);
+}
+
+/* ---------------------------------------------------------------------
  *		FlushDatabaseBuffers
  *
  *		This function writes all dirty pages of a database out to disk
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 5ae52dd..1543da6 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -176,6 +176,34 @@ ConditionalLockRelationOid(Oid relid, LOCKMODE lockmode)
 }
 
 /*
+ *		LockRelationId
+ *
+ * Lock, given a LockRelId.  Same as LockRelationOid but take LockRelId as an
+ * input.
+ */
+void
+LockRelationId(LockRelId *relid, LOCKMODE lockmode)
+{
+	LOCKTAG		tag;
+	LOCALLOCK  *locallock;
+	LockAcquireResult res;
+
+	SET_LOCKTAG_RELATION(tag, relid->dbId, relid->relId);
+
+	res = LockAcquireExtended(&tag, lockmode, false, false, true, &locallock);
+
+	/*
+	 * Now that we have the lock, check for invalidation messages; see notes
+	 * in LockRelationOid.
+	 */
+	if (res != LOCKACQUIRE_ALREADY_CLEAR)
+	{
+		AcceptInvalidationMessages();
+		MarkLockClear(locallock);
+	}
+}
+
+/*
  *		UnlockRelationId
  *
  * Unlock, given a LockRelId.  This is preferred over UnlockRelationOid
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index ff46a0e..1c8aba4 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -705,6 +705,9 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_TWOPHASE_FILE_WRITE:
 			event_name = "TwophaseFileWrite";
 			break;
+		case WAIT_EVENT_VERSION_FILE_WRITE:
+			event_name = "VersionFileWrite";
+			break;
 		case WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ:
 			event_name = "WALSenderTimelineHistoryRead";
 			break;
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 4d0718f..dee3387 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -252,6 +252,63 @@ RelationMapFilenodeToOid(Oid filenode, bool shared)
 }
 
 /*
+ * RelationMapOidToFilenodeForDatabase
+ *
+ * Like RelationMapOidToFilenode, but reads the mapping from the indicated
+ * path instead of using the one for the current database.
+ */
+Oid
+RelationMapOidToFilenodeForDatabase(char *dbpath, Oid relationId)
+{
+	RelMapFile	map;
+	int			i;
+
+	/* Read the relmap file from the source database. */
+	read_relmap_file(&map, dbpath, false, ERROR);
+
+	/* Iterate over the relmap entries to find the input relation OID. */
+	for (i = 0; i < map.num_mappings; i++)
+	{
+		if (relationId == map.mappings[i].mapoid)
+			return map.mappings[i].mapfilenode;
+	}
+
+	return InvalidOid;
+}
+
+/*
+ * RelationMapCopy
+ *
+ * Copy relmapfile from source db path to the destination db path and WAL log
+ * the operation. This is intended for use in creating a new relmap file
+ * for a database that doesn't have one yet, not for replacing an existing
+ * relmap file.
+ */
+void
+RelationMapCopy(Oid dbid, Oid tsid, char *srcdbpath, char *dstdbpath)
+{
+	RelMapFile map;
+
+	/*
+	 * Read the relmap file from the source database.
+	 */
+	read_relmap_file(&map, srcdbpath, false, ERROR);
+
+	/*
+	 * Write the same data into the destination database's relmap file.
+	 *
+	 * No sinval is needed because no one can be connected to the destination
+	 * database yet. For the same reason, there is no need to acquire
+	 * RelationMappingLock.
+	 *
+	 * There's no point in trying to preserve files here. The new database
+	 * isn't usable yet anyway, and won't ever be if we can't install a
+	 * relmap file.
+	 */
+	write_relmap_file(&map, true, false, false, dbid, tsid, dstdbpath);
+}
+
+/*
  * RelationMapUpdateMap
  *
  * Install a new relfilenode mapping for the specified relation.
@@ -1031,6 +1088,13 @@ relmap_redo(XLogReaderState *record)
 		 *
 		 * There shouldn't be anyone else updating relmaps during WAL replay,
 		 * but grab the lock to interlock against load_relmap_file().
+		 *
+		 * Note that we use the same WAL record for updating the relmap of
+		 * an existing database as we do for creating a new database. In
+		 * the latter case, taking the relmap log and sending sinval messages
+		 * is unnecessary, but harmless. If we wanted to avoid it, we could
+		 * add a flag to the WAL record to indicate which opration is being
+		 * performed.
 		 */
 		LWLockAcquire(RelationMappingLock, LW_EXCLUSIVE);
 		write_relmap_file(&newmap, false, true, false,
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 7cfa169..bd1ec42 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -370,7 +370,7 @@ extractPageInfo(XLogReaderState *record)
 
 	/* Is this a special record type that I recognize? */
 
-	if (rmid == RM_DBASE_ID && rminfo == XLOG_DBASE_CREATE)
+	if (rmid == RM_DBASE_ID && rminfo == XLOG_DBASE_CREATE_FILE_COPY)
 	{
 		/*
 		 * New databases can be safely ignored. It won't be present in the
@@ -382,6 +382,13 @@ extractPageInfo(XLogReaderState *record)
 		 * overwriting the database created in the target system.
 		 */
 	}
+	else if (rmid == RM_DBASE_ID && rminfo == XLOG_DBASE_CREATE_WAL_LOG)
+	{
+		/*
+		 * New databases can be safely ignored. It won't be present in the
+		 * source system, so it will be deleted.
+		 */
+	}
 	else if (rmid == RM_DBASE_ID && rminfo == XLOG_DBASE_DROP)
 	{
 		/*
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 5c06459..baabf98 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2776,13 +2776,15 @@ psql_completion(const char *text, int start, int end)
 	/* CREATE DATABASE */
 	else if (Matches("CREATE", "DATABASE", MatchAny))
 		COMPLETE_WITH("OWNER", "TEMPLATE", "ENCODING", "TABLESPACE",
-					  "IS_TEMPLATE",
+					  "IS_TEMPLATE", "STRATEGY",
 					  "ALLOW_CONNECTIONS", "CONNECTION LIMIT",
 					  "LC_COLLATE", "LC_CTYPE", "LOCALE", "OID",
 					  "LOCALE_PROVIDER", "ICU_LOCALE");
 
 	else if (Matches("CREATE", "DATABASE", MatchAny, "TEMPLATE"))
 		COMPLETE_WITH_QUERY(Query_for_list_of_template_databases);
+	else if (Matches("CREATE", "DATABASE", MatchAny, "STRATEGY"))
+		COMPLETE_WITH("WAL_LOG", "FILE_COPY");
 
 	/* CREATE DOMAIN */
 	else if (Matches("CREATE", "DOMAIN", MatchAny))
diff --git a/src/bin/scripts/createdb.c b/src/bin/scripts/createdb.c
index 6f612ab..0bffa2f 100644
--- a/src/bin/scripts/createdb.c
+++ b/src/bin/scripts/createdb.c
@@ -34,6 +34,7 @@ main(int argc, char *argv[])
 		{"tablespace", required_argument, NULL, 'D'},
 		{"template", required_argument, NULL, 'T'},
 		{"encoding", required_argument, NULL, 'E'},
+		{"strategy", required_argument, NULL, 'S'},
 		{"lc-collate", required_argument, NULL, 1},
 		{"lc-ctype", required_argument, NULL, 2},
 		{"locale", required_argument, NULL, 'l'},
@@ -60,6 +61,7 @@ main(int argc, char *argv[])
 	char	   *tablespace = NULL;
 	char	   *template = NULL;
 	char	   *encoding = NULL;
+	char	   *strategy = NULL;
 	char	   *lc_collate = NULL;
 	char	   *lc_ctype = NULL;
 	char	   *locale = NULL;
@@ -77,7 +79,7 @@ main(int argc, char *argv[])
 
 	handle_help_version_opts(argc, argv, "createdb", help);
 
-	while ((c = getopt_long(argc, argv, "h:p:U:wWeO:D:T:E:l:", long_options, &optindex)) != -1)
+	while ((c = getopt_long(argc, argv, "h:p:U:wWeO:D:T:E:l:S:", long_options, &optindex)) != -1)
 	{
 		switch (c)
 		{
@@ -111,6 +113,9 @@ main(int argc, char *argv[])
 			case 'E':
 				encoding = pg_strdup(optarg);
 				break;
+			case 'S':
+				strategy = pg_strdup(optarg);
+				break;
 			case 1:
 				lc_collate = pg_strdup(optarg);
 				break;
@@ -215,6 +220,8 @@ main(int argc, char *argv[])
 		appendPQExpBufferStr(&sql, " ENCODING ");
 		appendStringLiteralConn(&sql, encoding, conn);
 	}
+	if (strategy)
+		appendPQExpBuffer(&sql, " STRATEGY %s", fmtId(strategy));
 	if (template)
 		appendPQExpBuffer(&sql, " TEMPLATE %s", fmtId(template));
 	if (lc_collate)
@@ -294,6 +301,7 @@ help(const char *progname)
 	printf(_("      --locale-provider={libc|icu}\n"
 			 "                               locale provider for the database's default collation\n"));
 	printf(_("  -O, --owner=OWNER            database user to own the new database\n"));
+	printf(_("  -S, --strategy=STRATEGY      database creation strategy wal_log or file_copy\n"));
 	printf(_("  -T, --template=TEMPLATE      template database to copy\n"));
 	printf(_("  -V, --version                output version information, then exit\n"));
 	printf(_("  -?, --help                   show this help, then exit\n"));
diff --git a/src/bin/scripts/t/020_createdb.pl b/src/bin/scripts/t/020_createdb.pl
index 35deec9..44d3c6d 100644
--- a/src/bin/scripts/t/020_createdb.pl
+++ b/src/bin/scripts/t/020_createdb.pl
@@ -104,4 +104,24 @@ $node->command_checks_all(
 	],
 	'createdb with incorrect --lc-ctype');
 
+$node->command_checks_all(
+	[ 'createdb', '--strategy', "foo", 'foobar2' ],
+	1,
+	[qr/^$/],
+	[
+		qr/^createdb: error: database creation failed: ERROR:  invalid create database strategy|^createdb: error: database creation failed: ERROR:  invalid create database strategy foo/s
+	],
+	'createdb with incorrect --strategy');
+
+# Check database creation strategy
+$node->issues_sql_like(
+	[ 'createdb', '-T', 'foobar2', 'foobar4', '-S', 'wal_log'],
+	qr/statement: CREATE DATABASE foobar4 STRATEGY wal_log TEMPLATE foobar2/,
+	'create database with WAL_LOG strategy');
+
+$node->issues_sql_like(
+	[ 'createdb', '-T', 'foobar2', 'foobar5', '-S', 'file_copy'],
+	qr/statement: CREATE DATABASE foobar5 STRATEGY file_copy TEMPLATE foobar2/,
+	'create database with FILE_COPY strategy');
+
 done_testing();
diff --git a/src/include/commands/dbcommands_xlog.h b/src/include/commands/dbcommands_xlog.h
index 593a857..0ee2452 100644
--- a/src/include/commands/dbcommands_xlog.h
+++ b/src/include/commands/dbcommands_xlog.h
@@ -18,17 +18,32 @@
 #include "lib/stringinfo.h"
 
 /* record types */
-#define XLOG_DBASE_CREATE		0x00
-#define XLOG_DBASE_DROP			0x10
+#define XLOG_DBASE_CREATE_FILE_COPY		0x00
+#define XLOG_DBASE_CREATE_WAL_LOG		0x10
+#define XLOG_DBASE_DROP					0x20
 
-typedef struct xl_dbase_create_rec
+/*
+ * Single WAL record for an entire CREATE DATABASE operation. This is used
+ * by the FILE_COPY strategy.
+ */
+typedef struct xl_dbase_create_file_copy_rec
 {
-	/* Records copying of a single subdirectory incl. contents */
 	Oid			db_id;
 	Oid			tablespace_id;
 	Oid			src_db_id;
 	Oid			src_tablespace_id;
-} xl_dbase_create_rec;
+} xl_dbase_create_file_copy_rec;
+
+/*
+ * WAL record for the beginning of a CREATE DATABASE operation, when the
+ * WAL_LOG strategy is used. Each individual block will be logged separately
+ * afterward.
+ */
+typedef struct xl_dbase_create_wal_log_rec
+{
+	Oid			db_id;
+	Oid			tablespace_id;
+} xl_dbase_create_wal_log_rec;
 
 typedef struct xl_dbase_drop_rec
 {
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index dd01841..a6b657f 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -184,7 +184,8 @@ extern Buffer ReadBufferExtended(Relation reln, ForkNumber forkNum,
 								 BufferAccessStrategy strategy);
 extern Buffer ReadBufferWithoutRelcache(RelFileNode rnode,
 										ForkNumber forkNum, BlockNumber blockNum,
-										ReadBufferMode mode, BufferAccessStrategy strategy);
+										ReadBufferMode mode, BufferAccessStrategy strategy,
+										bool permanent);
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern void MarkBufferDirty(Buffer buffer);
@@ -203,6 +204,9 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
 extern void FlushOneBuffer(Buffer buffer);
 extern void FlushRelationBuffers(Relation rel);
 extern void FlushRelationsAllBuffers(struct SMgrRelationData **smgrs, int nrels);
+extern void CreateAndCopyRelationData(RelFileNode src_rnode,
+									  RelFileNode dst_rnode,
+									  bool permanent);
 extern void FlushDatabaseBuffers(Oid dbid);
 extern void DropRelFileNodeBuffers(struct SMgrRelationData *smgr_reln, ForkNumber *forkNum,
 								   int nforks, BlockNumber *firstDelBlock);
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index 49edbcc..be1d2c9 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -38,6 +38,7 @@ extern void RelationInitLockInfo(Relation relation);
 
 /* Lock a relation */
 extern void LockRelationOid(Oid relid, LOCKMODE lockmode);
+extern void LockRelationId(LockRelId *relid, LOCKMODE lockmode);
 extern bool ConditionalLockRelationOid(Oid relid, LOCKMODE lockmode);
 extern void UnlockRelationId(LockRelId *relid, LOCKMODE lockmode);
 extern void UnlockRelationOid(Oid relid, LOCKMODE lockmode);
diff --git a/src/include/utils/relmapper.h b/src/include/utils/relmapper.h
index 9fbb5a7..f10353e 100644
--- a/src/include/utils/relmapper.h
+++ b/src/include/utils/relmapper.h
@@ -38,7 +38,9 @@ typedef struct xl_relmap_update
 extern Oid	RelationMapOidToFilenode(Oid relationId, bool shared);
 
 extern Oid	RelationMapFilenodeToOid(Oid relationId, bool shared);
-
+extern Oid RelationMapOidToFilenodeForDatabase(char *dbpath, Oid relationId);
+extern void RelationMapCopy(Oid dbid, Oid tsid, char *srcdbpath,
+							char *dstdbpath);
 extern void RelationMapUpdateMap(Oid relationId, Oid fileNode, bool shared,
 								 bool immediate);
 
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 1c39ce0..d870c59 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -218,6 +218,7 @@ typedef enum
 	WAIT_EVENT_TWOPHASE_FILE_READ,
 	WAIT_EVENT_TWOPHASE_FILE_SYNC,
 	WAIT_EVENT_TWOPHASE_FILE_WRITE,
+	WAIT_EVENT_VERSION_FILE_WRITE,
 	WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ,
 	WAIT_EVENT_WAL_BOOTSTRAP_SYNC,
 	WAIT_EVENT_WAL_BOOTSTRAP_WRITE,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 93d5190..0747205 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -460,6 +460,8 @@ CoverPos
 CreateAmStmt
 CreateCastStmt
 CreateConversionStmt
+CreateDBRelInfo
+CreateDBStrategy
 CreateDomainStmt
 CreateEnumStmt
 CreateEventTrigStmt
@@ -3701,7 +3703,8 @@ xl_btree_update
 xl_btree_vacuum
 xl_clog_truncate
 xl_commit_ts_truncate
-xl_dbase_create_rec
+xl_dbase_create_file_copy_rec
+xl_dbase_create_wal_log_rec
 xl_dbase_drop_rec
 xl_end_of_recovery
 xl_hash_add_ovfl_page
-- 
1.8.3.1

#185

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Dilip Kumar (#184)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Wed, Mar 23, 2022 at 9:19 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I think directly using smgrcreate() is a better idea instead of first
registering and then unregistering it. I have made that change in
the attached patch. After this change now we can merge creating the
MAIN_FORKNUM also in the loop below where we are creating other
fork[1] with one extra condition but I think current code is in more
sync with the other code where we are doing the similar things so I
have not merged it in the loop. Please let me know if you think
otherwise.

Generally I think our practice is that we do the main fork
unconditionally (because it should always be there) and the others
only if they exist. I suggest that you make this consistent with that,
but you could do it like if (forkNum != MAIN_FORKNUM &&
!smgrexists(...)) continue if that seems nicer.

Do you think that this version handles pending syncs correctly? I
think perhaps that is overlooked.

--
Robert Haas
EDB: http://www.enterprisedb.com

#186

Dilip Kumar

dilipbalaut@gmail.com

almost 4 years ago

In reply to: Robert Haas (#185)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Wed, Mar 23, 2022 at 7:03 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Mar 23, 2022 at 9:19 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I think directly using smgrcreate() is a better idea instead of first
registering and then unregistering it. I have made that change in
the attached patch. After this change now we can merge creating the
MAIN_FORKNUM also in the loop below where we are creating other
fork[1] with one extra condition but I think current code is in more
sync with the other code where we are doing the similar things so I
have not merged it in the loop. Please let me know if you think
otherwise.

Generally I think our practice is that we do the main fork
unconditionally (because it should always be there) and the others
only if they exist. I suggest that you make this consistent with that,
but you could do it like if (forkNum != MAIN_FORKNUM &&
!smgrexists(...)) continue if that seems nicer.

Maybe we can do that.

Do you think that this version handles pending syncs correctly? I
think perhaps that is overlooked.

Yeah I missed that. So options are either we go to the other approach
and call RelationPreserveStorage() after
RelationCreateStorage(), or we expose the AddPendingSync() function
from the storage layer and then conditionally use it. I think if we
are planning to expose this api then we better rename it to
RelationAddPendingSync(). Honestly, I do not have any specific
preference here. I can try both the approaches and send both if you
or anyone else do not have any preference here?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#187

Andres Freund

andres@anarazel.de

almost 4 years ago

In reply to: Dilip Kumar (#184)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

Hi,

On 2022-03-23 18:49:11 +0530, Dilip Kumar wrote:

I think directly using smgrcreate() is a better idea instead of first
registering and then unregistering it. I have made that change in
the attached patch. After this change now we can merge creating the
MAIN_FORKNUM also in the loop below where we are creating other
fork[1] with one extra condition but I think current code is in more
sync with the other code where we are doing the similar things so I
have not merged it in the loop. Please let me know if you think
otherwise.

FWIW, this fails tests: https://cirrus-ci.com/build/4929662173315072
https://cirrus-ci.com/task/6651773434724352?logs=test_bin#L121
https://cirrus-ci.com/task/6088823481303040?logs=test_world#L2377

Greetings,

Andres Freund

#188

Dilip Kumar

dilipbalaut@gmail.com

almost 4 years ago

In reply to: Andres Freund (#187)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Wed, Mar 23, 2022 at 9:13 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2022-03-23 18:49:11 +0530, Dilip Kumar wrote:

I think directly using smgrcreate() is a better idea instead of first
registering and then unregistering it. I have made that change in
the attached patch. After this change now we can merge creating the
MAIN_FORKNUM also in the loop below where we are creating other
fork[1] with one extra condition but I think current code is in more
sync with the other code where we are doing the similar things so I
have not merged it in the loop. Please let me know if you think
otherwise.

FWIW, this fails tests: https://cirrus-ci.com/build/4929662173315072
https://cirrus-ci.com/task/6651773434724352?logs=test_bin#L121
https://cirrus-ci.com/task/6088823481303040?logs=test_world#L2377

Strange to see that these changes are making a failure in the
file_copy strategy[1]Failed test 'createdb -T foobar2 foobar5 -S file_copy exit code 0' because we made changes only related to the
wal_log strategy. However I will look into this. Thanks.
[1]: Failed test 'createdb -T foobar2 foobar5 -S file_copy exit code 0'
Failed test 'createdb -T foobar2 foobar5 -S file_copy exit code 0'

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#189

Dilip Kumar

dilipbalaut@gmail.com

almost 4 years ago

In reply to: Dilip Kumar (#186)

1 attachment(s)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Wed, Mar 23, 2022 at 9:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Mar 23, 2022 at 7:03 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Mar 23, 2022 at 9:19 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I think directly using smgrcreate() is a better idea instead of first
registering and then unregistering it. I have made that change in
the attached patch. After this change now we can merge creating the
MAIN_FORKNUM also in the loop below where we are creating other
fork[1] with one extra condition but I think current code is in more
sync with the other code where we are doing the similar things so I
have not merged it in the loop. Please let me know if you think
otherwise.

Generally I think our practice is that we do the main fork
unconditionally (because it should always be there) and the others
only if they exist. I suggest that you make this consistent with that,
but you could do it like if (forkNum != MAIN_FORKNUM &&
!smgrexists(...)) continue if that seems nicer.

Maybe we can do that.

Do you think that this version handles pending syncs correctly? I
think perhaps that is overlooked.

Yeah I missed that. So options are either we go to the other approach
and call RelationPreserveStorage() after
RelationCreateStorage(),

Here is the patch with this approach, I am not sending both patches
with different approaches in the same mail otherwise cfbot might
generate conflict while applying the patch I think, so I will send it
in a seperate mail.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v4-0001-Add-new-block-by-block-strategy-for-CREATE-DATABA.patchtext/x-patch; charset=US-ASCII; name=v4-0001-Add-new-block-by-block-strategy-for-CREATE-DATABA.patchDownload

From 2091047bf391424528c685059268cc8b1212a2dc Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Tue, 22 Mar 2022 11:22:26 -0400
Subject: [PATCH v4] Add new block-by-block strategy for CREATE DATABASE.

Because this strategy logs changes on a block-by-block basis, it
avoids the need to checkpoint before and after the operation.
However, because it logs each changed block individually, it might
generate a lot of extra write-ahead logging if the template database
is large. Therefore, the older strategy remains available via a new
STRATEGY parameter to CREATE DATABASE, and a corresponding --strategy
option to createdb.

Somewhat controversially, this patch assembles the list of relations
to be copied to the new database by reading the pg_class relation of
the template database. Cross-database access like this isn't normally
possible, but it can be made to work here because there can't be any
connections to the database being copied, nor can it contain any
in-doubt transactions. Even so, we have to use lower-level interfaces
than normal, since the table scan and relcache interfaces will not
work for a database to which we're not connected. The advantage of
this approach is that we do not need to rely on the filesystem to
determine what ought to be copied, but instead on PostgreSQL's own
knowledge of the database structure. This avoids, for example,
copying stray files that happen to be located in the source database
directory.

Dilip Kumar, with a fairly large number of cosmetic changes by me.
---
 contrib/bloom/blinsert.c                 |   2 +-
 doc/src/sgml/ref/create_database.sgml    |  22 +
 doc/src/sgml/ref/createdb.sgml           |  11 +
 src/backend/access/heap/heapam_handler.c |   2 +-
 src/backend/access/nbtree/nbtree.c       |   2 +-
 src/backend/access/rmgrdesc/dbasedesc.c  |  20 +-
 src/backend/access/transam/xlogutils.c   |   6 +-
 src/backend/commands/dbcommands.c        | 761 ++++++++++++++++++++++++++-----
 src/backend/storage/buffer/bufmgr.c      | 181 +++++++-
 src/backend/storage/lmgr/lmgr.c          |  28 ++
 src/backend/utils/activity/wait_event.c  |   3 +
 src/backend/utils/cache/relmapper.c      |  64 +++
 src/bin/pg_rewind/parsexlog.c            |   9 +-
 src/bin/psql/tab-complete.c              |   4 +-
 src/bin/scripts/createdb.c               |  10 +-
 src/bin/scripts/t/020_createdb.pl        |  20 +
 src/include/commands/dbcommands_xlog.h   |  25 +-
 src/include/storage/bufmgr.h             |   6 +-
 src/include/storage/lmgr.h               |   1 +
 src/include/utils/relmapper.h            |   4 +-
 src/include/utils/wait_event.h           |   1 +
 src/tools/pgindent/typedefs.list         |   5 +-
 22 files changed, 1048 insertions(+), 139 deletions(-)

diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c
index c94cf34..82378db 100644
--- a/contrib/bloom/blinsert.c
+++ b/contrib/bloom/blinsert.c
@@ -173,7 +173,7 @@ blbuildempty(Relation index)
 	 * Write the page and log it.  It might seem that an immediate sync would
 	 * be sufficient to guarantee that the file exists on disk, but recovery
 	 * itself might remove it while replaying, for example, an
-	 * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we need
+	 * XLOG_DBASE_CREATE* or XLOG_TBLSPC_CREATE record.  Therefore, we need
 	 * this even when wal_level=minimal.
 	 */
 	PageSetChecksumInplace(metapage, BLOOM_METAPAGE_BLKNO);
diff --git a/doc/src/sgml/ref/create_database.sgml b/doc/src/sgml/ref/create_database.sgml
index 5ae785a..255ad3a 100644
--- a/doc/src/sgml/ref/create_database.sgml
+++ b/doc/src/sgml/ref/create_database.sgml
@@ -25,6 +25,7 @@ CREATE DATABASE <replaceable class="parameter">name</replaceable>
     [ [ WITH ] [ OWNER [=] <replaceable class="parameter">user_name</replaceable> ]
            [ TEMPLATE [=] <replaceable class="parameter">template</replaceable> ]
            [ ENCODING [=] <replaceable class="parameter">encoding</replaceable> ]
+           [ STRATEGY [=] <replaceable class="parameter">strategy</replaceable> ] ]
            [ LOCALE [=] <replaceable class="parameter">locale</replaceable> ]
            [ LC_COLLATE [=] <replaceable class="parameter">lc_collate</replaceable> ]
            [ LC_CTYPE [=] <replaceable class="parameter">lc_ctype</replaceable> ]
@@ -118,6 +119,27 @@ CREATE DATABASE <replaceable class="parameter">name</replaceable>
        </para>
       </listitem>
      </varlistentry>
+     <varlistentry id="create-database-strategy" xreflabel="CREATE DATABASE STRATEGY">
+      <term><replaceable class="parameter">strategy</replaceable></term>
+      <listitem>
+       <para>
+        Strategy to be used in creating the new database.  If
+        the <literal>WAL_LOG</literal> strategy is used, the database will be
+        copied block by block and each block will be separately written
+        to the write-ahead log. This is the most efficient strategy in
+        cases where the template database is small, and therefore it is the
+        default. The older <literal>FILE_COPY</literal> strategy is also
+        available. This strategy writes a small record to the write-ahead log
+        for each tablespace used by the target database. Each such record
+        represents copying an entire directory to a new location at the
+        filesystem level. While this does reduce the write-ahed
+        log volume substantially, especially if the template database is large,
+        it also forces the system to perform a checkpoint both before and
+        after the creation of the new database. In some situations, this may
+        have a noticeable negative impact on overall system performance.
+       </para>
+      </listitem>
+     </varlistentry>
      <varlistentry>
       <term><replaceable class="parameter">locale</replaceable></term>
       <listitem>
diff --git a/doc/src/sgml/ref/createdb.sgml b/doc/src/sgml/ref/createdb.sgml
index be42e50..671cd362 100644
--- a/doc/src/sgml/ref/createdb.sgml
+++ b/doc/src/sgml/ref/createdb.sgml
@@ -178,6 +178,17 @@ PostgreSQL documentation
      </varlistentry>
 
      <varlistentry>
+      <term><option>-S <replaceable class="parameter">template</replaceable></option></term>
+      <term><option>--strategy=<replaceable class="parameter">strategy</replaceable></option></term>
+      <listitem>
+       <para>
+        Specifies the database creation strategy.  See
+        <xref linkend="create-database-strategy" /> for more details.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
       <term><option>-T <replaceable class="parameter">template</replaceable></option></term>
       <term><option>--template=<replaceable class="parameter">template</replaceable></option></term>
       <listitem>
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 39ef8a0..2b70ca0 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -601,7 +601,7 @@ heapam_relation_set_new_filenode(Relation rel,
 	 * even if the page has been logged, because the write did not go through
 	 * shared_buffers and therefore a concurrent checkpoint may have moved the
 	 * redo pointer past our xlog record.  Recovery may as well remove it
-	 * while replaying, for example, XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE
+	 * while replaying, for example, XLOG_DBASE_CREATE* or XLOG_TBLSPC_CREATE
 	 * record. Therefore, logging is necessary even if wal_level=minimal.
 	 */
 	if (persistence == RELPERSISTENCE_UNLOGGED)
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index c9b4964..dacf3f7 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -161,7 +161,7 @@ btbuildempty(Relation index)
 	 * Write the page and log it.  It might seem that an immediate sync would
 	 * be sufficient to guarantee that the file exists on disk, but recovery
 	 * itself might remove it while replaying, for example, an
-	 * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we need
+	 * XLOG_DBASE_CREATE* or XLOG_TBLSPC_CREATE record.  Therefore, we need
 	 * this even when wal_level=minimal.
 	 */
 	PageSetChecksumInplace(metapage, BTREE_METAPAGE);
diff --git a/src/backend/access/rmgrdesc/dbasedesc.c b/src/backend/access/rmgrdesc/dbasedesc.c
index 03af3fd..523d0b3 100644
--- a/src/backend/access/rmgrdesc/dbasedesc.c
+++ b/src/backend/access/rmgrdesc/dbasedesc.c
@@ -24,14 +24,23 @@ dbase_desc(StringInfo buf, XLogReaderState *record)
 	char	   *rec = XLogRecGetData(record);
 	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
 
-	if (info == XLOG_DBASE_CREATE)
+	if (info == XLOG_DBASE_CREATE_FILE_COPY)
 	{
-		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) rec;
+		xl_dbase_create_file_copy_rec *xlrec =
+		(xl_dbase_create_file_copy_rec *) rec;
 
 		appendStringInfo(buf, "copy dir %u/%u to %u/%u",
 						 xlrec->src_tablespace_id, xlrec->src_db_id,
 						 xlrec->tablespace_id, xlrec->db_id);
 	}
+	else if (info == XLOG_DBASE_CREATE_WAL_LOG)
+	{
+		xl_dbase_create_wal_log_rec *xlrec =
+		(xl_dbase_create_wal_log_rec *) rec;
+
+		appendStringInfo(buf, "create dir %u/%u",
+						 xlrec->tablespace_id, xlrec->db_id);
+	}
 	else if (info == XLOG_DBASE_DROP)
 	{
 		xl_dbase_drop_rec *xlrec = (xl_dbase_drop_rec *) rec;
@@ -51,8 +60,11 @@ dbase_identify(uint8 info)
 
 	switch (info & ~XLR_INFO_MASK)
 	{
-		case XLOG_DBASE_CREATE:
-			id = "CREATE";
+		case XLOG_DBASE_CREATE_FILE_COPY:
+			id = "CREATE_FILE_COPY";
+			break;
+		case XLOG_DBASE_CREATE_WAL_LOG:
+			id = "CREATE_WAL_LOG";
 			break;
 		case XLOG_DBASE_DROP:
 			id = "DROP";
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 511f2f1..a4dedc5 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -484,7 +484,7 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 	{
 		/* page exists in file */
 		buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-										   mode, NULL);
+										   mode, NULL, true);
 	}
 	else
 	{
@@ -509,7 +509,7 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 				ReleaseBuffer(buffer);
 			}
 			buffer = ReadBufferWithoutRelcache(rnode, forknum,
-											   P_NEW, mode, NULL);
+											   P_NEW, mode, NULL, true);
 		}
 		while (BufferGetBlockNumber(buffer) < blkno);
 		/* Handle the corner case that P_NEW returns non-consecutive pages */
@@ -519,7 +519,7 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 				LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 			ReleaseBuffer(buffer);
 			buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-											   mode, NULL);
+											   mode, NULL, true);
 		}
 	}
 
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 623e5ec..02a096c 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -63,13 +63,31 @@
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
 #include "utils/pg_locale.h"
+#include "utils/relmapper.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 
+/*
+ * Create database strategy.
+ *
+ * CREATEDB_WAL_LOG will copy the database at the block level and WAL log each
+ * copied block.
+ *
+ * CREATEDB_FILE_COPY will simply perform a file system level copy of the
+ * database and log a single record for each tablespace copied. To make this
+ * safe, it also triggers checkpoints before and after the operation.
+ */
+typedef enum CreateDBStrategy
+{
+	CREATEDB_WAL_LOG,
+	CREATEDB_FILE_COPY
+} CreateDBStrategy;
+
 typedef struct
 {
 	Oid			src_dboid;		/* source (template) DB */
 	Oid			dest_dboid;		/* DB we are trying to create */
+	CreateDBStrategy strategy;	/* create db strategy */
 } createdb_failure_params;
 
 typedef struct
@@ -78,6 +96,17 @@ typedef struct
 	Oid			dest_tsoid;		/* tablespace we are trying to move to */
 } movedb_failure_params;
 
+/*
+ * Information about a relation to be copied when creating a database.
+ */
+typedef struct CreateDBRelInfo
+{
+	RelFileNode rnode;			/* physical relation identifier */
+	Oid			reloid;			/* relation oid */
+	bool		permanent;		/* relation is permanent or unlogged */
+} CreateDBRelInfo;
+
+
 /* non-export function prototypes */
 static void createdb_failure_callback(int code, Datum arg);
 static void movedb(const char *dbname, const char *tblspcname);
@@ -93,7 +122,540 @@ static bool have_createdb_privilege(void);
 static void remove_dbtablespaces(Oid db_id);
 static bool check_db_file_conflict(Oid db_id);
 static int	errdetail_busy_db(int notherbackends, int npreparedxacts);
+static void CreateDatabaseUsingWalLog(Oid src_dboid, Oid dboid, Oid src_tsid,
+									  Oid dst_tsid);
+static List *ScanSourceDatabasePgClass(Oid srctbid, Oid srcdbid, char *srcpath);
+static List *ScanSourceDatabasePgClassPage(Page page, Buffer buf, Oid tbid,
+										   Oid dbid, char *srcpath,
+										   List *rnodelist, Snapshot snapshot);
+static CreateDBRelInfo *ScanSourceDatabasePgClassTuple(HeapTupleData *tuple,
+													   Oid tbid, Oid dbid,
+													   char *srcpath);
+static void CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid,
+									bool isRedo);
+static void CreateDatabaseUsingFileCopy(Oid src_dboid, Oid dboid, Oid src_tsid,
+										Oid dst_tsid);
+
+/*
+ * Create a new database using the WAL_LOG strategy.
+ *
+ * Each copied block is separately written to the write-ahead log.
+ */
+static void
+CreateDatabaseUsingWalLog(Oid src_dboid, Oid dst_dboid,
+						  Oid src_tsid, Oid dst_tsid)
+{
+	char	   *srcpath;
+	char	   *dstpath;
+	List	   *rnodelist = NULL;
+	ListCell   *cell;
+	LockRelId	srcrelid;
+	LockRelId	dstrelid;
+	RelFileNode srcrnode;
+	RelFileNode dstrnode;
+	CreateDBRelInfo *relinfo;
+
+	/* Get source and destination database paths. */
+	srcpath = GetDatabasePath(src_dboid, src_tsid);
+	dstpath = GetDatabasePath(dst_dboid, dst_tsid);
+
+	/* Create database directory and write PG_VERSION file. */
+	CreateDirAndVersionFile(dstpath, dst_dboid, dst_tsid, false);
+
+	/* Copy relmap file from source database to the destination database. */
+	RelationMapCopy(dst_dboid, dst_tsid, srcpath, dstpath);
+
+	/* Get list of relfilenodes to copy from the source database. */
+	rnodelist = ScanSourceDatabasePgClass(src_tsid, src_dboid, srcpath);
+	Assert(rnodelist != NIL);
+
+	/*
+	 * Database ids are common for all the relation so set it before entering
+	 * the loop.
+	 */
+	srcrelid.dbId = src_dboid;
+	dstrelid.dbId = dst_dboid;
+
+	/* Loop over our list of relfilenodes and copy each one. */
+	foreach(cell, rnodelist)
+	{
+		relinfo = lfirst(cell);
+		srcrnode = relinfo->rnode;
+
+		/*
+		 * If the relation is from the source db's default tablespace then we
+		 * need to create it in the destinations db's default tablespace.
+		 * Otherwise, we need to create in the same tablespace as it is in the
+		 * source database.
+		 */
+		if (srcrnode.spcNode == src_tsid)
+			dstrnode.spcNode = dst_tsid;
+		else
+			dstrnode.spcNode = srcrnode.spcNode;
 
+		dstrnode.dbNode = dst_dboid;
+		dstrnode.relNode = srcrnode.relNode;
+
+		/* Acquire locks on source and target relations before copying. */
+		dstrelid.relId = srcrelid.relId = relinfo->reloid;
+		LockRelationId(&srcrelid, AccessShareLock);
+		LockRelationId(&dstrelid, AccessShareLock);
+
+		/* Copy relation storage from source to the destination. */
+		CreateAndCopyRelationData(srcrnode, dstrnode, relinfo->permanent);
+
+		/* Release the locks. */
+		UnlockRelationId(&srcrelid, AccessShareLock);
+		UnlockRelationId(&dstrelid, AccessShareLock);
+	}
+
+	list_free_deep(rnodelist);
+}
+
+/*
+ * Scan the pg_class table in the source database to identify the relations
+ * that need to be copied to the destination database.
+ *
+ * This is an exception to the usual rule that cross-database access is
+ * not possible. We can make it work here because we know that there are no
+ * connections to the source database and (since there can't be prepared
+ * transactions touching that database) no in-doubt tuples either. This
+ * means that we don't need to worry about pruning removing anything from
+ * under us, and we don't need to be too picky about our snapshot either.
+ * As long as it sees all previously-committed XIDs as committed and all
+ * aborted XIDs as aborted, we should be fine: nothing else is possible
+ * here.
+ *
+ * We can't rely on the relcache for anything here, because that only knows
+ * about the database to which we are connected, and can't handle access to
+ * other databases. That also means we can't rely on the heap scan
+ * infrastructure, which would be a bad idea anyway since it might try
+ * to do things like HOT pruning which we definitely can't do safely in
+ * a database to which we're not even connected.
+ */
+static List *
+ScanSourceDatabasePgClass(Oid tbid, Oid dbid, char *srcpath)
+{
+	RelFileNode rnode;
+	BlockNumber nblocks;
+	BlockNumber blkno;
+	Buffer		buf;
+	Oid			relfilenode;
+	Page		page;
+	List	   *rnodelist = NIL;
+	LockRelId	relid;
+	Relation	rel;
+	Snapshot	snapshot;
+	BufferAccessStrategy bstrategy;
+
+	/* Get pg_class relfilenode. */
+	relfilenode = RelationMapOidToFilenodeForDatabase(srcpath,
+													  RelationRelationId);
+
+	/*
+	 * The system elsewhere assumes that we only read data for a relation
+	 * into shared_buffers while holding some sort of a lock on a relation,
+	 * so lock the source database's pg_class before we do anything else.
+	 */
+	relid.dbId = dbid;
+	relid.relId = RelationRelationId;
+	LockRelationId(&relid, AccessShareLock);
+
+	/* Prepare a RelFileNode for the pg_class relation. */
+	rnode.spcNode = tbid;
+	rnode.dbNode = dbid;
+	rnode.relNode = relfilenode;
+
+	/*
+	 * Create a fake relcache entry for the pg_class relation and get the
+	 * number of blocks.  Refer to the comments in CreateAndCopyRelationData()
+	 * for the rationale behind using the fake relcache entry.
+	 */
+	rel = CreateFakeRelcacheEntry(rnode);
+	nblocks = smgrnblocks(RelationGetSmgr(rel), MAIN_FORKNUM);
+	FreeFakeRelcacheEntry(rel);
+
+	/* Use a buffer access strategy since this is a bulk read operation. */
+	bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	/*
+	 * As explained in the function header comments, we need a snapshot that
+	 * will see all committed transactions as committed, and our transaction
+	 * snapshot - or the active snapshot - might not be new enough for that,
+	 * but the return value of GetLatestSnapshot() should work fine.
+	 */
+	snapshot = GetLatestSnapshot();
+
+	/* Process the relation block by block. */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		CHECK_FOR_INTERRUPTS();
+
+		buf = ReadBufferWithoutRelcache(rnode, MAIN_FORKNUM, blkno,
+										RBM_NORMAL, bstrategy, false);
+
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buf);
+		if (PageIsNew(page) || PageIsEmpty(page))
+		{
+			UnlockReleaseBuffer(buf);
+			continue;
+		}
+
+		/* Append relevant pg_class tuples for current page to rnodelist. */
+		rnodelist = ScanSourceDatabasePgClassPage(page, buf, tbid, dbid,
+												  srcpath, rnodelist,
+												  snapshot);
+
+		UnlockReleaseBuffer(buf);
+	}
+
+	/* Release relation lock. */
+	UnlockRelationId(&relid, AccessShareLock);
+
+	return rnodelist;
+}
+
+/*
+ * Scan one page of the source database's pg_class relation and add relevant
+ * entries to rnodelist. The return value is the updated list.
+ */
+static List *
+ScanSourceDatabasePgClassPage(Page page, Buffer buf, Oid tbid, Oid dbid,
+							  char *srcpath, List *rnodelist,
+							  Snapshot snapshot)
+{
+	BlockNumber		blkno = BufferGetBlockNumber(buf);
+	OffsetNumber	offnum;
+	OffsetNumber	maxoff;
+	HeapTupleData	tuple;
+
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/* Loop over offsets. */
+	for (offnum = FirstOffsetNumber;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid;
+
+		itemid = PageGetItemId(page, offnum);
+
+		/* Nothing to do if slot is empty or already dead. */
+		if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid) ||
+			ItemIdIsRedirected(itemid))
+			continue;
+
+		Assert(ItemIdIsNormal(itemid));
+		ItemPointerSet(&(tuple.t_self), blkno, offnum);
+
+		/* Initialize a HeapTupleData structure. */
+		tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
+		tuple.t_len = ItemIdGetLength(itemid);
+		tuple.t_tableOid = RelationRelationId;
+
+		/* Skip tuples that are not visible to this snapshot. */
+		if (HeapTupleSatisfiesVisibility(&tuple, snapshot, buf))
+		{
+			CreateDBRelInfo *relinfo;
+
+			/*
+			 * ScanSourceDatabasePgClassTuple is in charge of constructing
+			 * a CreateDBRelInfo object for this tuple, but can also decide
+			 * that this tuple isn't something we need to copy. If we do need
+			 * to copy the relation, add it to the list.
+			 */
+			relinfo = ScanSourceDatabasePgClassTuple(&tuple, tbid, dbid,
+													 srcpath);
+			if (relinfo != NULL)
+				rnodelist = lappend(rnodelist, relinfo);
+		}
+	}
+
+	return rnodelist;
+}
+
+/*
+ * Decide whether a certain pg_class tuple represents something that
+ * needs to be copied from the source database to the destination database,
+ * and if so, construct a CreateDBRelInfo for it.
+ *
+ * Visbility checks are handled by the caller, so our job here is just
+ * to assess the data stored in the tuple.
+ */
+CreateDBRelInfo *
+ScanSourceDatabasePgClassTuple(HeapTupleData *tuple, Oid tbid, Oid dbid,
+							   char *srcpath)
+{
+	CreateDBRelInfo	   *relinfo;
+	Form_pg_class		classForm;
+	Oid					relfilenode = InvalidOid;
+
+	classForm = (Form_pg_class) GETSTRUCT(tuple);
+
+	/*
+	 * Return NULL if this object does not need to be copied.
+	 *
+	 * Shared objects don't need to be copied, because they are shared.
+	 * Objects without storage can't be copied, because there's nothing to
+	 * copy. Temporary relations don't need to be copied either, because
+	 * they are inaccessible outside of the session that created them,
+	 * which must be gone already, and couldn't connect to a different database
+	 * if it still existed. autovacuum will eventually remove the pg_class
+	 * entries as well.
+	 */
+	if (classForm->reltablespace == GLOBALTABLESPACE_OID ||
+		!RELKIND_HAS_STORAGE(classForm->relkind) ||
+		classForm->relpersistence == RELPERSISTENCE_TEMP)
+		return NULL;
+
+	/*
+	 * If relfilenode is valid then directly use it.  Otherwise, consult the
+	 * relmap.
+	 */
+	if (OidIsValid(classForm->relfilenode))
+		relfilenode = classForm->relfilenode;
+	else
+		relfilenode = RelationMapOidToFilenodeForDatabase(srcpath,
+														  classForm->oid);
+
+	/* We must have a valid relfilenode oid. */
+	if (!OidIsValid(relfilenode))
+		elog(ERROR, "relation with OID %u does not have a valid relfilenode",
+			 classForm->oid);
+
+	/* Prepare a rel info element and add it to the list. */
+	relinfo = (CreateDBRelInfo *) palloc(sizeof(CreateDBRelInfo));
+	if (OidIsValid(classForm->reltablespace))
+		relinfo->rnode.spcNode = classForm->reltablespace;
+	else
+		relinfo->rnode.spcNode = tbid;
+
+	relinfo->rnode.dbNode = dbid;
+	relinfo->rnode.relNode = relfilenode;
+	relinfo->reloid = classForm->oid;
+
+	/* Temporary relations were rejected above. */
+	Assert(classForm->relpersistence != RELPERSISTENCE_TEMP);
+	relinfo->permanent =
+		(classForm->relpersistence == RELPERSISTENCE_PERMANENT) ? true : false;
+
+	return relinfo;
+}
+
+/*
+ * Create database directory and write out the PG_VERSION file in the database
+ * path.  If isRedo is true, it's okay for the database directory to exist
+ * already.
+ */
+static void
+CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid, bool isRedo)
+{
+	int			fd;
+	int			nbytes;
+	char		versionfile[MAXPGPATH];
+	char		buf[16];
+
+	/*
+	 * Prepare version data before starting a critical section.
+	 *
+	 * Note that we don't have to copy this from the source database; there's
+	 * only one legal value.
+	 */
+	sprintf(buf, "%s\n", PG_MAJORVERSION);
+	nbytes = strlen(PG_MAJORVERSION) + 1;
+
+	/* If we are not in WAL replay then write the WAL. */
+	if (!isRedo)
+	{
+		xl_dbase_create_wal_log_rec xlrec;
+		XLogRecPtr	lsn;
+
+		START_CRIT_SECTION();
+
+		xlrec.db_id = dbid;
+		xlrec.tablespace_id = tsid;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec),
+						 sizeof(xl_dbase_create_wal_log_rec));
+
+		lsn = XLogInsert(RM_DBASE_ID, XLOG_DBASE_CREATE_WAL_LOG);
+
+		/* As always, WAL must hit the disk before the data update does. */
+		XLogFlush(lsn);
+	}
+
+	/* Create database directory. */
+	if (MakePGDirectory(dbpath) < 0)
+	{
+		/* Failure other than already exists or not in WAL replay? */
+		if (errno != EEXIST || !isRedo)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not create directory \"%s\": %m", dbpath)));
+	}
+
+	/*
+	 * Create PG_VERSION file in the database path.  If the file already
+	 * exists and we are in WAL replay then try again to open it in write
+	 * mode.
+	 */
+	snprintf(versionfile, sizeof(versionfile), "%s/%s", dbpath, "PG_VERSION");
+
+	fd = OpenTransientFile(versionfile, O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
+	if (fd < 0 && errno == EEXIST && isRedo)
+		fd = OpenTransientFile(versionfile, O_WRONLY | O_TRUNC | PG_BINARY);
+
+	if (fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m", versionfile)));
+
+	/* Write PG_MAJORVERSION in the PG_VERSION file. */
+	pgstat_report_wait_start(WAIT_EVENT_VERSION_FILE_WRITE);
+	errno = 0;
+	if ((int) write(fd, buf, nbytes) != nbytes)
+	{
+		/* If write didn't set errno, assume problem is no disk space. */
+		if (errno == 0)
+			errno = ENOSPC;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m", versionfile)));
+	}
+	pgstat_report_wait_end();
+
+	/* Close the version file. */
+	CloseTransientFile(fd);
+
+	/* Critical section done. */
+	if (!isRedo)
+		END_CRIT_SECTION();
+}
+
+/*
+ * Create a new database using the FILE_COPY strategy.
+ *
+ * Copy each tablespace at the filesystem level, and log a single WAL record
+ * for each tablespace copied.  This requires a checkpoint before and after the
+ * copy, which may be expensive, but it does greatly reduce WAL generation
+ * if the copied database is large.
+ */
+static void
+CreateDatabaseUsingFileCopy(Oid src_dboid, Oid dst_dboid, Oid src_tsid,
+							Oid dst_tsid)
+{
+	TableScanDesc scan;
+	Relation	rel;
+	HeapTuple	tuple;
+
+	/*
+	 * Force a checkpoint before starting the copy. This will force all dirty
+	 * buffers, including those of unlogged tables, out to disk, to ensure
+	 * source database is up-to-date on disk for the copy.
+	 * FlushDatabaseBuffers() would suffice for that, but we also want to
+	 * process any pending unlink requests. Otherwise, if a checkpoint
+	 * happened while we're copying files, a file might be deleted just when
+	 * we're about to copy it, causing the lstat() call in copydir() to fail
+	 * with ENOENT.
+	 */
+	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE |
+					  CHECKPOINT_WAIT | CHECKPOINT_FLUSH_ALL);
+
+	/*
+	 * Iterate through all tablespaces of the template database, and copy each
+	 * one to the new database.
+	 */
+	rel = table_open(TableSpaceRelationId, AccessShareLock);
+	scan = table_beginscan_catalog(rel, 0, NULL);
+	while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+	{
+		Form_pg_tablespace spaceform = (Form_pg_tablespace) GETSTRUCT(tuple);
+		Oid			srctablespace = spaceform->oid;
+		Oid			dsttablespace;
+		char	   *srcpath;
+		char	   *dstpath;
+		struct stat st;
+
+		/* No need to copy global tablespace */
+		if (srctablespace == GLOBALTABLESPACE_OID)
+			continue;
+
+		srcpath = GetDatabasePath(src_dboid, srctablespace);
+
+		if (stat(srcpath, &st) < 0 || !S_ISDIR(st.st_mode) ||
+			directory_is_empty(srcpath))
+		{
+			/* Assume we can ignore it */
+			pfree(srcpath);
+			continue;
+		}
+
+		if (srctablespace == src_tsid)
+			dsttablespace = dst_tsid;
+		else
+			dsttablespace = srctablespace;
+
+		dstpath = GetDatabasePath(dst_dboid, dsttablespace);
+
+		/*
+		 * Copy this subdirectory to the new location
+		 *
+		 * We don't need to copy subdirectories
+		 */
+		copydir(srcpath, dstpath, false);
+
+		/* Record the filesystem change in XLOG */
+		{
+			xl_dbase_create_file_copy_rec xlrec;
+
+			xlrec.db_id = dst_dboid;
+			xlrec.tablespace_id = dsttablespace;
+			xlrec.src_db_id = src_dboid;
+			xlrec.src_tablespace_id = srctablespace;
+
+			XLogBeginInsert();
+			XLogRegisterData((char *) &xlrec,
+							 sizeof(xl_dbase_create_file_copy_rec));
+
+			(void) XLogInsert(RM_DBASE_ID,
+							  XLOG_DBASE_CREATE_FILE_COPY | XLR_SPECIAL_REL_UPDATE);
+		}
+	}
+	table_endscan(scan);
+	table_close(rel, AccessShareLock);
+
+	/*
+	 * We force a checkpoint before committing.  This effectively means that
+	 * committed XLOG_DBASE_CREATE_FILE_COPY operations will never need to be
+	 * replayed (at least not in ordinary crash recovery; we still have to
+	 * make the XLOG entry for the benefit of PITR operations). This avoids
+	 * two nasty scenarios:
+	 *
+	 * #1: When PITR is off, we don't XLOG the contents of newly created
+	 * indexes; therefore the drop-and-recreate-whole-directory behavior of
+	 * DBASE_CREATE replay would lose such indexes.
+	 *
+	 * #2: Since we have to recopy the source database during DBASE_CREATE
+	 * replay, we run the risk of copying changes in it that were committed
+	 * after the original CREATE DATABASE command but before the system crash
+	 * that led to the replay.  This is at least unexpected and at worst could
+	 * lead to inconsistencies, eg duplicate table names.
+	 *
+	 * (Both of these were real bugs in releases 8.0 through 8.0.3.)
+	 *
+	 * In PITR replay, the first of these isn't an issue, and the second is
+	 * only a risk if the CREATE DATABASE and subsequent template database
+	 * change both occur while a base backup is being taken. There doesn't
+	 * seem to be much we can do about that except document it as a
+	 * limitation.
+	 *
+	 * See CreateDatabaseUsingWalLog() for a less cheesy CREATE DATABASE
+	 * strategy that avoids these problems.
+	 */
+	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
+}
 
 /*
  * CREATE DATABASE
@@ -101,8 +663,6 @@ static int	errdetail_busy_db(int notherbackends, int npreparedxacts);
 Oid
 createdb(ParseState *pstate, const CreatedbStmt *stmt)
 {
-	TableScanDesc scan;
-	Relation	rel;
 	Oid			src_dboid;
 	Oid			src_owner;
 	int			src_encoding = -1;
@@ -137,6 +697,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	DefElem    *dallowconnections = NULL;
 	DefElem    *dconnlimit = NULL;
 	DefElem    *dcollversion = NULL;
+	DefElem    *dstrategy = NULL;
 	char	   *dbname = stmt->dbname;
 	char	   *dbowner = NULL;
 	const char *dbtemplate = NULL;
@@ -152,6 +713,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	char	   *dbcollversion = NULL;
 	int			notherbackends;
 	int			npreparedxacts;
+	CreateDBStrategy dbstrategy = CREATEDB_WAL_LOG;
 	createdb_failure_params fparms;
 
 	/* Extract options from the statement node tree */
@@ -269,6 +831,12 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 						(errcode(ERRCODE_INVALID_PARAMETER_VALUE)),
 						errmsg("OIDs less than %u are reserved for system objects", FirstNormalObjectId));
 		}
+		else if (strcmp(defel->defname, "strategy") == 0)
+		{
+			if (dstrategy)
+				errorConflictingDefElem(defel, pstate);
+			dstrategy = defel;
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -413,6 +981,23 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 							dbtemplate)));
 	}
 
+	/* Validate the database creation strategy. */
+	if (dstrategy && dstrategy->arg)
+	{
+		char	   *strategy;
+
+		strategy = defGetString(dstrategy);
+		if (strcmp(strategy, "wal_log") == 0)
+			dbstrategy = CREATEDB_WAL_LOG;
+		else if (strcmp(strategy, "file_copy") == 0)
+			dbstrategy = CREATEDB_FILE_COPY;
+		else
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("invalid create database strategy %s", strategy),
+					 errhint("Valid strategies are \"wal_log\", and \"file_copy\".")));
+	}
+
 	/* If encoding or locales are defaulted, use source's setting */
 	if (encoding < 0)
 		encoding = src_encoding;
@@ -753,17 +1338,16 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	InvokeObjectPostCreateHook(DatabaseRelationId, dboid, 0);
 
 	/*
-	 * Force a checkpoint before starting the copy. This will force all dirty
-	 * buffers, including those of unlogged tables, out to disk, to ensure
-	 * source database is up-to-date on disk for the copy.
-	 * FlushDatabaseBuffers() would suffice for that, but we also want to
-	 * process any pending unlink requests. Otherwise, if a checkpoint
-	 * happened while we're copying files, a file might be deleted just when
-	 * we're about to copy it, causing the lstat() call in copydir() to fail
-	 * with ENOENT.
+	 * Acquire a lock on the target database, although this is a new database
+	 * and no one else should be able to access it.  But if we are using wal
+	 * log strategy then we are going to access the relation pages using shared
+	 * buffers.  Therefore, as a general principle, we should acquire the
+	 * database lock and the relation lock before accessing any shared buffers.
+	 * Individual relation level locks would be acquired in the
+	 * CreateDatabaseUsingWalLog() when reading pages from the shared buffer.
 	 */
-	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT
-					  | CHECKPOINT_FLUSH_ALL);
+	if (dbstrategy == CREATEDB_WAL_LOG)
+		LockSharedObject(DatabaseRelationId, dboid, 0, AccessShareLock);
 
 	/*
 	 * Once we start copying subdirectories, we need to be able to clean 'em
@@ -774,101 +1358,24 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	 */
 	fparms.src_dboid = src_dboid;
 	fparms.dest_dboid = dboid;
+	fparms.strategy = dbstrategy;
+
 	PG_ENSURE_ERROR_CLEANUP(createdb_failure_callback,
 							PointerGetDatum(&fparms));
 	{
 		/*
-		 * Iterate through all tablespaces of the template database, and copy
-		 * each one to the new database.
-		 */
-		rel = table_open(TableSpaceRelationId, AccessShareLock);
-		scan = table_beginscan_catalog(rel, 0, NULL);
-		while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
-		{
-			Form_pg_tablespace spaceform = (Form_pg_tablespace) GETSTRUCT(tuple);
-			Oid			srctablespace = spaceform->oid;
-			Oid			dsttablespace;
-			char	   *srcpath;
-			char	   *dstpath;
-			struct stat st;
-
-			/* No need to copy global tablespace */
-			if (srctablespace == GLOBALTABLESPACE_OID)
-				continue;
-
-			srcpath = GetDatabasePath(src_dboid, srctablespace);
-
-			if (stat(srcpath, &st) < 0 || !S_ISDIR(st.st_mode) ||
-				directory_is_empty(srcpath))
-			{
-				/* Assume we can ignore it */
-				pfree(srcpath);
-				continue;
-			}
-
-			if (srctablespace == src_deftablespace)
-				dsttablespace = dst_deftablespace;
-			else
-				dsttablespace = srctablespace;
-
-			dstpath = GetDatabasePath(dboid, dsttablespace);
-
-			/*
-			 * Copy this subdirectory to the new location
-			 *
-			 * We don't need to copy subdirectories
-			 */
-			copydir(srcpath, dstpath, false);
-
-			/* Record the filesystem change in XLOG */
-			{
-				xl_dbase_create_rec xlrec;
-
-				xlrec.db_id = dboid;
-				xlrec.tablespace_id = dsttablespace;
-				xlrec.src_db_id = src_dboid;
-				xlrec.src_tablespace_id = srctablespace;
-
-				XLogBeginInsert();
-				XLogRegisterData((char *) &xlrec, sizeof(xl_dbase_create_rec));
-
-				(void) XLogInsert(RM_DBASE_ID,
-								  XLOG_DBASE_CREATE | XLR_SPECIAL_REL_UPDATE);
-			}
-		}
-		table_endscan(scan);
-		table_close(rel, AccessShareLock);
-
-		/*
-		 * We force a checkpoint before committing.  This effectively means
-		 * that committed XLOG_DBASE_CREATE operations will never need to be
-		 * replayed (at least not in ordinary crash recovery; we still have to
-		 * make the XLOG entry for the benefit of PITR operations). This
-		 * avoids two nasty scenarios:
-		 *
-		 * #1: When PITR is off, we don't XLOG the contents of newly created
-		 * indexes; therefore the drop-and-recreate-whole-directory behavior
-		 * of DBASE_CREATE replay would lose such indexes.
-		 *
-		 * #2: Since we have to recopy the source database during DBASE_CREATE
-		 * replay, we run the risk of copying changes in it that were
-		 * committed after the original CREATE DATABASE command but before the
-		 * system crash that led to the replay.  This is at least unexpected
-		 * and at worst could lead to inconsistencies, eg duplicate table
-		 * names.
-		 *
-		 * (Both of these were real bugs in releases 8.0 through 8.0.3.)
-		 *
-		 * In PITR replay, the first of these isn't an issue, and the second
-		 * is only a risk if the CREATE DATABASE and subsequent template
-		 * database change both occur while a base backup is being taken.
-		 * There doesn't seem to be much we can do about that except document
-		 * it as a limitation.
-		 *
-		 * Perhaps if we ever implement CREATE DATABASE in a less cheesy way,
-		 * we can avoid this.
+		 * If the user has asked to create a database with WAL_LOG strategy
+		 * then call CreateDatabaseUsingWalLog, which will copy the database
+		 * at the block level and it will WAL log each copied block.
+		 * Otherwise, call CreateDatabaseUsingFileCopy that will copy the
+		 * database file by file.
 		 */
-		RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
+		if (dbstrategy == CREATEDB_WAL_LOG)
+			CreateDatabaseUsingWalLog(src_dboid, dboid, src_deftablespace,
+									  dst_deftablespace);
+		else
+			CreateDatabaseUsingFileCopy(src_dboid, dboid, src_deftablespace,
+										dst_deftablespace);
 
 		/*
 		 * Close pg_database, but keep lock till commit.
@@ -955,6 +1462,25 @@ createdb_failure_callback(int code, Datum arg)
 	createdb_failure_params *fparms = (createdb_failure_params *) DatumGetPointer(arg);
 
 	/*
+	 * If we were copying database at block levels then drop pages for the
+	 * destination database that are in the shared buffer cache.  And tell
+	 * checkpointer to forget any pending fsync and unlink requests for files
+	 * in the database.  The reasoning behind doing this is same as explained
+	 * in dropdb function.  But unlike dropdb we don't need to call
+	 * pgstat_drop_database because this database is still not created so
+	 * there should not be any stat for this.
+	 */
+	if (fparms->strategy == CREATEDB_WAL_LOG)
+	{
+		DropDatabaseBuffers(fparms->dest_dboid);
+		ForgetDatabaseSyncRequests(fparms->dest_dboid);
+
+		/* Release lock on the target database. */
+		UnlockSharedObject(DatabaseRelationId, fparms->dest_dboid, 0,
+						   AccessShareLock);
+	}
+
+	/*
 	 * Release lock on source database before doing recursive remove. This is
 	 * not essential but it seems desirable to release the lock as soon as
 	 * possible.
@@ -1478,7 +2004,7 @@ movedb(const char *dbname, const char *tblspcname)
 		 * Record the filesystem change in XLOG
 		 */
 		{
-			xl_dbase_create_rec xlrec;
+			xl_dbase_create_file_copy_rec xlrec;
 
 			xlrec.db_id = db_id;
 			xlrec.tablespace_id = dst_tblspcoid;
@@ -1486,10 +2012,11 @@ movedb(const char *dbname, const char *tblspcname)
 			xlrec.src_tablespace_id = src_tblspcoid;
 
 			XLogBeginInsert();
-			XLogRegisterData((char *) &xlrec, sizeof(xl_dbase_create_rec));
+			XLogRegisterData((char *) &xlrec,
+							 sizeof(xl_dbase_create_file_copy_rec));
 
 			(void) XLogInsert(RM_DBASE_ID,
-							  XLOG_DBASE_CREATE | XLR_SPECIAL_REL_UPDATE);
+							  XLOG_DBASE_CREATE_FILE_COPY | XLR_SPECIAL_REL_UPDATE);
 		}
 
 		/*
@@ -1525,9 +2052,10 @@ movedb(const char *dbname, const char *tblspcname)
 
 		/*
 		 * Force another checkpoint here.  As in CREATE DATABASE, this is to
-		 * ensure that we don't have to replay a committed XLOG_DBASE_CREATE
-		 * operation, which would cause us to lose any unlogged operations
-		 * done in the new DB tablespace before the next checkpoint.
+		 * ensure that we don't have to replay a committed
+		 * XLOG_DBASE_CREATE_FILE_COPY operation, which would cause us to lose
+		 * any unlogged operations done in the new DB tablespace before the
+		 * next checkpoint.
 		 */
 		RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
 
@@ -2478,9 +3006,10 @@ dbase_redo(XLogReaderState *record)
 	/* Backup blocks are not used in dbase records */
 	Assert(!XLogRecHasAnyBlockRefs(record));
 
-	if (info == XLOG_DBASE_CREATE)
+	if (info == XLOG_DBASE_CREATE_FILE_COPY)
 	{
-		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) XLogRecGetData(record);
+		xl_dbase_create_file_copy_rec *xlrec =
+		(xl_dbase_create_file_copy_rec *) XLogRecGetData(record);
 		char	   *src_path;
 		char	   *dst_path;
 		struct stat st;
@@ -2515,6 +3044,18 @@ dbase_redo(XLogReaderState *record)
 		 */
 		copydir(src_path, dst_path, false);
 	}
+	else if (info == XLOG_DBASE_CREATE_WAL_LOG)
+	{
+		xl_dbase_create_wal_log_rec *xlrec =
+		(xl_dbase_create_wal_log_rec *) XLogRecGetData(record);
+		char	   *dbpath;
+
+		dbpath = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
+
+		/* Create the database directory with the version file. */
+		CreateDirAndVersionFile(dbpath, xlrec->db_id, xlrec->tablespace_id,
+								true);
+	}
 	else if (info == XLOG_DBASE_DROP)
 	{
 		xl_dbase_drop_rec *xlrec = (xl_dbase_drop_rec *) XLogRecGetData(record);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f5459c6..a41fef7 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -38,6 +38,7 @@
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/storage.h"
+#include "catalog/storage_xlog.h"
 #include "executor/instrument.h"
 #include "lib/binaryheap.h"
 #include "miscadmin.h"
@@ -486,6 +487,9 @@ static void FindAndDropRelFileNodeBuffers(RelFileNode rnode,
 										  ForkNumber forkNum,
 										  BlockNumber nForkBlock,
 										  BlockNumber firstDelBlock);
+static void RelationCopyStorageUsingBuffer(Relation src, Relation dst,
+										   ForkNumber forkNum,
+										   bool isunlogged);
 static void AtProcExit_Buffers(int code, Datum arg);
 static void CheckForBufferLeaks(void);
 static int	rnode_comparator(const void *p1, const void *p2);
@@ -772,23 +776,23 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
  * ReadBufferWithoutRelcache -- like ReadBufferExtended, but doesn't require
  *		a relcache entry for the relation.
  *
- * NB: At present, this function may only be used on permanent relations, which
- * is OK, because we only use it during XLOG replay.  If in the future we
- * want to use it on temporary or unlogged relations, we could pass additional
- * parameters.
+ * Pass permanent = true for a RELPERSISTENCE_PERMANENT relation, and
+ * permanent = false for a RELPERSISTENCE_UNLOGGED relation. This function
+ * cannot be used for temporary relations (and making that work might be
+ * difficult, unless we only want to read temporary relations for our own
+ * BackendId).
  */
 Buffer
 ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
 						  BlockNumber blockNum, ReadBufferMode mode,
-						  BufferAccessStrategy strategy)
+						  BufferAccessStrategy strategy, bool permanent)
 {
 	bool		hit;
 
 	SMgrRelation smgr = smgropen(rnode, InvalidBackendId);
 
-	Assert(InRecovery);
-
-	return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
+	return ReadBuffer_common(smgr, permanent ? RELPERSISTENCE_PERMANENT :
+							 RELPERSISTENCE_UNLOGGED, forkNum, blockNum,
 							 mode, strategy, &hit);
 }
 
@@ -3677,6 +3681,167 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 }
 
 /* ---------------------------------------------------------------------
+ *		RelationCopyStorageUsingBuffer
+ *
+ *		Copy fork's data using bufmgr.  Same as RelationCopyStorage but instead
+ *		of using smgrread and smgrextend this will copy using bufmgr APIs.
+ *
+ *		Refer comments atop CreateAndCopyRelationData() for details about
+ *		'permanent' parameter.
+ * --------------------------------------------------------------------
+ */
+static void
+RelationCopyStorageUsingBuffer(Relation src, Relation dst, ForkNumber forkNum,
+							   bool permanent)
+{
+	Buffer		srcBuf;
+	Buffer		dstBuf;
+	Page		srcPage;
+	Page		dstPage;
+	bool		use_wal;
+	BlockNumber nblocks;
+	BlockNumber blkno;
+	BufferAccessStrategy bstrategy_src;
+	BufferAccessStrategy bstrategy_dst;
+
+	/*
+	 * In general, we want to write WAL whenever wal_level > 'minimal', but
+	 * we can skip it when copying any fork of an unlogged relation other
+	 * than the init fork.
+	 */
+	use_wal = XLogIsNeeded() && (permanent || forkNum == INIT_FORKNUM);
+
+	/* Get number of blocks in the source relation. */
+	nblocks = smgrnblocks(RelationGetSmgr(src), forkNum);
+
+	/* Nothing to copy; just return. */
+	if (nblocks == 0)
+		return;
+
+	/* This is a bulk operation, so use buffer access strategies. */
+	bstrategy_src = GetAccessStrategy(BAS_BULKREAD);
+	bstrategy_dst = GetAccessStrategy(BAS_BULKWRITE);
+
+	/* Iterate over each block of the source relation file. */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		CHECK_FOR_INTERRUPTS();
+
+		/* Read block from source relation. */
+		srcBuf = ReadBufferWithoutRelcache(src->rd_node, forkNum, blkno,
+										   RBM_NORMAL, bstrategy_src,
+										   permanent);
+		srcPage = BufferGetPage(srcBuf);
+		if (PageIsNew(srcPage) || PageIsEmpty(srcPage))
+		{
+			ReleaseBuffer(srcBuf);
+			continue;
+		}
+
+		/* Use P_NEW to extend the destination relation. */
+		dstBuf = ReadBufferWithoutRelcache(dst->rd_node, forkNum, P_NEW,
+										   RBM_NORMAL, bstrategy_dst,
+										   permanent);
+		LockBuffer(dstBuf, BUFFER_LOCK_EXCLUSIVE);
+
+		START_CRIT_SECTION();
+
+		/* Copy page data from the source to the destination. */
+		dstPage = BufferGetPage(dstBuf);
+		memcpy(dstPage, srcPage, BLCKSZ);
+		MarkBufferDirty(dstBuf);
+
+		/* WAL-log the copied page. */
+		if (use_wal)
+			log_newpage_buffer(dstBuf, true);
+
+		END_CRIT_SECTION();
+
+		UnlockReleaseBuffer(dstBuf);
+		ReleaseBuffer(srcBuf);
+	}
+}
+
+/* ---------------------------------------------------------------------
+ *		CreateAndCopyRelationData
+ *
+ *		Create destination relation storage and copy all forks from the
+ *		source relation to the destination.
+ *
+ *		Pass permanent as true for permanent relations and false for
+ *		unlogged relations.  Currently this API is not supported for
+ *		temporary relations.
+ * --------------------------------------------------------------------
+ */
+void
+CreateAndCopyRelationData(RelFileNode src_rnode, RelFileNode dst_rnode,
+						  bool permanent)
+{
+	Relation		src_rel;
+	Relation		dst_rel;
+	char			relpersistence;
+
+	/* Set the relpersistence. */
+	relpersistence = permanent ?
+		RELPERSISTENCE_PERMANENT : RELPERSISTENCE_UNLOGGED;
+
+	/*
+	 * Prepare fake relcache entries for the srource and the destination.  It
+	 * is safe to use the fake relcache here because we are only going to
+	 * access the fields related to the physical storage.  We are using the
+	 * fake relcache entry only because it isn't safe to hold the smgr
+	 * pointers, for more details refer comments atop RelationGetSmgr.
+	 */
+	src_rel = CreateFakeRelcacheEntry(src_rnode);
+	dst_rel = CreateFakeRelcacheEntry(dst_rnode);
+
+	/*
+	 * Create and copy all forks of the relation.
+	 *
+	 * NOTE: any conflict in relfilenode value will be caught in
+	 * RelationCreateStorage().
+	 */
+	RelationCreateStorage(dst_rnode, relpersistence);
+
+	/*
+	 * Remove the pending delete entries registered by RelationCreateStorage
+	 * for processing at the abort transaction.  Because for the create
+	 * database time failures, we have a separate cleanup mechanism for the
+	 * whole database directory.  Therefore, we don't need cleanup for each
+	 * individual relation.
+	 */
+	RelationPreserveStorage(dst_rnode, false);
+
+	/* copy main fork. */
+	RelationCopyStorageUsingBuffer(src_rel, dst_rel, MAIN_FORKNUM, permanent);
+
+	/* copy those extra forks that exist */
+	for (ForkNumber forkNum = MAIN_FORKNUM + 1;
+		 forkNum <= MAX_FORKNUM; forkNum++)
+	{
+		if (smgrexists(RelationGetSmgr(src_rel), forkNum))
+		{
+			smgrcreate(RelationGetSmgr(dst_rel), forkNum, false);
+
+			/*
+			 * WAL log creation if the relation is persistent, or this is the
+			 * init fork of an unlogged relation.
+			 */
+			if (permanent || forkNum == INIT_FORKNUM)
+				log_smgrcreate(&dst_rnode, forkNum);
+
+			/* Copy a fork's data, block by block. */
+			RelationCopyStorageUsingBuffer(src_rel, dst_rel, forkNum,
+										   permanent);
+		}
+	}
+
+	/* Release fake relcache entries. */
+	FreeFakeRelcacheEntry(src_rel);
+	FreeFakeRelcacheEntry(dst_rel);
+}
+
+/* ---------------------------------------------------------------------
  *		FlushDatabaseBuffers
  *
  *		This function writes all dirty pages of a database out to disk
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 5ae52dd..1543da6 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -176,6 +176,34 @@ ConditionalLockRelationOid(Oid relid, LOCKMODE lockmode)
 }
 
 /*
+ *		LockRelationId
+ *
+ * Lock, given a LockRelId.  Same as LockRelationOid but take LockRelId as an
+ * input.
+ */
+void
+LockRelationId(LockRelId *relid, LOCKMODE lockmode)
+{
+	LOCKTAG		tag;
+	LOCALLOCK  *locallock;
+	LockAcquireResult res;
+
+	SET_LOCKTAG_RELATION(tag, relid->dbId, relid->relId);
+
+	res = LockAcquireExtended(&tag, lockmode, false, false, true, &locallock);
+
+	/*
+	 * Now that we have the lock, check for invalidation messages; see notes
+	 * in LockRelationOid.
+	 */
+	if (res != LOCKACQUIRE_ALREADY_CLEAR)
+	{
+		AcceptInvalidationMessages();
+		MarkLockClear(locallock);
+	}
+}
+
+/*
  *		UnlockRelationId
  *
  * Unlock, given a LockRelId.  This is preferred over UnlockRelationOid
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index ff46a0e..1c8aba4 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -705,6 +705,9 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_TWOPHASE_FILE_WRITE:
 			event_name = "TwophaseFileWrite";
 			break;
+		case WAIT_EVENT_VERSION_FILE_WRITE:
+			event_name = "VersionFileWrite";
+			break;
 		case WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ:
 			event_name = "WALSenderTimelineHistoryRead";
 			break;
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 4d0718f..dee3387 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -252,6 +252,63 @@ RelationMapFilenodeToOid(Oid filenode, bool shared)
 }
 
 /*
+ * RelationMapOidToFilenodeForDatabase
+ *
+ * Like RelationMapOidToFilenode, but reads the mapping from the indicated
+ * path instead of using the one for the current database.
+ */
+Oid
+RelationMapOidToFilenodeForDatabase(char *dbpath, Oid relationId)
+{
+	RelMapFile	map;
+	int			i;
+
+	/* Read the relmap file from the source database. */
+	read_relmap_file(&map, dbpath, false, ERROR);
+
+	/* Iterate over the relmap entries to find the input relation OID. */
+	for (i = 0; i < map.num_mappings; i++)
+	{
+		if (relationId == map.mappings[i].mapoid)
+			return map.mappings[i].mapfilenode;
+	}
+
+	return InvalidOid;
+}
+
+/*
+ * RelationMapCopy
+ *
+ * Copy relmapfile from source db path to the destination db path and WAL log
+ * the operation. This is intended for use in creating a new relmap file
+ * for a database that doesn't have one yet, not for replacing an existing
+ * relmap file.
+ */
+void
+RelationMapCopy(Oid dbid, Oid tsid, char *srcdbpath, char *dstdbpath)
+{
+	RelMapFile map;
+
+	/*
+	 * Read the relmap file from the source database.
+	 */
+	read_relmap_file(&map, srcdbpath, false, ERROR);
+
+	/*
+	 * Write the same data into the destination database's relmap file.
+	 *
+	 * No sinval is needed because no one can be connected to the destination
+	 * database yet. For the same reason, there is no need to acquire
+	 * RelationMappingLock.
+	 *
+	 * There's no point in trying to preserve files here. The new database
+	 * isn't usable yet anyway, and won't ever be if we can't install a
+	 * relmap file.
+	 */
+	write_relmap_file(&map, true, false, false, dbid, tsid, dstdbpath);
+}
+
+/*
  * RelationMapUpdateMap
  *
  * Install a new relfilenode mapping for the specified relation.
@@ -1031,6 +1088,13 @@ relmap_redo(XLogReaderState *record)
 		 *
 		 * There shouldn't be anyone else updating relmaps during WAL replay,
 		 * but grab the lock to interlock against load_relmap_file().
+		 *
+		 * Note that we use the same WAL record for updating the relmap of
+		 * an existing database as we do for creating a new database. In
+		 * the latter case, taking the relmap log and sending sinval messages
+		 * is unnecessary, but harmless. If we wanted to avoid it, we could
+		 * add a flag to the WAL record to indicate which opration is being
+		 * performed.
 		 */
 		LWLockAcquire(RelationMappingLock, LW_EXCLUSIVE);
 		write_relmap_file(&newmap, false, true, false,
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 7cfa169..bd1ec42 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -370,7 +370,7 @@ extractPageInfo(XLogReaderState *record)
 
 	/* Is this a special record type that I recognize? */
 
-	if (rmid == RM_DBASE_ID && rminfo == XLOG_DBASE_CREATE)
+	if (rmid == RM_DBASE_ID && rminfo == XLOG_DBASE_CREATE_FILE_COPY)
 	{
 		/*
 		 * New databases can be safely ignored. It won't be present in the
@@ -382,6 +382,13 @@ extractPageInfo(XLogReaderState *record)
 		 * overwriting the database created in the target system.
 		 */
 	}
+	else if (rmid == RM_DBASE_ID && rminfo == XLOG_DBASE_CREATE_WAL_LOG)
+	{
+		/*
+		 * New databases can be safely ignored. It won't be present in the
+		 * source system, so it will be deleted.
+		 */
+	}
 	else if (rmid == RM_DBASE_ID && rminfo == XLOG_DBASE_DROP)
 	{
 		/*
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 5c06459..baabf98 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2776,13 +2776,15 @@ psql_completion(const char *text, int start, int end)
 	/* CREATE DATABASE */
 	else if (Matches("CREATE", "DATABASE", MatchAny))
 		COMPLETE_WITH("OWNER", "TEMPLATE", "ENCODING", "TABLESPACE",
-					  "IS_TEMPLATE",
+					  "IS_TEMPLATE", "STRATEGY",
 					  "ALLOW_CONNECTIONS", "CONNECTION LIMIT",
 					  "LC_COLLATE", "LC_CTYPE", "LOCALE", "OID",
 					  "LOCALE_PROVIDER", "ICU_LOCALE");
 
 	else if (Matches("CREATE", "DATABASE", MatchAny, "TEMPLATE"))
 		COMPLETE_WITH_QUERY(Query_for_list_of_template_databases);
+	else if (Matches("CREATE", "DATABASE", MatchAny, "STRATEGY"))
+		COMPLETE_WITH("WAL_LOG", "FILE_COPY");
 
 	/* CREATE DOMAIN */
 	else if (Matches("CREATE", "DOMAIN", MatchAny))
diff --git a/src/bin/scripts/createdb.c b/src/bin/scripts/createdb.c
index 6f612ab..0bffa2f 100644
--- a/src/bin/scripts/createdb.c
+++ b/src/bin/scripts/createdb.c
@@ -34,6 +34,7 @@ main(int argc, char *argv[])
 		{"tablespace", required_argument, NULL, 'D'},
 		{"template", required_argument, NULL, 'T'},
 		{"encoding", required_argument, NULL, 'E'},
+		{"strategy", required_argument, NULL, 'S'},
 		{"lc-collate", required_argument, NULL, 1},
 		{"lc-ctype", required_argument, NULL, 2},
 		{"locale", required_argument, NULL, 'l'},
@@ -60,6 +61,7 @@ main(int argc, char *argv[])
 	char	   *tablespace = NULL;
 	char	   *template = NULL;
 	char	   *encoding = NULL;
+	char	   *strategy = NULL;
 	char	   *lc_collate = NULL;
 	char	   *lc_ctype = NULL;
 	char	   *locale = NULL;
@@ -77,7 +79,7 @@ main(int argc, char *argv[])
 
 	handle_help_version_opts(argc, argv, "createdb", help);
 
-	while ((c = getopt_long(argc, argv, "h:p:U:wWeO:D:T:E:l:", long_options, &optindex)) != -1)
+	while ((c = getopt_long(argc, argv, "h:p:U:wWeO:D:T:E:l:S:", long_options, &optindex)) != -1)
 	{
 		switch (c)
 		{
@@ -111,6 +113,9 @@ main(int argc, char *argv[])
 			case 'E':
 				encoding = pg_strdup(optarg);
 				break;
+			case 'S':
+				strategy = pg_strdup(optarg);
+				break;
 			case 1:
 				lc_collate = pg_strdup(optarg);
 				break;
@@ -215,6 +220,8 @@ main(int argc, char *argv[])
 		appendPQExpBufferStr(&sql, " ENCODING ");
 		appendStringLiteralConn(&sql, encoding, conn);
 	}
+	if (strategy)
+		appendPQExpBuffer(&sql, " STRATEGY %s", fmtId(strategy));
 	if (template)
 		appendPQExpBuffer(&sql, " TEMPLATE %s", fmtId(template));
 	if (lc_collate)
@@ -294,6 +301,7 @@ help(const char *progname)
 	printf(_("      --locale-provider={libc|icu}\n"
 			 "                               locale provider for the database's default collation\n"));
 	printf(_("  -O, --owner=OWNER            database user to own the new database\n"));
+	printf(_("  -S, --strategy=STRATEGY      database creation strategy wal_log or file_copy\n"));
 	printf(_("  -T, --template=TEMPLATE      template database to copy\n"));
 	printf(_("  -V, --version                output version information, then exit\n"));
 	printf(_("  -?, --help                   show this help, then exit\n"));
diff --git a/src/bin/scripts/t/020_createdb.pl b/src/bin/scripts/t/020_createdb.pl
index 35deec9..44d3c6d 100644
--- a/src/bin/scripts/t/020_createdb.pl
+++ b/src/bin/scripts/t/020_createdb.pl
@@ -104,4 +104,24 @@ $node->command_checks_all(
 	],
 	'createdb with incorrect --lc-ctype');
 
+$node->command_checks_all(
+	[ 'createdb', '--strategy', "foo", 'foobar2' ],
+	1,
+	[qr/^$/],
+	[
+		qr/^createdb: error: database creation failed: ERROR:  invalid create database strategy|^createdb: error: database creation failed: ERROR:  invalid create database strategy foo/s
+	],
+	'createdb with incorrect --strategy');
+
+# Check database creation strategy
+$node->issues_sql_like(
+	[ 'createdb', '-T', 'foobar2', 'foobar4', '-S', 'wal_log'],
+	qr/statement: CREATE DATABASE foobar4 STRATEGY wal_log TEMPLATE foobar2/,
+	'create database with WAL_LOG strategy');
+
+$node->issues_sql_like(
+	[ 'createdb', '-T', 'foobar2', 'foobar5', '-S', 'file_copy'],
+	qr/statement: CREATE DATABASE foobar5 STRATEGY file_copy TEMPLATE foobar2/,
+	'create database with FILE_COPY strategy');
+
 done_testing();
diff --git a/src/include/commands/dbcommands_xlog.h b/src/include/commands/dbcommands_xlog.h
index 593a857..0ee2452 100644
--- a/src/include/commands/dbcommands_xlog.h
+++ b/src/include/commands/dbcommands_xlog.h
@@ -18,17 +18,32 @@
 #include "lib/stringinfo.h"
 
 /* record types */
-#define XLOG_DBASE_CREATE		0x00
-#define XLOG_DBASE_DROP			0x10
+#define XLOG_DBASE_CREATE_FILE_COPY		0x00
+#define XLOG_DBASE_CREATE_WAL_LOG		0x10
+#define XLOG_DBASE_DROP					0x20
 
-typedef struct xl_dbase_create_rec
+/*
+ * Single WAL record for an entire CREATE DATABASE operation. This is used
+ * by the FILE_COPY strategy.
+ */
+typedef struct xl_dbase_create_file_copy_rec
 {
-	/* Records copying of a single subdirectory incl. contents */
 	Oid			db_id;
 	Oid			tablespace_id;
 	Oid			src_db_id;
 	Oid			src_tablespace_id;
-} xl_dbase_create_rec;
+} xl_dbase_create_file_copy_rec;
+
+/*
+ * WAL record for the beginning of a CREATE DATABASE operation, when the
+ * WAL_LOG strategy is used. Each individual block will be logged separately
+ * afterward.
+ */
+typedef struct xl_dbase_create_wal_log_rec
+{
+	Oid			db_id;
+	Oid			tablespace_id;
+} xl_dbase_create_wal_log_rec;
 
 typedef struct xl_dbase_drop_rec
 {
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index dd01841..a6b657f 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -184,7 +184,8 @@ extern Buffer ReadBufferExtended(Relation reln, ForkNumber forkNum,
 								 BufferAccessStrategy strategy);
 extern Buffer ReadBufferWithoutRelcache(RelFileNode rnode,
 										ForkNumber forkNum, BlockNumber blockNum,
-										ReadBufferMode mode, BufferAccessStrategy strategy);
+										ReadBufferMode mode, BufferAccessStrategy strategy,
+										bool permanent);
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern void MarkBufferDirty(Buffer buffer);
@@ -203,6 +204,9 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
 extern void FlushOneBuffer(Buffer buffer);
 extern void FlushRelationBuffers(Relation rel);
 extern void FlushRelationsAllBuffers(struct SMgrRelationData **smgrs, int nrels);
+extern void CreateAndCopyRelationData(RelFileNode src_rnode,
+									  RelFileNode dst_rnode,
+									  bool permanent);
 extern void FlushDatabaseBuffers(Oid dbid);
 extern void DropRelFileNodeBuffers(struct SMgrRelationData *smgr_reln, ForkNumber *forkNum,
 								   int nforks, BlockNumber *firstDelBlock);
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index 49edbcc..be1d2c9 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -38,6 +38,7 @@ extern void RelationInitLockInfo(Relation relation);
 
 /* Lock a relation */
 extern void LockRelationOid(Oid relid, LOCKMODE lockmode);
+extern void LockRelationId(LockRelId *relid, LOCKMODE lockmode);
 extern bool ConditionalLockRelationOid(Oid relid, LOCKMODE lockmode);
 extern void UnlockRelationId(LockRelId *relid, LOCKMODE lockmode);
 extern void UnlockRelationOid(Oid relid, LOCKMODE lockmode);
diff --git a/src/include/utils/relmapper.h b/src/include/utils/relmapper.h
index 9fbb5a7..f10353e 100644
--- a/src/include/utils/relmapper.h
+++ b/src/include/utils/relmapper.h
@@ -38,7 +38,9 @@ typedef struct xl_relmap_update
 extern Oid	RelationMapOidToFilenode(Oid relationId, bool shared);
 
 extern Oid	RelationMapFilenodeToOid(Oid relationId, bool shared);
-
+extern Oid RelationMapOidToFilenodeForDatabase(char *dbpath, Oid relationId);
+extern void RelationMapCopy(Oid dbid, Oid tsid, char *srcdbpath,
+							char *dstdbpath);
 extern void RelationMapUpdateMap(Oid relationId, Oid fileNode, bool shared,
 								 bool immediate);
 
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 1c39ce0..d870c59 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -218,6 +218,7 @@ typedef enum
 	WAIT_EVENT_TWOPHASE_FILE_READ,
 	WAIT_EVENT_TWOPHASE_FILE_SYNC,
 	WAIT_EVENT_TWOPHASE_FILE_WRITE,
+	WAIT_EVENT_VERSION_FILE_WRITE,
 	WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ,
 	WAIT_EVENT_WAL_BOOTSTRAP_SYNC,
 	WAIT_EVENT_WAL_BOOTSTRAP_WRITE,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 93d5190..0747205 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -460,6 +460,8 @@ CoverPos
 CreateAmStmt
 CreateCastStmt
 CreateConversionStmt
+CreateDBRelInfo
+CreateDBStrategy
 CreateDomainStmt
 CreateEnumStmt
 CreateEventTrigStmt
@@ -3701,7 +3703,8 @@ xl_btree_update
 xl_btree_vacuum
 xl_clog_truncate
 xl_commit_ts_truncate
-xl_dbase_create_rec
+xl_dbase_create_file_copy_rec
+xl_dbase_create_wal_log_rec
 xl_dbase_drop_rec
 xl_end_of_recovery
 xl_hash_add_ovfl_page
-- 
1.8.3.1

#190

Dilip Kumar

dilipbalaut@gmail.com

almost 4 years ago

In reply to: Dilip Kumar (#188)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Wed, Mar 23, 2022 at 9:25 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Mar 23, 2022 at 9:13 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2022-03-23 18:49:11 +0530, Dilip Kumar wrote:

I think directly using smgrcreate() is a better idea instead of first
registering and then unregistering it. I have made that change in
the attached patch. After this change now we can merge creating the
MAIN_FORKNUM also in the loop below where we are creating other
fork[1] with one extra condition but I think current code is in more
sync with the other code where we are doing the similar things so I
have not merged it in the loop. Please let me know if you think
otherwise.

FWIW, this fails tests: https://cirrus-ci.com/build/4929662173315072
https://cirrus-ci.com/task/6651773434724352?logs=test_bin#L121
https://cirrus-ci.com/task/6088823481303040?logs=test_world#L2377

Strange to see that these changes are making a failure in the
file_copy strategy[1] because we made changes only related to the
wal_log strategy. However I will look into this. Thanks.
[1]
Failed test 'createdb -T foobar2 foobar5 -S file_copy exit code 0'

I could not see any reason for it to fail, and I could not reproduce
it either. Is it possible to access the server log for this cfbot
failure?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#191

Andres Freund

andres@anarazel.de

almost 4 years ago

In reply to: Dilip Kumar (#190)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

Hi,

On 2022-03-23 22:29:40 +0530, Dilip Kumar wrote:

I could not see any reason for it to fail, and I could not reproduce
it either. Is it possible to access the server log for this cfbot
failure?

Yes, near the top, below the cpu / memory graphs, there's a file
navigator. Should have all files ending with *.log or starting with
regress_log_*.

Greetings,

Andres Freund

#192

Dilip Kumar

dilipbalaut@gmail.com

almost 4 years ago

In reply to: Andres Freund (#191)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Wed, Mar 23, 2022 at 10:37 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2022-03-23 22:29:40 +0530, Dilip Kumar wrote:

I could not see any reason for it to fail, and I could not reproduce
it either. Is it possible to access the server log for this cfbot
failure?

Yes, near the top, below the cpu / memory graphs, there's a file
navigator. Should have all files ending with *.log or starting with
regress_log_*.

Okay, I think I have found the reasoning for this failure, basically,
if we see the below logs then the second statement is failing with
foobar5 already exists and that is because some of the above test case
is conditionally generating the same name. So the fix is to use a
different name.

2022-03-23 13:53:54.554 UTC [32647][client backend]
[020_createdb.pl][3/12:0] LOG: statement: CREATE DATABASE foobar5
TEMPLATE template0 LOCALE_PROVIDER icu ICU_LOCALE 'en';
......
2022-03-23 13:53:55.374 UTC [32717][client backend]
[020_createdb.pl][3/46:0] LOG: statement: CREATE DATABASE foobar5
STRATEGY file_copy TEMPLATE foobar2;
2022-03-23 13:53:55.390 UTC [32717][client backend]
[020_createdb.pl][3/46:0] ERROR: database "foobar5" already exists

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#193

Dilip Kumar

dilipbalaut@gmail.com

almost 4 years ago

In reply to: Dilip Kumar (#192)

1 attachment(s)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Wed, Mar 23, 2022 at 10:50 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Mar 23, 2022 at 10:37 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2022-03-23 22:29:40 +0530, Dilip Kumar wrote:

I could not see any reason for it to fail, and I could not reproduce
it either. Is it possible to access the server log for this cfbot
failure?

Yes, near the top, below the cpu / memory graphs, there's a file
navigator. Should have all files ending with *.log or starting with
regress_log_*.

Okay, I think I have found the reasoning for this failure, basically,
if we see the below logs then the second statement is failing with
foobar5 already exists and that is because some of the above test case
is conditionally generating the same name. So the fix is to use a
different name.

In the latest version I have fixed this issue by using a non
conflicting name, because when it was compiled with-icu the foobar5
was already used and we were seeing failure. Apart from this I have
fixed the duplicate cleanup problem by passing an extra parameter to
RelationCreateStorage, which decides whether to register for on-abort
delete or not and added the comments for the same. IMHO this looks
the most cleaner way to do it, please check the patch and let me know
your thoughts.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v5-0001-Add-new-block-by-block-strategy-for-CREATE-DATABA.patchtext/x-patch; charset=US-ASCII; name=v5-0001-Add-new-block-by-block-strategy-for-CREATE-DATABA.patchDownload

From d9821c93b8d5b4a5707943d23f7beae6826627f0 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Tue, 22 Mar 2022 11:22:26 -0400
Subject: [PATCH v5] Add new block-by-block strategy for CREATE DATABASE.

Because this strategy logs changes on a block-by-block basis, it
avoids the need to checkpoint before and after the operation.
However, because it logs each changed block individually, it might
generate a lot of extra write-ahead logging if the template database
is large. Therefore, the older strategy remains available via a new
STRATEGY parameter to CREATE DATABASE, and a corresponding --strategy
option to createdb.

Somewhat controversially, this patch assembles the list of relations
to be copied to the new database by reading the pg_class relation of
the template database. Cross-database access like this isn't normally
possible, but it can be made to work here because there can't be any
connections to the database being copied, nor can it contain any
in-doubt transactions. Even so, we have to use lower-level interfaces
than normal, since the table scan and relcache interfaces will not
work for a database to which we're not connected. The advantage of
this approach is that we do not need to rely on the filesystem to
determine what ought to be copied, but instead on PostgreSQL's own
knowledge of the database structure. This avoids, for example,
copying stray files that happen to be located in the source database
directory.

Dilip Kumar, with a fairly large number of cosmetic changes by me.
---
 contrib/bloom/blinsert.c                 |   2 +-
 doc/src/sgml/ref/create_database.sgml    |  22 +
 doc/src/sgml/ref/createdb.sgml           |  11 +
 src/backend/access/heap/heapam_handler.c |   6 +-
 src/backend/access/nbtree/nbtree.c       |   2 +-
 src/backend/access/rmgrdesc/dbasedesc.c  |  20 +-
 src/backend/access/transam/xlogutils.c   |   6 +-
 src/backend/catalog/heap.c               |   2 +-
 src/backend/catalog/storage.c            |  34 +-
 src/backend/commands/dbcommands.c        | 761 ++++++++++++++++++++++++++-----
 src/backend/commands/tablecmds.c         |   2 +-
 src/backend/storage/buffer/bufmgr.c      | 172 ++++++-
 src/backend/storage/lmgr/lmgr.c          |  28 ++
 src/backend/utils/activity/wait_event.c  |   3 +
 src/backend/utils/cache/relcache.c       |   2 +-
 src/backend/utils/cache/relmapper.c      |  64 +++
 src/bin/pg_rewind/parsexlog.c            |   9 +-
 src/bin/psql/tab-complete.c              |   4 +-
 src/bin/scripts/createdb.c               |  10 +-
 src/bin/scripts/t/020_createdb.pl        |  20 +
 src/include/catalog/storage.h            |   4 +-
 src/include/commands/dbcommands_xlog.h   |  25 +-
 src/include/storage/bufmgr.h             |   6 +-
 src/include/storage/lmgr.h               |   1 +
 src/include/utils/relmapper.h            |   4 +-
 src/include/utils/wait_event.h           |   1 +
 src/tools/pgindent/typedefs.list         |   5 +-
 27 files changed, 1069 insertions(+), 157 deletions(-)

diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c
index c94cf34..82378db 100644
--- a/contrib/bloom/blinsert.c
+++ b/contrib/bloom/blinsert.c
@@ -173,7 +173,7 @@ blbuildempty(Relation index)
 	 * Write the page and log it.  It might seem that an immediate sync would
 	 * be sufficient to guarantee that the file exists on disk, but recovery
 	 * itself might remove it while replaying, for example, an
-	 * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we need
+	 * XLOG_DBASE_CREATE* or XLOG_TBLSPC_CREATE record.  Therefore, we need
 	 * this even when wal_level=minimal.
 	 */
 	PageSetChecksumInplace(metapage, BLOOM_METAPAGE_BLKNO);
diff --git a/doc/src/sgml/ref/create_database.sgml b/doc/src/sgml/ref/create_database.sgml
index 5ae785a..255ad3a 100644
--- a/doc/src/sgml/ref/create_database.sgml
+++ b/doc/src/sgml/ref/create_database.sgml
@@ -25,6 +25,7 @@ CREATE DATABASE <replaceable class="parameter">name</replaceable>
     [ [ WITH ] [ OWNER [=] <replaceable class="parameter">user_name</replaceable> ]
            [ TEMPLATE [=] <replaceable class="parameter">template</replaceable> ]
            [ ENCODING [=] <replaceable class="parameter">encoding</replaceable> ]
+           [ STRATEGY [=] <replaceable class="parameter">strategy</replaceable> ] ]
            [ LOCALE [=] <replaceable class="parameter">locale</replaceable> ]
            [ LC_COLLATE [=] <replaceable class="parameter">lc_collate</replaceable> ]
            [ LC_CTYPE [=] <replaceable class="parameter">lc_ctype</replaceable> ]
@@ -118,6 +119,27 @@ CREATE DATABASE <replaceable class="parameter">name</replaceable>
        </para>
       </listitem>
      </varlistentry>
+     <varlistentry id="create-database-strategy" xreflabel="CREATE DATABASE STRATEGY">
+      <term><replaceable class="parameter">strategy</replaceable></term>
+      <listitem>
+       <para>
+        Strategy to be used in creating the new database.  If
+        the <literal>WAL_LOG</literal> strategy is used, the database will be
+        copied block by block and each block will be separately written
+        to the write-ahead log. This is the most efficient strategy in
+        cases where the template database is small, and therefore it is the
+        default. The older <literal>FILE_COPY</literal> strategy is also
+        available. This strategy writes a small record to the write-ahead log
+        for each tablespace used by the target database. Each such record
+        represents copying an entire directory to a new location at the
+        filesystem level. While this does reduce the write-ahed
+        log volume substantially, especially if the template database is large,
+        it also forces the system to perform a checkpoint both before and
+        after the creation of the new database. In some situations, this may
+        have a noticeable negative impact on overall system performance.
+       </para>
+      </listitem>
+     </varlistentry>
      <varlistentry>
       <term><replaceable class="parameter">locale</replaceable></term>
       <listitem>
diff --git a/doc/src/sgml/ref/createdb.sgml b/doc/src/sgml/ref/createdb.sgml
index be42e50..671cd362 100644
--- a/doc/src/sgml/ref/createdb.sgml
+++ b/doc/src/sgml/ref/createdb.sgml
@@ -178,6 +178,17 @@ PostgreSQL documentation
      </varlistentry>
 
      <varlistentry>
+      <term><option>-S <replaceable class="parameter">template</replaceable></option></term>
+      <term><option>--strategy=<replaceable class="parameter">strategy</replaceable></option></term>
+      <listitem>
+       <para>
+        Specifies the database creation strategy.  See
+        <xref linkend="create-database-strategy" /> for more details.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
       <term><option>-T <replaceable class="parameter">template</replaceable></option></term>
       <term><option>--template=<replaceable class="parameter">template</replaceable></option></term>
       <listitem>
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 39ef8a0..dee264e 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -593,7 +593,7 @@ heapam_relation_set_new_filenode(Relation rel,
 	 */
 	*minmulti = GetOldestMultiXactId();
 
-	srel = RelationCreateStorage(*newrnode, persistence);
+	srel = RelationCreateStorage(*newrnode, persistence, true);
 
 	/*
 	 * If required, set up an init fork for an unlogged table so that it can
@@ -601,7 +601,7 @@ heapam_relation_set_new_filenode(Relation rel,
 	 * even if the page has been logged, because the write did not go through
 	 * shared_buffers and therefore a concurrent checkpoint may have moved the
 	 * redo pointer past our xlog record.  Recovery may as well remove it
-	 * while replaying, for example, XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE
+	 * while replaying, for example, XLOG_DBASE_CREATE* or XLOG_TBLSPC_CREATE
 	 * record. Therefore, logging is necessary even if wal_level=minimal.
 	 */
 	if (persistence == RELPERSISTENCE_UNLOGGED)
@@ -645,7 +645,7 @@ heapam_relation_copy_data(Relation rel, const RelFileNode *newrnode)
 	 * NOTE: any conflict in relfilenode value will be caught in
 	 * RelationCreateStorage().
 	 */
-	RelationCreateStorage(*newrnode, rel->rd_rel->relpersistence);
+	RelationCreateStorage(*newrnode, rel->rd_rel->relpersistence, true);
 
 	/* copy main fork */
 	RelationCopyStorage(RelationGetSmgr(rel), dstrel, MAIN_FORKNUM,
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index c9b4964..dacf3f7 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -161,7 +161,7 @@ btbuildempty(Relation index)
 	 * Write the page and log it.  It might seem that an immediate sync would
 	 * be sufficient to guarantee that the file exists on disk, but recovery
 	 * itself might remove it while replaying, for example, an
-	 * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we need
+	 * XLOG_DBASE_CREATE* or XLOG_TBLSPC_CREATE record.  Therefore, we need
 	 * this even when wal_level=minimal.
 	 */
 	PageSetChecksumInplace(metapage, BTREE_METAPAGE);
diff --git a/src/backend/access/rmgrdesc/dbasedesc.c b/src/backend/access/rmgrdesc/dbasedesc.c
index 03af3fd..523d0b3 100644
--- a/src/backend/access/rmgrdesc/dbasedesc.c
+++ b/src/backend/access/rmgrdesc/dbasedesc.c
@@ -24,14 +24,23 @@ dbase_desc(StringInfo buf, XLogReaderState *record)
 	char	   *rec = XLogRecGetData(record);
 	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
 
-	if (info == XLOG_DBASE_CREATE)
+	if (info == XLOG_DBASE_CREATE_FILE_COPY)
 	{
-		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) rec;
+		xl_dbase_create_file_copy_rec *xlrec =
+		(xl_dbase_create_file_copy_rec *) rec;
 
 		appendStringInfo(buf, "copy dir %u/%u to %u/%u",
 						 xlrec->src_tablespace_id, xlrec->src_db_id,
 						 xlrec->tablespace_id, xlrec->db_id);
 	}
+	else if (info == XLOG_DBASE_CREATE_WAL_LOG)
+	{
+		xl_dbase_create_wal_log_rec *xlrec =
+		(xl_dbase_create_wal_log_rec *) rec;
+
+		appendStringInfo(buf, "create dir %u/%u",
+						 xlrec->tablespace_id, xlrec->db_id);
+	}
 	else if (info == XLOG_DBASE_DROP)
 	{
 		xl_dbase_drop_rec *xlrec = (xl_dbase_drop_rec *) rec;
@@ -51,8 +60,11 @@ dbase_identify(uint8 info)
 
 	switch (info & ~XLR_INFO_MASK)
 	{
-		case XLOG_DBASE_CREATE:
-			id = "CREATE";
+		case XLOG_DBASE_CREATE_FILE_COPY:
+			id = "CREATE_FILE_COPY";
+			break;
+		case XLOG_DBASE_CREATE_WAL_LOG:
+			id = "CREATE_WAL_LOG";
 			break;
 		case XLOG_DBASE_DROP:
 			id = "DROP";
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 511f2f1..a4dedc5 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -484,7 +484,7 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 	{
 		/* page exists in file */
 		buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-										   mode, NULL);
+										   mode, NULL, true);
 	}
 	else
 	{
@@ -509,7 +509,7 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 				ReleaseBuffer(buffer);
 			}
 			buffer = ReadBufferWithoutRelcache(rnode, forknum,
-											   P_NEW, mode, NULL);
+											   P_NEW, mode, NULL, true);
 		}
 		while (BufferGetBlockNumber(buffer) < blkno);
 		/* Handle the corner case that P_NEW returns non-consecutive pages */
@@ -519,7 +519,7 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 				LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 			ReleaseBuffer(buffer);
 			buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-											   mode, NULL);
+											   mode, NULL, true);
 		}
 	}
 
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index 696fd59..6eb78a9 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -387,7 +387,7 @@ heap_create(const char *relname,
 											relpersistence,
 											relfrozenxid, relminmxid);
 		else if (RELKIND_HAS_STORAGE(rel->rd_rel->relkind))
-			RelationCreateStorage(rel->rd_node, relpersistence);
+			RelationCreateStorage(rel->rd_node, relpersistence, true);
 		else
 			Assert(false);
 	}
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 9b80755..74580dd 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -112,12 +112,14 @@ AddPendingSync(const RelFileNode *rnode)
  * modules that need them.
  *
  * This function is transactional. The creation is WAL-logged, and if the
- * transaction aborts later on, the storage will be destroyed.
+ * transaction aborts later on, the storage will be destroyed.  But if the
+ * caller has its own cleanup mechanism and don't want to register for cleanup
+ * on abort then pass 'register_delete' as false.
  */
 SMgrRelation
-RelationCreateStorage(RelFileNode rnode, char relpersistence)
+RelationCreateStorage(RelFileNode rnode, char relpersistence,
+					  bool register_delete)
 {
-	PendingRelDelete *pending;
 	SMgrRelation srel;
 	BackendId	backend;
 	bool		needs_wal;
@@ -149,15 +151,23 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
 	if (needs_wal)
 		log_smgrcreate(&srel->smgr_rnode.node, MAIN_FORKNUM);
 
-	/* Add the relation to the list of stuff to delete at abort */
-	pending = (PendingRelDelete *)
-		MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
-	pending->relnode = rnode;
-	pending->backend = backend;
-	pending->atCommit = false;	/* delete if abort */
-	pending->nestLevel = GetCurrentTransactionNestLevel();
-	pending->next = pendingDeletes;
-	pendingDeletes = pending;
+	/*
+	 * Add the relation to the list of stuff to delete at abort, if we are
+	 * asked to do so.
+	 */
+	if (register_delete)
+	{
+		PendingRelDelete *pending;
+
+		pending = (PendingRelDelete *)
+			MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+		pending->relnode = rnode;
+		pending->backend = backend;
+		pending->atCommit = false;	/* delete if abort */
+		pending->nestLevel = GetCurrentTransactionNestLevel();
+		pending->next = pendingDeletes;
+		pendingDeletes = pending;
+	}
 
 	if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
 	{
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 623e5ec..02a096c 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -63,13 +63,31 @@
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
 #include "utils/pg_locale.h"
+#include "utils/relmapper.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 
+/*
+ * Create database strategy.
+ *
+ * CREATEDB_WAL_LOG will copy the database at the block level and WAL log each
+ * copied block.
+ *
+ * CREATEDB_FILE_COPY will simply perform a file system level copy of the
+ * database and log a single record for each tablespace copied. To make this
+ * safe, it also triggers checkpoints before and after the operation.
+ */
+typedef enum CreateDBStrategy
+{
+	CREATEDB_WAL_LOG,
+	CREATEDB_FILE_COPY
+} CreateDBStrategy;
+
 typedef struct
 {
 	Oid			src_dboid;		/* source (template) DB */
 	Oid			dest_dboid;		/* DB we are trying to create */
+	CreateDBStrategy strategy;	/* create db strategy */
 } createdb_failure_params;
 
 typedef struct
@@ -78,6 +96,17 @@ typedef struct
 	Oid			dest_tsoid;		/* tablespace we are trying to move to */
 } movedb_failure_params;
 
+/*
+ * Information about a relation to be copied when creating a database.
+ */
+typedef struct CreateDBRelInfo
+{
+	RelFileNode rnode;			/* physical relation identifier */
+	Oid			reloid;			/* relation oid */
+	bool		permanent;		/* relation is permanent or unlogged */
+} CreateDBRelInfo;
+
+
 /* non-export function prototypes */
 static void createdb_failure_callback(int code, Datum arg);
 static void movedb(const char *dbname, const char *tblspcname);
@@ -93,7 +122,540 @@ static bool have_createdb_privilege(void);
 static void remove_dbtablespaces(Oid db_id);
 static bool check_db_file_conflict(Oid db_id);
 static int	errdetail_busy_db(int notherbackends, int npreparedxacts);
+static void CreateDatabaseUsingWalLog(Oid src_dboid, Oid dboid, Oid src_tsid,
+									  Oid dst_tsid);
+static List *ScanSourceDatabasePgClass(Oid srctbid, Oid srcdbid, char *srcpath);
+static List *ScanSourceDatabasePgClassPage(Page page, Buffer buf, Oid tbid,
+										   Oid dbid, char *srcpath,
+										   List *rnodelist, Snapshot snapshot);
+static CreateDBRelInfo *ScanSourceDatabasePgClassTuple(HeapTupleData *tuple,
+													   Oid tbid, Oid dbid,
+													   char *srcpath);
+static void CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid,
+									bool isRedo);
+static void CreateDatabaseUsingFileCopy(Oid src_dboid, Oid dboid, Oid src_tsid,
+										Oid dst_tsid);
+
+/*
+ * Create a new database using the WAL_LOG strategy.
+ *
+ * Each copied block is separately written to the write-ahead log.
+ */
+static void
+CreateDatabaseUsingWalLog(Oid src_dboid, Oid dst_dboid,
+						  Oid src_tsid, Oid dst_tsid)
+{
+	char	   *srcpath;
+	char	   *dstpath;
+	List	   *rnodelist = NULL;
+	ListCell   *cell;
+	LockRelId	srcrelid;
+	LockRelId	dstrelid;
+	RelFileNode srcrnode;
+	RelFileNode dstrnode;
+	CreateDBRelInfo *relinfo;
+
+	/* Get source and destination database paths. */
+	srcpath = GetDatabasePath(src_dboid, src_tsid);
+	dstpath = GetDatabasePath(dst_dboid, dst_tsid);
+
+	/* Create database directory and write PG_VERSION file. */
+	CreateDirAndVersionFile(dstpath, dst_dboid, dst_tsid, false);
+
+	/* Copy relmap file from source database to the destination database. */
+	RelationMapCopy(dst_dboid, dst_tsid, srcpath, dstpath);
+
+	/* Get list of relfilenodes to copy from the source database. */
+	rnodelist = ScanSourceDatabasePgClass(src_tsid, src_dboid, srcpath);
+	Assert(rnodelist != NIL);
+
+	/*
+	 * Database ids are common for all the relation so set it before entering
+	 * the loop.
+	 */
+	srcrelid.dbId = src_dboid;
+	dstrelid.dbId = dst_dboid;
+
+	/* Loop over our list of relfilenodes and copy each one. */
+	foreach(cell, rnodelist)
+	{
+		relinfo = lfirst(cell);
+		srcrnode = relinfo->rnode;
+
+		/*
+		 * If the relation is from the source db's default tablespace then we
+		 * need to create it in the destinations db's default tablespace.
+		 * Otherwise, we need to create in the same tablespace as it is in the
+		 * source database.
+		 */
+		if (srcrnode.spcNode == src_tsid)
+			dstrnode.spcNode = dst_tsid;
+		else
+			dstrnode.spcNode = srcrnode.spcNode;
 
+		dstrnode.dbNode = dst_dboid;
+		dstrnode.relNode = srcrnode.relNode;
+
+		/* Acquire locks on source and target relations before copying. */
+		dstrelid.relId = srcrelid.relId = relinfo->reloid;
+		LockRelationId(&srcrelid, AccessShareLock);
+		LockRelationId(&dstrelid, AccessShareLock);
+
+		/* Copy relation storage from source to the destination. */
+		CreateAndCopyRelationData(srcrnode, dstrnode, relinfo->permanent);
+
+		/* Release the locks. */
+		UnlockRelationId(&srcrelid, AccessShareLock);
+		UnlockRelationId(&dstrelid, AccessShareLock);
+	}
+
+	list_free_deep(rnodelist);
+}
+
+/*
+ * Scan the pg_class table in the source database to identify the relations
+ * that need to be copied to the destination database.
+ *
+ * This is an exception to the usual rule that cross-database access is
+ * not possible. We can make it work here because we know that there are no
+ * connections to the source database and (since there can't be prepared
+ * transactions touching that database) no in-doubt tuples either. This
+ * means that we don't need to worry about pruning removing anything from
+ * under us, and we don't need to be too picky about our snapshot either.
+ * As long as it sees all previously-committed XIDs as committed and all
+ * aborted XIDs as aborted, we should be fine: nothing else is possible
+ * here.
+ *
+ * We can't rely on the relcache for anything here, because that only knows
+ * about the database to which we are connected, and can't handle access to
+ * other databases. That also means we can't rely on the heap scan
+ * infrastructure, which would be a bad idea anyway since it might try
+ * to do things like HOT pruning which we definitely can't do safely in
+ * a database to which we're not even connected.
+ */
+static List *
+ScanSourceDatabasePgClass(Oid tbid, Oid dbid, char *srcpath)
+{
+	RelFileNode rnode;
+	BlockNumber nblocks;
+	BlockNumber blkno;
+	Buffer		buf;
+	Oid			relfilenode;
+	Page		page;
+	List	   *rnodelist = NIL;
+	LockRelId	relid;
+	Relation	rel;
+	Snapshot	snapshot;
+	BufferAccessStrategy bstrategy;
+
+	/* Get pg_class relfilenode. */
+	relfilenode = RelationMapOidToFilenodeForDatabase(srcpath,
+													  RelationRelationId);
+
+	/*
+	 * The system elsewhere assumes that we only read data for a relation
+	 * into shared_buffers while holding some sort of a lock on a relation,
+	 * so lock the source database's pg_class before we do anything else.
+	 */
+	relid.dbId = dbid;
+	relid.relId = RelationRelationId;
+	LockRelationId(&relid, AccessShareLock);
+
+	/* Prepare a RelFileNode for the pg_class relation. */
+	rnode.spcNode = tbid;
+	rnode.dbNode = dbid;
+	rnode.relNode = relfilenode;
+
+	/*
+	 * Create a fake relcache entry for the pg_class relation and get the
+	 * number of blocks.  Refer to the comments in CreateAndCopyRelationData()
+	 * for the rationale behind using the fake relcache entry.
+	 */
+	rel = CreateFakeRelcacheEntry(rnode);
+	nblocks = smgrnblocks(RelationGetSmgr(rel), MAIN_FORKNUM);
+	FreeFakeRelcacheEntry(rel);
+
+	/* Use a buffer access strategy since this is a bulk read operation. */
+	bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	/*
+	 * As explained in the function header comments, we need a snapshot that
+	 * will see all committed transactions as committed, and our transaction
+	 * snapshot - or the active snapshot - might not be new enough for that,
+	 * but the return value of GetLatestSnapshot() should work fine.
+	 */
+	snapshot = GetLatestSnapshot();
+
+	/* Process the relation block by block. */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		CHECK_FOR_INTERRUPTS();
+
+		buf = ReadBufferWithoutRelcache(rnode, MAIN_FORKNUM, blkno,
+										RBM_NORMAL, bstrategy, false);
+
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buf);
+		if (PageIsNew(page) || PageIsEmpty(page))
+		{
+			UnlockReleaseBuffer(buf);
+			continue;
+		}
+
+		/* Append relevant pg_class tuples for current page to rnodelist. */
+		rnodelist = ScanSourceDatabasePgClassPage(page, buf, tbid, dbid,
+												  srcpath, rnodelist,
+												  snapshot);
+
+		UnlockReleaseBuffer(buf);
+	}
+
+	/* Release relation lock. */
+	UnlockRelationId(&relid, AccessShareLock);
+
+	return rnodelist;
+}
+
+/*
+ * Scan one page of the source database's pg_class relation and add relevant
+ * entries to rnodelist. The return value is the updated list.
+ */
+static List *
+ScanSourceDatabasePgClassPage(Page page, Buffer buf, Oid tbid, Oid dbid,
+							  char *srcpath, List *rnodelist,
+							  Snapshot snapshot)
+{
+	BlockNumber		blkno = BufferGetBlockNumber(buf);
+	OffsetNumber	offnum;
+	OffsetNumber	maxoff;
+	HeapTupleData	tuple;
+
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/* Loop over offsets. */
+	for (offnum = FirstOffsetNumber;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid;
+
+		itemid = PageGetItemId(page, offnum);
+
+		/* Nothing to do if slot is empty or already dead. */
+		if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid) ||
+			ItemIdIsRedirected(itemid))
+			continue;
+
+		Assert(ItemIdIsNormal(itemid));
+		ItemPointerSet(&(tuple.t_self), blkno, offnum);
+
+		/* Initialize a HeapTupleData structure. */
+		tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
+		tuple.t_len = ItemIdGetLength(itemid);
+		tuple.t_tableOid = RelationRelationId;
+
+		/* Skip tuples that are not visible to this snapshot. */
+		if (HeapTupleSatisfiesVisibility(&tuple, snapshot, buf))
+		{
+			CreateDBRelInfo *relinfo;
+
+			/*
+			 * ScanSourceDatabasePgClassTuple is in charge of constructing
+			 * a CreateDBRelInfo object for this tuple, but can also decide
+			 * that this tuple isn't something we need to copy. If we do need
+			 * to copy the relation, add it to the list.
+			 */
+			relinfo = ScanSourceDatabasePgClassTuple(&tuple, tbid, dbid,
+													 srcpath);
+			if (relinfo != NULL)
+				rnodelist = lappend(rnodelist, relinfo);
+		}
+	}
+
+	return rnodelist;
+}
+
+/*
+ * Decide whether a certain pg_class tuple represents something that
+ * needs to be copied from the source database to the destination database,
+ * and if so, construct a CreateDBRelInfo for it.
+ *
+ * Visbility checks are handled by the caller, so our job here is just
+ * to assess the data stored in the tuple.
+ */
+CreateDBRelInfo *
+ScanSourceDatabasePgClassTuple(HeapTupleData *tuple, Oid tbid, Oid dbid,
+							   char *srcpath)
+{
+	CreateDBRelInfo	   *relinfo;
+	Form_pg_class		classForm;
+	Oid					relfilenode = InvalidOid;
+
+	classForm = (Form_pg_class) GETSTRUCT(tuple);
+
+	/*
+	 * Return NULL if this object does not need to be copied.
+	 *
+	 * Shared objects don't need to be copied, because they are shared.
+	 * Objects without storage can't be copied, because there's nothing to
+	 * copy. Temporary relations don't need to be copied either, because
+	 * they are inaccessible outside of the session that created them,
+	 * which must be gone already, and couldn't connect to a different database
+	 * if it still existed. autovacuum will eventually remove the pg_class
+	 * entries as well.
+	 */
+	if (classForm->reltablespace == GLOBALTABLESPACE_OID ||
+		!RELKIND_HAS_STORAGE(classForm->relkind) ||
+		classForm->relpersistence == RELPERSISTENCE_TEMP)
+		return NULL;
+
+	/*
+	 * If relfilenode is valid then directly use it.  Otherwise, consult the
+	 * relmap.
+	 */
+	if (OidIsValid(classForm->relfilenode))
+		relfilenode = classForm->relfilenode;
+	else
+		relfilenode = RelationMapOidToFilenodeForDatabase(srcpath,
+														  classForm->oid);
+
+	/* We must have a valid relfilenode oid. */
+	if (!OidIsValid(relfilenode))
+		elog(ERROR, "relation with OID %u does not have a valid relfilenode",
+			 classForm->oid);
+
+	/* Prepare a rel info element and add it to the list. */
+	relinfo = (CreateDBRelInfo *) palloc(sizeof(CreateDBRelInfo));
+	if (OidIsValid(classForm->reltablespace))
+		relinfo->rnode.spcNode = classForm->reltablespace;
+	else
+		relinfo->rnode.spcNode = tbid;
+
+	relinfo->rnode.dbNode = dbid;
+	relinfo->rnode.relNode = relfilenode;
+	relinfo->reloid = classForm->oid;
+
+	/* Temporary relations were rejected above. */
+	Assert(classForm->relpersistence != RELPERSISTENCE_TEMP);
+	relinfo->permanent =
+		(classForm->relpersistence == RELPERSISTENCE_PERMANENT) ? true : false;
+
+	return relinfo;
+}
+
+/*
+ * Create database directory and write out the PG_VERSION file in the database
+ * path.  If isRedo is true, it's okay for the database directory to exist
+ * already.
+ */
+static void
+CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid, bool isRedo)
+{
+	int			fd;
+	int			nbytes;
+	char		versionfile[MAXPGPATH];
+	char		buf[16];
+
+	/*
+	 * Prepare version data before starting a critical section.
+	 *
+	 * Note that we don't have to copy this from the source database; there's
+	 * only one legal value.
+	 */
+	sprintf(buf, "%s\n", PG_MAJORVERSION);
+	nbytes = strlen(PG_MAJORVERSION) + 1;
+
+	/* If we are not in WAL replay then write the WAL. */
+	if (!isRedo)
+	{
+		xl_dbase_create_wal_log_rec xlrec;
+		XLogRecPtr	lsn;
+
+		START_CRIT_SECTION();
+
+		xlrec.db_id = dbid;
+		xlrec.tablespace_id = tsid;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec),
+						 sizeof(xl_dbase_create_wal_log_rec));
+
+		lsn = XLogInsert(RM_DBASE_ID, XLOG_DBASE_CREATE_WAL_LOG);
+
+		/* As always, WAL must hit the disk before the data update does. */
+		XLogFlush(lsn);
+	}
+
+	/* Create database directory. */
+	if (MakePGDirectory(dbpath) < 0)
+	{
+		/* Failure other than already exists or not in WAL replay? */
+		if (errno != EEXIST || !isRedo)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not create directory \"%s\": %m", dbpath)));
+	}
+
+	/*
+	 * Create PG_VERSION file in the database path.  If the file already
+	 * exists and we are in WAL replay then try again to open it in write
+	 * mode.
+	 */
+	snprintf(versionfile, sizeof(versionfile), "%s/%s", dbpath, "PG_VERSION");
+
+	fd = OpenTransientFile(versionfile, O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
+	if (fd < 0 && errno == EEXIST && isRedo)
+		fd = OpenTransientFile(versionfile, O_WRONLY | O_TRUNC | PG_BINARY);
+
+	if (fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m", versionfile)));
+
+	/* Write PG_MAJORVERSION in the PG_VERSION file. */
+	pgstat_report_wait_start(WAIT_EVENT_VERSION_FILE_WRITE);
+	errno = 0;
+	if ((int) write(fd, buf, nbytes) != nbytes)
+	{
+		/* If write didn't set errno, assume problem is no disk space. */
+		if (errno == 0)
+			errno = ENOSPC;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m", versionfile)));
+	}
+	pgstat_report_wait_end();
+
+	/* Close the version file. */
+	CloseTransientFile(fd);
+
+	/* Critical section done. */
+	if (!isRedo)
+		END_CRIT_SECTION();
+}
+
+/*
+ * Create a new database using the FILE_COPY strategy.
+ *
+ * Copy each tablespace at the filesystem level, and log a single WAL record
+ * for each tablespace copied.  This requires a checkpoint before and after the
+ * copy, which may be expensive, but it does greatly reduce WAL generation
+ * if the copied database is large.
+ */
+static void
+CreateDatabaseUsingFileCopy(Oid src_dboid, Oid dst_dboid, Oid src_tsid,
+							Oid dst_tsid)
+{
+	TableScanDesc scan;
+	Relation	rel;
+	HeapTuple	tuple;
+
+	/*
+	 * Force a checkpoint before starting the copy. This will force all dirty
+	 * buffers, including those of unlogged tables, out to disk, to ensure
+	 * source database is up-to-date on disk for the copy.
+	 * FlushDatabaseBuffers() would suffice for that, but we also want to
+	 * process any pending unlink requests. Otherwise, if a checkpoint
+	 * happened while we're copying files, a file might be deleted just when
+	 * we're about to copy it, causing the lstat() call in copydir() to fail
+	 * with ENOENT.
+	 */
+	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE |
+					  CHECKPOINT_WAIT | CHECKPOINT_FLUSH_ALL);
+
+	/*
+	 * Iterate through all tablespaces of the template database, and copy each
+	 * one to the new database.
+	 */
+	rel = table_open(TableSpaceRelationId, AccessShareLock);
+	scan = table_beginscan_catalog(rel, 0, NULL);
+	while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+	{
+		Form_pg_tablespace spaceform = (Form_pg_tablespace) GETSTRUCT(tuple);
+		Oid			srctablespace = spaceform->oid;
+		Oid			dsttablespace;
+		char	   *srcpath;
+		char	   *dstpath;
+		struct stat st;
+
+		/* No need to copy global tablespace */
+		if (srctablespace == GLOBALTABLESPACE_OID)
+			continue;
+
+		srcpath = GetDatabasePath(src_dboid, srctablespace);
+
+		if (stat(srcpath, &st) < 0 || !S_ISDIR(st.st_mode) ||
+			directory_is_empty(srcpath))
+		{
+			/* Assume we can ignore it */
+			pfree(srcpath);
+			continue;
+		}
+
+		if (srctablespace == src_tsid)
+			dsttablespace = dst_tsid;
+		else
+			dsttablespace = srctablespace;
+
+		dstpath = GetDatabasePath(dst_dboid, dsttablespace);
+
+		/*
+		 * Copy this subdirectory to the new location
+		 *
+		 * We don't need to copy subdirectories
+		 */
+		copydir(srcpath, dstpath, false);
+
+		/* Record the filesystem change in XLOG */
+		{
+			xl_dbase_create_file_copy_rec xlrec;
+
+			xlrec.db_id = dst_dboid;
+			xlrec.tablespace_id = dsttablespace;
+			xlrec.src_db_id = src_dboid;
+			xlrec.src_tablespace_id = srctablespace;
+
+			XLogBeginInsert();
+			XLogRegisterData((char *) &xlrec,
+							 sizeof(xl_dbase_create_file_copy_rec));
+
+			(void) XLogInsert(RM_DBASE_ID,
+							  XLOG_DBASE_CREATE_FILE_COPY | XLR_SPECIAL_REL_UPDATE);
+		}
+	}
+	table_endscan(scan);
+	table_close(rel, AccessShareLock);
+
+	/*
+	 * We force a checkpoint before committing.  This effectively means that
+	 * committed XLOG_DBASE_CREATE_FILE_COPY operations will never need to be
+	 * replayed (at least not in ordinary crash recovery; we still have to
+	 * make the XLOG entry for the benefit of PITR operations). This avoids
+	 * two nasty scenarios:
+	 *
+	 * #1: When PITR is off, we don't XLOG the contents of newly created
+	 * indexes; therefore the drop-and-recreate-whole-directory behavior of
+	 * DBASE_CREATE replay would lose such indexes.
+	 *
+	 * #2: Since we have to recopy the source database during DBASE_CREATE
+	 * replay, we run the risk of copying changes in it that were committed
+	 * after the original CREATE DATABASE command but before the system crash
+	 * that led to the replay.  This is at least unexpected and at worst could
+	 * lead to inconsistencies, eg duplicate table names.
+	 *
+	 * (Both of these were real bugs in releases 8.0 through 8.0.3.)
+	 *
+	 * In PITR replay, the first of these isn't an issue, and the second is
+	 * only a risk if the CREATE DATABASE and subsequent template database
+	 * change both occur while a base backup is being taken. There doesn't
+	 * seem to be much we can do about that except document it as a
+	 * limitation.
+	 *
+	 * See CreateDatabaseUsingWalLog() for a less cheesy CREATE DATABASE
+	 * strategy that avoids these problems.
+	 */
+	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
+}
 
 /*
  * CREATE DATABASE
@@ -101,8 +663,6 @@ static int	errdetail_busy_db(int notherbackends, int npreparedxacts);
 Oid
 createdb(ParseState *pstate, const CreatedbStmt *stmt)
 {
-	TableScanDesc scan;
-	Relation	rel;
 	Oid			src_dboid;
 	Oid			src_owner;
 	int			src_encoding = -1;
@@ -137,6 +697,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	DefElem    *dallowconnections = NULL;
 	DefElem    *dconnlimit = NULL;
 	DefElem    *dcollversion = NULL;
+	DefElem    *dstrategy = NULL;
 	char	   *dbname = stmt->dbname;
 	char	   *dbowner = NULL;
 	const char *dbtemplate = NULL;
@@ -152,6 +713,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	char	   *dbcollversion = NULL;
 	int			notherbackends;
 	int			npreparedxacts;
+	CreateDBStrategy dbstrategy = CREATEDB_WAL_LOG;
 	createdb_failure_params fparms;
 
 	/* Extract options from the statement node tree */
@@ -269,6 +831,12 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 						(errcode(ERRCODE_INVALID_PARAMETER_VALUE)),
 						errmsg("OIDs less than %u are reserved for system objects", FirstNormalObjectId));
 		}
+		else if (strcmp(defel->defname, "strategy") == 0)
+		{
+			if (dstrategy)
+				errorConflictingDefElem(defel, pstate);
+			dstrategy = defel;
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -413,6 +981,23 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 							dbtemplate)));
 	}
 
+	/* Validate the database creation strategy. */
+	if (dstrategy && dstrategy->arg)
+	{
+		char	   *strategy;
+
+		strategy = defGetString(dstrategy);
+		if (strcmp(strategy, "wal_log") == 0)
+			dbstrategy = CREATEDB_WAL_LOG;
+		else if (strcmp(strategy, "file_copy") == 0)
+			dbstrategy = CREATEDB_FILE_COPY;
+		else
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("invalid create database strategy %s", strategy),
+					 errhint("Valid strategies are \"wal_log\", and \"file_copy\".")));
+	}
+
 	/* If encoding or locales are defaulted, use source's setting */
 	if (encoding < 0)
 		encoding = src_encoding;
@@ -753,17 +1338,16 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	InvokeObjectPostCreateHook(DatabaseRelationId, dboid, 0);
 
 	/*
-	 * Force a checkpoint before starting the copy. This will force all dirty
-	 * buffers, including those of unlogged tables, out to disk, to ensure
-	 * source database is up-to-date on disk for the copy.
-	 * FlushDatabaseBuffers() would suffice for that, but we also want to
-	 * process any pending unlink requests. Otherwise, if a checkpoint
-	 * happened while we're copying files, a file might be deleted just when
-	 * we're about to copy it, causing the lstat() call in copydir() to fail
-	 * with ENOENT.
+	 * Acquire a lock on the target database, although this is a new database
+	 * and no one else should be able to access it.  But if we are using wal
+	 * log strategy then we are going to access the relation pages using shared
+	 * buffers.  Therefore, as a general principle, we should acquire the
+	 * database lock and the relation lock before accessing any shared buffers.
+	 * Individual relation level locks would be acquired in the
+	 * CreateDatabaseUsingWalLog() when reading pages from the shared buffer.
 	 */
-	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT
-					  | CHECKPOINT_FLUSH_ALL);
+	if (dbstrategy == CREATEDB_WAL_LOG)
+		LockSharedObject(DatabaseRelationId, dboid, 0, AccessShareLock);
 
 	/*
 	 * Once we start copying subdirectories, we need to be able to clean 'em
@@ -774,101 +1358,24 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	 */
 	fparms.src_dboid = src_dboid;
 	fparms.dest_dboid = dboid;
+	fparms.strategy = dbstrategy;
+
 	PG_ENSURE_ERROR_CLEANUP(createdb_failure_callback,
 							PointerGetDatum(&fparms));
 	{
 		/*
-		 * Iterate through all tablespaces of the template database, and copy
-		 * each one to the new database.
-		 */
-		rel = table_open(TableSpaceRelationId, AccessShareLock);
-		scan = table_beginscan_catalog(rel, 0, NULL);
-		while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
-		{
-			Form_pg_tablespace spaceform = (Form_pg_tablespace) GETSTRUCT(tuple);
-			Oid			srctablespace = spaceform->oid;
-			Oid			dsttablespace;
-			char	   *srcpath;
-			char	   *dstpath;
-			struct stat st;
-
-			/* No need to copy global tablespace */
-			if (srctablespace == GLOBALTABLESPACE_OID)
-				continue;
-
-			srcpath = GetDatabasePath(src_dboid, srctablespace);
-
-			if (stat(srcpath, &st) < 0 || !S_ISDIR(st.st_mode) ||
-				directory_is_empty(srcpath))
-			{
-				/* Assume we can ignore it */
-				pfree(srcpath);
-				continue;
-			}
-
-			if (srctablespace == src_deftablespace)
-				dsttablespace = dst_deftablespace;
-			else
-				dsttablespace = srctablespace;
-
-			dstpath = GetDatabasePath(dboid, dsttablespace);
-
-			/*
-			 * Copy this subdirectory to the new location
-			 *
-			 * We don't need to copy subdirectories
-			 */
-			copydir(srcpath, dstpath, false);
-
-			/* Record the filesystem change in XLOG */
-			{
-				xl_dbase_create_rec xlrec;
-
-				xlrec.db_id = dboid;
-				xlrec.tablespace_id = dsttablespace;
-				xlrec.src_db_id = src_dboid;
-				xlrec.src_tablespace_id = srctablespace;
-
-				XLogBeginInsert();
-				XLogRegisterData((char *) &xlrec, sizeof(xl_dbase_create_rec));
-
-				(void) XLogInsert(RM_DBASE_ID,
-								  XLOG_DBASE_CREATE | XLR_SPECIAL_REL_UPDATE);
-			}
-		}
-		table_endscan(scan);
-		table_close(rel, AccessShareLock);
-
-		/*
-		 * We force a checkpoint before committing.  This effectively means
-		 * that committed XLOG_DBASE_CREATE operations will never need to be
-		 * replayed (at least not in ordinary crash recovery; we still have to
-		 * make the XLOG entry for the benefit of PITR operations). This
-		 * avoids two nasty scenarios:
-		 *
-		 * #1: When PITR is off, we don't XLOG the contents of newly created
-		 * indexes; therefore the drop-and-recreate-whole-directory behavior
-		 * of DBASE_CREATE replay would lose such indexes.
-		 *
-		 * #2: Since we have to recopy the source database during DBASE_CREATE
-		 * replay, we run the risk of copying changes in it that were
-		 * committed after the original CREATE DATABASE command but before the
-		 * system crash that led to the replay.  This is at least unexpected
-		 * and at worst could lead to inconsistencies, eg duplicate table
-		 * names.
-		 *
-		 * (Both of these were real bugs in releases 8.0 through 8.0.3.)
-		 *
-		 * In PITR replay, the first of these isn't an issue, and the second
-		 * is only a risk if the CREATE DATABASE and subsequent template
-		 * database change both occur while a base backup is being taken.
-		 * There doesn't seem to be much we can do about that except document
-		 * it as a limitation.
-		 *
-		 * Perhaps if we ever implement CREATE DATABASE in a less cheesy way,
-		 * we can avoid this.
+		 * If the user has asked to create a database with WAL_LOG strategy
+		 * then call CreateDatabaseUsingWalLog, which will copy the database
+		 * at the block level and it will WAL log each copied block.
+		 * Otherwise, call CreateDatabaseUsingFileCopy that will copy the
+		 * database file by file.
 		 */
-		RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
+		if (dbstrategy == CREATEDB_WAL_LOG)
+			CreateDatabaseUsingWalLog(src_dboid, dboid, src_deftablespace,
+									  dst_deftablespace);
+		else
+			CreateDatabaseUsingFileCopy(src_dboid, dboid, src_deftablespace,
+										dst_deftablespace);
 
 		/*
 		 * Close pg_database, but keep lock till commit.
@@ -955,6 +1462,25 @@ createdb_failure_callback(int code, Datum arg)
 	createdb_failure_params *fparms = (createdb_failure_params *) DatumGetPointer(arg);
 
 	/*
+	 * If we were copying database at block levels then drop pages for the
+	 * destination database that are in the shared buffer cache.  And tell
+	 * checkpointer to forget any pending fsync and unlink requests for files
+	 * in the database.  The reasoning behind doing this is same as explained
+	 * in dropdb function.  But unlike dropdb we don't need to call
+	 * pgstat_drop_database because this database is still not created so
+	 * there should not be any stat for this.
+	 */
+	if (fparms->strategy == CREATEDB_WAL_LOG)
+	{
+		DropDatabaseBuffers(fparms->dest_dboid);
+		ForgetDatabaseSyncRequests(fparms->dest_dboid);
+
+		/* Release lock on the target database. */
+		UnlockSharedObject(DatabaseRelationId, fparms->dest_dboid, 0,
+						   AccessShareLock);
+	}
+
+	/*
 	 * Release lock on source database before doing recursive remove. This is
 	 * not essential but it seems desirable to release the lock as soon as
 	 * possible.
@@ -1478,7 +2004,7 @@ movedb(const char *dbname, const char *tblspcname)
 		 * Record the filesystem change in XLOG
 		 */
 		{
-			xl_dbase_create_rec xlrec;
+			xl_dbase_create_file_copy_rec xlrec;
 
 			xlrec.db_id = db_id;
 			xlrec.tablespace_id = dst_tblspcoid;
@@ -1486,10 +2012,11 @@ movedb(const char *dbname, const char *tblspcname)
 			xlrec.src_tablespace_id = src_tblspcoid;
 
 			XLogBeginInsert();
-			XLogRegisterData((char *) &xlrec, sizeof(xl_dbase_create_rec));
+			XLogRegisterData((char *) &xlrec,
+							 sizeof(xl_dbase_create_file_copy_rec));
 
 			(void) XLogInsert(RM_DBASE_ID,
-							  XLOG_DBASE_CREATE | XLR_SPECIAL_REL_UPDATE);
+							  XLOG_DBASE_CREATE_FILE_COPY | XLR_SPECIAL_REL_UPDATE);
 		}
 
 		/*
@@ -1525,9 +2052,10 @@ movedb(const char *dbname, const char *tblspcname)
 
 		/*
 		 * Force another checkpoint here.  As in CREATE DATABASE, this is to
-		 * ensure that we don't have to replay a committed XLOG_DBASE_CREATE
-		 * operation, which would cause us to lose any unlogged operations
-		 * done in the new DB tablespace before the next checkpoint.
+		 * ensure that we don't have to replay a committed
+		 * XLOG_DBASE_CREATE_FILE_COPY operation, which would cause us to lose
+		 * any unlogged operations done in the new DB tablespace before the
+		 * next checkpoint.
 		 */
 		RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
 
@@ -2478,9 +3006,10 @@ dbase_redo(XLogReaderState *record)
 	/* Backup blocks are not used in dbase records */
 	Assert(!XLogRecHasAnyBlockRefs(record));
 
-	if (info == XLOG_DBASE_CREATE)
+	if (info == XLOG_DBASE_CREATE_FILE_COPY)
 	{
-		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) XLogRecGetData(record);
+		xl_dbase_create_file_copy_rec *xlrec =
+		(xl_dbase_create_file_copy_rec *) XLogRecGetData(record);
 		char	   *src_path;
 		char	   *dst_path;
 		struct stat st;
@@ -2515,6 +3044,18 @@ dbase_redo(XLogReaderState *record)
 		 */
 		copydir(src_path, dst_path, false);
 	}
+	else if (info == XLOG_DBASE_CREATE_WAL_LOG)
+	{
+		xl_dbase_create_wal_log_rec *xlrec =
+		(xl_dbase_create_wal_log_rec *) XLogRecGetData(record);
+		char	   *dbpath;
+
+		dbpath = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
+
+		/* Create the database directory with the version file. */
+		CreateDirAndVersionFile(dbpath, xlrec->db_id, xlrec->tablespace_id,
+								true);
+	}
 	else if (info == XLOG_DBASE_DROP)
 	{
 		xl_dbase_drop_rec *xlrec = (xl_dbase_drop_rec *) XLogRecGetData(record);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 80faae9..fb022ba 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -14625,7 +14625,7 @@ index_copy_data(Relation rel, RelFileNode newrnode)
 	 * NOTE: any conflict in relfilenode value will be caught in
 	 * RelationCreateStorage().
 	 */
-	RelationCreateStorage(newrnode, rel->rd_rel->relpersistence);
+	RelationCreateStorage(newrnode, rel->rd_rel->relpersistence, true);
 
 	/* copy main fork */
 	RelationCopyStorage(RelationGetSmgr(rel), dstrel, MAIN_FORKNUM,
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f5459c6..6bb9393 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -38,6 +38,7 @@
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/storage.h"
+#include "catalog/storage_xlog.h"
 #include "executor/instrument.h"
 #include "lib/binaryheap.h"
 #include "miscadmin.h"
@@ -486,6 +487,9 @@ static void FindAndDropRelFileNodeBuffers(RelFileNode rnode,
 										  ForkNumber forkNum,
 										  BlockNumber nForkBlock,
 										  BlockNumber firstDelBlock);
+static void RelationCopyStorageUsingBuffer(Relation src, Relation dst,
+										   ForkNumber forkNum,
+										   bool isunlogged);
 static void AtProcExit_Buffers(int code, Datum arg);
 static void CheckForBufferLeaks(void);
 static int	rnode_comparator(const void *p1, const void *p2);
@@ -772,23 +776,23 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
  * ReadBufferWithoutRelcache -- like ReadBufferExtended, but doesn't require
  *		a relcache entry for the relation.
  *
- * NB: At present, this function may only be used on permanent relations, which
- * is OK, because we only use it during XLOG replay.  If in the future we
- * want to use it on temporary or unlogged relations, we could pass additional
- * parameters.
+ * Pass permanent = true for a RELPERSISTENCE_PERMANENT relation, and
+ * permanent = false for a RELPERSISTENCE_UNLOGGED relation. This function
+ * cannot be used for temporary relations (and making that work might be
+ * difficult, unless we only want to read temporary relations for our own
+ * BackendId).
  */
 Buffer
 ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
 						  BlockNumber blockNum, ReadBufferMode mode,
-						  BufferAccessStrategy strategy)
+						  BufferAccessStrategy strategy, bool permanent)
 {
 	bool		hit;
 
 	SMgrRelation smgr = smgropen(rnode, InvalidBackendId);
 
-	Assert(InRecovery);
-
-	return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
+	return ReadBuffer_common(smgr, permanent ? RELPERSISTENCE_PERMANENT :
+							 RELPERSISTENCE_UNLOGGED, forkNum, blockNum,
 							 mode, strategy, &hit);
 }
 
@@ -3677,6 +3681,158 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 }
 
 /* ---------------------------------------------------------------------
+ *		RelationCopyStorageUsingBuffer
+ *
+ *		Copy fork's data using bufmgr.  Same as RelationCopyStorage but instead
+ *		of using smgrread and smgrextend this will copy using bufmgr APIs.
+ *
+ *		Refer comments atop CreateAndCopyRelationData() for details about
+ *		'permanent' parameter.
+ * --------------------------------------------------------------------
+ */
+static void
+RelationCopyStorageUsingBuffer(Relation src, Relation dst, ForkNumber forkNum,
+							   bool permanent)
+{
+	Buffer		srcBuf;
+	Buffer		dstBuf;
+	Page		srcPage;
+	Page		dstPage;
+	bool		use_wal;
+	BlockNumber nblocks;
+	BlockNumber blkno;
+	BufferAccessStrategy bstrategy_src;
+	BufferAccessStrategy bstrategy_dst;
+
+	/*
+	 * In general, we want to write WAL whenever wal_level > 'minimal', but
+	 * we can skip it when copying any fork of an unlogged relation other
+	 * than the init fork.
+	 */
+	use_wal = XLogIsNeeded() && (permanent || forkNum == INIT_FORKNUM);
+
+	/* Get number of blocks in the source relation. */
+	nblocks = smgrnblocks(RelationGetSmgr(src), forkNum);
+
+	/* Nothing to copy; just return. */
+	if (nblocks == 0)
+		return;
+
+	/* This is a bulk operation, so use buffer access strategies. */
+	bstrategy_src = GetAccessStrategy(BAS_BULKREAD);
+	bstrategy_dst = GetAccessStrategy(BAS_BULKWRITE);
+
+	/* Iterate over each block of the source relation file. */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		CHECK_FOR_INTERRUPTS();
+
+		/* Read block from source relation. */
+		srcBuf = ReadBufferWithoutRelcache(src->rd_node, forkNum, blkno,
+										   RBM_NORMAL, bstrategy_src,
+										   permanent);
+		srcPage = BufferGetPage(srcBuf);
+		if (PageIsNew(srcPage) || PageIsEmpty(srcPage))
+		{
+			ReleaseBuffer(srcBuf);
+			continue;
+		}
+
+		/* Use P_NEW to extend the destination relation. */
+		dstBuf = ReadBufferWithoutRelcache(dst->rd_node, forkNum, P_NEW,
+										   RBM_NORMAL, bstrategy_dst,
+										   permanent);
+		LockBuffer(dstBuf, BUFFER_LOCK_EXCLUSIVE);
+
+		START_CRIT_SECTION();
+
+		/* Copy page data from the source to the destination. */
+		dstPage = BufferGetPage(dstBuf);
+		memcpy(dstPage, srcPage, BLCKSZ);
+		MarkBufferDirty(dstBuf);
+
+		/* WAL-log the copied page. */
+		if (use_wal)
+			log_newpage_buffer(dstBuf, true);
+
+		END_CRIT_SECTION();
+
+		UnlockReleaseBuffer(dstBuf);
+		ReleaseBuffer(srcBuf);
+	}
+}
+
+/* ---------------------------------------------------------------------
+ *		CreateAndCopyRelationData
+ *
+ *		Create destination relation storage and copy all forks from the
+ *		source relation to the destination.
+ *
+ *		Pass permanent as true for permanent relations and false for
+ *		unlogged relations.  Currently this API is not supported for
+ *		temporary relations.
+ * --------------------------------------------------------------------
+ */
+void
+CreateAndCopyRelationData(RelFileNode src_rnode, RelFileNode dst_rnode,
+						  bool permanent)
+{
+	Relation		src_rel;
+	Relation		dst_rel;
+	char			relpersistence;
+
+	/* Set the relpersistence. */
+	relpersistence = permanent ?
+		RELPERSISTENCE_PERMANENT : RELPERSISTENCE_UNLOGGED;
+
+	/*
+	 * Prepare fake relcache entries for the srource and the destination.  It
+	 * is safe to use the fake relcache here because we are only going to
+	 * access the fields related to the physical storage.  We are using the
+	 * fake relcache entry only because it isn't safe to hold the smgr
+	 * pointers, for more details refer comments atop RelationGetSmgr.
+	 */
+	src_rel = CreateFakeRelcacheEntry(src_rnode);
+	dst_rel = CreateFakeRelcacheEntry(dst_rnode);
+
+	/*
+	 * Create and copy all forks of the relation.  During create database we
+	 * have a separate cleanup mechanism which deletes complete database
+	 * directory.  Therefore, each individual relation doesn't need to be
+	 * registered for cleanup.
+	 */
+	RelationCreateStorage(dst_rnode, relpersistence, false);
+
+	/* copy main fork. */
+	RelationCopyStorageUsingBuffer(src_rel, dst_rel, MAIN_FORKNUM, permanent);
+
+	/* copy those extra forks that exist */
+	for (ForkNumber forkNum = MAIN_FORKNUM + 1;
+		 forkNum <= MAX_FORKNUM; forkNum++)
+	{
+		if (smgrexists(RelationGetSmgr(src_rel), forkNum))
+		{
+			smgrcreate(RelationGetSmgr(dst_rel), forkNum, false);
+
+			/*
+			 * WAL log creation if the relation is persistent, or this is the
+			 * init fork of an unlogged relation.
+			 */
+			if (permanent || forkNum == INIT_FORKNUM)
+				log_smgrcreate(&dst_rnode, forkNum);
+
+			/* Copy a fork's data, block by block. */
+			RelationCopyStorageUsingBuffer(src_rel, dst_rel, forkNum,
+										   permanent);
+		}
+	}
+
+	/* Release fake relcache entries. */
+	FreeFakeRelcacheEntry(src_rel);
+	FreeFakeRelcacheEntry(dst_rel);
+}
+
+/* ---------------------------------------------------------------------
  *		FlushDatabaseBuffers
  *
  *		This function writes all dirty pages of a database out to disk
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 5ae52dd..1543da6 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -176,6 +176,34 @@ ConditionalLockRelationOid(Oid relid, LOCKMODE lockmode)
 }
 
 /*
+ *		LockRelationId
+ *
+ * Lock, given a LockRelId.  Same as LockRelationOid but take LockRelId as an
+ * input.
+ */
+void
+LockRelationId(LockRelId *relid, LOCKMODE lockmode)
+{
+	LOCKTAG		tag;
+	LOCALLOCK  *locallock;
+	LockAcquireResult res;
+
+	SET_LOCKTAG_RELATION(tag, relid->dbId, relid->relId);
+
+	res = LockAcquireExtended(&tag, lockmode, false, false, true, &locallock);
+
+	/*
+	 * Now that we have the lock, check for invalidation messages; see notes
+	 * in LockRelationOid.
+	 */
+	if (res != LOCKACQUIRE_ALREADY_CLEAR)
+	{
+		AcceptInvalidationMessages();
+		MarkLockClear(locallock);
+	}
+}
+
+/*
  *		UnlockRelationId
  *
  * Unlock, given a LockRelId.  This is preferred over UnlockRelationOid
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index ff46a0e..1c8aba4 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -705,6 +705,9 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_TWOPHASE_FILE_WRITE:
 			event_name = "TwophaseFileWrite";
 			break;
+		case WAIT_EVENT_VERSION_FILE_WRITE:
+			event_name = "VersionFileWrite";
+			break;
 		case WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ:
 			event_name = "WALSenderTimelineHistoryRead";
 			break;
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 3d05297..c08d560 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -3745,7 +3745,7 @@ RelationSetNewRelfilenode(Relation relation, char persistence)
 		/* handle these directly, at least for now */
 		SMgrRelation srel;
 
-		srel = RelationCreateStorage(newrnode, persistence);
+		srel = RelationCreateStorage(newrnode, persistence, true);
 		smgrclose(srel);
 	}
 	else
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 4d0718f..dee3387 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -252,6 +252,63 @@ RelationMapFilenodeToOid(Oid filenode, bool shared)
 }
 
 /*
+ * RelationMapOidToFilenodeForDatabase
+ *
+ * Like RelationMapOidToFilenode, but reads the mapping from the indicated
+ * path instead of using the one for the current database.
+ */
+Oid
+RelationMapOidToFilenodeForDatabase(char *dbpath, Oid relationId)
+{
+	RelMapFile	map;
+	int			i;
+
+	/* Read the relmap file from the source database. */
+	read_relmap_file(&map, dbpath, false, ERROR);
+
+	/* Iterate over the relmap entries to find the input relation OID. */
+	for (i = 0; i < map.num_mappings; i++)
+	{
+		if (relationId == map.mappings[i].mapoid)
+			return map.mappings[i].mapfilenode;
+	}
+
+	return InvalidOid;
+}
+
+/*
+ * RelationMapCopy
+ *
+ * Copy relmapfile from source db path to the destination db path and WAL log
+ * the operation. This is intended for use in creating a new relmap file
+ * for a database that doesn't have one yet, not for replacing an existing
+ * relmap file.
+ */
+void
+RelationMapCopy(Oid dbid, Oid tsid, char *srcdbpath, char *dstdbpath)
+{
+	RelMapFile map;
+
+	/*
+	 * Read the relmap file from the source database.
+	 */
+	read_relmap_file(&map, srcdbpath, false, ERROR);
+
+	/*
+	 * Write the same data into the destination database's relmap file.
+	 *
+	 * No sinval is needed because no one can be connected to the destination
+	 * database yet. For the same reason, there is no need to acquire
+	 * RelationMappingLock.
+	 *
+	 * There's no point in trying to preserve files here. The new database
+	 * isn't usable yet anyway, and won't ever be if we can't install a
+	 * relmap file.
+	 */
+	write_relmap_file(&map, true, false, false, dbid, tsid, dstdbpath);
+}
+
+/*
  * RelationMapUpdateMap
  *
  * Install a new relfilenode mapping for the specified relation.
@@ -1031,6 +1088,13 @@ relmap_redo(XLogReaderState *record)
 		 *
 		 * There shouldn't be anyone else updating relmaps during WAL replay,
 		 * but grab the lock to interlock against load_relmap_file().
+		 *
+		 * Note that we use the same WAL record for updating the relmap of
+		 * an existing database as we do for creating a new database. In
+		 * the latter case, taking the relmap log and sending sinval messages
+		 * is unnecessary, but harmless. If we wanted to avoid it, we could
+		 * add a flag to the WAL record to indicate which opration is being
+		 * performed.
 		 */
 		LWLockAcquire(RelationMappingLock, LW_EXCLUSIVE);
 		write_relmap_file(&newmap, false, true, false,
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 3ed2a2e..49966e7 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -372,7 +372,7 @@ extractPageInfo(XLogReaderState *record)
 
 	/* Is this a special record type that I recognize? */
 
-	if (rmid == RM_DBASE_ID && rminfo == XLOG_DBASE_CREATE)
+	if (rmid == RM_DBASE_ID && rminfo == XLOG_DBASE_CREATE_FILE_COPY)
 	{
 		/*
 		 * New databases can be safely ignored. It won't be present in the
@@ -384,6 +384,13 @@ extractPageInfo(XLogReaderState *record)
 		 * overwriting the database created in the target system.
 		 */
 	}
+	else if (rmid == RM_DBASE_ID && rminfo == XLOG_DBASE_CREATE_WAL_LOG)
+	{
+		/*
+		 * New databases can be safely ignored. It won't be present in the
+		 * source system, so it will be deleted.
+		 */
+	}
 	else if (rmid == RM_DBASE_ID && rminfo == XLOG_DBASE_DROP)
 	{
 		/*
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 5c06459..baabf98 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2776,13 +2776,15 @@ psql_completion(const char *text, int start, int end)
 	/* CREATE DATABASE */
 	else if (Matches("CREATE", "DATABASE", MatchAny))
 		COMPLETE_WITH("OWNER", "TEMPLATE", "ENCODING", "TABLESPACE",
-					  "IS_TEMPLATE",
+					  "IS_TEMPLATE", "STRATEGY",
 					  "ALLOW_CONNECTIONS", "CONNECTION LIMIT",
 					  "LC_COLLATE", "LC_CTYPE", "LOCALE", "OID",
 					  "LOCALE_PROVIDER", "ICU_LOCALE");
 
 	else if (Matches("CREATE", "DATABASE", MatchAny, "TEMPLATE"))
 		COMPLETE_WITH_QUERY(Query_for_list_of_template_databases);
+	else if (Matches("CREATE", "DATABASE", MatchAny, "STRATEGY"))
+		COMPLETE_WITH("WAL_LOG", "FILE_COPY");
 
 	/* CREATE DOMAIN */
 	else if (Matches("CREATE", "DOMAIN", MatchAny))
diff --git a/src/bin/scripts/createdb.c b/src/bin/scripts/createdb.c
index 6f612ab..0bffa2f 100644
--- a/src/bin/scripts/createdb.c
+++ b/src/bin/scripts/createdb.c
@@ -34,6 +34,7 @@ main(int argc, char *argv[])
 		{"tablespace", required_argument, NULL, 'D'},
 		{"template", required_argument, NULL, 'T'},
 		{"encoding", required_argument, NULL, 'E'},
+		{"strategy", required_argument, NULL, 'S'},
 		{"lc-collate", required_argument, NULL, 1},
 		{"lc-ctype", required_argument, NULL, 2},
 		{"locale", required_argument, NULL, 'l'},
@@ -60,6 +61,7 @@ main(int argc, char *argv[])
 	char	   *tablespace = NULL;
 	char	   *template = NULL;
 	char	   *encoding = NULL;
+	char	   *strategy = NULL;
 	char	   *lc_collate = NULL;
 	char	   *lc_ctype = NULL;
 	char	   *locale = NULL;
@@ -77,7 +79,7 @@ main(int argc, char *argv[])
 
 	handle_help_version_opts(argc, argv, "createdb", help);
 
-	while ((c = getopt_long(argc, argv, "h:p:U:wWeO:D:T:E:l:", long_options, &optindex)) != -1)
+	while ((c = getopt_long(argc, argv, "h:p:U:wWeO:D:T:E:l:S:", long_options, &optindex)) != -1)
 	{
 		switch (c)
 		{
@@ -111,6 +113,9 @@ main(int argc, char *argv[])
 			case 'E':
 				encoding = pg_strdup(optarg);
 				break;
+			case 'S':
+				strategy = pg_strdup(optarg);
+				break;
 			case 1:
 				lc_collate = pg_strdup(optarg);
 				break;
@@ -215,6 +220,8 @@ main(int argc, char *argv[])
 		appendPQExpBufferStr(&sql, " ENCODING ");
 		appendStringLiteralConn(&sql, encoding, conn);
 	}
+	if (strategy)
+		appendPQExpBuffer(&sql, " STRATEGY %s", fmtId(strategy));
 	if (template)
 		appendPQExpBuffer(&sql, " TEMPLATE %s", fmtId(template));
 	if (lc_collate)
@@ -294,6 +301,7 @@ help(const char *progname)
 	printf(_("      --locale-provider={libc|icu}\n"
 			 "                               locale provider for the database's default collation\n"));
 	printf(_("  -O, --owner=OWNER            database user to own the new database\n"));
+	printf(_("  -S, --strategy=STRATEGY      database creation strategy wal_log or file_copy\n"));
 	printf(_("  -T, --template=TEMPLATE      template database to copy\n"));
 	printf(_("  -V, --version                output version information, then exit\n"));
 	printf(_("  -?, --help                   show this help, then exit\n"));
diff --git a/src/bin/scripts/t/020_createdb.pl b/src/bin/scripts/t/020_createdb.pl
index 35deec9..14d3a95 100644
--- a/src/bin/scripts/t/020_createdb.pl
+++ b/src/bin/scripts/t/020_createdb.pl
@@ -104,4 +104,24 @@ $node->command_checks_all(
 	],
 	'createdb with incorrect --lc-ctype');
 
+$node->command_checks_all(
+	[ 'createdb', '--strategy', "foo", 'foobar2' ],
+	1,
+	[qr/^$/],
+	[
+		qr/^createdb: error: database creation failed: ERROR:  invalid create database strategy|^createdb: error: database creation failed: ERROR:  invalid create database strategy foo/s
+	],
+	'createdb with incorrect --strategy');
+
+# Check database creation strategy
+$node->issues_sql_like(
+	[ 'createdb', '-T', 'foobar2', 'foobar6', '-S', 'wal_log'],
+	qr/statement: CREATE DATABASE foobar6 STRATEGY wal_log TEMPLATE foobar2/,
+	'create database with WAL_LOG strategy');
+
+$node->issues_sql_like(
+	[ 'createdb', '-T', 'foobar2', 'foobar7', '-S', 'file_copy'],
+	qr/statement: CREATE DATABASE foobar7 STRATEGY file_copy TEMPLATE foobar2/,
+	'create database with FILE_COPY strategy');
+
 done_testing();
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 9ffc741..844a023 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -22,7 +22,9 @@
 /* GUC variables */
 extern int	wal_skip_threshold;
 
-extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence);
+extern SMgrRelation RelationCreateStorage(RelFileNode rnode,
+										  char relpersistence,
+										  bool register_delete);
 extern void RelationDropStorage(Relation rel);
 extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
 extern void RelationPreTruncate(Relation rel);
diff --git a/src/include/commands/dbcommands_xlog.h b/src/include/commands/dbcommands_xlog.h
index 593a857..0ee2452 100644
--- a/src/include/commands/dbcommands_xlog.h
+++ b/src/include/commands/dbcommands_xlog.h
@@ -18,17 +18,32 @@
 #include "lib/stringinfo.h"
 
 /* record types */
-#define XLOG_DBASE_CREATE		0x00
-#define XLOG_DBASE_DROP			0x10
+#define XLOG_DBASE_CREATE_FILE_COPY		0x00
+#define XLOG_DBASE_CREATE_WAL_LOG		0x10
+#define XLOG_DBASE_DROP					0x20
 
-typedef struct xl_dbase_create_rec
+/*
+ * Single WAL record for an entire CREATE DATABASE operation. This is used
+ * by the FILE_COPY strategy.
+ */
+typedef struct xl_dbase_create_file_copy_rec
 {
-	/* Records copying of a single subdirectory incl. contents */
 	Oid			db_id;
 	Oid			tablespace_id;
 	Oid			src_db_id;
 	Oid			src_tablespace_id;
-} xl_dbase_create_rec;
+} xl_dbase_create_file_copy_rec;
+
+/*
+ * WAL record for the beginning of a CREATE DATABASE operation, when the
+ * WAL_LOG strategy is used. Each individual block will be logged separately
+ * afterward.
+ */
+typedef struct xl_dbase_create_wal_log_rec
+{
+	Oid			db_id;
+	Oid			tablespace_id;
+} xl_dbase_create_wal_log_rec;
 
 typedef struct xl_dbase_drop_rec
 {
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index dd01841..a6b657f 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -184,7 +184,8 @@ extern Buffer ReadBufferExtended(Relation reln, ForkNumber forkNum,
 								 BufferAccessStrategy strategy);
 extern Buffer ReadBufferWithoutRelcache(RelFileNode rnode,
 										ForkNumber forkNum, BlockNumber blockNum,
-										ReadBufferMode mode, BufferAccessStrategy strategy);
+										ReadBufferMode mode, BufferAccessStrategy strategy,
+										bool permanent);
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern void MarkBufferDirty(Buffer buffer);
@@ -203,6 +204,9 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
 extern void FlushOneBuffer(Buffer buffer);
 extern void FlushRelationBuffers(Relation rel);
 extern void FlushRelationsAllBuffers(struct SMgrRelationData **smgrs, int nrels);
+extern void CreateAndCopyRelationData(RelFileNode src_rnode,
+									  RelFileNode dst_rnode,
+									  bool permanent);
 extern void FlushDatabaseBuffers(Oid dbid);
 extern void DropRelFileNodeBuffers(struct SMgrRelationData *smgr_reln, ForkNumber *forkNum,
 								   int nforks, BlockNumber *firstDelBlock);
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index 49edbcc..be1d2c9 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -38,6 +38,7 @@ extern void RelationInitLockInfo(Relation relation);
 
 /* Lock a relation */
 extern void LockRelationOid(Oid relid, LOCKMODE lockmode);
+extern void LockRelationId(LockRelId *relid, LOCKMODE lockmode);
 extern bool ConditionalLockRelationOid(Oid relid, LOCKMODE lockmode);
 extern void UnlockRelationId(LockRelId *relid, LOCKMODE lockmode);
 extern void UnlockRelationOid(Oid relid, LOCKMODE lockmode);
diff --git a/src/include/utils/relmapper.h b/src/include/utils/relmapper.h
index 9fbb5a7..f10353e 100644
--- a/src/include/utils/relmapper.h
+++ b/src/include/utils/relmapper.h
@@ -38,7 +38,9 @@ typedef struct xl_relmap_update
 extern Oid	RelationMapOidToFilenode(Oid relationId, bool shared);
 
 extern Oid	RelationMapFilenodeToOid(Oid relationId, bool shared);
-
+extern Oid RelationMapOidToFilenodeForDatabase(char *dbpath, Oid relationId);
+extern void RelationMapCopy(Oid dbid, Oid tsid, char *srcdbpath,
+							char *dstdbpath);
 extern void RelationMapUpdateMap(Oid relationId, Oid fileNode, bool shared,
 								 bool immediate);
 
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 1c39ce0..d870c59 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -218,6 +218,7 @@ typedef enum
 	WAIT_EVENT_TWOPHASE_FILE_READ,
 	WAIT_EVENT_TWOPHASE_FILE_SYNC,
 	WAIT_EVENT_TWOPHASE_FILE_WRITE,
+	WAIT_EVENT_VERSION_FILE_WRITE,
 	WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ,
 	WAIT_EVENT_WAL_BOOTSTRAP_SYNC,
 	WAIT_EVENT_WAL_BOOTSTRAP_WRITE,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 4968803..4c9f927 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -461,6 +461,8 @@ CoverPos
 CreateAmStmt
 CreateCastStmt
 CreateConversionStmt
+CreateDBRelInfo
+CreateDBStrategy
 CreateDomainStmt
 CreateEnumStmt
 CreateEventTrigStmt
@@ -3701,7 +3703,8 @@ xl_btree_update
 xl_btree_vacuum
 xl_clog_truncate
 xl_commit_ts_truncate
-xl_dbase_create_rec
+xl_dbase_create_file_copy_rec
+xl_dbase_create_wal_log_rec
 xl_dbase_drop_rec
 xl_end_of_recovery
 xl_hash_add_ovfl_page
-- 
1.8.3.1

#194

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Dilip Kumar (#193)

1 attachment(s)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Thu, Mar 24, 2022 at 1:29 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

In the latest version I have fixed this issue by using a non
conflicting name, because when it was compiled with-icu the foobar5
was already used and we were seeing failure. Apart from this I have
fixed the duplicate cleanup problem by passing an extra parameter to
RelationCreateStorage, which decides whether to register for on-abort
delete or not and added the comments for the same. IMHO this looks
the most cleaner way to do it, please check the patch and let me know
your thoughts.

I think that might be an OK way to do it. I think if we were starting
from scratch we'd probably want to come up with some better system,
but that's true of a lot of things.

I went over your version and changed some comments. I also added
documentation for the new wait event. Here's a new version.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachments:

v6-0001-Add-new-block-by-block-strategy-for-CREATE-DATABA.patchapplication/octet-stream; name=v6-0001-Add-new-block-by-block-strategy-for-CREATE-DATABA.patchDownload

From 25520f52de555c057edfd0e480c28d797b4b4af7 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Thu, 24 Mar 2022 11:55:46 -0400
Subject: [PATCH v6] Add new block-by-block strategy for CREATE DATABASE.

Because this strategy logs changes on a block-by-block basis, it
avoids the need to checkpoint before and after the operation.
However, because it logs each changed block individually, it might
generate a lot of extra write-ahead logging if the template database
is large. Therefore, the older strategy remains available via a new
STRATEGY parameter to CREATE DATABASE, and a corresponding --strategy
option to createdb.

Somewhat controversially, this patch assembles the list of relations
to be copied to the new database by reading the pg_class relation of
the template database. Cross-database access like this isn't normally
possible, but it can be made to work here because there can't be any
connections to the database being copied, nor can it contain any
in-doubt transactions. Even so, we have to use lower-level interfaces
than normal, since the table scan and relcache interfaces will not
work for a database to which we're not connected. The advantage of
this approach is that we do not need to rely on the filesystem to
determine what ought to be copied, but instead on PostgreSQL's own
knowledge of the database structure. This avoids, for example,
copying stray files that happen to be located in the source database
directory.

Dilip Kumar, with a fairly large number of cosmetic changes by me.
---
 contrib/bloom/blinsert.c                 |   2 +-
 doc/src/sgml/monitoring.sgml             |   4 +
 doc/src/sgml/ref/create_database.sgml    |  22 +
 doc/src/sgml/ref/createdb.sgml           |  11 +
 src/backend/access/heap/heapam_handler.c |   6 +-
 src/backend/access/nbtree/nbtree.c       |   2 +-
 src/backend/access/rmgrdesc/dbasedesc.c  |  20 +-
 src/backend/access/transam/xlogutils.c   |   6 +-
 src/backend/catalog/heap.c               |   2 +-
 src/backend/catalog/storage.c            |  34 +-
 src/backend/commands/dbcommands.c        | 769 +++++++++++++++++++----
 src/backend/commands/tablecmds.c         |   2 +-
 src/backend/storage/buffer/bufmgr.c      | 172 ++++-
 src/backend/storage/lmgr/lmgr.c          |  28 +
 src/backend/utils/activity/wait_event.c  |   3 +
 src/backend/utils/cache/relcache.c       |   2 +-
 src/backend/utils/cache/relmapper.c      |  64 ++
 src/bin/pg_rewind/parsexlog.c            |   9 +-
 src/bin/psql/tab-complete.c              |   4 +-
 src/bin/scripts/createdb.c               |  10 +-
 src/bin/scripts/t/020_createdb.pl        |  20 +
 src/include/catalog/storage.h            |   4 +-
 src/include/commands/dbcommands_xlog.h   |  25 +-
 src/include/storage/bufmgr.h             |   6 +-
 src/include/storage/lmgr.h               |   1 +
 src/include/utils/relmapper.h            |   4 +-
 src/include/utils/wait_event.h           |   1 +
 src/tools/pgindent/typedefs.list         |   5 +-
 28 files changed, 1081 insertions(+), 157 deletions(-)

diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c
index c94cf34e69..82378db441 100644
--- a/contrib/bloom/blinsert.c
+++ b/contrib/bloom/blinsert.c
@@ -173,7 +173,7 @@ blbuildempty(Relation index)
 	 * Write the page and log it.  It might seem that an immediate sync would
 	 * be sufficient to guarantee that the file exists on disk, but recovery
 	 * itself might remove it while replaying, for example, an
-	 * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we need
+	 * XLOG_DBASE_CREATE* or XLOG_TBLSPC_CREATE record.  Therefore, we need
 	 * this even when wal_level=minimal.
 	 */
 	PageSetChecksumInplace(metapage, BLOOM_METAPAGE_BLKNO);
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 35b2923c5e..562f59f82a 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1502,6 +1502,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry><literal>TwophaseFileWrite</literal></entry>
       <entry>Waiting for a write of a two phase state file.</entry>
      </row>
+     <row>
+      <entry><literal>VersionFileWrite</literal></entry>
+      <entry>Waiting for the version file to be written while creating a database.</entry>
+     </row>
      <row>
       <entry><literal>WALBootstrapSync</literal></entry>
       <entry>Waiting for WAL to reach durable storage during
diff --git a/doc/src/sgml/ref/create_database.sgml b/doc/src/sgml/ref/create_database.sgml
index 5ae785ab95..255ad3a1ce 100644
--- a/doc/src/sgml/ref/create_database.sgml
+++ b/doc/src/sgml/ref/create_database.sgml
@@ -25,6 +25,7 @@ CREATE DATABASE <replaceable class="parameter">name</replaceable>
     [ [ WITH ] [ OWNER [=] <replaceable class="parameter">user_name</replaceable> ]
            [ TEMPLATE [=] <replaceable class="parameter">template</replaceable> ]
            [ ENCODING [=] <replaceable class="parameter">encoding</replaceable> ]
+           [ STRATEGY [=] <replaceable class="parameter">strategy</replaceable> ] ]
            [ LOCALE [=] <replaceable class="parameter">locale</replaceable> ]
            [ LC_COLLATE [=] <replaceable class="parameter">lc_collate</replaceable> ]
            [ LC_CTYPE [=] <replaceable class="parameter">lc_ctype</replaceable> ]
@@ -118,6 +119,27 @@ CREATE DATABASE <replaceable class="parameter">name</replaceable>
        </para>
       </listitem>
      </varlistentry>
+     <varlistentry id="create-database-strategy" xreflabel="CREATE DATABASE STRATEGY">
+      <term><replaceable class="parameter">strategy</replaceable></term>
+      <listitem>
+       <para>
+        Strategy to be used in creating the new database.  If
+        the <literal>WAL_LOG</literal> strategy is used, the database will be
+        copied block by block and each block will be separately written
+        to the write-ahead log. This is the most efficient strategy in
+        cases where the template database is small, and therefore it is the
+        default. The older <literal>FILE_COPY</literal> strategy is also
+        available. This strategy writes a small record to the write-ahead log
+        for each tablespace used by the target database. Each such record
+        represents copying an entire directory to a new location at the
+        filesystem level. While this does reduce the write-ahed
+        log volume substantially, especially if the template database is large,
+        it also forces the system to perform a checkpoint both before and
+        after the creation of the new database. In some situations, this may
+        have a noticeable negative impact on overall system performance.
+       </para>
+      </listitem>
+     </varlistentry>
      <varlistentry>
       <term><replaceable class="parameter">locale</replaceable></term>
       <listitem>
diff --git a/doc/src/sgml/ref/createdb.sgml b/doc/src/sgml/ref/createdb.sgml
index be42e502d6..671cd362d9 100644
--- a/doc/src/sgml/ref/createdb.sgml
+++ b/doc/src/sgml/ref/createdb.sgml
@@ -177,6 +177,17 @@ PostgreSQL documentation
       </listitem>
      </varlistentry>
 
+     <varlistentry>
+      <term><option>-S <replaceable class="parameter">template</replaceable></option></term>
+      <term><option>--strategy=<replaceable class="parameter">strategy</replaceable></option></term>
+      <listitem>
+       <para>
+        Specifies the database creation strategy.  See
+        <xref linkend="create-database-strategy" /> for more details.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry>
       <term><option>-T <replaceable class="parameter">template</replaceable></option></term>
       <term><option>--template=<replaceable class="parameter">template</replaceable></option></term>
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 39ef8a0b77..dee264e859 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -593,7 +593,7 @@ heapam_relation_set_new_filenode(Relation rel,
 	 */
 	*minmulti = GetOldestMultiXactId();
 
-	srel = RelationCreateStorage(*newrnode, persistence);
+	srel = RelationCreateStorage(*newrnode, persistence, true);
 
 	/*
 	 * If required, set up an init fork for an unlogged table so that it can
@@ -601,7 +601,7 @@ heapam_relation_set_new_filenode(Relation rel,
 	 * even if the page has been logged, because the write did not go through
 	 * shared_buffers and therefore a concurrent checkpoint may have moved the
 	 * redo pointer past our xlog record.  Recovery may as well remove it
-	 * while replaying, for example, XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE
+	 * while replaying, for example, XLOG_DBASE_CREATE* or XLOG_TBLSPC_CREATE
 	 * record. Therefore, logging is necessary even if wal_level=minimal.
 	 */
 	if (persistence == RELPERSISTENCE_UNLOGGED)
@@ -645,7 +645,7 @@ heapam_relation_copy_data(Relation rel, const RelFileNode *newrnode)
 	 * NOTE: any conflict in relfilenode value will be caught in
 	 * RelationCreateStorage().
 	 */
-	RelationCreateStorage(*newrnode, rel->rd_rel->relpersistence);
+	RelationCreateStorage(*newrnode, rel->rd_rel->relpersistence, true);
 
 	/* copy main fork */
 	RelationCopyStorage(RelationGetSmgr(rel), dstrel, MAIN_FORKNUM,
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index c9b4964c1e..dacf3f7a58 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -161,7 +161,7 @@ btbuildempty(Relation index)
 	 * Write the page and log it.  It might seem that an immediate sync would
 	 * be sufficient to guarantee that the file exists on disk, but recovery
 	 * itself might remove it while replaying, for example, an
-	 * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we need
+	 * XLOG_DBASE_CREATE* or XLOG_TBLSPC_CREATE record.  Therefore, we need
 	 * this even when wal_level=minimal.
 	 */
 	PageSetChecksumInplace(metapage, BTREE_METAPAGE);
diff --git a/src/backend/access/rmgrdesc/dbasedesc.c b/src/backend/access/rmgrdesc/dbasedesc.c
index 03af3fdbcf..523d0b3c1d 100644
--- a/src/backend/access/rmgrdesc/dbasedesc.c
+++ b/src/backend/access/rmgrdesc/dbasedesc.c
@@ -24,14 +24,23 @@ dbase_desc(StringInfo buf, XLogReaderState *record)
 	char	   *rec = XLogRecGetData(record);
 	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
 
-	if (info == XLOG_DBASE_CREATE)
+	if (info == XLOG_DBASE_CREATE_FILE_COPY)
 	{
-		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) rec;
+		xl_dbase_create_file_copy_rec *xlrec =
+		(xl_dbase_create_file_copy_rec *) rec;
 
 		appendStringInfo(buf, "copy dir %u/%u to %u/%u",
 						 xlrec->src_tablespace_id, xlrec->src_db_id,
 						 xlrec->tablespace_id, xlrec->db_id);
 	}
+	else if (info == XLOG_DBASE_CREATE_WAL_LOG)
+	{
+		xl_dbase_create_wal_log_rec *xlrec =
+		(xl_dbase_create_wal_log_rec *) rec;
+
+		appendStringInfo(buf, "create dir %u/%u",
+						 xlrec->tablespace_id, xlrec->db_id);
+	}
 	else if (info == XLOG_DBASE_DROP)
 	{
 		xl_dbase_drop_rec *xlrec = (xl_dbase_drop_rec *) rec;
@@ -51,8 +60,11 @@ dbase_identify(uint8 info)
 
 	switch (info & ~XLR_INFO_MASK)
 	{
-		case XLOG_DBASE_CREATE:
-			id = "CREATE";
+		case XLOG_DBASE_CREATE_FILE_COPY:
+			id = "CREATE_FILE_COPY";
+			break;
+		case XLOG_DBASE_CREATE_WAL_LOG:
+			id = "CREATE_WAL_LOG";
 			break;
 		case XLOG_DBASE_DROP:
 			id = "DROP";
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 511f2f186f..a4dedc58b7 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -484,7 +484,7 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 	{
 		/* page exists in file */
 		buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-										   mode, NULL);
+										   mode, NULL, true);
 	}
 	else
 	{
@@ -509,7 +509,7 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 				ReleaseBuffer(buffer);
 			}
 			buffer = ReadBufferWithoutRelcache(rnode, forknum,
-											   P_NEW, mode, NULL);
+											   P_NEW, mode, NULL, true);
 		}
 		while (BufferGetBlockNumber(buffer) < blkno);
 		/* Handle the corner case that P_NEW returns non-consecutive pages */
@@ -519,7 +519,7 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 				LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 			ReleaseBuffer(buffer);
 			buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-											   mode, NULL);
+											   mode, NULL, true);
 		}
 	}
 
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index 7e99de88b3..32be51222e 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -403,7 +403,7 @@ heap_create(const char *relname,
 											relpersistence,
 											relfrozenxid, relminmxid);
 		else if (RELKIND_HAS_STORAGE(rel->rd_rel->relkind))
-			RelationCreateStorage(rel->rd_node, relpersistence);
+			RelationCreateStorage(rel->rd_node, relpersistence, true);
 		else
 			Assert(false);
 	}
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 9b8075536a..f9304d1ccd 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -112,12 +112,14 @@ AddPendingSync(const RelFileNode *rnode)
  * modules that need them.
  *
  * This function is transactional. The creation is WAL-logged, and if the
- * transaction aborts later on, the storage will be destroyed.
+ * transaction aborts later on, the storage will be destroyed.  A caller
+ * that does not want the storage to be destroyed in case of an abort may
+ * pass register_delete = false.
  */
 SMgrRelation
-RelationCreateStorage(RelFileNode rnode, char relpersistence)
+RelationCreateStorage(RelFileNode rnode, char relpersistence,
+					  bool register_delete)
 {
-	PendingRelDelete *pending;
 	SMgrRelation srel;
 	BackendId	backend;
 	bool		needs_wal;
@@ -149,15 +151,23 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
 	if (needs_wal)
 		log_smgrcreate(&srel->smgr_rnode.node, MAIN_FORKNUM);
 
-	/* Add the relation to the list of stuff to delete at abort */
-	pending = (PendingRelDelete *)
-		MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
-	pending->relnode = rnode;
-	pending->backend = backend;
-	pending->atCommit = false;	/* delete if abort */
-	pending->nestLevel = GetCurrentTransactionNestLevel();
-	pending->next = pendingDeletes;
-	pendingDeletes = pending;
+	/*
+	 * Add the relation to the list of stuff to delete at abort, if we are
+	 * asked to do so.
+	 */
+	if (register_delete)
+	{
+		PendingRelDelete *pending;
+
+		pending = (PendingRelDelete *)
+			MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+		pending->relnode = rnode;
+		pending->backend = backend;
+		pending->atCommit = false;	/* delete if abort */
+		pending->nestLevel = GetCurrentTransactionNestLevel();
+		pending->next = pendingDeletes;
+		pendingDeletes = pending;
+	}
 
 	if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
 	{
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 623e5ec778..df16533901 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -63,13 +63,31 @@
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
 #include "utils/pg_locale.h"
+#include "utils/relmapper.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 
+/*
+ * Create database strategy.
+ *
+ * CREATEDB_WAL_LOG will copy the database at the block level and WAL log each
+ * copied block.
+ *
+ * CREATEDB_FILE_COPY will simply perform a file system level copy of the
+ * database and log a single record for each tablespace copied. To make this
+ * safe, it also triggers checkpoints before and after the operation.
+ */
+typedef enum CreateDBStrategy
+{
+	CREATEDB_WAL_LOG,
+	CREATEDB_FILE_COPY
+} CreateDBStrategy;
+
 typedef struct
 {
 	Oid			src_dboid;		/* source (template) DB */
 	Oid			dest_dboid;		/* DB we are trying to create */
+	CreateDBStrategy strategy;	/* create db strategy */
 } createdb_failure_params;
 
 typedef struct
@@ -78,6 +96,17 @@ typedef struct
 	Oid			dest_tsoid;		/* tablespace we are trying to move to */
 } movedb_failure_params;
 
+/*
+ * Information about a relation to be copied when creating a database.
+ */
+typedef struct CreateDBRelInfo
+{
+	RelFileNode rnode;			/* physical relation identifier */
+	Oid			reloid;			/* relation oid */
+	bool		permanent;		/* relation is permanent or unlogged */
+} CreateDBRelInfo;
+
+
 /* non-export function prototypes */
 static void createdb_failure_callback(int code, Datum arg);
 static void movedb(const char *dbname, const char *tblspcname);
@@ -93,7 +122,546 @@ static bool have_createdb_privilege(void);
 static void remove_dbtablespaces(Oid db_id);
 static bool check_db_file_conflict(Oid db_id);
 static int	errdetail_busy_db(int notherbackends, int npreparedxacts);
+static void CreateDatabaseUsingWalLog(Oid src_dboid, Oid dboid, Oid src_tsid,
+									  Oid dst_tsid);
+static List *ScanSourceDatabasePgClass(Oid srctbid, Oid srcdbid, char *srcpath);
+static List *ScanSourceDatabasePgClassPage(Page page, Buffer buf, Oid tbid,
+										   Oid dbid, char *srcpath,
+										   List *rnodelist, Snapshot snapshot);
+static CreateDBRelInfo *ScanSourceDatabasePgClassTuple(HeapTupleData *tuple,
+													   Oid tbid, Oid dbid,
+													   char *srcpath);
+static void CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid,
+									bool isRedo);
+static void CreateDatabaseUsingFileCopy(Oid src_dboid, Oid dboid, Oid src_tsid,
+										Oid dst_tsid);
+
+/*
+ * Create a new database using the WAL_LOG strategy.
+ *
+ * Each copied block is separately written to the write-ahead log.
+ */
+static void
+CreateDatabaseUsingWalLog(Oid src_dboid, Oid dst_dboid,
+						  Oid src_tsid, Oid dst_tsid)
+{
+	char	   *srcpath;
+	char	   *dstpath;
+	List	   *rnodelist = NULL;
+	ListCell   *cell;
+	LockRelId	srcrelid;
+	LockRelId	dstrelid;
+	RelFileNode srcrnode;
+	RelFileNode dstrnode;
+	CreateDBRelInfo *relinfo;
+
+	/* Get source and destination database paths. */
+	srcpath = GetDatabasePath(src_dboid, src_tsid);
+	dstpath = GetDatabasePath(dst_dboid, dst_tsid);
+
+	/* Create database directory and write PG_VERSION file. */
+	CreateDirAndVersionFile(dstpath, dst_dboid, dst_tsid, false);
+
+	/* Copy relmap file from source database to the destination database. */
+	RelationMapCopy(dst_dboid, dst_tsid, srcpath, dstpath);
+
+	/* Get list of relfilenodes to copy from the source database. */
+	rnodelist = ScanSourceDatabasePgClass(src_tsid, src_dboid, srcpath);
+	Assert(rnodelist != NIL);
+
+	/*
+	 * Database IDs will be the same for all relations so set them before
+	 * entering the loop.
+	 */
+	srcrelid.dbId = src_dboid;
+	dstrelid.dbId = dst_dboid;
+
+	/* Loop over our list of relfilenodes and copy each one. */
+	foreach(cell, rnodelist)
+	{
+		relinfo = lfirst(cell);
+		srcrnode = relinfo->rnode;
+
+		/*
+		 * If the relation is from the source db's default tablespace then we
+		 * need to create it in the destinations db's default tablespace.
+		 * Otherwise, we need to create in the same tablespace as it is in the
+		 * source database.
+		 */
+		if (srcrnode.spcNode == src_tsid)
+			dstrnode.spcNode = dst_tsid;
+		else
+			dstrnode.spcNode = srcrnode.spcNode;
+
+		dstrnode.dbNode = dst_dboid;
+		dstrnode.relNode = srcrnode.relNode;
+
+		/*
+		 * Acquire locks on source and target relations before copying.
+		 *
+		 * We typically do not read relation data into shared_buffers without
+		 * holding a relation lock. It's unclear what could go wrong if we
+		 * skipped it in this case, because nobody can be modifying either
+		 * the source or destination database at this point, and we have locks
+		 * on both databases, too, but let's take the conservative route.
+		 */
+		dstrelid.relId = srcrelid.relId = relinfo->reloid;
+		LockRelationId(&srcrelid, AccessShareLock);
+		LockRelationId(&dstrelid, AccessShareLock);
+
+		/* Copy relation storage from source to the destination. */
+		CreateAndCopyRelationData(srcrnode, dstrnode, relinfo->permanent);
+
+		/* Release the relation locks. */
+		UnlockRelationId(&srcrelid, AccessShareLock);
+		UnlockRelationId(&dstrelid, AccessShareLock);
+	}
+
+	list_free_deep(rnodelist);
+}
+
+/*
+ * Scan the pg_class table in the source database to identify the relations
+ * that need to be copied to the destination database.
+ *
+ * This is an exception to the usual rule that cross-database access is
+ * not possible. We can make it work here because we know that there are no
+ * connections to the source database and (since there can't be prepared
+ * transactions touching that database) no in-doubt tuples either. This
+ * means that we don't need to worry about pruning removing anything from
+ * under us, and we don't need to be too picky about our snapshot either.
+ * As long as it sees all previously-committed XIDs as committed and all
+ * aborted XIDs as aborted, we should be fine: nothing else is possible
+ * here.
+ *
+ * We can't rely on the relcache for anything here, because that only knows
+ * about the database to which we are connected, and can't handle access to
+ * other databases. That also means we can't rely on the heap scan
+ * infrastructure, which would be a bad idea anyway since it might try
+ * to do things like HOT pruning which we definitely can't do safely in
+ * a database to which we're not even connected.
+ */
+static List *
+ScanSourceDatabasePgClass(Oid tbid, Oid dbid, char *srcpath)
+{
+	RelFileNode rnode;
+	BlockNumber nblocks;
+	BlockNumber blkno;
+	Buffer		buf;
+	Oid			relfilenode;
+	Page		page;
+	List	   *rnodelist = NIL;
+	LockRelId	relid;
+	Relation	rel;
+	Snapshot	snapshot;
+	BufferAccessStrategy bstrategy;
+
+	/* Get pg_class relfilenode. */
+	relfilenode = RelationMapOidToFilenodeForDatabase(srcpath,
+													  RelationRelationId);
+
+	/* Don't read data into shared_buffers without holding a relation lock. */
+	relid.dbId = dbid;
+	relid.relId = RelationRelationId;
+	LockRelationId(&relid, AccessShareLock);
+
+	/* Prepare a RelFileNode for the pg_class relation. */
+	rnode.spcNode = tbid;
+	rnode.dbNode = dbid;
+	rnode.relNode = relfilenode;
+
+	/*
+	 * We can't use a real relcache entry for a relation in some other
+	 * database, but since we're only going to access the fields related
+	 * to physical storage, a fake one is good enough. If we didn't do this
+	 * and used the smgr layer directly, we would have to worry about
+	 * invalidations.
+	 */
+	rel = CreateFakeRelcacheEntry(rnode);
+	nblocks = smgrnblocks(RelationGetSmgr(rel), MAIN_FORKNUM);
+	FreeFakeRelcacheEntry(rel);
+
+	/* Use a buffer access strategy since this is a bulk read operation. */
+	bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	/*
+	 * As explained in the function header comments, we need a snapshot that
+	 * will see all committed transactions as committed, and our transaction
+	 * snapshot - or the active snapshot - might not be new enough for that,
+	 * but the return value of GetLatestSnapshot() should work fine.
+	 */
+	snapshot = GetLatestSnapshot();
+
+	/* Process the relation block by block. */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		CHECK_FOR_INTERRUPTS();
+
+		buf = ReadBufferWithoutRelcache(rnode, MAIN_FORKNUM, blkno,
+										RBM_NORMAL, bstrategy, false);
+
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buf);
+		if (PageIsNew(page) || PageIsEmpty(page))
+		{
+			UnlockReleaseBuffer(buf);
+			continue;
+		}
+
+		/* Append relevant pg_class tuples for current page to rnodelist. */
+		rnodelist = ScanSourceDatabasePgClassPage(page, buf, tbid, dbid,
+												  srcpath, rnodelist,
+												  snapshot);
+
+		UnlockReleaseBuffer(buf);
+	}
+
+	/* Release relation lock. */
+	UnlockRelationId(&relid, AccessShareLock);
+
+	return rnodelist;
+}
+
+/*
+ * Scan one page of the source database's pg_class relation and add relevant
+ * entries to rnodelist. The return value is the updated list.
+ */
+static List *
+ScanSourceDatabasePgClassPage(Page page, Buffer buf, Oid tbid, Oid dbid,
+							  char *srcpath, List *rnodelist,
+							  Snapshot snapshot)
+{
+	BlockNumber		blkno = BufferGetBlockNumber(buf);
+	OffsetNumber	offnum;
+	OffsetNumber	maxoff;
+	HeapTupleData	tuple;
+
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/* Loop over offsets. */
+	for (offnum = FirstOffsetNumber;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid;
+
+		itemid = PageGetItemId(page, offnum);
+
+		/* Nothing to do if slot is empty or already dead. */
+		if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid) ||
+			ItemIdIsRedirected(itemid))
+			continue;
+
+		Assert(ItemIdIsNormal(itemid));
+		ItemPointerSet(&(tuple.t_self), blkno, offnum);
+
+		/* Initialize a HeapTupleData structure. */
+		tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
+		tuple.t_len = ItemIdGetLength(itemid);
+		tuple.t_tableOid = RelationRelationId;
+
+		/* Skip tuples that are not visible to this snapshot. */
+		if (HeapTupleSatisfiesVisibility(&tuple, snapshot, buf))
+		{
+			CreateDBRelInfo *relinfo;
+
+			/*
+			 * ScanSourceDatabasePgClassTuple is in charge of constructing
+			 * a CreateDBRelInfo object for this tuple, but can also decide
+			 * that this tuple isn't something we need to copy. If we do need
+			 * to copy the relation, add it to the list.
+			 */
+			relinfo = ScanSourceDatabasePgClassTuple(&tuple, tbid, dbid,
+													 srcpath);
+			if (relinfo != NULL)
+				rnodelist = lappend(rnodelist, relinfo);
+		}
+	}
 
+	return rnodelist;
+}
+
+/*
+ * Decide whether a certain pg_class tuple represents something that
+ * needs to be copied from the source database to the destination database,
+ * and if so, construct a CreateDBRelInfo for it.
+ *
+ * Visbility checks are handled by the caller, so our job here is just
+ * to assess the data stored in the tuple.
+ */
+CreateDBRelInfo *
+ScanSourceDatabasePgClassTuple(HeapTupleData *tuple, Oid tbid, Oid dbid,
+							   char *srcpath)
+{
+	CreateDBRelInfo	   *relinfo;
+	Form_pg_class		classForm;
+	Oid					relfilenode = InvalidOid;
+
+	classForm = (Form_pg_class) GETSTRUCT(tuple);
+
+	/*
+	 * Return NULL if this object does not need to be copied.
+	 *
+	 * Shared objects don't need to be copied, because they are shared.
+	 * Objects without storage can't be copied, because there's nothing to
+	 * copy. Temporary relations don't need to be copied either, because
+	 * they are inaccessible outside of the session that created them,
+	 * which must be gone already, and couldn't connect to a different database
+	 * if it still existed. autovacuum will eventually remove the pg_class
+	 * entries as well.
+	 */
+	if (classForm->reltablespace == GLOBALTABLESPACE_OID ||
+		!RELKIND_HAS_STORAGE(classForm->relkind) ||
+		classForm->relpersistence == RELPERSISTENCE_TEMP)
+		return NULL;
+
+	/*
+	 * If relfilenode is valid then directly use it.  Otherwise, consult the
+	 * relmap.
+	 */
+	if (OidIsValid(classForm->relfilenode))
+		relfilenode = classForm->relfilenode;
+	else
+		relfilenode = RelationMapOidToFilenodeForDatabase(srcpath,
+														  classForm->oid);
+
+	/* We must have a valid relfilenode oid. */
+	if (!OidIsValid(relfilenode))
+		elog(ERROR, "relation with OID %u does not have a valid relfilenode",
+			 classForm->oid);
+
+	/* Prepare a rel info element and add it to the list. */
+	relinfo = (CreateDBRelInfo *) palloc(sizeof(CreateDBRelInfo));
+	if (OidIsValid(classForm->reltablespace))
+		relinfo->rnode.spcNode = classForm->reltablespace;
+	else
+		relinfo->rnode.spcNode = tbid;
+
+	relinfo->rnode.dbNode = dbid;
+	relinfo->rnode.relNode = relfilenode;
+	relinfo->reloid = classForm->oid;
+
+	/* Temporary relations were rejected above. */
+	Assert(classForm->relpersistence != RELPERSISTENCE_TEMP);
+	relinfo->permanent =
+		(classForm->relpersistence == RELPERSISTENCE_PERMANENT) ? true : false;
+
+	return relinfo;
+}
+
+/*
+ * Create database directory and write out the PG_VERSION file in the database
+ * path.  If isRedo is true, it's okay for the database directory to exist
+ * already.
+ */
+static void
+CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid, bool isRedo)
+{
+	int			fd;
+	int			nbytes;
+	char		versionfile[MAXPGPATH];
+	char		buf[16];
+
+	/*
+	 * Prepare version data before starting a critical section.
+	 *
+	 * Note that we don't have to copy this from the source database; there's
+	 * only one legal value.
+	 */
+	sprintf(buf, "%s\n", PG_MAJORVERSION);
+	nbytes = strlen(PG_MAJORVERSION) + 1;
+
+	/* If we are not in WAL replay then write the WAL. */
+	if (!isRedo)
+	{
+		xl_dbase_create_wal_log_rec xlrec;
+		XLogRecPtr	lsn;
+
+		START_CRIT_SECTION();
+
+		xlrec.db_id = dbid;
+		xlrec.tablespace_id = tsid;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec),
+						 sizeof(xl_dbase_create_wal_log_rec));
+
+		lsn = XLogInsert(RM_DBASE_ID, XLOG_DBASE_CREATE_WAL_LOG);
+
+		/* As always, WAL must hit the disk before the data update does. */
+		XLogFlush(lsn);
+	}
+
+	/* Create database directory. */
+	if (MakePGDirectory(dbpath) < 0)
+	{
+		/* Failure other than already exists or not in WAL replay? */
+		if (errno != EEXIST || !isRedo)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not create directory \"%s\": %m", dbpath)));
+	}
+
+	/*
+	 * Create PG_VERSION file in the database path.  If the file already
+	 * exists and we are in WAL replay then try again to open it in write
+	 * mode.
+	 */
+	snprintf(versionfile, sizeof(versionfile), "%s/%s", dbpath, "PG_VERSION");
+
+	fd = OpenTransientFile(versionfile, O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
+	if (fd < 0 && errno == EEXIST && isRedo)
+		fd = OpenTransientFile(versionfile, O_WRONLY | O_TRUNC | PG_BINARY);
+
+	if (fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m", versionfile)));
+
+	/* Write PG_MAJORVERSION in the PG_VERSION file. */
+	pgstat_report_wait_start(WAIT_EVENT_VERSION_FILE_WRITE);
+	errno = 0;
+	if ((int) write(fd, buf, nbytes) != nbytes)
+	{
+		/* If write didn't set errno, assume problem is no disk space. */
+		if (errno == 0)
+			errno = ENOSPC;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m", versionfile)));
+	}
+	pgstat_report_wait_end();
+
+	/* Close the version file. */
+	CloseTransientFile(fd);
+
+	/* Critical section done. */
+	if (!isRedo)
+		END_CRIT_SECTION();
+}
+
+/*
+ * Create a new database using the FILE_COPY strategy.
+ *
+ * Copy each tablespace at the filesystem level, and log a single WAL record
+ * for each tablespace copied.  This requires a checkpoint before and after the
+ * copy, which may be expensive, but it does greatly reduce WAL generation
+ * if the copied database is large.
+ */
+static void
+CreateDatabaseUsingFileCopy(Oid src_dboid, Oid dst_dboid, Oid src_tsid,
+							Oid dst_tsid)
+{
+	TableScanDesc scan;
+	Relation	rel;
+	HeapTuple	tuple;
+
+	/*
+	 * Force a checkpoint before starting the copy. This will force all dirty
+	 * buffers, including those of unlogged tables, out to disk, to ensure
+	 * source database is up-to-date on disk for the copy.
+	 * FlushDatabaseBuffers() would suffice for that, but we also want to
+	 * process any pending unlink requests. Otherwise, if a checkpoint
+	 * happened while we're copying files, a file might be deleted just when
+	 * we're about to copy it, causing the lstat() call in copydir() to fail
+	 * with ENOENT.
+	 */
+	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE |
+					  CHECKPOINT_WAIT | CHECKPOINT_FLUSH_ALL);
+
+	/*
+	 * Iterate through all tablespaces of the template database, and copy each
+	 * one to the new database.
+	 */
+	rel = table_open(TableSpaceRelationId, AccessShareLock);
+	scan = table_beginscan_catalog(rel, 0, NULL);
+	while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+	{
+		Form_pg_tablespace spaceform = (Form_pg_tablespace) GETSTRUCT(tuple);
+		Oid			srctablespace = spaceform->oid;
+		Oid			dsttablespace;
+		char	   *srcpath;
+		char	   *dstpath;
+		struct stat st;
+
+		/* No need to copy global tablespace */
+		if (srctablespace == GLOBALTABLESPACE_OID)
+			continue;
+
+		srcpath = GetDatabasePath(src_dboid, srctablespace);
+
+		if (stat(srcpath, &st) < 0 || !S_ISDIR(st.st_mode) ||
+			directory_is_empty(srcpath))
+		{
+			/* Assume we can ignore it */
+			pfree(srcpath);
+			continue;
+		}
+
+		if (srctablespace == src_tsid)
+			dsttablespace = dst_tsid;
+		else
+			dsttablespace = srctablespace;
+
+		dstpath = GetDatabasePath(dst_dboid, dsttablespace);
+
+		/*
+		 * Copy this subdirectory to the new location
+		 *
+		 * We don't need to copy subdirectories
+		 */
+		copydir(srcpath, dstpath, false);
+
+		/* Record the filesystem change in XLOG */
+		{
+			xl_dbase_create_file_copy_rec xlrec;
+
+			xlrec.db_id = dst_dboid;
+			xlrec.tablespace_id = dsttablespace;
+			xlrec.src_db_id = src_dboid;
+			xlrec.src_tablespace_id = srctablespace;
+
+			XLogBeginInsert();
+			XLogRegisterData((char *) &xlrec,
+							 sizeof(xl_dbase_create_file_copy_rec));
+
+			(void) XLogInsert(RM_DBASE_ID,
+							  XLOG_DBASE_CREATE_FILE_COPY | XLR_SPECIAL_REL_UPDATE);
+		}
+	}
+	table_endscan(scan);
+	table_close(rel, AccessShareLock);
+
+	/*
+	 * We force a checkpoint before committing.  This effectively means that
+	 * committed XLOG_DBASE_CREATE_FILE_COPY operations will never need to be
+	 * replayed (at least not in ordinary crash recovery; we still have to
+	 * make the XLOG entry for the benefit of PITR operations). This avoids
+	 * two nasty scenarios:
+	 *
+	 * #1: When PITR is off, we don't XLOG the contents of newly created
+	 * indexes; therefore the drop-and-recreate-whole-directory behavior of
+	 * DBASE_CREATE replay would lose such indexes.
+	 *
+	 * #2: Since we have to recopy the source database during DBASE_CREATE
+	 * replay, we run the risk of copying changes in it that were committed
+	 * after the original CREATE DATABASE command but before the system crash
+	 * that led to the replay.  This is at least unexpected and at worst could
+	 * lead to inconsistencies, eg duplicate table names.
+	 *
+	 * (Both of these were real bugs in releases 8.0 through 8.0.3.)
+	 *
+	 * In PITR replay, the first of these isn't an issue, and the second is
+	 * only a risk if the CREATE DATABASE and subsequent template database
+	 * change both occur while a base backup is being taken. There doesn't
+	 * seem to be much we can do about that except document it as a
+	 * limitation.
+	 *
+	 * See CreateDatabaseUsingWalLog() for a less cheesy CREATE DATABASE
+	 * strategy that avoids these problems.
+	 */
+	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
+}
 
 /*
  * CREATE DATABASE
@@ -101,8 +669,6 @@ static int	errdetail_busy_db(int notherbackends, int npreparedxacts);
 Oid
 createdb(ParseState *pstate, const CreatedbStmt *stmt)
 {
-	TableScanDesc scan;
-	Relation	rel;
 	Oid			src_dboid;
 	Oid			src_owner;
 	int			src_encoding = -1;
@@ -137,6 +703,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	DefElem    *dallowconnections = NULL;
 	DefElem    *dconnlimit = NULL;
 	DefElem    *dcollversion = NULL;
+	DefElem    *dstrategy = NULL;
 	char	   *dbname = stmt->dbname;
 	char	   *dbowner = NULL;
 	const char *dbtemplate = NULL;
@@ -152,6 +719,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	char	   *dbcollversion = NULL;
 	int			notherbackends;
 	int			npreparedxacts;
+	CreateDBStrategy dbstrategy = CREATEDB_WAL_LOG;
 	createdb_failure_params fparms;
 
 	/* Extract options from the statement node tree */
@@ -269,6 +837,12 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 						(errcode(ERRCODE_INVALID_PARAMETER_VALUE)),
 						errmsg("OIDs less than %u are reserved for system objects", FirstNormalObjectId));
 		}
+		else if (strcmp(defel->defname, "strategy") == 0)
+		{
+			if (dstrategy)
+				errorConflictingDefElem(defel, pstate);
+			dstrategy = defel;
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -413,6 +987,23 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 							dbtemplate)));
 	}
 
+	/* Validate the database creation strategy. */
+	if (dstrategy && dstrategy->arg)
+	{
+		char	   *strategy;
+
+		strategy = defGetString(dstrategy);
+		if (strcmp(strategy, "wal_log") == 0)
+			dbstrategy = CREATEDB_WAL_LOG;
+		else if (strcmp(strategy, "file_copy") == 0)
+			dbstrategy = CREATEDB_FILE_COPY;
+		else
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("invalid create database strategy %s", strategy),
+					 errhint("Valid strategies are \"wal_log\", and \"file_copy\".")));
+	}
+
 	/* If encoding or locales are defaulted, use source's setting */
 	if (encoding < 0)
 		encoding = src_encoding;
@@ -753,17 +1344,18 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	InvokeObjectPostCreateHook(DatabaseRelationId, dboid, 0);
 
 	/*
-	 * Force a checkpoint before starting the copy. This will force all dirty
-	 * buffers, including those of unlogged tables, out to disk, to ensure
-	 * source database is up-to-date on disk for the copy.
-	 * FlushDatabaseBuffers() would suffice for that, but we also want to
-	 * process any pending unlink requests. Otherwise, if a checkpoint
-	 * happened while we're copying files, a file might be deleted just when
-	 * we're about to copy it, causing the lstat() call in copydir() to fail
-	 * with ENOENT.
+	 * If we're going to be reading data for the to-be-created database
+	 * into shared_buffers, take a lock on it. Nobody should know that this
+	 * database exists yet, but it's good to maintain the invariant that a
+	 * lock an AccessExclusiveLock on the database is sufficient to drop all
+	 * of its buffers without worrying about more being read later.
+	 *
+	 * Note that we need to do this before entering the PG_ENSURE_ERROR_CLEANUP
+	 * block below, because createdb_failure_callback expects this lock to
+	 * be held already.
 	 */
-	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT
-					  | CHECKPOINT_FLUSH_ALL);
+	if (dbstrategy == CREATEDB_WAL_LOG)
+		LockSharedObject(DatabaseRelationId, dboid, 0, AccessShareLock);
 
 	/*
 	 * Once we start copying subdirectories, we need to be able to clean 'em
@@ -774,101 +1366,24 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	 */
 	fparms.src_dboid = src_dboid;
 	fparms.dest_dboid = dboid;
+	fparms.strategy = dbstrategy;
+
 	PG_ENSURE_ERROR_CLEANUP(createdb_failure_callback,
 							PointerGetDatum(&fparms));
 	{
 		/*
-		 * Iterate through all tablespaces of the template database, and copy
-		 * each one to the new database.
-		 */
-		rel = table_open(TableSpaceRelationId, AccessShareLock);
-		scan = table_beginscan_catalog(rel, 0, NULL);
-		while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
-		{
-			Form_pg_tablespace spaceform = (Form_pg_tablespace) GETSTRUCT(tuple);
-			Oid			srctablespace = spaceform->oid;
-			Oid			dsttablespace;
-			char	   *srcpath;
-			char	   *dstpath;
-			struct stat st;
-
-			/* No need to copy global tablespace */
-			if (srctablespace == GLOBALTABLESPACE_OID)
-				continue;
-
-			srcpath = GetDatabasePath(src_dboid, srctablespace);
-
-			if (stat(srcpath, &st) < 0 || !S_ISDIR(st.st_mode) ||
-				directory_is_empty(srcpath))
-			{
-				/* Assume we can ignore it */
-				pfree(srcpath);
-				continue;
-			}
-
-			if (srctablespace == src_deftablespace)
-				dsttablespace = dst_deftablespace;
-			else
-				dsttablespace = srctablespace;
-
-			dstpath = GetDatabasePath(dboid, dsttablespace);
-
-			/*
-			 * Copy this subdirectory to the new location
-			 *
-			 * We don't need to copy subdirectories
-			 */
-			copydir(srcpath, dstpath, false);
-
-			/* Record the filesystem change in XLOG */
-			{
-				xl_dbase_create_rec xlrec;
-
-				xlrec.db_id = dboid;
-				xlrec.tablespace_id = dsttablespace;
-				xlrec.src_db_id = src_dboid;
-				xlrec.src_tablespace_id = srctablespace;
-
-				XLogBeginInsert();
-				XLogRegisterData((char *) &xlrec, sizeof(xl_dbase_create_rec));
-
-				(void) XLogInsert(RM_DBASE_ID,
-								  XLOG_DBASE_CREATE | XLR_SPECIAL_REL_UPDATE);
-			}
-		}
-		table_endscan(scan);
-		table_close(rel, AccessShareLock);
-
-		/*
-		 * We force a checkpoint before committing.  This effectively means
-		 * that committed XLOG_DBASE_CREATE operations will never need to be
-		 * replayed (at least not in ordinary crash recovery; we still have to
-		 * make the XLOG entry for the benefit of PITR operations). This
-		 * avoids two nasty scenarios:
-		 *
-		 * #1: When PITR is off, we don't XLOG the contents of newly created
-		 * indexes; therefore the drop-and-recreate-whole-directory behavior
-		 * of DBASE_CREATE replay would lose such indexes.
-		 *
-		 * #2: Since we have to recopy the source database during DBASE_CREATE
-		 * replay, we run the risk of copying changes in it that were
-		 * committed after the original CREATE DATABASE command but before the
-		 * system crash that led to the replay.  This is at least unexpected
-		 * and at worst could lead to inconsistencies, eg duplicate table
-		 * names.
-		 *
-		 * (Both of these were real bugs in releases 8.0 through 8.0.3.)
-		 *
-		 * In PITR replay, the first of these isn't an issue, and the second
-		 * is only a risk if the CREATE DATABASE and subsequent template
-		 * database change both occur while a base backup is being taken.
-		 * There doesn't seem to be much we can do about that except document
-		 * it as a limitation.
-		 *
-		 * Perhaps if we ever implement CREATE DATABASE in a less cheesy way,
-		 * we can avoid this.
+		 * If the user has asked to create a database with WAL_LOG strategy
+		 * then call CreateDatabaseUsingWalLog, which will copy the database
+		 * at the block level and it will WAL log each copied block.
+		 * Otherwise, call CreateDatabaseUsingFileCopy that will copy the
+		 * database file by file.
 		 */
-		RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
+		if (dbstrategy == CREATEDB_WAL_LOG)
+			CreateDatabaseUsingWalLog(src_dboid, dboid, src_deftablespace,
+									  dst_deftablespace);
+		else
+			CreateDatabaseUsingFileCopy(src_dboid, dboid, src_deftablespace,
+										dst_deftablespace);
 
 		/*
 		 * Close pg_database, but keep lock till commit.
@@ -954,6 +1469,25 @@ createdb_failure_callback(int code, Datum arg)
 {
 	createdb_failure_params *fparms = (createdb_failure_params *) DatumGetPointer(arg);
 
+	/*
+	 * If we were copying database at block levels then drop pages for the
+	 * destination database that are in the shared buffer cache.  And tell
+	 * checkpointer to forget any pending fsync and unlink requests for files
+	 * in the database.  The reasoning behind doing this is same as explained
+	 * in dropdb function.  But unlike dropdb we don't need to call
+	 * pgstat_drop_database because this database is still not created so
+	 * there should not be any stat for this.
+	 */
+	if (fparms->strategy == CREATEDB_WAL_LOG)
+	{
+		DropDatabaseBuffers(fparms->dest_dboid);
+		ForgetDatabaseSyncRequests(fparms->dest_dboid);
+
+		/* Release lock on the target database. */
+		UnlockSharedObject(DatabaseRelationId, fparms->dest_dboid, 0,
+						   AccessShareLock);
+	}
+
 	/*
 	 * Release lock on source database before doing recursive remove. This is
 	 * not essential but it seems desirable to release the lock as soon as
@@ -1478,7 +2012,7 @@ movedb(const char *dbname, const char *tblspcname)
 		 * Record the filesystem change in XLOG
 		 */
 		{
-			xl_dbase_create_rec xlrec;
+			xl_dbase_create_file_copy_rec xlrec;
 
 			xlrec.db_id = db_id;
 			xlrec.tablespace_id = dst_tblspcoid;
@@ -1486,10 +2020,11 @@ movedb(const char *dbname, const char *tblspcname)
 			xlrec.src_tablespace_id = src_tblspcoid;
 
 			XLogBeginInsert();
-			XLogRegisterData((char *) &xlrec, sizeof(xl_dbase_create_rec));
+			XLogRegisterData((char *) &xlrec,
+							 sizeof(xl_dbase_create_file_copy_rec));
 
 			(void) XLogInsert(RM_DBASE_ID,
-							  XLOG_DBASE_CREATE | XLR_SPECIAL_REL_UPDATE);
+							  XLOG_DBASE_CREATE_FILE_COPY | XLR_SPECIAL_REL_UPDATE);
 		}
 
 		/*
@@ -1525,9 +2060,10 @@ movedb(const char *dbname, const char *tblspcname)
 
 		/*
 		 * Force another checkpoint here.  As in CREATE DATABASE, this is to
-		 * ensure that we don't have to replay a committed XLOG_DBASE_CREATE
-		 * operation, which would cause us to lose any unlogged operations
-		 * done in the new DB tablespace before the next checkpoint.
+		 * ensure that we don't have to replay a committed
+		 * XLOG_DBASE_CREATE_FILE_COPY operation, which would cause us to lose
+		 * any unlogged operations done in the new DB tablespace before the
+		 * next checkpoint.
 		 */
 		RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
 
@@ -2478,9 +3014,10 @@ dbase_redo(XLogReaderState *record)
 	/* Backup blocks are not used in dbase records */
 	Assert(!XLogRecHasAnyBlockRefs(record));
 
-	if (info == XLOG_DBASE_CREATE)
+	if (info == XLOG_DBASE_CREATE_FILE_COPY)
 	{
-		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) XLogRecGetData(record);
+		xl_dbase_create_file_copy_rec *xlrec =
+		(xl_dbase_create_file_copy_rec *) XLogRecGetData(record);
 		char	   *src_path;
 		char	   *dst_path;
 		struct stat st;
@@ -2515,6 +3052,18 @@ dbase_redo(XLogReaderState *record)
 		 */
 		copydir(src_path, dst_path, false);
 	}
+	else if (info == XLOG_DBASE_CREATE_WAL_LOG)
+	{
+		xl_dbase_create_wal_log_rec *xlrec =
+		(xl_dbase_create_wal_log_rec *) XLogRecGetData(record);
+		char	   *dbpath;
+
+		dbpath = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
+
+		/* Create the database directory with the version file. */
+		CreateDirAndVersionFile(dbpath, xlrec->db_id, xlrec->tablespace_id,
+								true);
+	}
 	else if (info == XLOG_DBASE_DROP)
 	{
 		xl_dbase_drop_rec *xlrec = (xl_dbase_drop_rec *) XLogRecGetData(record);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index ab9a53b27c..b93888ca53 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -14627,7 +14627,7 @@ index_copy_data(Relation rel, RelFileNode newrnode)
 	 * NOTE: any conflict in relfilenode value will be caught in
 	 * RelationCreateStorage().
 	 */
-	RelationCreateStorage(newrnode, rel->rd_rel->relpersistence);
+	RelationCreateStorage(newrnode, rel->rd_rel->relpersistence, true);
 
 	/* copy main fork */
 	RelationCopyStorage(RelationGetSmgr(rel), dstrel, MAIN_FORKNUM,
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f5459c68f8..f5ef6ef7da 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -38,6 +38,7 @@
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/storage.h"
+#include "catalog/storage_xlog.h"
 #include "executor/instrument.h"
 #include "lib/binaryheap.h"
 #include "miscadmin.h"
@@ -486,6 +487,9 @@ static void FindAndDropRelFileNodeBuffers(RelFileNode rnode,
 										  ForkNumber forkNum,
 										  BlockNumber nForkBlock,
 										  BlockNumber firstDelBlock);
+static void RelationCopyStorageUsingBuffer(Relation src, Relation dst,
+										   ForkNumber forkNum,
+										   bool isunlogged);
 static void AtProcExit_Buffers(int code, Datum arg);
 static void CheckForBufferLeaks(void);
 static int	rnode_comparator(const void *p1, const void *p2);
@@ -772,23 +776,23 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
  * ReadBufferWithoutRelcache -- like ReadBufferExtended, but doesn't require
  *		a relcache entry for the relation.
  *
- * NB: At present, this function may only be used on permanent relations, which
- * is OK, because we only use it during XLOG replay.  If in the future we
- * want to use it on temporary or unlogged relations, we could pass additional
- * parameters.
+ * Pass permanent = true for a RELPERSISTENCE_PERMANENT relation, and
+ * permanent = false for a RELPERSISTENCE_UNLOGGED relation. This function
+ * cannot be used for temporary relations (and making that work might be
+ * difficult, unless we only want to read temporary relations for our own
+ * BackendId).
  */
 Buffer
 ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
 						  BlockNumber blockNum, ReadBufferMode mode,
-						  BufferAccessStrategy strategy)
+						  BufferAccessStrategy strategy, bool permanent)
 {
 	bool		hit;
 
 	SMgrRelation smgr = smgropen(rnode, InvalidBackendId);
 
-	Assert(InRecovery);
-
-	return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
+	return ReadBuffer_common(smgr, permanent ? RELPERSISTENCE_PERMANENT :
+							 RELPERSISTENCE_UNLOGGED, forkNum, blockNum,
 							 mode, strategy, &hit);
 }
 
@@ -3676,6 +3680,158 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 	pfree(srels);
 }
 
+/* ---------------------------------------------------------------------
+ *		RelationCopyStorageUsingBuffer
+ *
+ *		Copy fork's data using bufmgr.  Same as RelationCopyStorage but instead
+ *		of using smgrread and smgrextend this will copy using bufmgr APIs.
+ *
+ *		Refer comments atop CreateAndCopyRelationData() for details about
+ *		'permanent' parameter.
+ * --------------------------------------------------------------------
+ */
+static void
+RelationCopyStorageUsingBuffer(Relation src, Relation dst, ForkNumber forkNum,
+							   bool permanent)
+{
+	Buffer		srcBuf;
+	Buffer		dstBuf;
+	Page		srcPage;
+	Page		dstPage;
+	bool		use_wal;
+	BlockNumber nblocks;
+	BlockNumber blkno;
+	BufferAccessStrategy bstrategy_src;
+	BufferAccessStrategy bstrategy_dst;
+
+	/*
+	 * In general, we want to write WAL whenever wal_level > 'minimal', but
+	 * we can skip it when copying any fork of an unlogged relation other
+	 * than the init fork.
+	 */
+	use_wal = XLogIsNeeded() && (permanent || forkNum == INIT_FORKNUM);
+
+	/* Get number of blocks in the source relation. */
+	nblocks = smgrnblocks(RelationGetSmgr(src), forkNum);
+
+	/* Nothing to copy; just return. */
+	if (nblocks == 0)
+		return;
+
+	/* This is a bulk operation, so use buffer access strategies. */
+	bstrategy_src = GetAccessStrategy(BAS_BULKREAD);
+	bstrategy_dst = GetAccessStrategy(BAS_BULKWRITE);
+
+	/* Iterate over each block of the source relation file. */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		CHECK_FOR_INTERRUPTS();
+
+		/* Read block from source relation. */
+		srcBuf = ReadBufferWithoutRelcache(src->rd_node, forkNum, blkno,
+										   RBM_NORMAL, bstrategy_src,
+										   permanent);
+		srcPage = BufferGetPage(srcBuf);
+		if (PageIsNew(srcPage) || PageIsEmpty(srcPage))
+		{
+			ReleaseBuffer(srcBuf);
+			continue;
+		}
+
+		/* Use P_NEW to extend the destination relation. */
+		dstBuf = ReadBufferWithoutRelcache(dst->rd_node, forkNum, P_NEW,
+										   RBM_NORMAL, bstrategy_dst,
+										   permanent);
+		LockBuffer(dstBuf, BUFFER_LOCK_EXCLUSIVE);
+
+		START_CRIT_SECTION();
+
+		/* Copy page data from the source to the destination. */
+		dstPage = BufferGetPage(dstBuf);
+		memcpy(dstPage, srcPage, BLCKSZ);
+		MarkBufferDirty(dstBuf);
+
+		/* WAL-log the copied page. */
+		if (use_wal)
+			log_newpage_buffer(dstBuf, true);
+
+		END_CRIT_SECTION();
+
+		UnlockReleaseBuffer(dstBuf);
+		ReleaseBuffer(srcBuf);
+	}
+}
+
+/* ---------------------------------------------------------------------
+ *		CreateAndCopyRelationData
+ *
+ *		Create destination relation storage and copy all forks from the
+ *		source relation to the destination.
+ *
+ *		Pass permanent as true for permanent relations and false for
+ *		unlogged relations.  Currently this API is not supported for
+ *		temporary relations.
+ * --------------------------------------------------------------------
+ */
+void
+CreateAndCopyRelationData(RelFileNode src_rnode, RelFileNode dst_rnode,
+						  bool permanent)
+{
+	Relation		src_rel;
+	Relation		dst_rel;
+	char			relpersistence;
+
+	/* Set the relpersistence. */
+	relpersistence = permanent ?
+		RELPERSISTENCE_PERMANENT : RELPERSISTENCE_UNLOGGED;
+
+	/*
+	 * We can't use a real relcache entry for a relation in some other
+	 * database, but since we're only going to access the fields related
+	 * to physical storage, a fake one is good enough. If we didn't do this
+	 * and used the smgr layer directly, we would have to worry about
+	 * invalidations.
+	 */
+	src_rel = CreateFakeRelcacheEntry(src_rnode);
+	dst_rel = CreateFakeRelcacheEntry(dst_rnode);
+
+	/*
+	 * Create and copy all forks of the relation.  During create database we
+	 * have a separate cleanup mechanism which deletes complete database
+	 * directory.  Therefore, each individual relation doesn't need to be
+	 * registered for cleanup.
+	 */
+	RelationCreateStorage(dst_rnode, relpersistence, false);
+
+	/* copy main fork. */
+	RelationCopyStorageUsingBuffer(src_rel, dst_rel, MAIN_FORKNUM, permanent);
+
+	/* copy those extra forks that exist */
+	for (ForkNumber forkNum = MAIN_FORKNUM + 1;
+		 forkNum <= MAX_FORKNUM; forkNum++)
+	{
+		if (smgrexists(RelationGetSmgr(src_rel), forkNum))
+		{
+			smgrcreate(RelationGetSmgr(dst_rel), forkNum, false);
+
+			/*
+			 * WAL log creation if the relation is persistent, or this is the
+			 * init fork of an unlogged relation.
+			 */
+			if (permanent || forkNum == INIT_FORKNUM)
+				log_smgrcreate(&dst_rnode, forkNum);
+
+			/* Copy a fork's data, block by block. */
+			RelationCopyStorageUsingBuffer(src_rel, dst_rel, forkNum,
+										   permanent);
+		}
+	}
+
+	/* Release fake relcache entries. */
+	FreeFakeRelcacheEntry(src_rel);
+	FreeFakeRelcacheEntry(dst_rel);
+}
+
 /* ---------------------------------------------------------------------
  *		FlushDatabaseBuffers
  *
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 5ae52dd14d..1543da6162 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -175,6 +175,34 @@ ConditionalLockRelationOid(Oid relid, LOCKMODE lockmode)
 	return true;
 }
 
+/*
+ *		LockRelationId
+ *
+ * Lock, given a LockRelId.  Same as LockRelationOid but take LockRelId as an
+ * input.
+ */
+void
+LockRelationId(LockRelId *relid, LOCKMODE lockmode)
+{
+	LOCKTAG		tag;
+	LOCALLOCK  *locallock;
+	LockAcquireResult res;
+
+	SET_LOCKTAG_RELATION(tag, relid->dbId, relid->relId);
+
+	res = LockAcquireExtended(&tag, lockmode, false, false, true, &locallock);
+
+	/*
+	 * Now that we have the lock, check for invalidation messages; see notes
+	 * in LockRelationOid.
+	 */
+	if (res != LOCKACQUIRE_ALREADY_CLEAR)
+	{
+		AcceptInvalidationMessages();
+		MarkLockClear(locallock);
+	}
+}
+
 /*
  *		UnlockRelationId
  *
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index ff46a0e3c7..1c8aba4925 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -705,6 +705,9 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_TWOPHASE_FILE_WRITE:
 			event_name = "TwophaseFileWrite";
 			break;
+		case WAIT_EVENT_VERSION_FILE_WRITE:
+			event_name = "VersionFileWrite";
+			break;
 		case WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ:
 			event_name = "WALSenderTimelineHistoryRead";
 			break;
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index fccffce572..9ddf549562 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -3726,7 +3726,7 @@ RelationSetNewRelfilenode(Relation relation, char persistence)
 		/* handle these directly, at least for now */
 		SMgrRelation srel;
 
-		srel = RelationCreateStorage(newrnode, persistence);
+		srel = RelationCreateStorage(newrnode, persistence, true);
 		smgrclose(srel);
 	}
 	else
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 4d0718f001..dee3387d02 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -251,6 +251,63 @@ RelationMapFilenodeToOid(Oid filenode, bool shared)
 	return InvalidOid;
 }
 
+/*
+ * RelationMapOidToFilenodeForDatabase
+ *
+ * Like RelationMapOidToFilenode, but reads the mapping from the indicated
+ * path instead of using the one for the current database.
+ */
+Oid
+RelationMapOidToFilenodeForDatabase(char *dbpath, Oid relationId)
+{
+	RelMapFile	map;
+	int			i;
+
+	/* Read the relmap file from the source database. */
+	read_relmap_file(&map, dbpath, false, ERROR);
+
+	/* Iterate over the relmap entries to find the input relation OID. */
+	for (i = 0; i < map.num_mappings; i++)
+	{
+		if (relationId == map.mappings[i].mapoid)
+			return map.mappings[i].mapfilenode;
+	}
+
+	return InvalidOid;
+}
+
+/*
+ * RelationMapCopy
+ *
+ * Copy relmapfile from source db path to the destination db path and WAL log
+ * the operation. This is intended for use in creating a new relmap file
+ * for a database that doesn't have one yet, not for replacing an existing
+ * relmap file.
+ */
+void
+RelationMapCopy(Oid dbid, Oid tsid, char *srcdbpath, char *dstdbpath)
+{
+	RelMapFile map;
+
+	/*
+	 * Read the relmap file from the source database.
+	 */
+	read_relmap_file(&map, srcdbpath, false, ERROR);
+
+	/*
+	 * Write the same data into the destination database's relmap file.
+	 *
+	 * No sinval is needed because no one can be connected to the destination
+	 * database yet. For the same reason, there is no need to acquire
+	 * RelationMappingLock.
+	 *
+	 * There's no point in trying to preserve files here. The new database
+	 * isn't usable yet anyway, and won't ever be if we can't install a
+	 * relmap file.
+	 */
+	write_relmap_file(&map, true, false, false, dbid, tsid, dstdbpath);
+}
+
 /*
  * RelationMapUpdateMap
  *
@@ -1031,6 +1088,13 @@ relmap_redo(XLogReaderState *record)
 		 *
 		 * There shouldn't be anyone else updating relmaps during WAL replay,
 		 * but grab the lock to interlock against load_relmap_file().
+		 *
+		 * Note that we use the same WAL record for updating the relmap of
+		 * an existing database as we do for creating a new database. In
+		 * the latter case, taking the relmap log and sending sinval messages
+		 * is unnecessary, but harmless. If we wanted to avoid it, we could
+		 * add a flag to the WAL record to indicate which opration is being
+		 * performed.
 		 */
 		LWLockAcquire(RelationMappingLock, LW_EXCLUSIVE);
 		write_relmap_file(&newmap, false, true, false,
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 7cfa169e9b..bd1ec42ac6 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -370,7 +370,7 @@ extractPageInfo(XLogReaderState *record)
 
 	/* Is this a special record type that I recognize? */
 
-	if (rmid == RM_DBASE_ID && rminfo == XLOG_DBASE_CREATE)
+	if (rmid == RM_DBASE_ID && rminfo == XLOG_DBASE_CREATE_FILE_COPY)
 	{
 		/*
 		 * New databases can be safely ignored. It won't be present in the
@@ -382,6 +382,13 @@ extractPageInfo(XLogReaderState *record)
 		 * overwriting the database created in the target system.
 		 */
 	}
+	else if (rmid == RM_DBASE_ID && rminfo == XLOG_DBASE_CREATE_WAL_LOG)
+	{
+		/*
+		 * New databases can be safely ignored. It won't be present in the
+		 * source system, so it will be deleted.
+		 */
+	}
 	else if (rmid == RM_DBASE_ID && rminfo == XLOG_DBASE_DROP)
 	{
 		/*
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 183abcc275..ee06b0f0a4 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2773,13 +2773,15 @@ psql_completion(const char *text, int start, int end)
 	/* CREATE DATABASE */
 	else if (Matches("CREATE", "DATABASE", MatchAny))
 		COMPLETE_WITH("OWNER", "TEMPLATE", "ENCODING", "TABLESPACE",
-					  "IS_TEMPLATE",
+					  "IS_TEMPLATE", "STRATEGY",
 					  "ALLOW_CONNECTIONS", "CONNECTION LIMIT",
 					  "LC_COLLATE", "LC_CTYPE", "LOCALE", "OID",
 					  "LOCALE_PROVIDER", "ICU_LOCALE");
 
 	else if (Matches("CREATE", "DATABASE", MatchAny, "TEMPLATE"))
 		COMPLETE_WITH_QUERY(Query_for_list_of_template_databases);
+	else if (Matches("CREATE", "DATABASE", MatchAny, "STRATEGY"))
+		COMPLETE_WITH("WAL_LOG", "FILE_COPY");
 
 	/* CREATE DOMAIN */
 	else if (Matches("CREATE", "DOMAIN", MatchAny))
diff --git a/src/bin/scripts/createdb.c b/src/bin/scripts/createdb.c
index 6f612abf7c..0bffa2f3ee 100644
--- a/src/bin/scripts/createdb.c
+++ b/src/bin/scripts/createdb.c
@@ -34,6 +34,7 @@ main(int argc, char *argv[])
 		{"tablespace", required_argument, NULL, 'D'},
 		{"template", required_argument, NULL, 'T'},
 		{"encoding", required_argument, NULL, 'E'},
+		{"strategy", required_argument, NULL, 'S'},
 		{"lc-collate", required_argument, NULL, 1},
 		{"lc-ctype", required_argument, NULL, 2},
 		{"locale", required_argument, NULL, 'l'},
@@ -60,6 +61,7 @@ main(int argc, char *argv[])
 	char	   *tablespace = NULL;
 	char	   *template = NULL;
 	char	   *encoding = NULL;
+	char	   *strategy = NULL;
 	char	   *lc_collate = NULL;
 	char	   *lc_ctype = NULL;
 	char	   *locale = NULL;
@@ -77,7 +79,7 @@ main(int argc, char *argv[])
 
 	handle_help_version_opts(argc, argv, "createdb", help);
 
-	while ((c = getopt_long(argc, argv, "h:p:U:wWeO:D:T:E:l:", long_options, &optindex)) != -1)
+	while ((c = getopt_long(argc, argv, "h:p:U:wWeO:D:T:E:l:S:", long_options, &optindex)) != -1)
 	{
 		switch (c)
 		{
@@ -111,6 +113,9 @@ main(int argc, char *argv[])
 			case 'E':
 				encoding = pg_strdup(optarg);
 				break;
+			case 'S':
+				strategy = pg_strdup(optarg);
+				break;
 			case 1:
 				lc_collate = pg_strdup(optarg);
 				break;
@@ -215,6 +220,8 @@ main(int argc, char *argv[])
 		appendPQExpBufferStr(&sql, " ENCODING ");
 		appendStringLiteralConn(&sql, encoding, conn);
 	}
+	if (strategy)
+		appendPQExpBuffer(&sql, " STRATEGY %s", fmtId(strategy));
 	if (template)
 		appendPQExpBuffer(&sql, " TEMPLATE %s", fmtId(template));
 	if (lc_collate)
@@ -294,6 +301,7 @@ help(const char *progname)
 	printf(_("      --locale-provider={libc|icu}\n"
 			 "                               locale provider for the database's default collation\n"));
 	printf(_("  -O, --owner=OWNER            database user to own the new database\n"));
+	printf(_("  -S, --strategy=STRATEGY      database creation strategy wal_log or file_copy\n"));
 	printf(_("  -T, --template=TEMPLATE      template database to copy\n"));
 	printf(_("  -V, --version                output version information, then exit\n"));
 	printf(_("  -?, --help                   show this help, then exit\n"));
diff --git a/src/bin/scripts/t/020_createdb.pl b/src/bin/scripts/t/020_createdb.pl
index 35deec9a92..14d3a9563d 100644
--- a/src/bin/scripts/t/020_createdb.pl
+++ b/src/bin/scripts/t/020_createdb.pl
@@ -104,4 +104,24 @@ $node->command_checks_all(
 	],
 	'createdb with incorrect --lc-ctype');
 
+$node->command_checks_all(
+	[ 'createdb', '--strategy', "foo", 'foobar2' ],
+	1,
+	[qr/^$/],
+	[
+		qr/^createdb: error: database creation failed: ERROR:  invalid create database strategy|^createdb: error: database creation failed: ERROR:  invalid create database strategy foo/s
+	],
+	'createdb with incorrect --strategy');
+
+# Check database creation strategy
+$node->issues_sql_like(
+	[ 'createdb', '-T', 'foobar2', 'foobar6', '-S', 'wal_log'],
+	qr/statement: CREATE DATABASE foobar6 STRATEGY wal_log TEMPLATE foobar2/,
+	'create database with WAL_LOG strategy');
+
+$node->issues_sql_like(
+	[ 'createdb', '-T', 'foobar2', 'foobar7', '-S', 'file_copy'],
+	qr/statement: CREATE DATABASE foobar7 STRATEGY file_copy TEMPLATE foobar2/,
+	'create database with FILE_COPY strategy');
+
 done_testing();
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 9ffc741913..844a023b2c 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -22,7 +22,9 @@
 /* GUC variables */
 extern int	wal_skip_threshold;
 
-extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence);
+extern SMgrRelation RelationCreateStorage(RelFileNode rnode,
+										  char relpersistence,
+										  bool register_delete);
 extern void RelationDropStorage(Relation rel);
 extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
 extern void RelationPreTruncate(Relation rel);
diff --git a/src/include/commands/dbcommands_xlog.h b/src/include/commands/dbcommands_xlog.h
index 593a8578a4..0ee2452feb 100644
--- a/src/include/commands/dbcommands_xlog.h
+++ b/src/include/commands/dbcommands_xlog.h
@@ -18,17 +18,32 @@
 #include "lib/stringinfo.h"
 
 /* record types */
-#define XLOG_DBASE_CREATE		0x00
-#define XLOG_DBASE_DROP			0x10
+#define XLOG_DBASE_CREATE_FILE_COPY		0x00
+#define XLOG_DBASE_CREATE_WAL_LOG		0x10
+#define XLOG_DBASE_DROP					0x20
 
-typedef struct xl_dbase_create_rec
+/*
+ * Single WAL record for an entire CREATE DATABASE operation. This is used
+ * by the FILE_COPY strategy.
+ */
+typedef struct xl_dbase_create_file_copy_rec
 {
-	/* Records copying of a single subdirectory incl. contents */
 	Oid			db_id;
 	Oid			tablespace_id;
 	Oid			src_db_id;
 	Oid			src_tablespace_id;
-} xl_dbase_create_rec;
+} xl_dbase_create_file_copy_rec;
+
+/*
+ * WAL record for the beginning of a CREATE DATABASE operation, when the
+ * WAL_LOG strategy is used. Each individual block will be logged separately
+ * afterward.
+ */
+typedef struct xl_dbase_create_wal_log_rec
+{
+	Oid			db_id;
+	Oid			tablespace_id;
+} xl_dbase_create_wal_log_rec;
 
 typedef struct xl_dbase_drop_rec
 {
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index dd01841c30..a6b657f0ba 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -184,7 +184,8 @@ extern Buffer ReadBufferExtended(Relation reln, ForkNumber forkNum,
 								 BufferAccessStrategy strategy);
 extern Buffer ReadBufferWithoutRelcache(RelFileNode rnode,
 										ForkNumber forkNum, BlockNumber blockNum,
-										ReadBufferMode mode, BufferAccessStrategy strategy);
+										ReadBufferMode mode, BufferAccessStrategy strategy,
+										bool permanent);
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern void MarkBufferDirty(Buffer buffer);
@@ -203,6 +204,9 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
 extern void FlushOneBuffer(Buffer buffer);
 extern void FlushRelationBuffers(Relation rel);
 extern void FlushRelationsAllBuffers(struct SMgrRelationData **smgrs, int nrels);
+extern void CreateAndCopyRelationData(RelFileNode src_rnode,
+									  RelFileNode dst_rnode,
+									  bool permanent);
 extern void FlushDatabaseBuffers(Oid dbid);
 extern void DropRelFileNodeBuffers(struct SMgrRelationData *smgr_reln, ForkNumber *forkNum,
 								   int nforks, BlockNumber *firstDelBlock);
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index 49edbcc81b..be1d2c99a9 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -38,6 +38,7 @@ extern void RelationInitLockInfo(Relation relation);
 
 /* Lock a relation */
 extern void LockRelationOid(Oid relid, LOCKMODE lockmode);
+extern void LockRelationId(LockRelId *relid, LOCKMODE lockmode);
 extern bool ConditionalLockRelationOid(Oid relid, LOCKMODE lockmode);
 extern void UnlockRelationId(LockRelId *relid, LOCKMODE lockmode);
 extern void UnlockRelationOid(Oid relid, LOCKMODE lockmode);
diff --git a/src/include/utils/relmapper.h b/src/include/utils/relmapper.h
index 9fbb5a7f9b..f10353e139 100644
--- a/src/include/utils/relmapper.h
+++ b/src/include/utils/relmapper.h
@@ -38,7 +38,9 @@ typedef struct xl_relmap_update
 extern Oid	RelationMapOidToFilenode(Oid relationId, bool shared);
 
 extern Oid	RelationMapFilenodeToOid(Oid relationId, bool shared);
-
+extern Oid RelationMapOidToFilenodeForDatabase(char *dbpath, Oid relationId);
+extern void RelationMapCopy(Oid dbid, Oid tsid, char *srcdbpath,
+							char *dstdbpath);
 extern void RelationMapUpdateMap(Oid relationId, Oid fileNode, bool shared,
 								 bool immediate);
 
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 1c39ce031a..d870c59263 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -218,6 +218,7 @@ typedef enum
 	WAIT_EVENT_TWOPHASE_FILE_READ,
 	WAIT_EVENT_TWOPHASE_FILE_SYNC,
 	WAIT_EVENT_TWOPHASE_FILE_WRITE,
+	WAIT_EVENT_VERSION_FILE_WRITE,
 	WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ,
 	WAIT_EVENT_WAL_BOOTSTRAP_SYNC,
 	WAIT_EVENT_WAL_BOOTSTRAP_WRITE,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 93d5190508..07472055dd 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -460,6 +460,8 @@ CoverPos
 CreateAmStmt
 CreateCastStmt
 CreateConversionStmt
+CreateDBRelInfo
+CreateDBStrategy
 CreateDomainStmt
 CreateEnumStmt
 CreateEventTrigStmt
@@ -3701,7 +3703,8 @@ xl_btree_update
 xl_btree_vacuum
 xl_clog_truncate
 xl_commit_ts_truncate
-xl_dbase_create_rec
+xl_dbase_create_file_copy_rec
+xl_dbase_create_wal_log_rec
 xl_dbase_drop_rec
 xl_end_of_recovery
 xl_hash_add_ovfl_page
-- 
2.24.3 (Apple Git-128)

#195

Dilip Kumar

dilipbalaut@gmail.com

almost 4 years ago

In reply to: Robert Haas (#194)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Thu, Mar 24, 2022 at 9:29 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Mar 24, 2022 at 1:29 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

In the latest version I have fixed this issue by using a non
conflicting name, because when it was compiled with-icu the foobar5
was already used and we were seeing failure. Apart from this I have
fixed the duplicate cleanup problem by passing an extra parameter to
RelationCreateStorage, which decides whether to register for on-abort
delete or not and added the comments for the same. IMHO this looks
the most cleaner way to do it, please check the patch and let me know
your thoughts.

I think that might be an OK way to do it. I think if we were starting
from scratch we'd probably want to come up with some better system,
but that's true of a lot of things.

Right.

I went over your version and changed some comments. I also added
documentation for the new wait event. Here's a new version.

Thanks, I have gone through your changes in comments and docs and those LGTM.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#196

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Dilip Kumar (#195)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Thu, Mar 24, 2022 at 12:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Thanks, I have gone through your changes in comments and docs and those LGTM.

It looks like this patch will need to be updated for Alvaro's commit
49d9cfc68bf4e0d32a948fe72d5a0ef7f464944e. The newly added test
029_replay_tsp_drops.pl fails with this patch applied. The standby log
shows:

2022-03-25 10:00:10.022 EDT [38209] LOG: entering standby mode
2022-03-25 10:00:10.024 EDT [38209] LOG: redo starts at 0/3000028
2022-03-25 10:00:10.062 EDT [38209] FATAL: could not create directory
"pg_tblspc/16385/PG_15_202203241/16390": No such file or directory
2022-03-25 10:00:10.062 EDT [38209] CONTEXT: WAL redo at 0/43EBD88
for Database/CREATE_WAL_LOG: create dir 16385/16390

On a quick look, I'm guessing that XLOG_DBASE_CREATE_WAL_LOG will need
to mirror some of the logic that was added to the replay code for the
existing strategy, but I haven't figured out the details.

--
Robert Haas
EDB: http://www.enterprisedb.com

#197

Dilip Kumar

dilipbalaut@gmail.com

almost 4 years ago

In reply to: Robert Haas (#196)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Fri, Mar 25, 2022 at 7:41 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Mar 24, 2022 at 12:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Thanks, I have gone through your changes in comments and docs and those LGTM.

It looks like this patch will need to be updated for Alvaro's commit
49d9cfc68bf4e0d32a948fe72d5a0ef7f464944e. The newly added test
029_replay_tsp_drops.pl fails with this patch applied. The standby log
shows:

2022-03-25 10:00:10.022 EDT [38209] LOG: entering standby mode
2022-03-25 10:00:10.024 EDT [38209] LOG: redo starts at 0/3000028
2022-03-25 10:00:10.062 EDT [38209] FATAL: could not create directory
"pg_tblspc/16385/PG_15_202203241/16390": No such file or directory
2022-03-25 10:00:10.062 EDT [38209] CONTEXT: WAL redo at 0/43EBD88
for Database/CREATE_WAL_LOG: create dir 16385/16390

On a quick look, I'm guessing that XLOG_DBASE_CREATE_WAL_LOG will need
to mirror some of the logic that was added to the replay code for the
existing strategy, but I haven't figured out the details.

Yeah, I think I got it, for XLOG_DBASE_CREATE_WAL_LOG now we will have
to handle the missing parent directory case, like Alvaro handled for
the XLOG_DBASE_CREATE(_FILE_COPY) case.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#198

Dilip Kumar

dilipbalaut@gmail.com

almost 4 years ago

In reply to: Dilip Kumar (#197)

1 attachment(s)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Fri, Mar 25, 2022 at 8:16 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On a quick look, I'm guessing that XLOG_DBASE_CREATE_WAL_LOG will need
to mirror some of the logic that was added to the replay code for the
existing strategy, but I haven't figured out the details.

Yeah, I think I got it, for XLOG_DBASE_CREATE_WAL_LOG now we will have
to handle the missing parent directory case, like Alvaro handled for
the XLOG_DBASE_CREATE(_FILE_COPY) case.

I have updated the patch so now we skip the XLOG_DBASE_CREATE_WAL_LOG
as well if the tablespace directory is missing. But with our new
wal_log method there will be other follow up wal logs like,
XLOG_RELMAP_UPDATE, XLOG_SMGR_CREATE and XLOG_FPI.

I have put the similar logic for relmap_update WAL replay as well, but
we don't need this for smgr_create or fpi. Because the mdcreate() is
taking care of creating missing directory in TablespaceCreateDbspace()
and fpi only logged after we create the new smgr at least in case of
create database.

Now, is it possible to get the FPI without smgr_create wal in other
cases? If it is then that problem is orthogonal to this path, but
anyway I could not find any such scenario.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v7-0001-Add-new-block-by-block-strategy-for-CREATE-DATABA.patchtext/x-patch; charset=US-ASCII; name=v7-0001-Add-new-block-by-block-strategy-for-CREATE-DATABA.patchDownload

From e0133aa89ca5da8309d07e4236ded4af513d3905 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Sat, 26 Mar 2022 10:35:44 +0530
Subject: [PATCH v7] Add new block-by-block strategy for CREATE DATABASE.

Because this strategy logs changes on a block-by-block basis, it
avoids the need to checkpoint before and after the operation.
However, because it logs each changed block individually, it might
generate a lot of extra write-ahead logging if the template database
is large. Therefore, the older strategy remains available via a new
STRATEGY parameter to CREATE DATABASE, and a corresponding --strategy
option to createdb.

Somewhat controversially, this patch assembles the list of relations
to be copied to the new database by reading the pg_class relation of
the template database. Cross-database access like this isn't normally
possible, but it can be made to work here because there can't be any
connections to the database being copied, nor can it contain any
in-doubt transactions. Even so, we have to use lower-level interfaces
than normal, since the table scan and relcache interfaces will not
work for a database to which we're not connected. The advantage of
this approach is that we do not need to rely on the filesystem to
determine what ought to be copied, but instead on PostgreSQL's own
knowledge of the database structure. This avoids, for example,
copying stray files that happen to be located in the source database
directory.

Dilip Kumar, with a fairly large number of cosmetic changes by me.
---
 contrib/bloom/blinsert.c                 |   2 +-
 doc/src/sgml/monitoring.sgml             |   4 +
 doc/src/sgml/ref/create_database.sgml    |  22 +
 doc/src/sgml/ref/createdb.sgml           |  11 +
 src/backend/access/heap/heapam_handler.c |   6 +-
 src/backend/access/nbtree/nbtree.c       |   2 +-
 src/backend/access/rmgrdesc/dbasedesc.c  |  20 +-
 src/backend/access/transam/xlogutils.c   |   6 +-
 src/backend/catalog/heap.c               |   2 +-
 src/backend/catalog/storage.c            |  34 +-
 src/backend/commands/dbcommands.c        | 795 ++++++++++++++++++++++++++-----
 src/backend/commands/tablecmds.c         |   2 +-
 src/backend/storage/buffer/bufmgr.c      | 172 ++++++-
 src/backend/storage/lmgr/lmgr.c          |  28 ++
 src/backend/utils/activity/wait_event.c  |   3 +
 src/backend/utils/cache/relcache.c       |   2 +-
 src/backend/utils/cache/relmapper.c      |  97 ++++
 src/bin/pg_rewind/parsexlog.c            |   9 +-
 src/bin/psql/tab-complete.c              |   4 +-
 src/bin/scripts/createdb.c               |  10 +-
 src/bin/scripts/t/020_createdb.pl        |  20 +
 src/include/catalog/storage.h            |   4 +-
 src/include/commands/dbcommands_xlog.h   |  25 +-
 src/include/storage/bufmgr.h             |   6 +-
 src/include/storage/lmgr.h               |   1 +
 src/include/utils/relmapper.h            |   4 +-
 src/include/utils/wait_event.h           |   1 +
 src/tools/pgindent/typedefs.list         |   5 +-
 28 files changed, 1140 insertions(+), 157 deletions(-)

diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c
index c94cf34..82378db 100644
--- a/contrib/bloom/blinsert.c
+++ b/contrib/bloom/blinsert.c
@@ -173,7 +173,7 @@ blbuildempty(Relation index)
 	 * Write the page and log it.  It might seem that an immediate sync would
 	 * be sufficient to guarantee that the file exists on disk, but recovery
 	 * itself might remove it while replaying, for example, an
-	 * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we need
+	 * XLOG_DBASE_CREATE* or XLOG_TBLSPC_CREATE record.  Therefore, we need
 	 * this even when wal_level=minimal.
 	 */
 	PageSetChecksumInplace(metapage, BLOOM_METAPAGE_BLKNO);
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 35b2923..562f59f 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1503,6 +1503,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry>Waiting for a write of a two phase state file.</entry>
      </row>
      <row>
+      <entry><literal>VersionFileWrite</literal></entry>
+      <entry>Waiting for the version file to be written while creating a database.</entry>
+     </row>
+     <row>
       <entry><literal>WALBootstrapSync</literal></entry>
       <entry>Waiting for WAL to reach durable storage during
        bootstrapping.</entry>
diff --git a/doc/src/sgml/ref/create_database.sgml b/doc/src/sgml/ref/create_database.sgml
index 5ae785a..255ad3a 100644
--- a/doc/src/sgml/ref/create_database.sgml
+++ b/doc/src/sgml/ref/create_database.sgml
@@ -25,6 +25,7 @@ CREATE DATABASE <replaceable class="parameter">name</replaceable>
     [ [ WITH ] [ OWNER [=] <replaceable class="parameter">user_name</replaceable> ]
            [ TEMPLATE [=] <replaceable class="parameter">template</replaceable> ]
            [ ENCODING [=] <replaceable class="parameter">encoding</replaceable> ]
+           [ STRATEGY [=] <replaceable class="parameter">strategy</replaceable> ] ]
            [ LOCALE [=] <replaceable class="parameter">locale</replaceable> ]
            [ LC_COLLATE [=] <replaceable class="parameter">lc_collate</replaceable> ]
            [ LC_CTYPE [=] <replaceable class="parameter">lc_ctype</replaceable> ]
@@ -118,6 +119,27 @@ CREATE DATABASE <replaceable class="parameter">name</replaceable>
        </para>
       </listitem>
      </varlistentry>
+     <varlistentry id="create-database-strategy" xreflabel="CREATE DATABASE STRATEGY">
+      <term><replaceable class="parameter">strategy</replaceable></term>
+      <listitem>
+       <para>
+        Strategy to be used in creating the new database.  If
+        the <literal>WAL_LOG</literal> strategy is used, the database will be
+        copied block by block and each block will be separately written
+        to the write-ahead log. This is the most efficient strategy in
+        cases where the template database is small, and therefore it is the
+        default. The older <literal>FILE_COPY</literal> strategy is also
+        available. This strategy writes a small record to the write-ahead log
+        for each tablespace used by the target database. Each such record
+        represents copying an entire directory to a new location at the
+        filesystem level. While this does reduce the write-ahed
+        log volume substantially, especially if the template database is large,
+        it also forces the system to perform a checkpoint both before and
+        after the creation of the new database. In some situations, this may
+        have a noticeable negative impact on overall system performance.
+       </para>
+      </listitem>
+     </varlistentry>
      <varlistentry>
       <term><replaceable class="parameter">locale</replaceable></term>
       <listitem>
diff --git a/doc/src/sgml/ref/createdb.sgml b/doc/src/sgml/ref/createdb.sgml
index be42e50..671cd362 100644
--- a/doc/src/sgml/ref/createdb.sgml
+++ b/doc/src/sgml/ref/createdb.sgml
@@ -178,6 +178,17 @@ PostgreSQL documentation
      </varlistentry>
 
      <varlistentry>
+      <term><option>-S <replaceable class="parameter">template</replaceable></option></term>
+      <term><option>--strategy=<replaceable class="parameter">strategy</replaceable></option></term>
+      <listitem>
+       <para>
+        Specifies the database creation strategy.  See
+        <xref linkend="create-database-strategy" /> for more details.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
       <term><option>-T <replaceable class="parameter">template</replaceable></option></term>
       <term><option>--template=<replaceable class="parameter">template</replaceable></option></term>
       <listitem>
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 39ef8a0..dee264e 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -593,7 +593,7 @@ heapam_relation_set_new_filenode(Relation rel,
 	 */
 	*minmulti = GetOldestMultiXactId();
 
-	srel = RelationCreateStorage(*newrnode, persistence);
+	srel = RelationCreateStorage(*newrnode, persistence, true);
 
 	/*
 	 * If required, set up an init fork for an unlogged table so that it can
@@ -601,7 +601,7 @@ heapam_relation_set_new_filenode(Relation rel,
 	 * even if the page has been logged, because the write did not go through
 	 * shared_buffers and therefore a concurrent checkpoint may have moved the
 	 * redo pointer past our xlog record.  Recovery may as well remove it
-	 * while replaying, for example, XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE
+	 * while replaying, for example, XLOG_DBASE_CREATE* or XLOG_TBLSPC_CREATE
 	 * record. Therefore, logging is necessary even if wal_level=minimal.
 	 */
 	if (persistence == RELPERSISTENCE_UNLOGGED)
@@ -645,7 +645,7 @@ heapam_relation_copy_data(Relation rel, const RelFileNode *newrnode)
 	 * NOTE: any conflict in relfilenode value will be caught in
 	 * RelationCreateStorage().
 	 */
-	RelationCreateStorage(*newrnode, rel->rd_rel->relpersistence);
+	RelationCreateStorage(*newrnode, rel->rd_rel->relpersistence, true);
 
 	/* copy main fork */
 	RelationCopyStorage(RelationGetSmgr(rel), dstrel, MAIN_FORKNUM,
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index c9b4964..dacf3f7 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -161,7 +161,7 @@ btbuildempty(Relation index)
 	 * Write the page and log it.  It might seem that an immediate sync would
 	 * be sufficient to guarantee that the file exists on disk, but recovery
 	 * itself might remove it while replaying, for example, an
-	 * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we need
+	 * XLOG_DBASE_CREATE* or XLOG_TBLSPC_CREATE record.  Therefore, we need
 	 * this even when wal_level=minimal.
 	 */
 	PageSetChecksumInplace(metapage, BTREE_METAPAGE);
diff --git a/src/backend/access/rmgrdesc/dbasedesc.c b/src/backend/access/rmgrdesc/dbasedesc.c
index 03af3fd..523d0b3 100644
--- a/src/backend/access/rmgrdesc/dbasedesc.c
+++ b/src/backend/access/rmgrdesc/dbasedesc.c
@@ -24,14 +24,23 @@ dbase_desc(StringInfo buf, XLogReaderState *record)
 	char	   *rec = XLogRecGetData(record);
 	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
 
-	if (info == XLOG_DBASE_CREATE)
+	if (info == XLOG_DBASE_CREATE_FILE_COPY)
 	{
-		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) rec;
+		xl_dbase_create_file_copy_rec *xlrec =
+		(xl_dbase_create_file_copy_rec *) rec;
 
 		appendStringInfo(buf, "copy dir %u/%u to %u/%u",
 						 xlrec->src_tablespace_id, xlrec->src_db_id,
 						 xlrec->tablespace_id, xlrec->db_id);
 	}
+	else if (info == XLOG_DBASE_CREATE_WAL_LOG)
+	{
+		xl_dbase_create_wal_log_rec *xlrec =
+		(xl_dbase_create_wal_log_rec *) rec;
+
+		appendStringInfo(buf, "create dir %u/%u",
+						 xlrec->tablespace_id, xlrec->db_id);
+	}
 	else if (info == XLOG_DBASE_DROP)
 	{
 		xl_dbase_drop_rec *xlrec = (xl_dbase_drop_rec *) rec;
@@ -51,8 +60,11 @@ dbase_identify(uint8 info)
 
 	switch (info & ~XLR_INFO_MASK)
 	{
-		case XLOG_DBASE_CREATE:
-			id = "CREATE";
+		case XLOG_DBASE_CREATE_FILE_COPY:
+			id = "CREATE_FILE_COPY";
+			break;
+		case XLOG_DBASE_CREATE_WAL_LOG:
+			id = "CREATE_WAL_LOG";
 			break;
 		case XLOG_DBASE_DROP:
 			id = "DROP";
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 8c1b821..cb745ab 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -641,7 +641,7 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 	{
 		/* page exists in file */
 		buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-										   mode, NULL);
+										   mode, NULL, true);
 	}
 	else
 	{
@@ -666,7 +666,7 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 				ReleaseBuffer(buffer);
 			}
 			buffer = ReadBufferWithoutRelcache(rnode, forknum,
-											   P_NEW, mode, NULL);
+											   P_NEW, mode, NULL, true);
 		}
 		while (BufferGetBlockNumber(buffer) < blkno);
 		/* Handle the corner case that P_NEW returns non-consecutive pages */
@@ -676,7 +676,7 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 				LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 			ReleaseBuffer(buffer);
 			buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-											   mode, NULL);
+											   mode, NULL, true);
 		}
 	}
 
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index 696fd59..6eb78a9 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -387,7 +387,7 @@ heap_create(const char *relname,
 											relpersistence,
 											relfrozenxid, relminmxid);
 		else if (RELKIND_HAS_STORAGE(rel->rd_rel->relkind))
-			RelationCreateStorage(rel->rd_node, relpersistence);
+			RelationCreateStorage(rel->rd_node, relpersistence, true);
 		else
 			Assert(false);
 	}
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index ce5568f..9898701 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -112,12 +112,14 @@ AddPendingSync(const RelFileNode *rnode)
  * modules that need them.
  *
  * This function is transactional. The creation is WAL-logged, and if the
- * transaction aborts later on, the storage will be destroyed.
+ * transaction aborts later on, the storage will be destroyed.  A caller
+ * that does not want the storage to be destroyed in case of an abort may
+ * pass register_delete = false.
  */
 SMgrRelation
-RelationCreateStorage(RelFileNode rnode, char relpersistence)
+RelationCreateStorage(RelFileNode rnode, char relpersistence,
+					  bool register_delete)
 {
-	PendingRelDelete *pending;
 	SMgrRelation srel;
 	BackendId	backend;
 	bool		needs_wal;
@@ -149,15 +151,23 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
 	if (needs_wal)
 		log_smgrcreate(&srel->smgr_rnode.node, MAIN_FORKNUM);
 
-	/* Add the relation to the list of stuff to delete at abort */
-	pending = (PendingRelDelete *)
-		MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
-	pending->relnode = rnode;
-	pending->backend = backend;
-	pending->atCommit = false;	/* delete if abort */
-	pending->nestLevel = GetCurrentTransactionNestLevel();
-	pending->next = pendingDeletes;
-	pendingDeletes = pending;
+	/*
+	 * Add the relation to the list of stuff to delete at abort, if we are
+	 * asked to do so.
+	 */
+	if (register_delete)
+	{
+		PendingRelDelete *pending;
+
+		pending = (PendingRelDelete *)
+			MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+		pending->relnode = rnode;
+		pending->backend = backend;
+		pending->atCommit = false;	/* delete if abort */
+		pending->nestLevel = GetCurrentTransactionNestLevel();
+		pending->next = pendingDeletes;
+		pendingDeletes = pending;
+	}
 
 	if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
 	{
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 95771b0..75f011e 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -64,13 +64,31 @@
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
 #include "utils/pg_locale.h"
+#include "utils/relmapper.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 
+/*
+ * Create database strategy.
+ *
+ * CREATEDB_WAL_LOG will copy the database at the block level and WAL log each
+ * copied block.
+ *
+ * CREATEDB_FILE_COPY will simply perform a file system level copy of the
+ * database and log a single record for each tablespace copied. To make this
+ * safe, it also triggers checkpoints before and after the operation.
+ */
+typedef enum CreateDBStrategy
+{
+	CREATEDB_WAL_LOG,
+	CREATEDB_FILE_COPY
+} CreateDBStrategy;
+
 typedef struct
 {
 	Oid			src_dboid;		/* source (template) DB */
 	Oid			dest_dboid;		/* DB we are trying to create */
+	CreateDBStrategy strategy;	/* create db strategy */
 } createdb_failure_params;
 
 typedef struct
@@ -79,6 +97,17 @@ typedef struct
 	Oid			dest_tsoid;		/* tablespace we are trying to move to */
 } movedb_failure_params;
 
+/*
+ * Information about a relation to be copied when creating a database.
+ */
+typedef struct CreateDBRelInfo
+{
+	RelFileNode rnode;			/* physical relation identifier */
+	Oid			reloid;			/* relation oid */
+	bool		permanent;		/* relation is permanent or unlogged */
+} CreateDBRelInfo;
+
+
 /* non-export function prototypes */
 static void createdb_failure_callback(int code, Datum arg);
 static void movedb(const char *dbname, const char *tblspcname);
@@ -94,7 +123,546 @@ static bool have_createdb_privilege(void);
 static void remove_dbtablespaces(Oid db_id);
 static bool check_db_file_conflict(Oid db_id);
 static int	errdetail_busy_db(int notherbackends, int npreparedxacts);
+static void CreateDatabaseUsingWalLog(Oid src_dboid, Oid dboid, Oid src_tsid,
+									  Oid dst_tsid);
+static List *ScanSourceDatabasePgClass(Oid srctbid, Oid srcdbid, char *srcpath);
+static List *ScanSourceDatabasePgClassPage(Page page, Buffer buf, Oid tbid,
+										   Oid dbid, char *srcpath,
+										   List *rnodelist, Snapshot snapshot);
+static CreateDBRelInfo *ScanSourceDatabasePgClassTuple(HeapTupleData *tuple,
+													   Oid tbid, Oid dbid,
+													   char *srcpath);
+static void CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid,
+									bool isRedo);
+static void CreateDatabaseUsingFileCopy(Oid src_dboid, Oid dboid, Oid src_tsid,
+										Oid dst_tsid);
+
+/*
+ * Create a new database using the WAL_LOG strategy.
+ *
+ * Each copied block is separately written to the write-ahead log.
+ */
+static void
+CreateDatabaseUsingWalLog(Oid src_dboid, Oid dst_dboid,
+						  Oid src_tsid, Oid dst_tsid)
+{
+	char	   *srcpath;
+	char	   *dstpath;
+	List	   *rnodelist = NULL;
+	ListCell   *cell;
+	LockRelId	srcrelid;
+	LockRelId	dstrelid;
+	RelFileNode srcrnode;
+	RelFileNode dstrnode;
+	CreateDBRelInfo *relinfo;
+
+	/* Get source and destination database paths. */
+	srcpath = GetDatabasePath(src_dboid, src_tsid);
+	dstpath = GetDatabasePath(dst_dboid, dst_tsid);
+
+	/* Create database directory and write PG_VERSION file. */
+	CreateDirAndVersionFile(dstpath, dst_dboid, dst_tsid, false);
+
+	/* Copy relmap file from source database to the destination database. */
+	RelationMapCopy(dst_dboid, dst_tsid, srcpath, dstpath);
+
+	/* Get list of relfilenodes to copy from the source database. */
+	rnodelist = ScanSourceDatabasePgClass(src_tsid, src_dboid, srcpath);
+	Assert(rnodelist != NIL);
+
+	/*
+	 * Database IDs will be the same for all relations so set them before
+	 * entering the loop.
+	 */
+	srcrelid.dbId = src_dboid;
+	dstrelid.dbId = dst_dboid;
+
+	/* Loop over our list of relfilenodes and copy each one. */
+	foreach(cell, rnodelist)
+	{
+		relinfo = lfirst(cell);
+		srcrnode = relinfo->rnode;
+
+		/*
+		 * If the relation is from the source db's default tablespace then we
+		 * need to create it in the destinations db's default tablespace.
+		 * Otherwise, we need to create in the same tablespace as it is in the
+		 * source database.
+		 */
+		if (srcrnode.spcNode == src_tsid)
+			dstrnode.spcNode = dst_tsid;
+		else
+			dstrnode.spcNode = srcrnode.spcNode;
+
+		dstrnode.dbNode = dst_dboid;
+		dstrnode.relNode = srcrnode.relNode;
+
+		/*
+		 * Acquire locks on source and target relations before copying.
+		 *
+		 * We typically do not read relation data into shared_buffers without
+		 * holding a relation lock. It's unclear what could go wrong if we
+		 * skipped it in this case, because nobody can be modifying either
+		 * the source or destination database at this point, and we have locks
+		 * on both databases, too, but let's take the conservative route.
+		 */
+		dstrelid.relId = srcrelid.relId = relinfo->reloid;
+		LockRelationId(&srcrelid, AccessShareLock);
+		LockRelationId(&dstrelid, AccessShareLock);
+
+		/* Copy relation storage from source to the destination. */
+		CreateAndCopyRelationData(srcrnode, dstrnode, relinfo->permanent);
 
+		/* Release the relation locks. */
+		UnlockRelationId(&srcrelid, AccessShareLock);
+		UnlockRelationId(&dstrelid, AccessShareLock);
+	}
+
+	list_free_deep(rnodelist);
+}
+
+/*
+ * Scan the pg_class table in the source database to identify the relations
+ * that need to be copied to the destination database.
+ *
+ * This is an exception to the usual rule that cross-database access is
+ * not possible. We can make it work here because we know that there are no
+ * connections to the source database and (since there can't be prepared
+ * transactions touching that database) no in-doubt tuples either. This
+ * means that we don't need to worry about pruning removing anything from
+ * under us, and we don't need to be too picky about our snapshot either.
+ * As long as it sees all previously-committed XIDs as committed and all
+ * aborted XIDs as aborted, we should be fine: nothing else is possible
+ * here.
+ *
+ * We can't rely on the relcache for anything here, because that only knows
+ * about the database to which we are connected, and can't handle access to
+ * other databases. That also means we can't rely on the heap scan
+ * infrastructure, which would be a bad idea anyway since it might try
+ * to do things like HOT pruning which we definitely can't do safely in
+ * a database to which we're not even connected.
+ */
+static List *
+ScanSourceDatabasePgClass(Oid tbid, Oid dbid, char *srcpath)
+{
+	RelFileNode rnode;
+	BlockNumber nblocks;
+	BlockNumber blkno;
+	Buffer		buf;
+	Oid			relfilenode;
+	Page		page;
+	List	   *rnodelist = NIL;
+	LockRelId	relid;
+	Relation	rel;
+	Snapshot	snapshot;
+	BufferAccessStrategy bstrategy;
+
+	/* Get pg_class relfilenode. */
+	relfilenode = RelationMapOidToFilenodeForDatabase(srcpath,
+													  RelationRelationId);
+
+	/* Don't read data into shared_buffers without holding a relation lock. */
+	relid.dbId = dbid;
+	relid.relId = RelationRelationId;
+	LockRelationId(&relid, AccessShareLock);
+
+	/* Prepare a RelFileNode for the pg_class relation. */
+	rnode.spcNode = tbid;
+	rnode.dbNode = dbid;
+	rnode.relNode = relfilenode;
+
+	/*
+	 * We can't use a real relcache entry for a relation in some other
+	 * database, but since we're only going to access the fields related
+	 * to physical storage, a fake one is good enough. If we didn't do this
+	 * and used the smgr layer directly, we would have to worry about
+	 * invalidations.
+	 */
+	rel = CreateFakeRelcacheEntry(rnode);
+	nblocks = smgrnblocks(RelationGetSmgr(rel), MAIN_FORKNUM);
+	FreeFakeRelcacheEntry(rel);
+
+	/* Use a buffer access strategy since this is a bulk read operation. */
+	bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	/*
+	 * As explained in the function header comments, we need a snapshot that
+	 * will see all committed transactions as committed, and our transaction
+	 * snapshot - or the active snapshot - might not be new enough for that,
+	 * but the return value of GetLatestSnapshot() should work fine.
+	 */
+	snapshot = GetLatestSnapshot();
+
+	/* Process the relation block by block. */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		CHECK_FOR_INTERRUPTS();
+
+		buf = ReadBufferWithoutRelcache(rnode, MAIN_FORKNUM, blkno,
+										RBM_NORMAL, bstrategy, false);
+
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buf);
+		if (PageIsNew(page) || PageIsEmpty(page))
+		{
+			UnlockReleaseBuffer(buf);
+			continue;
+		}
+
+		/* Append relevant pg_class tuples for current page to rnodelist. */
+		rnodelist = ScanSourceDatabasePgClassPage(page, buf, tbid, dbid,
+												  srcpath, rnodelist,
+												  snapshot);
+
+		UnlockReleaseBuffer(buf);
+	}
+
+	/* Release relation lock. */
+	UnlockRelationId(&relid, AccessShareLock);
+
+	return rnodelist;
+}
+
+/*
+ * Scan one page of the source database's pg_class relation and add relevant
+ * entries to rnodelist. The return value is the updated list.
+ */
+static List *
+ScanSourceDatabasePgClassPage(Page page, Buffer buf, Oid tbid, Oid dbid,
+							  char *srcpath, List *rnodelist,
+							  Snapshot snapshot)
+{
+	BlockNumber		blkno = BufferGetBlockNumber(buf);
+	OffsetNumber	offnum;
+	OffsetNumber	maxoff;
+	HeapTupleData	tuple;
+
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/* Loop over offsets. */
+	for (offnum = FirstOffsetNumber;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid;
+
+		itemid = PageGetItemId(page, offnum);
+
+		/* Nothing to do if slot is empty or already dead. */
+		if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid) ||
+			ItemIdIsRedirected(itemid))
+			continue;
+
+		Assert(ItemIdIsNormal(itemid));
+		ItemPointerSet(&(tuple.t_self), blkno, offnum);
+
+		/* Initialize a HeapTupleData structure. */
+		tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
+		tuple.t_len = ItemIdGetLength(itemid);
+		tuple.t_tableOid = RelationRelationId;
+
+		/* Skip tuples that are not visible to this snapshot. */
+		if (HeapTupleSatisfiesVisibility(&tuple, snapshot, buf))
+		{
+			CreateDBRelInfo *relinfo;
+
+			/*
+			 * ScanSourceDatabasePgClassTuple is in charge of constructing
+			 * a CreateDBRelInfo object for this tuple, but can also decide
+			 * that this tuple isn't something we need to copy. If we do need
+			 * to copy the relation, add it to the list.
+			 */
+			relinfo = ScanSourceDatabasePgClassTuple(&tuple, tbid, dbid,
+													 srcpath);
+			if (relinfo != NULL)
+				rnodelist = lappend(rnodelist, relinfo);
+		}
+	}
+
+	return rnodelist;
+}
+
+/*
+ * Decide whether a certain pg_class tuple represents something that
+ * needs to be copied from the source database to the destination database,
+ * and if so, construct a CreateDBRelInfo for it.
+ *
+ * Visbility checks are handled by the caller, so our job here is just
+ * to assess the data stored in the tuple.
+ */
+CreateDBRelInfo *
+ScanSourceDatabasePgClassTuple(HeapTupleData *tuple, Oid tbid, Oid dbid,
+							   char *srcpath)
+{
+	CreateDBRelInfo	   *relinfo;
+	Form_pg_class		classForm;
+	Oid					relfilenode = InvalidOid;
+
+	classForm = (Form_pg_class) GETSTRUCT(tuple);
+
+	/*
+	 * Return NULL if this object does not need to be copied.
+	 *
+	 * Shared objects don't need to be copied, because they are shared.
+	 * Objects without storage can't be copied, because there's nothing to
+	 * copy. Temporary relations don't need to be copied either, because
+	 * they are inaccessible outside of the session that created them,
+	 * which must be gone already, and couldn't connect to a different database
+	 * if it still existed. autovacuum will eventually remove the pg_class
+	 * entries as well.
+	 */
+	if (classForm->reltablespace == GLOBALTABLESPACE_OID ||
+		!RELKIND_HAS_STORAGE(classForm->relkind) ||
+		classForm->relpersistence == RELPERSISTENCE_TEMP)
+		return NULL;
+
+	/*
+	 * If relfilenode is valid then directly use it.  Otherwise, consult the
+	 * relmap.
+	 */
+	if (OidIsValid(classForm->relfilenode))
+		relfilenode = classForm->relfilenode;
+	else
+		relfilenode = RelationMapOidToFilenodeForDatabase(srcpath,
+														  classForm->oid);
+
+	/* We must have a valid relfilenode oid. */
+	if (!OidIsValid(relfilenode))
+		elog(ERROR, "relation with OID %u does not have a valid relfilenode",
+			 classForm->oid);
+
+	/* Prepare a rel info element and add it to the list. */
+	relinfo = (CreateDBRelInfo *) palloc(sizeof(CreateDBRelInfo));
+	if (OidIsValid(classForm->reltablespace))
+		relinfo->rnode.spcNode = classForm->reltablespace;
+	else
+		relinfo->rnode.spcNode = tbid;
+
+	relinfo->rnode.dbNode = dbid;
+	relinfo->rnode.relNode = relfilenode;
+	relinfo->reloid = classForm->oid;
+
+	/* Temporary relations were rejected above. */
+	Assert(classForm->relpersistence != RELPERSISTENCE_TEMP);
+	relinfo->permanent =
+		(classForm->relpersistence == RELPERSISTENCE_PERMANENT) ? true : false;
+
+	return relinfo;
+}
+
+/*
+ * Create database directory and write out the PG_VERSION file in the database
+ * path.  If isRedo is true, it's okay for the database directory to exist
+ * already.
+ */
+static void
+CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid, bool isRedo)
+{
+	int			fd;
+	int			nbytes;
+	char		versionfile[MAXPGPATH];
+	char		buf[16];
+
+	/*
+	 * Prepare version data before starting a critical section.
+	 *
+	 * Note that we don't have to copy this from the source database; there's
+	 * only one legal value.
+	 */
+	sprintf(buf, "%s\n", PG_MAJORVERSION);
+	nbytes = strlen(PG_MAJORVERSION) + 1;
+
+	/* If we are not in WAL replay then write the WAL. */
+	if (!isRedo)
+	{
+		xl_dbase_create_wal_log_rec xlrec;
+		XLogRecPtr	lsn;
+
+		START_CRIT_SECTION();
+
+		xlrec.db_id = dbid;
+		xlrec.tablespace_id = tsid;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec),
+						 sizeof(xl_dbase_create_wal_log_rec));
+
+		lsn = XLogInsert(RM_DBASE_ID, XLOG_DBASE_CREATE_WAL_LOG);
+
+		/* As always, WAL must hit the disk before the data update does. */
+		XLogFlush(lsn);
+	}
+
+	/* Create database directory. */
+	if (MakePGDirectory(dbpath) < 0)
+	{
+		/* Failure other than already exists or not in WAL replay? */
+		if (errno != EEXIST || !isRedo)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not create directory \"%s\": %m", dbpath)));
+	}
+
+	/*
+	 * Create PG_VERSION file in the database path.  If the file already
+	 * exists and we are in WAL replay then try again to open it in write
+	 * mode.
+	 */
+	snprintf(versionfile, sizeof(versionfile), "%s/%s", dbpath, "PG_VERSION");
+
+	fd = OpenTransientFile(versionfile, O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
+	if (fd < 0 && errno == EEXIST && isRedo)
+		fd = OpenTransientFile(versionfile, O_WRONLY | O_TRUNC | PG_BINARY);
+
+	if (fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m", versionfile)));
+
+	/* Write PG_MAJORVERSION in the PG_VERSION file. */
+	pgstat_report_wait_start(WAIT_EVENT_VERSION_FILE_WRITE);
+	errno = 0;
+	if ((int) write(fd, buf, nbytes) != nbytes)
+	{
+		/* If write didn't set errno, assume problem is no disk space. */
+		if (errno == 0)
+			errno = ENOSPC;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m", versionfile)));
+	}
+	pgstat_report_wait_end();
+
+	/* Close the version file. */
+	CloseTransientFile(fd);
+
+	/* Critical section done. */
+	if (!isRedo)
+		END_CRIT_SECTION();
+}
+
+/*
+ * Create a new database using the FILE_COPY strategy.
+ *
+ * Copy each tablespace at the filesystem level, and log a single WAL record
+ * for each tablespace copied.  This requires a checkpoint before and after the
+ * copy, which may be expensive, but it does greatly reduce WAL generation
+ * if the copied database is large.
+ */
+static void
+CreateDatabaseUsingFileCopy(Oid src_dboid, Oid dst_dboid, Oid src_tsid,
+							Oid dst_tsid)
+{
+	TableScanDesc scan;
+	Relation	rel;
+	HeapTuple	tuple;
+
+	/*
+	 * Force a checkpoint before starting the copy. This will force all dirty
+	 * buffers, including those of unlogged tables, out to disk, to ensure
+	 * source database is up-to-date on disk for the copy.
+	 * FlushDatabaseBuffers() would suffice for that, but we also want to
+	 * process any pending unlink requests. Otherwise, if a checkpoint
+	 * happened while we're copying files, a file might be deleted just when
+	 * we're about to copy it, causing the lstat() call in copydir() to fail
+	 * with ENOENT.
+	 */
+	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE |
+					  CHECKPOINT_WAIT | CHECKPOINT_FLUSH_ALL);
+
+	/*
+	 * Iterate through all tablespaces of the template database, and copy each
+	 * one to the new database.
+	 */
+	rel = table_open(TableSpaceRelationId, AccessShareLock);
+	scan = table_beginscan_catalog(rel, 0, NULL);
+	while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+	{
+		Form_pg_tablespace spaceform = (Form_pg_tablespace) GETSTRUCT(tuple);
+		Oid			srctablespace = spaceform->oid;
+		Oid			dsttablespace;
+		char	   *srcpath;
+		char	   *dstpath;
+		struct stat st;
+
+		/* No need to copy global tablespace */
+		if (srctablespace == GLOBALTABLESPACE_OID)
+			continue;
+
+		srcpath = GetDatabasePath(src_dboid, srctablespace);
+
+		if (stat(srcpath, &st) < 0 || !S_ISDIR(st.st_mode) ||
+			directory_is_empty(srcpath))
+		{
+			/* Assume we can ignore it */
+			pfree(srcpath);
+			continue;
+		}
+
+		if (srctablespace == src_tsid)
+			dsttablespace = dst_tsid;
+		else
+			dsttablespace = srctablespace;
+
+		dstpath = GetDatabasePath(dst_dboid, dsttablespace);
+
+		/*
+		 * Copy this subdirectory to the new location
+		 *
+		 * We don't need to copy subdirectories
+		 */
+		copydir(srcpath, dstpath, false);
+
+		/* Record the filesystem change in XLOG */
+		{
+			xl_dbase_create_file_copy_rec xlrec;
+
+			xlrec.db_id = dst_dboid;
+			xlrec.tablespace_id = dsttablespace;
+			xlrec.src_db_id = src_dboid;
+			xlrec.src_tablespace_id = srctablespace;
+
+			XLogBeginInsert();
+			XLogRegisterData((char *) &xlrec,
+							 sizeof(xl_dbase_create_file_copy_rec));
+
+			(void) XLogInsert(RM_DBASE_ID,
+							  XLOG_DBASE_CREATE_FILE_COPY | XLR_SPECIAL_REL_UPDATE);
+		}
+	}
+	table_endscan(scan);
+	table_close(rel, AccessShareLock);
+
+	/*
+	 * We force a checkpoint before committing.  This effectively means that
+	 * committed XLOG_DBASE_CREATE_FILE_COPY operations will never need to be
+	 * replayed (at least not in ordinary crash recovery; we still have to
+	 * make the XLOG entry for the benefit of PITR operations). This avoids
+	 * two nasty scenarios:
+	 *
+	 * #1: When PITR is off, we don't XLOG the contents of newly created
+	 * indexes; therefore the drop-and-recreate-whole-directory behavior of
+	 * DBASE_CREATE replay would lose such indexes.
+	 *
+	 * #2: Since we have to recopy the source database during DBASE_CREATE
+	 * replay, we run the risk of copying changes in it that were committed
+	 * after the original CREATE DATABASE command but before the system crash
+	 * that led to the replay.  This is at least unexpected and at worst could
+	 * lead to inconsistencies, eg duplicate table names.
+	 *
+	 * (Both of these were real bugs in releases 8.0 through 8.0.3.)
+	 *
+	 * In PITR replay, the first of these isn't an issue, and the second is
+	 * only a risk if the CREATE DATABASE and subsequent template database
+	 * change both occur while a base backup is being taken. There doesn't
+	 * seem to be much we can do about that except document it as a
+	 * limitation.
+	 *
+	 * See CreateDatabaseUsingWalLog() for a less cheesy CREATE DATABASE
+	 * strategy that avoids these problems.
+	 */
+	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
+}
 
 /*
  * CREATE DATABASE
@@ -102,8 +670,6 @@ static int	errdetail_busy_db(int notherbackends, int npreparedxacts);
 Oid
 createdb(ParseState *pstate, const CreatedbStmt *stmt)
 {
-	TableScanDesc scan;
-	Relation	rel;
 	Oid			src_dboid;
 	Oid			src_owner;
 	int			src_encoding = -1;
@@ -138,6 +704,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	DefElem    *dallowconnections = NULL;
 	DefElem    *dconnlimit = NULL;
 	DefElem    *dcollversion = NULL;
+	DefElem    *dstrategy = NULL;
 	char	   *dbname = stmt->dbname;
 	char	   *dbowner = NULL;
 	const char *dbtemplate = NULL;
@@ -153,6 +720,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	char	   *dbcollversion = NULL;
 	int			notherbackends;
 	int			npreparedxacts;
+	CreateDBStrategy dbstrategy = CREATEDB_WAL_LOG;
 	createdb_failure_params fparms;
 
 	/* Extract options from the statement node tree */
@@ -270,6 +838,12 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 						(errcode(ERRCODE_INVALID_PARAMETER_VALUE)),
 						errmsg("OIDs less than %u are reserved for system objects", FirstNormalObjectId));
 		}
+		else if (strcmp(defel->defname, "strategy") == 0)
+		{
+			if (dstrategy)
+				errorConflictingDefElem(defel, pstate);
+			dstrategy = defel;
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -414,6 +988,23 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 							dbtemplate)));
 	}
 
+	/* Validate the database creation strategy. */
+	if (dstrategy && dstrategy->arg)
+	{
+		char	   *strategy;
+
+		strategy = defGetString(dstrategy);
+		if (strcmp(strategy, "wal_log") == 0)
+			dbstrategy = CREATEDB_WAL_LOG;
+		else if (strcmp(strategy, "file_copy") == 0)
+			dbstrategy = CREATEDB_FILE_COPY;
+		else
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("invalid create database strategy %s", strategy),
+					 errhint("Valid strategies are \"wal_log\", and \"file_copy\".")));
+	}
+
 	/* If encoding or locales are defaulted, use source's setting */
 	if (encoding < 0)
 		encoding = src_encoding;
@@ -754,17 +1345,18 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	InvokeObjectPostCreateHook(DatabaseRelationId, dboid, 0);
 
 	/*
-	 * Force a checkpoint before starting the copy. This will force all dirty
-	 * buffers, including those of unlogged tables, out to disk, to ensure
-	 * source database is up-to-date on disk for the copy.
-	 * FlushDatabaseBuffers() would suffice for that, but we also want to
-	 * process any pending unlink requests. Otherwise, if a checkpoint
-	 * happened while we're copying files, a file might be deleted just when
-	 * we're about to copy it, causing the lstat() call in copydir() to fail
-	 * with ENOENT.
+	 * If we're going to be reading data for the to-be-created database
+	 * into shared_buffers, take a lock on it. Nobody should know that this
+	 * database exists yet, but it's good to maintain the invariant that a
+	 * lock an AccessExclusiveLock on the database is sufficient to drop all
+	 * of its buffers without worrying about more being read later.
+	 *
+	 * Note that we need to do this before entering the PG_ENSURE_ERROR_CLEANUP
+	 * block below, because createdb_failure_callback expects this lock to
+	 * be held already.
 	 */
-	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT
-					  | CHECKPOINT_FLUSH_ALL);
+	if (dbstrategy == CREATEDB_WAL_LOG)
+		LockSharedObject(DatabaseRelationId, dboid, 0, AccessShareLock);
 
 	/*
 	 * Once we start copying subdirectories, we need to be able to clean 'em
@@ -775,101 +1367,24 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	 */
 	fparms.src_dboid = src_dboid;
 	fparms.dest_dboid = dboid;
+	fparms.strategy = dbstrategy;
+
 	PG_ENSURE_ERROR_CLEANUP(createdb_failure_callback,
 							PointerGetDatum(&fparms));
 	{
 		/*
-		 * Iterate through all tablespaces of the template database, and copy
-		 * each one to the new database.
+		 * If the user has asked to create a database with WAL_LOG strategy
+		 * then call CreateDatabaseUsingWalLog, which will copy the database
+		 * at the block level and it will WAL log each copied block.
+		 * Otherwise, call CreateDatabaseUsingFileCopy that will copy the
+		 * database file by file.
 		 */
-		rel = table_open(TableSpaceRelationId, AccessShareLock);
-		scan = table_beginscan_catalog(rel, 0, NULL);
-		while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
-		{
-			Form_pg_tablespace spaceform = (Form_pg_tablespace) GETSTRUCT(tuple);
-			Oid			srctablespace = spaceform->oid;
-			Oid			dsttablespace;
-			char	   *srcpath;
-			char	   *dstpath;
-			struct stat st;
-
-			/* No need to copy global tablespace */
-			if (srctablespace == GLOBALTABLESPACE_OID)
-				continue;
-
-			srcpath = GetDatabasePath(src_dboid, srctablespace);
-
-			if (stat(srcpath, &st) < 0 || !S_ISDIR(st.st_mode) ||
-				directory_is_empty(srcpath))
-			{
-				/* Assume we can ignore it */
-				pfree(srcpath);
-				continue;
-			}
-
-			if (srctablespace == src_deftablespace)
-				dsttablespace = dst_deftablespace;
-			else
-				dsttablespace = srctablespace;
-
-			dstpath = GetDatabasePath(dboid, dsttablespace);
-
-			/*
-			 * Copy this subdirectory to the new location
-			 *
-			 * We don't need to copy subdirectories
-			 */
-			copydir(srcpath, dstpath, false);
-
-			/* Record the filesystem change in XLOG */
-			{
-				xl_dbase_create_rec xlrec;
-
-				xlrec.db_id = dboid;
-				xlrec.tablespace_id = dsttablespace;
-				xlrec.src_db_id = src_dboid;
-				xlrec.src_tablespace_id = srctablespace;
-
-				XLogBeginInsert();
-				XLogRegisterData((char *) &xlrec, sizeof(xl_dbase_create_rec));
-
-				(void) XLogInsert(RM_DBASE_ID,
-								  XLOG_DBASE_CREATE | XLR_SPECIAL_REL_UPDATE);
-			}
-		}
-		table_endscan(scan);
-		table_close(rel, AccessShareLock);
-
-		/*
-		 * We force a checkpoint before committing.  This effectively means
-		 * that committed XLOG_DBASE_CREATE operations will never need to be
-		 * replayed (at least not in ordinary crash recovery; we still have to
-		 * make the XLOG entry for the benefit of PITR operations). This
-		 * avoids two nasty scenarios:
-		 *
-		 * #1: When PITR is off, we don't XLOG the contents of newly created
-		 * indexes; therefore the drop-and-recreate-whole-directory behavior
-		 * of DBASE_CREATE replay would lose such indexes.
-		 *
-		 * #2: Since we have to recopy the source database during DBASE_CREATE
-		 * replay, we run the risk of copying changes in it that were
-		 * committed after the original CREATE DATABASE command but before the
-		 * system crash that led to the replay.  This is at least unexpected
-		 * and at worst could lead to inconsistencies, eg duplicate table
-		 * names.
-		 *
-		 * (Both of these were real bugs in releases 8.0 through 8.0.3.)
-		 *
-		 * In PITR replay, the first of these isn't an issue, and the second
-		 * is only a risk if the CREATE DATABASE and subsequent template
-		 * database change both occur while a base backup is being taken.
-		 * There doesn't seem to be much we can do about that except document
-		 * it as a limitation.
-		 *
-		 * Perhaps if we ever implement CREATE DATABASE in a less cheesy way,
-		 * we can avoid this.
-		 */
-		RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
+		if (dbstrategy == CREATEDB_WAL_LOG)
+			CreateDatabaseUsingWalLog(src_dboid, dboid, src_deftablespace,
+									  dst_deftablespace);
+		else
+			CreateDatabaseUsingFileCopy(src_dboid, dboid, src_deftablespace,
+										dst_deftablespace);
 
 		/*
 		 * Close pg_database, but keep lock till commit.
@@ -956,6 +1471,25 @@ createdb_failure_callback(int code, Datum arg)
 	createdb_failure_params *fparms = (createdb_failure_params *) DatumGetPointer(arg);
 
 	/*
+	 * If we were copying database at block levels then drop pages for the
+	 * destination database that are in the shared buffer cache.  And tell
+	 * checkpointer to forget any pending fsync and unlink requests for files
+	 * in the database.  The reasoning behind doing this is same as explained
+	 * in dropdb function.  But unlike dropdb we don't need to call
+	 * pgstat_drop_database because this database is still not created so
+	 * there should not be any stat for this.
+	 */
+	if (fparms->strategy == CREATEDB_WAL_LOG)
+	{
+		DropDatabaseBuffers(fparms->dest_dboid);
+		ForgetDatabaseSyncRequests(fparms->dest_dboid);
+
+		/* Release lock on the target database. */
+		UnlockSharedObject(DatabaseRelationId, fparms->dest_dboid, 0,
+						   AccessShareLock);
+	}
+
+	/*
 	 * Release lock on source database before doing recursive remove. This is
 	 * not essential but it seems desirable to release the lock as soon as
 	 * possible.
@@ -1479,7 +2013,7 @@ movedb(const char *dbname, const char *tblspcname)
 		 * Record the filesystem change in XLOG
 		 */
 		{
-			xl_dbase_create_rec xlrec;
+			xl_dbase_create_file_copy_rec xlrec;
 
 			xlrec.db_id = db_id;
 			xlrec.tablespace_id = dst_tblspcoid;
@@ -1487,10 +2021,11 @@ movedb(const char *dbname, const char *tblspcname)
 			xlrec.src_tablespace_id = src_tblspcoid;
 
 			XLogBeginInsert();
-			XLogRegisterData((char *) &xlrec, sizeof(xl_dbase_create_rec));
+			XLogRegisterData((char *) &xlrec,
+							 sizeof(xl_dbase_create_file_copy_rec));
 
 			(void) XLogInsert(RM_DBASE_ID,
-							  XLOG_DBASE_CREATE | XLR_SPECIAL_REL_UPDATE);
+							  XLOG_DBASE_CREATE_FILE_COPY | XLR_SPECIAL_REL_UPDATE);
 		}
 
 		/*
@@ -1526,9 +2061,10 @@ movedb(const char *dbname, const char *tblspcname)
 
 		/*
 		 * Force another checkpoint here.  As in CREATE DATABASE, this is to
-		 * ensure that we don't have to replay a committed XLOG_DBASE_CREATE
-		 * operation, which would cause us to lose any unlogged operations
-		 * done in the new DB tablespace before the next checkpoint.
+		 * ensure that we don't have to replay a committed
+		 * XLOG_DBASE_CREATE_FILE_COPY operation, which would cause us to lose
+		 * any unlogged operations done in the new DB tablespace before the
+		 * next checkpoint.
 		 */
 		RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
 
@@ -2479,9 +3015,10 @@ dbase_redo(XLogReaderState *record)
 	/* Backup blocks are not used in dbase records */
 	Assert(!XLogRecHasAnyBlockRefs(record));
 
-	if (info == XLOG_DBASE_CREATE)
+	if (info == XLOG_DBASE_CREATE_FILE_COPY)
 	{
-		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) XLogRecGetData(record);
+		xl_dbase_create_file_copy_rec *xlrec =
+		(xl_dbase_create_file_copy_rec *) XLogRecGetData(record);
 		char	   *src_path;
 		char	   *dst_path;
 		char	   *parent_path;
@@ -2568,6 +3105,44 @@ dbase_redo(XLogReaderState *record)
 		 */
 		copydir(src_path, dst_path, false);
 	}
+	else if (info == XLOG_DBASE_CREATE_WAL_LOG)
+	{
+		xl_dbase_create_wal_log_rec *xlrec =
+		(xl_dbase_create_wal_log_rec *) XLogRecGetData(record);
+		char	   *dbpath;
+
+		dbpath = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
+		if (!reachedConsistency)
+		{
+			char	   *parent_path;
+			struct stat st;
+
+			/*
+			 * Skip the replay of database directory creation if parent
+			 * tablespace directory is missing.  For more detailes refer
+			 * comments in above case XLOG_DBASE_CREATE_FILE_COPY.
+			 */
+			parent_path = pstrdup(dbpath);
+			get_parent_directory(parent_path);
+			if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode)))
+			{
+				XLogRememberMissingDir(xlrec->tablespace_id, InvalidOid, parent_path);
+				ereport(WARNING,
+						(errmsg("skipping replay of database creation WAL record"),
+						 errdetail("The target tablespace \"%s\" directory was not found.",
+								   parent_path),
+						 errhint("A future WAL record that removes the directory before reaching consistent mode is expected.")));
+				pfree(parent_path);
+
+				return;
+			}
+			pfree(parent_path);
+		}
+
+		/* Create the database directory with the version file. */
+		CreateDirAndVersionFile(dbpath, xlrec->db_id, xlrec->tablespace_id,
+								true);
+	}
 	else if (info == XLOG_DBASE_DROP)
 	{
 		xl_dbase_drop_rec *xlrec = (xl_dbase_drop_rec *) XLogRecGetData(record);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 124b996..51b4a00 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -14626,7 +14626,7 @@ index_copy_data(Relation rel, RelFileNode newrnode)
 	 * NOTE: any conflict in relfilenode value will be caught in
 	 * RelationCreateStorage().
 	 */
-	RelationCreateStorage(newrnode, rel->rd_rel->relpersistence);
+	RelationCreateStorage(newrnode, rel->rd_rel->relpersistence, true);
 
 	/* copy main fork */
 	RelationCopyStorage(RelationGetSmgr(rel), dstrel, MAIN_FORKNUM,
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 11005ed..d73a40c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -38,6 +38,7 @@
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/storage.h"
+#include "catalog/storage_xlog.h"
 #include "executor/instrument.h"
 #include "lib/binaryheap.h"
 #include "miscadmin.h"
@@ -486,6 +487,9 @@ static void FindAndDropRelFileNodeBuffers(RelFileNode rnode,
 										  ForkNumber forkNum,
 										  BlockNumber nForkBlock,
 										  BlockNumber firstDelBlock);
+static void RelationCopyStorageUsingBuffer(Relation src, Relation dst,
+										   ForkNumber forkNum,
+										   bool isunlogged);
 static void AtProcExit_Buffers(int code, Datum arg);
 static void CheckForBufferLeaks(void);
 static int	rnode_comparator(const void *p1, const void *p2);
@@ -772,23 +776,23 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
  * ReadBufferWithoutRelcache -- like ReadBufferExtended, but doesn't require
  *		a relcache entry for the relation.
  *
- * NB: At present, this function may only be used on permanent relations, which
- * is OK, because we only use it during XLOG replay.  If in the future we
- * want to use it on temporary or unlogged relations, we could pass additional
- * parameters.
+ * Pass permanent = true for a RELPERSISTENCE_PERMANENT relation, and
+ * permanent = false for a RELPERSISTENCE_UNLOGGED relation. This function
+ * cannot be used for temporary relations (and making that work might be
+ * difficult, unless we only want to read temporary relations for our own
+ * BackendId).
  */
 Buffer
 ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
 						  BlockNumber blockNum, ReadBufferMode mode,
-						  BufferAccessStrategy strategy)
+						  BufferAccessStrategy strategy, bool permanent)
 {
 	bool		hit;
 
 	SMgrRelation smgr = smgropen(rnode, InvalidBackendId);
 
-	Assert(InRecovery);
-
-	return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
+	return ReadBuffer_common(smgr, permanent ? RELPERSISTENCE_PERMANENT :
+							 RELPERSISTENCE_UNLOGGED, forkNum, blockNum,
 							 mode, strategy, &hit);
 }
 
@@ -3677,6 +3681,158 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 }
 
 /* ---------------------------------------------------------------------
+ *		RelationCopyStorageUsingBuffer
+ *
+ *		Copy fork's data using bufmgr.  Same as RelationCopyStorage but instead
+ *		of using smgrread and smgrextend this will copy using bufmgr APIs.
+ *
+ *		Refer comments atop CreateAndCopyRelationData() for details about
+ *		'permanent' parameter.
+ * --------------------------------------------------------------------
+ */
+static void
+RelationCopyStorageUsingBuffer(Relation src, Relation dst, ForkNumber forkNum,
+							   bool permanent)
+{
+	Buffer		srcBuf;
+	Buffer		dstBuf;
+	Page		srcPage;
+	Page		dstPage;
+	bool		use_wal;
+	BlockNumber nblocks;
+	BlockNumber blkno;
+	BufferAccessStrategy bstrategy_src;
+	BufferAccessStrategy bstrategy_dst;
+
+	/*
+	 * In general, we want to write WAL whenever wal_level > 'minimal', but
+	 * we can skip it when copying any fork of an unlogged relation other
+	 * than the init fork.
+	 */
+	use_wal = XLogIsNeeded() && (permanent || forkNum == INIT_FORKNUM);
+
+	/* Get number of blocks in the source relation. */
+	nblocks = smgrnblocks(RelationGetSmgr(src), forkNum);
+
+	/* Nothing to copy; just return. */
+	if (nblocks == 0)
+		return;
+
+	/* This is a bulk operation, so use buffer access strategies. */
+	bstrategy_src = GetAccessStrategy(BAS_BULKREAD);
+	bstrategy_dst = GetAccessStrategy(BAS_BULKWRITE);
+
+	/* Iterate over each block of the source relation file. */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		CHECK_FOR_INTERRUPTS();
+
+		/* Read block from source relation. */
+		srcBuf = ReadBufferWithoutRelcache(src->rd_node, forkNum, blkno,
+										   RBM_NORMAL, bstrategy_src,
+										   permanent);
+		srcPage = BufferGetPage(srcBuf);
+		if (PageIsNew(srcPage) || PageIsEmpty(srcPage))
+		{
+			ReleaseBuffer(srcBuf);
+			continue;
+		}
+
+		/* Use P_NEW to extend the destination relation. */
+		dstBuf = ReadBufferWithoutRelcache(dst->rd_node, forkNum, P_NEW,
+										   RBM_NORMAL, bstrategy_dst,
+										   permanent);
+		LockBuffer(dstBuf, BUFFER_LOCK_EXCLUSIVE);
+
+		START_CRIT_SECTION();
+
+		/* Copy page data from the source to the destination. */
+		dstPage = BufferGetPage(dstBuf);
+		memcpy(dstPage, srcPage, BLCKSZ);
+		MarkBufferDirty(dstBuf);
+
+		/* WAL-log the copied page. */
+		if (use_wal)
+			log_newpage_buffer(dstBuf, true);
+
+		END_CRIT_SECTION();
+
+		UnlockReleaseBuffer(dstBuf);
+		ReleaseBuffer(srcBuf);
+	}
+}
+
+/* ---------------------------------------------------------------------
+ *		CreateAndCopyRelationData
+ *
+ *		Create destination relation storage and copy all forks from the
+ *		source relation to the destination.
+ *
+ *		Pass permanent as true for permanent relations and false for
+ *		unlogged relations.  Currently this API is not supported for
+ *		temporary relations.
+ * --------------------------------------------------------------------
+ */
+void
+CreateAndCopyRelationData(RelFileNode src_rnode, RelFileNode dst_rnode,
+						  bool permanent)
+{
+	Relation		src_rel;
+	Relation		dst_rel;
+	char			relpersistence;
+
+	/* Set the relpersistence. */
+	relpersistence = permanent ?
+		RELPERSISTENCE_PERMANENT : RELPERSISTENCE_UNLOGGED;
+
+	/*
+	 * We can't use a real relcache entry for a relation in some other
+	 * database, but since we're only going to access the fields related
+	 * to physical storage, a fake one is good enough. If we didn't do this
+	 * and used the smgr layer directly, we would have to worry about
+	 * invalidations.
+	 */
+	src_rel = CreateFakeRelcacheEntry(src_rnode);
+	dst_rel = CreateFakeRelcacheEntry(dst_rnode);
+
+	/*
+	 * Create and copy all forks of the relation.  During create database we
+	 * have a separate cleanup mechanism which deletes complete database
+	 * directory.  Therefore, each individual relation doesn't need to be
+	 * registered for cleanup.
+	 */
+	RelationCreateStorage(dst_rnode, relpersistence, false);
+
+	/* copy main fork. */
+	RelationCopyStorageUsingBuffer(src_rel, dst_rel, MAIN_FORKNUM, permanent);
+
+	/* copy those extra forks that exist */
+	for (ForkNumber forkNum = MAIN_FORKNUM + 1;
+		 forkNum <= MAX_FORKNUM; forkNum++)
+	{
+		if (smgrexists(RelationGetSmgr(src_rel), forkNum))
+		{
+			smgrcreate(RelationGetSmgr(dst_rel), forkNum, false);
+
+			/*
+			 * WAL log creation if the relation is persistent, or this is the
+			 * init fork of an unlogged relation.
+			 */
+			if (permanent || forkNum == INIT_FORKNUM)
+				log_smgrcreate(&dst_rnode, forkNum);
+
+			/* Copy a fork's data, block by block. */
+			RelationCopyStorageUsingBuffer(src_rel, dst_rel, forkNum,
+										   permanent);
+		}
+	}
+
+	/* Release fake relcache entries. */
+	FreeFakeRelcacheEntry(src_rel);
+	FreeFakeRelcacheEntry(dst_rel);
+}
+
+/* ---------------------------------------------------------------------
  *		FlushDatabaseBuffers
  *
  *		This function writes all dirty pages of a database out to disk
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 5ae52dd..1543da6 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -176,6 +176,34 @@ ConditionalLockRelationOid(Oid relid, LOCKMODE lockmode)
 }
 
 /*
+ *		LockRelationId
+ *
+ * Lock, given a LockRelId.  Same as LockRelationOid but take LockRelId as an
+ * input.
+ */
+void
+LockRelationId(LockRelId *relid, LOCKMODE lockmode)
+{
+	LOCKTAG		tag;
+	LOCALLOCK  *locallock;
+	LockAcquireResult res;
+
+	SET_LOCKTAG_RELATION(tag, relid->dbId, relid->relId);
+
+	res = LockAcquireExtended(&tag, lockmode, false, false, true, &locallock);
+
+	/*
+	 * Now that we have the lock, check for invalidation messages; see notes
+	 * in LockRelationOid.
+	 */
+	if (res != LOCKACQUIRE_ALREADY_CLEAR)
+	{
+		AcceptInvalidationMessages();
+		MarkLockClear(locallock);
+	}
+}
+
+/*
  *		UnlockRelationId
  *
  * Unlock, given a LockRelId.  This is preferred over UnlockRelationOid
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index ff46a0e..1c8aba4 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -705,6 +705,9 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_TWOPHASE_FILE_WRITE:
 			event_name = "TwophaseFileWrite";
 			break;
+		case WAIT_EVENT_VERSION_FILE_WRITE:
+			event_name = "VersionFileWrite";
+			break;
 		case WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ:
 			event_name = "WALSenderTimelineHistoryRead";
 			break;
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 4f3fe11..b819d7c 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -3746,7 +3746,7 @@ RelationSetNewRelfilenode(Relation relation, char persistence)
 		/* handle these directly, at least for now */
 		SMgrRelation srel;
 
-		srel = RelationCreateStorage(newrnode, persistence);
+		srel = RelationCreateStorage(newrnode, persistence, true);
 		smgrclose(srel);
 	}
 	else
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 4d0718f..742856e 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -46,6 +46,8 @@
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/pg_tablespace.h"
 #include "catalog/storage.h"
@@ -252,6 +254,63 @@ RelationMapFilenodeToOid(Oid filenode, bool shared)
 }
 
 /*
+ * RelationMapOidToFilenodeForDatabase
+ *
+ * Like RelationMapOidToFilenode, but reads the mapping from the indicated
+ * path instead of using the one for the current database.
+ */
+Oid
+RelationMapOidToFilenodeForDatabase(char *dbpath, Oid relationId)
+{
+	RelMapFile	map;
+	int			i;
+
+	/* Read the relmap file from the source database. */
+	read_relmap_file(&map, dbpath, false, ERROR);
+
+	/* Iterate over the relmap entries to find the input relation OID. */
+	for (i = 0; i < map.num_mappings; i++)
+	{
+		if (relationId == map.mappings[i].mapoid)
+			return map.mappings[i].mapfilenode;
+	}
+
+	return InvalidOid;
+}
+
+/*
+ * RelationMapCopy
+ *
+ * Copy relmapfile from source db path to the destination db path and WAL log
+ * the operation. This is intended for use in creating a new relmap file
+ * for a database that doesn't have one yet, not for replacing an existing
+ * relmap file.
+ */
+void
+RelationMapCopy(Oid dbid, Oid tsid, char *srcdbpath, char *dstdbpath)
+{
+	RelMapFile map;
+
+	/*
+	 * Read the relmap file from the source database.
+	 */
+	read_relmap_file(&map, srcdbpath, false, ERROR);
+
+	/*
+	 * Write the same data into the destination database's relmap file.
+	 *
+	 * No sinval is needed because no one can be connected to the destination
+	 * database yet. For the same reason, there is no need to acquire
+	 * RelationMappingLock.
+	 *
+	 * There's no point in trying to preserve files here. The new database
+	 * isn't usable yet anyway, and won't ever be if we can't install a
+	 * relmap file.
+	 */
+	write_relmap_file(&map, true, false, false, dbid, tsid, dstdbpath);
+}
+
+/*
  * RelationMapUpdateMap
  *
  * Install a new relfilenode mapping for the specified relation.
@@ -1023,6 +1082,37 @@ relmap_redo(XLogReaderState *record)
 
 		/* We need to construct the pathname for this database */
 		dbpath = GetDatabasePath(xlrec->dbid, xlrec->tsid);
+		if (!reachedConsistency)
+		{
+			char	   *parent_path;
+			struct stat st;
+
+			/*
+			 * Skip replaying relmap file writes if the parent tablespace
+			 * directory isn't present.  The reason we need to skip it is that
+			 * if we build the database using the wal_log strategy, then we
+			 * will be creating a new relmap file, and if we skipped creating
+			 * the database directory due to a missing tablespace directory,
+			 * then we will also need to skip this step.  For more details on
+			 * why the database directory creation WAL is skipped, refer to
+			 * comments in dbase_redo().
+			 */
+			parent_path = pstrdup(dbpath);
+			get_parent_directory(parent_path);
+			if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode)))
+			{
+				XLogRememberMissingDir(xlrec->tsid, InvalidOid, parent_path);
+				ereport(WARNING,
+						(errmsg("skipping replay of relmap file write WAL record"),
+						 errdetail("The target tablespace \"%s\" directory was not found.",
+								   parent_path),
+						 errhint("A future WAL record that removes the directory before reaching consistent mode is expected.")));
+				pfree(parent_path);
+
+				return;
+			}
+			pfree(parent_path);
+		}
 
 		/*
 		 * Write out the new map and send sinval, but of course don't write a
@@ -1031,6 +1121,13 @@ relmap_redo(XLogReaderState *record)
 		 *
 		 * There shouldn't be anyone else updating relmaps during WAL replay,
 		 * but grab the lock to interlock against load_relmap_file().
+		 *
+		 * Note that we use the same WAL record for updating the relmap of
+		 * an existing database as we do for creating a new database. In
+		 * the latter case, taking the relmap log and sending sinval messages
+		 * is unnecessary, but harmless. If we wanted to avoid it, we could
+		 * add a flag to the WAL record to indicate which opration is being
+		 * performed.
 		 */
 		LWLockAcquire(RelationMappingLock, LW_EXCLUSIVE);
 		write_relmap_file(&newmap, false, true, false,
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 3ed2a2e..49966e7 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -372,7 +372,7 @@ extractPageInfo(XLogReaderState *record)
 
 	/* Is this a special record type that I recognize? */
 
-	if (rmid == RM_DBASE_ID && rminfo == XLOG_DBASE_CREATE)
+	if (rmid == RM_DBASE_ID && rminfo == XLOG_DBASE_CREATE_FILE_COPY)
 	{
 		/*
 		 * New databases can be safely ignored. It won't be present in the
@@ -384,6 +384,13 @@ extractPageInfo(XLogReaderState *record)
 		 * overwriting the database created in the target system.
 		 */
 	}
+	else if (rmid == RM_DBASE_ID && rminfo == XLOG_DBASE_CREATE_WAL_LOG)
+	{
+		/*
+		 * New databases can be safely ignored. It won't be present in the
+		 * source system, so it will be deleted.
+		 */
+	}
 	else if (rmid == RM_DBASE_ID && rminfo == XLOG_DBASE_DROP)
 	{
 		/*
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 63bfdf1..fc7cbcd 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2780,13 +2780,15 @@ psql_completion(const char *text, int start, int end)
 	/* CREATE DATABASE */
 	else if (Matches("CREATE", "DATABASE", MatchAny))
 		COMPLETE_WITH("OWNER", "TEMPLATE", "ENCODING", "TABLESPACE",
-					  "IS_TEMPLATE",
+					  "IS_TEMPLATE", "STRATEGY",
 					  "ALLOW_CONNECTIONS", "CONNECTION LIMIT",
 					  "LC_COLLATE", "LC_CTYPE", "LOCALE", "OID",
 					  "LOCALE_PROVIDER", "ICU_LOCALE");
 
 	else if (Matches("CREATE", "DATABASE", MatchAny, "TEMPLATE"))
 		COMPLETE_WITH_QUERY(Query_for_list_of_template_databases);
+	else if (Matches("CREATE", "DATABASE", MatchAny, "STRATEGY"))
+		COMPLETE_WITH("WAL_LOG", "FILE_COPY");
 
 	/* CREATE DOMAIN */
 	else if (Matches("CREATE", "DOMAIN", MatchAny))
diff --git a/src/bin/scripts/createdb.c b/src/bin/scripts/createdb.c
index 6f612ab..0bffa2f 100644
--- a/src/bin/scripts/createdb.c
+++ b/src/bin/scripts/createdb.c
@@ -34,6 +34,7 @@ main(int argc, char *argv[])
 		{"tablespace", required_argument, NULL, 'D'},
 		{"template", required_argument, NULL, 'T'},
 		{"encoding", required_argument, NULL, 'E'},
+		{"strategy", required_argument, NULL, 'S'},
 		{"lc-collate", required_argument, NULL, 1},
 		{"lc-ctype", required_argument, NULL, 2},
 		{"locale", required_argument, NULL, 'l'},
@@ -60,6 +61,7 @@ main(int argc, char *argv[])
 	char	   *tablespace = NULL;
 	char	   *template = NULL;
 	char	   *encoding = NULL;
+	char	   *strategy = NULL;
 	char	   *lc_collate = NULL;
 	char	   *lc_ctype = NULL;
 	char	   *locale = NULL;
@@ -77,7 +79,7 @@ main(int argc, char *argv[])
 
 	handle_help_version_opts(argc, argv, "createdb", help);
 
-	while ((c = getopt_long(argc, argv, "h:p:U:wWeO:D:T:E:l:", long_options, &optindex)) != -1)
+	while ((c = getopt_long(argc, argv, "h:p:U:wWeO:D:T:E:l:S:", long_options, &optindex)) != -1)
 	{
 		switch (c)
 		{
@@ -111,6 +113,9 @@ main(int argc, char *argv[])
 			case 'E':
 				encoding = pg_strdup(optarg);
 				break;
+			case 'S':
+				strategy = pg_strdup(optarg);
+				break;
 			case 1:
 				lc_collate = pg_strdup(optarg);
 				break;
@@ -215,6 +220,8 @@ main(int argc, char *argv[])
 		appendPQExpBufferStr(&sql, " ENCODING ");
 		appendStringLiteralConn(&sql, encoding, conn);
 	}
+	if (strategy)
+		appendPQExpBuffer(&sql, " STRATEGY %s", fmtId(strategy));
 	if (template)
 		appendPQExpBuffer(&sql, " TEMPLATE %s", fmtId(template));
 	if (lc_collate)
@@ -294,6 +301,7 @@ help(const char *progname)
 	printf(_("      --locale-provider={libc|icu}\n"
 			 "                               locale provider for the database's default collation\n"));
 	printf(_("  -O, --owner=OWNER            database user to own the new database\n"));
+	printf(_("  -S, --strategy=STRATEGY      database creation strategy wal_log or file_copy\n"));
 	printf(_("  -T, --template=TEMPLATE      template database to copy\n"));
 	printf(_("  -V, --version                output version information, then exit\n"));
 	printf(_("  -?, --help                   show this help, then exit\n"));
diff --git a/src/bin/scripts/t/020_createdb.pl b/src/bin/scripts/t/020_createdb.pl
index 35deec9..14d3a95 100644
--- a/src/bin/scripts/t/020_createdb.pl
+++ b/src/bin/scripts/t/020_createdb.pl
@@ -104,4 +104,24 @@ $node->command_checks_all(
 	],
 	'createdb with incorrect --lc-ctype');
 
+$node->command_checks_all(
+	[ 'createdb', '--strategy', "foo", 'foobar2' ],
+	1,
+	[qr/^$/],
+	[
+		qr/^createdb: error: database creation failed: ERROR:  invalid create database strategy|^createdb: error: database creation failed: ERROR:  invalid create database strategy foo/s
+	],
+	'createdb with incorrect --strategy');
+
+# Check database creation strategy
+$node->issues_sql_like(
+	[ 'createdb', '-T', 'foobar2', 'foobar6', '-S', 'wal_log'],
+	qr/statement: CREATE DATABASE foobar6 STRATEGY wal_log TEMPLATE foobar2/,
+	'create database with WAL_LOG strategy');
+
+$node->issues_sql_like(
+	[ 'createdb', '-T', 'foobar2', 'foobar7', '-S', 'file_copy'],
+	qr/statement: CREATE DATABASE foobar7 STRATEGY file_copy TEMPLATE foobar2/,
+	'create database with FILE_COPY strategy');
+
 done_testing();
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 9ffc741..844a023 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -22,7 +22,9 @@
 /* GUC variables */
 extern int	wal_skip_threshold;
 
-extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence);
+extern SMgrRelation RelationCreateStorage(RelFileNode rnode,
+										  char relpersistence,
+										  bool register_delete);
 extern void RelationDropStorage(Relation rel);
 extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
 extern void RelationPreTruncate(Relation rel);
diff --git a/src/include/commands/dbcommands_xlog.h b/src/include/commands/dbcommands_xlog.h
index 593a857..0ee2452 100644
--- a/src/include/commands/dbcommands_xlog.h
+++ b/src/include/commands/dbcommands_xlog.h
@@ -18,17 +18,32 @@
 #include "lib/stringinfo.h"
 
 /* record types */
-#define XLOG_DBASE_CREATE		0x00
-#define XLOG_DBASE_DROP			0x10
+#define XLOG_DBASE_CREATE_FILE_COPY		0x00
+#define XLOG_DBASE_CREATE_WAL_LOG		0x10
+#define XLOG_DBASE_DROP					0x20
 
-typedef struct xl_dbase_create_rec
+/*
+ * Single WAL record for an entire CREATE DATABASE operation. This is used
+ * by the FILE_COPY strategy.
+ */
+typedef struct xl_dbase_create_file_copy_rec
 {
-	/* Records copying of a single subdirectory incl. contents */
 	Oid			db_id;
 	Oid			tablespace_id;
 	Oid			src_db_id;
 	Oid			src_tablespace_id;
-} xl_dbase_create_rec;
+} xl_dbase_create_file_copy_rec;
+
+/*
+ * WAL record for the beginning of a CREATE DATABASE operation, when the
+ * WAL_LOG strategy is used. Each individual block will be logged separately
+ * afterward.
+ */
+typedef struct xl_dbase_create_wal_log_rec
+{
+	Oid			db_id;
+	Oid			tablespace_id;
+} xl_dbase_create_wal_log_rec;
 
 typedef struct xl_dbase_drop_rec
 {
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index dd01841..a6b657f 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -184,7 +184,8 @@ extern Buffer ReadBufferExtended(Relation reln, ForkNumber forkNum,
 								 BufferAccessStrategy strategy);
 extern Buffer ReadBufferWithoutRelcache(RelFileNode rnode,
 										ForkNumber forkNum, BlockNumber blockNum,
-										ReadBufferMode mode, BufferAccessStrategy strategy);
+										ReadBufferMode mode, BufferAccessStrategy strategy,
+										bool permanent);
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern void MarkBufferDirty(Buffer buffer);
@@ -203,6 +204,9 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
 extern void FlushOneBuffer(Buffer buffer);
 extern void FlushRelationBuffers(Relation rel);
 extern void FlushRelationsAllBuffers(struct SMgrRelationData **smgrs, int nrels);
+extern void CreateAndCopyRelationData(RelFileNode src_rnode,
+									  RelFileNode dst_rnode,
+									  bool permanent);
 extern void FlushDatabaseBuffers(Oid dbid);
 extern void DropRelFileNodeBuffers(struct SMgrRelationData *smgr_reln, ForkNumber *forkNum,
 								   int nforks, BlockNumber *firstDelBlock);
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index 49edbcc..be1d2c9 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -38,6 +38,7 @@ extern void RelationInitLockInfo(Relation relation);
 
 /* Lock a relation */
 extern void LockRelationOid(Oid relid, LOCKMODE lockmode);
+extern void LockRelationId(LockRelId *relid, LOCKMODE lockmode);
 extern bool ConditionalLockRelationOid(Oid relid, LOCKMODE lockmode);
 extern void UnlockRelationId(LockRelId *relid, LOCKMODE lockmode);
 extern void UnlockRelationOid(Oid relid, LOCKMODE lockmode);
diff --git a/src/include/utils/relmapper.h b/src/include/utils/relmapper.h
index 9fbb5a7..f10353e 100644
--- a/src/include/utils/relmapper.h
+++ b/src/include/utils/relmapper.h
@@ -38,7 +38,9 @@ typedef struct xl_relmap_update
 extern Oid	RelationMapOidToFilenode(Oid relationId, bool shared);
 
 extern Oid	RelationMapFilenodeToOid(Oid relationId, bool shared);
-
+extern Oid RelationMapOidToFilenodeForDatabase(char *dbpath, Oid relationId);
+extern void RelationMapCopy(Oid dbid, Oid tsid, char *srcdbpath,
+							char *dstdbpath);
 extern void RelationMapUpdateMap(Oid relationId, Oid fileNode, bool shared,
 								 bool immediate);
 
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 1c39ce0..d870c59 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -218,6 +218,7 @@ typedef enum
 	WAIT_EVENT_TWOPHASE_FILE_READ,
 	WAIT_EVENT_TWOPHASE_FILE_SYNC,
 	WAIT_EVENT_TWOPHASE_FILE_WRITE,
+	WAIT_EVENT_VERSION_FILE_WRITE,
 	WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ,
 	WAIT_EVENT_WAL_BOOTSTRAP_SYNC,
 	WAIT_EVENT_WAL_BOOTSTRAP_WRITE,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 85c808a..e0544b7 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -461,6 +461,8 @@ CoverPos
 CreateAmStmt
 CreateCastStmt
 CreateConversionStmt
+CreateDBRelInfo
+CreateDBStrategy
 CreateDomainStmt
 CreateEnumStmt
 CreateEventTrigStmt
@@ -3701,7 +3703,8 @@ xl_btree_update
 xl_btree_vacuum
 xl_clog_truncate
 xl_commit_ts_truncate
-xl_dbase_create_rec
+xl_dbase_create_file_copy_rec
+xl_dbase_create_wal_log_rec
 xl_dbase_drop_rec
 xl_end_of_recovery
 xl_hash_add_ovfl_page
-- 
1.8.3.1

#199

Dilip Kumar

dilipbalaut@gmail.com

almost 4 years ago

In reply to: Dilip Kumar (#198)

1 attachment(s)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Sat, Mar 26, 2022 at 5:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Mar 25, 2022 at 8:16 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On a quick look, I'm guessing that XLOG_DBASE_CREATE_WAL_LOG will need
to mirror some of the logic that was added to the replay code for the
existing strategy, but I haven't figured out the details.

Yeah, I think I got it, for XLOG_DBASE_CREATE_WAL_LOG now we will have
to handle the missing parent directory case, like Alvaro handled for
the XLOG_DBASE_CREATE(_FILE_COPY) case.

I have updated the patch so now we skip the XLOG_DBASE_CREATE_WAL_LOG
as well if the tablespace directory is missing. But with our new
wal_log method there will be other follow up wal logs like,
XLOG_RELMAP_UPDATE, XLOG_SMGR_CREATE and XLOG_FPI.

I have put the similar logic for relmap_update WAL replay as well,

There was some mistake in the last patch, basically, for relmap update
also I have checked the missing tablespace directory but I should have
checked the missing database directory so I have fixed that.

Now, is it possible to get the FPI without smgr_create wal in other
cases? If it is then that problem is orthogonal to this path, but
anyway I could not find any such scenario.

I have digged further into it, tried manually removing the directory
before XLOG_FPI, but I noticed that during FPI also
XLogReadBufferExtended() take cares of creating the missing files
using smgrcreate() and that intern take care of missing directory
creation so I don't think we have any problem here.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v8-0001-Add-new-block-by-block-strategy-for-CREATE-DATABA.patchtext/x-patch; charset=US-ASCII; name=v8-0001-Add-new-block-by-block-strategy-for-CREATE-DATABA.patchDownload

From c35ba040770184d3048d9529668e30f8a59f5b75 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Sat, 26 Mar 2022 10:35:44 +0530
Subject: [PATCH v8] Add new block-by-block strategy for CREATE DATABASE.

Because this strategy logs changes on a block-by-block basis, it
avoids the need to checkpoint before and after the operation.
However, because it logs each changed block individually, it might
generate a lot of extra write-ahead logging if the template database
is large. Therefore, the older strategy remains available via a new
STRATEGY parameter to CREATE DATABASE, and a corresponding --strategy
option to createdb.

Somewhat controversially, this patch assembles the list of relations
to be copied to the new database by reading the pg_class relation of
the template database. Cross-database access like this isn't normally
possible, but it can be made to work here because there can't be any
connections to the database being copied, nor can it contain any
in-doubt transactions. Even so, we have to use lower-level interfaces
than normal, since the table scan and relcache interfaces will not
work for a database to which we're not connected. The advantage of
this approach is that we do not need to rely on the filesystem to
determine what ought to be copied, but instead on PostgreSQL's own
knowledge of the database structure. This avoids, for example,
copying stray files that happen to be located in the source database
directory.

Dilip Kumar, with a fairly large number of cosmetic changes by me.
---
 contrib/bloom/blinsert.c                 |   2 +-
 doc/src/sgml/monitoring.sgml             |   4 +
 doc/src/sgml/ref/create_database.sgml    |  22 +
 doc/src/sgml/ref/createdb.sgml           |  11 +
 src/backend/access/heap/heapam_handler.c |   6 +-
 src/backend/access/nbtree/nbtree.c       |   2 +-
 src/backend/access/rmgrdesc/dbasedesc.c  |  20 +-
 src/backend/access/transam/xlogutils.c   |   6 +-
 src/backend/catalog/heap.c               |   2 +-
 src/backend/catalog/storage.c            |  34 +-
 src/backend/commands/dbcommands.c        | 795 ++++++++++++++++++++++++++-----
 src/backend/commands/tablecmds.c         |   2 +-
 src/backend/storage/buffer/bufmgr.c      | 172 ++++++-
 src/backend/storage/lmgr/lmgr.c          |  28 ++
 src/backend/utils/activity/wait_event.c  |   3 +
 src/backend/utils/cache/relcache.c       |   2 +-
 src/backend/utils/cache/relmapper.c      |  93 ++++
 src/bin/pg_rewind/parsexlog.c            |   9 +-
 src/bin/psql/tab-complete.c              |   4 +-
 src/bin/scripts/createdb.c               |  10 +-
 src/bin/scripts/t/020_createdb.pl        |  20 +
 src/include/catalog/storage.h            |   4 +-
 src/include/commands/dbcommands_xlog.h   |  25 +-
 src/include/storage/bufmgr.h             |   6 +-
 src/include/storage/lmgr.h               |   1 +
 src/include/utils/relmapper.h            |   4 +-
 src/include/utils/wait_event.h           |   1 +
 src/tools/pgindent/typedefs.list         |   5 +-
 28 files changed, 1136 insertions(+), 157 deletions(-)

diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c
index c94cf34..82378db 100644
--- a/contrib/bloom/blinsert.c
+++ b/contrib/bloom/blinsert.c
@@ -173,7 +173,7 @@ blbuildempty(Relation index)
 	 * Write the page and log it.  It might seem that an immediate sync would
 	 * be sufficient to guarantee that the file exists on disk, but recovery
 	 * itself might remove it while replaying, for example, an
-	 * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we need
+	 * XLOG_DBASE_CREATE* or XLOG_TBLSPC_CREATE record.  Therefore, we need
 	 * this even when wal_level=minimal.
 	 */
 	PageSetChecksumInplace(metapage, BLOOM_METAPAGE_BLKNO);
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 35b2923..562f59f 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1503,6 +1503,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry>Waiting for a write of a two phase state file.</entry>
      </row>
      <row>
+      <entry><literal>VersionFileWrite</literal></entry>
+      <entry>Waiting for the version file to be written while creating a database.</entry>
+     </row>
+     <row>
       <entry><literal>WALBootstrapSync</literal></entry>
       <entry>Waiting for WAL to reach durable storage during
        bootstrapping.</entry>
diff --git a/doc/src/sgml/ref/create_database.sgml b/doc/src/sgml/ref/create_database.sgml
index 5ae785a..255ad3a 100644
--- a/doc/src/sgml/ref/create_database.sgml
+++ b/doc/src/sgml/ref/create_database.sgml
@@ -25,6 +25,7 @@ CREATE DATABASE <replaceable class="parameter">name</replaceable>
     [ [ WITH ] [ OWNER [=] <replaceable class="parameter">user_name</replaceable> ]
            [ TEMPLATE [=] <replaceable class="parameter">template</replaceable> ]
            [ ENCODING [=] <replaceable class="parameter">encoding</replaceable> ]
+           [ STRATEGY [=] <replaceable class="parameter">strategy</replaceable> ] ]
            [ LOCALE [=] <replaceable class="parameter">locale</replaceable> ]
            [ LC_COLLATE [=] <replaceable class="parameter">lc_collate</replaceable> ]
            [ LC_CTYPE [=] <replaceable class="parameter">lc_ctype</replaceable> ]
@@ -118,6 +119,27 @@ CREATE DATABASE <replaceable class="parameter">name</replaceable>
        </para>
       </listitem>
      </varlistentry>
+     <varlistentry id="create-database-strategy" xreflabel="CREATE DATABASE STRATEGY">
+      <term><replaceable class="parameter">strategy</replaceable></term>
+      <listitem>
+       <para>
+        Strategy to be used in creating the new database.  If
+        the <literal>WAL_LOG</literal> strategy is used, the database will be
+        copied block by block and each block will be separately written
+        to the write-ahead log. This is the most efficient strategy in
+        cases where the template database is small, and therefore it is the
+        default. The older <literal>FILE_COPY</literal> strategy is also
+        available. This strategy writes a small record to the write-ahead log
+        for each tablespace used by the target database. Each such record
+        represents copying an entire directory to a new location at the
+        filesystem level. While this does reduce the write-ahed
+        log volume substantially, especially if the template database is large,
+        it also forces the system to perform a checkpoint both before and
+        after the creation of the new database. In some situations, this may
+        have a noticeable negative impact on overall system performance.
+       </para>
+      </listitem>
+     </varlistentry>
      <varlistentry>
       <term><replaceable class="parameter">locale</replaceable></term>
       <listitem>
diff --git a/doc/src/sgml/ref/createdb.sgml b/doc/src/sgml/ref/createdb.sgml
index be42e50..671cd362 100644
--- a/doc/src/sgml/ref/createdb.sgml
+++ b/doc/src/sgml/ref/createdb.sgml
@@ -178,6 +178,17 @@ PostgreSQL documentation
      </varlistentry>
 
      <varlistentry>
+      <term><option>-S <replaceable class="parameter">template</replaceable></option></term>
+      <term><option>--strategy=<replaceable class="parameter">strategy</replaceable></option></term>
+      <listitem>
+       <para>
+        Specifies the database creation strategy.  See
+        <xref linkend="create-database-strategy" /> for more details.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
       <term><option>-T <replaceable class="parameter">template</replaceable></option></term>
       <term><option>--template=<replaceable class="parameter">template</replaceable></option></term>
       <listitem>
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 39ef8a0..dee264e 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -593,7 +593,7 @@ heapam_relation_set_new_filenode(Relation rel,
 	 */
 	*minmulti = GetOldestMultiXactId();
 
-	srel = RelationCreateStorage(*newrnode, persistence);
+	srel = RelationCreateStorage(*newrnode, persistence, true);
 
 	/*
 	 * If required, set up an init fork for an unlogged table so that it can
@@ -601,7 +601,7 @@ heapam_relation_set_new_filenode(Relation rel,
 	 * even if the page has been logged, because the write did not go through
 	 * shared_buffers and therefore a concurrent checkpoint may have moved the
 	 * redo pointer past our xlog record.  Recovery may as well remove it
-	 * while replaying, for example, XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE
+	 * while replaying, for example, XLOG_DBASE_CREATE* or XLOG_TBLSPC_CREATE
 	 * record. Therefore, logging is necessary even if wal_level=minimal.
 	 */
 	if (persistence == RELPERSISTENCE_UNLOGGED)
@@ -645,7 +645,7 @@ heapam_relation_copy_data(Relation rel, const RelFileNode *newrnode)
 	 * NOTE: any conflict in relfilenode value will be caught in
 	 * RelationCreateStorage().
 	 */
-	RelationCreateStorage(*newrnode, rel->rd_rel->relpersistence);
+	RelationCreateStorage(*newrnode, rel->rd_rel->relpersistence, true);
 
 	/* copy main fork */
 	RelationCopyStorage(RelationGetSmgr(rel), dstrel, MAIN_FORKNUM,
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index c9b4964..dacf3f7 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -161,7 +161,7 @@ btbuildempty(Relation index)
 	 * Write the page and log it.  It might seem that an immediate sync would
 	 * be sufficient to guarantee that the file exists on disk, but recovery
 	 * itself might remove it while replaying, for example, an
-	 * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we need
+	 * XLOG_DBASE_CREATE* or XLOG_TBLSPC_CREATE record.  Therefore, we need
 	 * this even when wal_level=minimal.
 	 */
 	PageSetChecksumInplace(metapage, BTREE_METAPAGE);
diff --git a/src/backend/access/rmgrdesc/dbasedesc.c b/src/backend/access/rmgrdesc/dbasedesc.c
index 03af3fd..523d0b3 100644
--- a/src/backend/access/rmgrdesc/dbasedesc.c
+++ b/src/backend/access/rmgrdesc/dbasedesc.c
@@ -24,14 +24,23 @@ dbase_desc(StringInfo buf, XLogReaderState *record)
 	char	   *rec = XLogRecGetData(record);
 	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
 
-	if (info == XLOG_DBASE_CREATE)
+	if (info == XLOG_DBASE_CREATE_FILE_COPY)
 	{
-		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) rec;
+		xl_dbase_create_file_copy_rec *xlrec =
+		(xl_dbase_create_file_copy_rec *) rec;
 
 		appendStringInfo(buf, "copy dir %u/%u to %u/%u",
 						 xlrec->src_tablespace_id, xlrec->src_db_id,
 						 xlrec->tablespace_id, xlrec->db_id);
 	}
+	else if (info == XLOG_DBASE_CREATE_WAL_LOG)
+	{
+		xl_dbase_create_wal_log_rec *xlrec =
+		(xl_dbase_create_wal_log_rec *) rec;
+
+		appendStringInfo(buf, "create dir %u/%u",
+						 xlrec->tablespace_id, xlrec->db_id);
+	}
 	else if (info == XLOG_DBASE_DROP)
 	{
 		xl_dbase_drop_rec *xlrec = (xl_dbase_drop_rec *) rec;
@@ -51,8 +60,11 @@ dbase_identify(uint8 info)
 
 	switch (info & ~XLR_INFO_MASK)
 	{
-		case XLOG_DBASE_CREATE:
-			id = "CREATE";
+		case XLOG_DBASE_CREATE_FILE_COPY:
+			id = "CREATE_FILE_COPY";
+			break;
+		case XLOG_DBASE_CREATE_WAL_LOG:
+			id = "CREATE_WAL_LOG";
 			break;
 		case XLOG_DBASE_DROP:
 			id = "DROP";
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 8c1b821..cb745ab 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -641,7 +641,7 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 	{
 		/* page exists in file */
 		buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-										   mode, NULL);
+										   mode, NULL, true);
 	}
 	else
 	{
@@ -666,7 +666,7 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 				ReleaseBuffer(buffer);
 			}
 			buffer = ReadBufferWithoutRelcache(rnode, forknum,
-											   P_NEW, mode, NULL);
+											   P_NEW, mode, NULL, true);
 		}
 		while (BufferGetBlockNumber(buffer) < blkno);
 		/* Handle the corner case that P_NEW returns non-consecutive pages */
@@ -676,7 +676,7 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 				LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 			ReleaseBuffer(buffer);
 			buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-											   mode, NULL);
+											   mode, NULL, true);
 		}
 	}
 
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index 696fd59..6eb78a9 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -387,7 +387,7 @@ heap_create(const char *relname,
 											relpersistence,
 											relfrozenxid, relminmxid);
 		else if (RELKIND_HAS_STORAGE(rel->rd_rel->relkind))
-			RelationCreateStorage(rel->rd_node, relpersistence);
+			RelationCreateStorage(rel->rd_node, relpersistence, true);
 		else
 			Assert(false);
 	}
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index ce5568f..9898701 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -112,12 +112,14 @@ AddPendingSync(const RelFileNode *rnode)
  * modules that need them.
  *
  * This function is transactional. The creation is WAL-logged, and if the
- * transaction aborts later on, the storage will be destroyed.
+ * transaction aborts later on, the storage will be destroyed.  A caller
+ * that does not want the storage to be destroyed in case of an abort may
+ * pass register_delete = false.
  */
 SMgrRelation
-RelationCreateStorage(RelFileNode rnode, char relpersistence)
+RelationCreateStorage(RelFileNode rnode, char relpersistence,
+					  bool register_delete)
 {
-	PendingRelDelete *pending;
 	SMgrRelation srel;
 	BackendId	backend;
 	bool		needs_wal;
@@ -149,15 +151,23 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
 	if (needs_wal)
 		log_smgrcreate(&srel->smgr_rnode.node, MAIN_FORKNUM);
 
-	/* Add the relation to the list of stuff to delete at abort */
-	pending = (PendingRelDelete *)
-		MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
-	pending->relnode = rnode;
-	pending->backend = backend;
-	pending->atCommit = false;	/* delete if abort */
-	pending->nestLevel = GetCurrentTransactionNestLevel();
-	pending->next = pendingDeletes;
-	pendingDeletes = pending;
+	/*
+	 * Add the relation to the list of stuff to delete at abort, if we are
+	 * asked to do so.
+	 */
+	if (register_delete)
+	{
+		PendingRelDelete *pending;
+
+		pending = (PendingRelDelete *)
+			MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+		pending->relnode = rnode;
+		pending->backend = backend;
+		pending->atCommit = false;	/* delete if abort */
+		pending->nestLevel = GetCurrentTransactionNestLevel();
+		pending->next = pendingDeletes;
+		pendingDeletes = pending;
+	}
 
 	if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
 	{
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 95771b0..75f011e 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -64,13 +64,31 @@
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
 #include "utils/pg_locale.h"
+#include "utils/relmapper.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 
+/*
+ * Create database strategy.
+ *
+ * CREATEDB_WAL_LOG will copy the database at the block level and WAL log each
+ * copied block.
+ *
+ * CREATEDB_FILE_COPY will simply perform a file system level copy of the
+ * database and log a single record for each tablespace copied. To make this
+ * safe, it also triggers checkpoints before and after the operation.
+ */
+typedef enum CreateDBStrategy
+{
+	CREATEDB_WAL_LOG,
+	CREATEDB_FILE_COPY
+} CreateDBStrategy;
+
 typedef struct
 {
 	Oid			src_dboid;		/* source (template) DB */
 	Oid			dest_dboid;		/* DB we are trying to create */
+	CreateDBStrategy strategy;	/* create db strategy */
 } createdb_failure_params;
 
 typedef struct
@@ -79,6 +97,17 @@ typedef struct
 	Oid			dest_tsoid;		/* tablespace we are trying to move to */
 } movedb_failure_params;
 
+/*
+ * Information about a relation to be copied when creating a database.
+ */
+typedef struct CreateDBRelInfo
+{
+	RelFileNode rnode;			/* physical relation identifier */
+	Oid			reloid;			/* relation oid */
+	bool		permanent;		/* relation is permanent or unlogged */
+} CreateDBRelInfo;
+
+
 /* non-export function prototypes */
 static void createdb_failure_callback(int code, Datum arg);
 static void movedb(const char *dbname, const char *tblspcname);
@@ -94,7 +123,546 @@ static bool have_createdb_privilege(void);
 static void remove_dbtablespaces(Oid db_id);
 static bool check_db_file_conflict(Oid db_id);
 static int	errdetail_busy_db(int notherbackends, int npreparedxacts);
+static void CreateDatabaseUsingWalLog(Oid src_dboid, Oid dboid, Oid src_tsid,
+									  Oid dst_tsid);
+static List *ScanSourceDatabasePgClass(Oid srctbid, Oid srcdbid, char *srcpath);
+static List *ScanSourceDatabasePgClassPage(Page page, Buffer buf, Oid tbid,
+										   Oid dbid, char *srcpath,
+										   List *rnodelist, Snapshot snapshot);
+static CreateDBRelInfo *ScanSourceDatabasePgClassTuple(HeapTupleData *tuple,
+													   Oid tbid, Oid dbid,
+													   char *srcpath);
+static void CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid,
+									bool isRedo);
+static void CreateDatabaseUsingFileCopy(Oid src_dboid, Oid dboid, Oid src_tsid,
+										Oid dst_tsid);
+
+/*
+ * Create a new database using the WAL_LOG strategy.
+ *
+ * Each copied block is separately written to the write-ahead log.
+ */
+static void
+CreateDatabaseUsingWalLog(Oid src_dboid, Oid dst_dboid,
+						  Oid src_tsid, Oid dst_tsid)
+{
+	char	   *srcpath;
+	char	   *dstpath;
+	List	   *rnodelist = NULL;
+	ListCell   *cell;
+	LockRelId	srcrelid;
+	LockRelId	dstrelid;
+	RelFileNode srcrnode;
+	RelFileNode dstrnode;
+	CreateDBRelInfo *relinfo;
+
+	/* Get source and destination database paths. */
+	srcpath = GetDatabasePath(src_dboid, src_tsid);
+	dstpath = GetDatabasePath(dst_dboid, dst_tsid);
+
+	/* Create database directory and write PG_VERSION file. */
+	CreateDirAndVersionFile(dstpath, dst_dboid, dst_tsid, false);
+
+	/* Copy relmap file from source database to the destination database. */
+	RelationMapCopy(dst_dboid, dst_tsid, srcpath, dstpath);
+
+	/* Get list of relfilenodes to copy from the source database. */
+	rnodelist = ScanSourceDatabasePgClass(src_tsid, src_dboid, srcpath);
+	Assert(rnodelist != NIL);
+
+	/*
+	 * Database IDs will be the same for all relations so set them before
+	 * entering the loop.
+	 */
+	srcrelid.dbId = src_dboid;
+	dstrelid.dbId = dst_dboid;
+
+	/* Loop over our list of relfilenodes and copy each one. */
+	foreach(cell, rnodelist)
+	{
+		relinfo = lfirst(cell);
+		srcrnode = relinfo->rnode;
+
+		/*
+		 * If the relation is from the source db's default tablespace then we
+		 * need to create it in the destinations db's default tablespace.
+		 * Otherwise, we need to create in the same tablespace as it is in the
+		 * source database.
+		 */
+		if (srcrnode.spcNode == src_tsid)
+			dstrnode.spcNode = dst_tsid;
+		else
+			dstrnode.spcNode = srcrnode.spcNode;
+
+		dstrnode.dbNode = dst_dboid;
+		dstrnode.relNode = srcrnode.relNode;
+
+		/*
+		 * Acquire locks on source and target relations before copying.
+		 *
+		 * We typically do not read relation data into shared_buffers without
+		 * holding a relation lock. It's unclear what could go wrong if we
+		 * skipped it in this case, because nobody can be modifying either
+		 * the source or destination database at this point, and we have locks
+		 * on both databases, too, but let's take the conservative route.
+		 */
+		dstrelid.relId = srcrelid.relId = relinfo->reloid;
+		LockRelationId(&srcrelid, AccessShareLock);
+		LockRelationId(&dstrelid, AccessShareLock);
+
+		/* Copy relation storage from source to the destination. */
+		CreateAndCopyRelationData(srcrnode, dstrnode, relinfo->permanent);
 
+		/* Release the relation locks. */
+		UnlockRelationId(&srcrelid, AccessShareLock);
+		UnlockRelationId(&dstrelid, AccessShareLock);
+	}
+
+	list_free_deep(rnodelist);
+}
+
+/*
+ * Scan the pg_class table in the source database to identify the relations
+ * that need to be copied to the destination database.
+ *
+ * This is an exception to the usual rule that cross-database access is
+ * not possible. We can make it work here because we know that there are no
+ * connections to the source database and (since there can't be prepared
+ * transactions touching that database) no in-doubt tuples either. This
+ * means that we don't need to worry about pruning removing anything from
+ * under us, and we don't need to be too picky about our snapshot either.
+ * As long as it sees all previously-committed XIDs as committed and all
+ * aborted XIDs as aborted, we should be fine: nothing else is possible
+ * here.
+ *
+ * We can't rely on the relcache for anything here, because that only knows
+ * about the database to which we are connected, and can't handle access to
+ * other databases. That also means we can't rely on the heap scan
+ * infrastructure, which would be a bad idea anyway since it might try
+ * to do things like HOT pruning which we definitely can't do safely in
+ * a database to which we're not even connected.
+ */
+static List *
+ScanSourceDatabasePgClass(Oid tbid, Oid dbid, char *srcpath)
+{
+	RelFileNode rnode;
+	BlockNumber nblocks;
+	BlockNumber blkno;
+	Buffer		buf;
+	Oid			relfilenode;
+	Page		page;
+	List	   *rnodelist = NIL;
+	LockRelId	relid;
+	Relation	rel;
+	Snapshot	snapshot;
+	BufferAccessStrategy bstrategy;
+
+	/* Get pg_class relfilenode. */
+	relfilenode = RelationMapOidToFilenodeForDatabase(srcpath,
+													  RelationRelationId);
+
+	/* Don't read data into shared_buffers without holding a relation lock. */
+	relid.dbId = dbid;
+	relid.relId = RelationRelationId;
+	LockRelationId(&relid, AccessShareLock);
+
+	/* Prepare a RelFileNode for the pg_class relation. */
+	rnode.spcNode = tbid;
+	rnode.dbNode = dbid;
+	rnode.relNode = relfilenode;
+
+	/*
+	 * We can't use a real relcache entry for a relation in some other
+	 * database, but since we're only going to access the fields related
+	 * to physical storage, a fake one is good enough. If we didn't do this
+	 * and used the smgr layer directly, we would have to worry about
+	 * invalidations.
+	 */
+	rel = CreateFakeRelcacheEntry(rnode);
+	nblocks = smgrnblocks(RelationGetSmgr(rel), MAIN_FORKNUM);
+	FreeFakeRelcacheEntry(rel);
+
+	/* Use a buffer access strategy since this is a bulk read operation. */
+	bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	/*
+	 * As explained in the function header comments, we need a snapshot that
+	 * will see all committed transactions as committed, and our transaction
+	 * snapshot - or the active snapshot - might not be new enough for that,
+	 * but the return value of GetLatestSnapshot() should work fine.
+	 */
+	snapshot = GetLatestSnapshot();
+
+	/* Process the relation block by block. */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		CHECK_FOR_INTERRUPTS();
+
+		buf = ReadBufferWithoutRelcache(rnode, MAIN_FORKNUM, blkno,
+										RBM_NORMAL, bstrategy, false);
+
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buf);
+		if (PageIsNew(page) || PageIsEmpty(page))
+		{
+			UnlockReleaseBuffer(buf);
+			continue;
+		}
+
+		/* Append relevant pg_class tuples for current page to rnodelist. */
+		rnodelist = ScanSourceDatabasePgClassPage(page, buf, tbid, dbid,
+												  srcpath, rnodelist,
+												  snapshot);
+
+		UnlockReleaseBuffer(buf);
+	}
+
+	/* Release relation lock. */
+	UnlockRelationId(&relid, AccessShareLock);
+
+	return rnodelist;
+}
+
+/*
+ * Scan one page of the source database's pg_class relation and add relevant
+ * entries to rnodelist. The return value is the updated list.
+ */
+static List *
+ScanSourceDatabasePgClassPage(Page page, Buffer buf, Oid tbid, Oid dbid,
+							  char *srcpath, List *rnodelist,
+							  Snapshot snapshot)
+{
+	BlockNumber		blkno = BufferGetBlockNumber(buf);
+	OffsetNumber	offnum;
+	OffsetNumber	maxoff;
+	HeapTupleData	tuple;
+
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/* Loop over offsets. */
+	for (offnum = FirstOffsetNumber;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid;
+
+		itemid = PageGetItemId(page, offnum);
+
+		/* Nothing to do if slot is empty or already dead. */
+		if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid) ||
+			ItemIdIsRedirected(itemid))
+			continue;
+
+		Assert(ItemIdIsNormal(itemid));
+		ItemPointerSet(&(tuple.t_self), blkno, offnum);
+
+		/* Initialize a HeapTupleData structure. */
+		tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
+		tuple.t_len = ItemIdGetLength(itemid);
+		tuple.t_tableOid = RelationRelationId;
+
+		/* Skip tuples that are not visible to this snapshot. */
+		if (HeapTupleSatisfiesVisibility(&tuple, snapshot, buf))
+		{
+			CreateDBRelInfo *relinfo;
+
+			/*
+			 * ScanSourceDatabasePgClassTuple is in charge of constructing
+			 * a CreateDBRelInfo object for this tuple, but can also decide
+			 * that this tuple isn't something we need to copy. If we do need
+			 * to copy the relation, add it to the list.
+			 */
+			relinfo = ScanSourceDatabasePgClassTuple(&tuple, tbid, dbid,
+													 srcpath);
+			if (relinfo != NULL)
+				rnodelist = lappend(rnodelist, relinfo);
+		}
+	}
+
+	return rnodelist;
+}
+
+/*
+ * Decide whether a certain pg_class tuple represents something that
+ * needs to be copied from the source database to the destination database,
+ * and if so, construct a CreateDBRelInfo for it.
+ *
+ * Visbility checks are handled by the caller, so our job here is just
+ * to assess the data stored in the tuple.
+ */
+CreateDBRelInfo *
+ScanSourceDatabasePgClassTuple(HeapTupleData *tuple, Oid tbid, Oid dbid,
+							   char *srcpath)
+{
+	CreateDBRelInfo	   *relinfo;
+	Form_pg_class		classForm;
+	Oid					relfilenode = InvalidOid;
+
+	classForm = (Form_pg_class) GETSTRUCT(tuple);
+
+	/*
+	 * Return NULL if this object does not need to be copied.
+	 *
+	 * Shared objects don't need to be copied, because they are shared.
+	 * Objects without storage can't be copied, because there's nothing to
+	 * copy. Temporary relations don't need to be copied either, because
+	 * they are inaccessible outside of the session that created them,
+	 * which must be gone already, and couldn't connect to a different database
+	 * if it still existed. autovacuum will eventually remove the pg_class
+	 * entries as well.
+	 */
+	if (classForm->reltablespace == GLOBALTABLESPACE_OID ||
+		!RELKIND_HAS_STORAGE(classForm->relkind) ||
+		classForm->relpersistence == RELPERSISTENCE_TEMP)
+		return NULL;
+
+	/*
+	 * If relfilenode is valid then directly use it.  Otherwise, consult the
+	 * relmap.
+	 */
+	if (OidIsValid(classForm->relfilenode))
+		relfilenode = classForm->relfilenode;
+	else
+		relfilenode = RelationMapOidToFilenodeForDatabase(srcpath,
+														  classForm->oid);
+
+	/* We must have a valid relfilenode oid. */
+	if (!OidIsValid(relfilenode))
+		elog(ERROR, "relation with OID %u does not have a valid relfilenode",
+			 classForm->oid);
+
+	/* Prepare a rel info element and add it to the list. */
+	relinfo = (CreateDBRelInfo *) palloc(sizeof(CreateDBRelInfo));
+	if (OidIsValid(classForm->reltablespace))
+		relinfo->rnode.spcNode = classForm->reltablespace;
+	else
+		relinfo->rnode.spcNode = tbid;
+
+	relinfo->rnode.dbNode = dbid;
+	relinfo->rnode.relNode = relfilenode;
+	relinfo->reloid = classForm->oid;
+
+	/* Temporary relations were rejected above. */
+	Assert(classForm->relpersistence != RELPERSISTENCE_TEMP);
+	relinfo->permanent =
+		(classForm->relpersistence == RELPERSISTENCE_PERMANENT) ? true : false;
+
+	return relinfo;
+}
+
+/*
+ * Create database directory and write out the PG_VERSION file in the database
+ * path.  If isRedo is true, it's okay for the database directory to exist
+ * already.
+ */
+static void
+CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid, bool isRedo)
+{
+	int			fd;
+	int			nbytes;
+	char		versionfile[MAXPGPATH];
+	char		buf[16];
+
+	/*
+	 * Prepare version data before starting a critical section.
+	 *
+	 * Note that we don't have to copy this from the source database; there's
+	 * only one legal value.
+	 */
+	sprintf(buf, "%s\n", PG_MAJORVERSION);
+	nbytes = strlen(PG_MAJORVERSION) + 1;
+
+	/* If we are not in WAL replay then write the WAL. */
+	if (!isRedo)
+	{
+		xl_dbase_create_wal_log_rec xlrec;
+		XLogRecPtr	lsn;
+
+		START_CRIT_SECTION();
+
+		xlrec.db_id = dbid;
+		xlrec.tablespace_id = tsid;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec),
+						 sizeof(xl_dbase_create_wal_log_rec));
+
+		lsn = XLogInsert(RM_DBASE_ID, XLOG_DBASE_CREATE_WAL_LOG);
+
+		/* As always, WAL must hit the disk before the data update does. */
+		XLogFlush(lsn);
+	}
+
+	/* Create database directory. */
+	if (MakePGDirectory(dbpath) < 0)
+	{
+		/* Failure other than already exists or not in WAL replay? */
+		if (errno != EEXIST || !isRedo)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not create directory \"%s\": %m", dbpath)));
+	}
+
+	/*
+	 * Create PG_VERSION file in the database path.  If the file already
+	 * exists and we are in WAL replay then try again to open it in write
+	 * mode.
+	 */
+	snprintf(versionfile, sizeof(versionfile), "%s/%s", dbpath, "PG_VERSION");
+
+	fd = OpenTransientFile(versionfile, O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
+	if (fd < 0 && errno == EEXIST && isRedo)
+		fd = OpenTransientFile(versionfile, O_WRONLY | O_TRUNC | PG_BINARY);
+
+	if (fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m", versionfile)));
+
+	/* Write PG_MAJORVERSION in the PG_VERSION file. */
+	pgstat_report_wait_start(WAIT_EVENT_VERSION_FILE_WRITE);
+	errno = 0;
+	if ((int) write(fd, buf, nbytes) != nbytes)
+	{
+		/* If write didn't set errno, assume problem is no disk space. */
+		if (errno == 0)
+			errno = ENOSPC;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m", versionfile)));
+	}
+	pgstat_report_wait_end();
+
+	/* Close the version file. */
+	CloseTransientFile(fd);
+
+	/* Critical section done. */
+	if (!isRedo)
+		END_CRIT_SECTION();
+}
+
+/*
+ * Create a new database using the FILE_COPY strategy.
+ *
+ * Copy each tablespace at the filesystem level, and log a single WAL record
+ * for each tablespace copied.  This requires a checkpoint before and after the
+ * copy, which may be expensive, but it does greatly reduce WAL generation
+ * if the copied database is large.
+ */
+static void
+CreateDatabaseUsingFileCopy(Oid src_dboid, Oid dst_dboid, Oid src_tsid,
+							Oid dst_tsid)
+{
+	TableScanDesc scan;
+	Relation	rel;
+	HeapTuple	tuple;
+
+	/*
+	 * Force a checkpoint before starting the copy. This will force all dirty
+	 * buffers, including those of unlogged tables, out to disk, to ensure
+	 * source database is up-to-date on disk for the copy.
+	 * FlushDatabaseBuffers() would suffice for that, but we also want to
+	 * process any pending unlink requests. Otherwise, if a checkpoint
+	 * happened while we're copying files, a file might be deleted just when
+	 * we're about to copy it, causing the lstat() call in copydir() to fail
+	 * with ENOENT.
+	 */
+	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE |
+					  CHECKPOINT_WAIT | CHECKPOINT_FLUSH_ALL);
+
+	/*
+	 * Iterate through all tablespaces of the template database, and copy each
+	 * one to the new database.
+	 */
+	rel = table_open(TableSpaceRelationId, AccessShareLock);
+	scan = table_beginscan_catalog(rel, 0, NULL);
+	while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+	{
+		Form_pg_tablespace spaceform = (Form_pg_tablespace) GETSTRUCT(tuple);
+		Oid			srctablespace = spaceform->oid;
+		Oid			dsttablespace;
+		char	   *srcpath;
+		char	   *dstpath;
+		struct stat st;
+
+		/* No need to copy global tablespace */
+		if (srctablespace == GLOBALTABLESPACE_OID)
+			continue;
+
+		srcpath = GetDatabasePath(src_dboid, srctablespace);
+
+		if (stat(srcpath, &st) < 0 || !S_ISDIR(st.st_mode) ||
+			directory_is_empty(srcpath))
+		{
+			/* Assume we can ignore it */
+			pfree(srcpath);
+			continue;
+		}
+
+		if (srctablespace == src_tsid)
+			dsttablespace = dst_tsid;
+		else
+			dsttablespace = srctablespace;
+
+		dstpath = GetDatabasePath(dst_dboid, dsttablespace);
+
+		/*
+		 * Copy this subdirectory to the new location
+		 *
+		 * We don't need to copy subdirectories
+		 */
+		copydir(srcpath, dstpath, false);
+
+		/* Record the filesystem change in XLOG */
+		{
+			xl_dbase_create_file_copy_rec xlrec;
+
+			xlrec.db_id = dst_dboid;
+			xlrec.tablespace_id = dsttablespace;
+			xlrec.src_db_id = src_dboid;
+			xlrec.src_tablespace_id = srctablespace;
+
+			XLogBeginInsert();
+			XLogRegisterData((char *) &xlrec,
+							 sizeof(xl_dbase_create_file_copy_rec));
+
+			(void) XLogInsert(RM_DBASE_ID,
+							  XLOG_DBASE_CREATE_FILE_COPY | XLR_SPECIAL_REL_UPDATE);
+		}
+	}
+	table_endscan(scan);
+	table_close(rel, AccessShareLock);
+
+	/*
+	 * We force a checkpoint before committing.  This effectively means that
+	 * committed XLOG_DBASE_CREATE_FILE_COPY operations will never need to be
+	 * replayed (at least not in ordinary crash recovery; we still have to
+	 * make the XLOG entry for the benefit of PITR operations). This avoids
+	 * two nasty scenarios:
+	 *
+	 * #1: When PITR is off, we don't XLOG the contents of newly created
+	 * indexes; therefore the drop-and-recreate-whole-directory behavior of
+	 * DBASE_CREATE replay would lose such indexes.
+	 *
+	 * #2: Since we have to recopy the source database during DBASE_CREATE
+	 * replay, we run the risk of copying changes in it that were committed
+	 * after the original CREATE DATABASE command but before the system crash
+	 * that led to the replay.  This is at least unexpected and at worst could
+	 * lead to inconsistencies, eg duplicate table names.
+	 *
+	 * (Both of these were real bugs in releases 8.0 through 8.0.3.)
+	 *
+	 * In PITR replay, the first of these isn't an issue, and the second is
+	 * only a risk if the CREATE DATABASE and subsequent template database
+	 * change both occur while a base backup is being taken. There doesn't
+	 * seem to be much we can do about that except document it as a
+	 * limitation.
+	 *
+	 * See CreateDatabaseUsingWalLog() for a less cheesy CREATE DATABASE
+	 * strategy that avoids these problems.
+	 */
+	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
+}
 
 /*
  * CREATE DATABASE
@@ -102,8 +670,6 @@ static int	errdetail_busy_db(int notherbackends, int npreparedxacts);
 Oid
 createdb(ParseState *pstate, const CreatedbStmt *stmt)
 {
-	TableScanDesc scan;
-	Relation	rel;
 	Oid			src_dboid;
 	Oid			src_owner;
 	int			src_encoding = -1;
@@ -138,6 +704,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	DefElem    *dallowconnections = NULL;
 	DefElem    *dconnlimit = NULL;
 	DefElem    *dcollversion = NULL;
+	DefElem    *dstrategy = NULL;
 	char	   *dbname = stmt->dbname;
 	char	   *dbowner = NULL;
 	const char *dbtemplate = NULL;
@@ -153,6 +720,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	char	   *dbcollversion = NULL;
 	int			notherbackends;
 	int			npreparedxacts;
+	CreateDBStrategy dbstrategy = CREATEDB_WAL_LOG;
 	createdb_failure_params fparms;
 
 	/* Extract options from the statement node tree */
@@ -270,6 +838,12 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 						(errcode(ERRCODE_INVALID_PARAMETER_VALUE)),
 						errmsg("OIDs less than %u are reserved for system objects", FirstNormalObjectId));
 		}
+		else if (strcmp(defel->defname, "strategy") == 0)
+		{
+			if (dstrategy)
+				errorConflictingDefElem(defel, pstate);
+			dstrategy = defel;
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -414,6 +988,23 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 							dbtemplate)));
 	}
 
+	/* Validate the database creation strategy. */
+	if (dstrategy && dstrategy->arg)
+	{
+		char	   *strategy;
+
+		strategy = defGetString(dstrategy);
+		if (strcmp(strategy, "wal_log") == 0)
+			dbstrategy = CREATEDB_WAL_LOG;
+		else if (strcmp(strategy, "file_copy") == 0)
+			dbstrategy = CREATEDB_FILE_COPY;
+		else
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("invalid create database strategy %s", strategy),
+					 errhint("Valid strategies are \"wal_log\", and \"file_copy\".")));
+	}
+
 	/* If encoding or locales are defaulted, use source's setting */
 	if (encoding < 0)
 		encoding = src_encoding;
@@ -754,17 +1345,18 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	InvokeObjectPostCreateHook(DatabaseRelationId, dboid, 0);
 
 	/*
-	 * Force a checkpoint before starting the copy. This will force all dirty
-	 * buffers, including those of unlogged tables, out to disk, to ensure
-	 * source database is up-to-date on disk for the copy.
-	 * FlushDatabaseBuffers() would suffice for that, but we also want to
-	 * process any pending unlink requests. Otherwise, if a checkpoint
-	 * happened while we're copying files, a file might be deleted just when
-	 * we're about to copy it, causing the lstat() call in copydir() to fail
-	 * with ENOENT.
+	 * If we're going to be reading data for the to-be-created database
+	 * into shared_buffers, take a lock on it. Nobody should know that this
+	 * database exists yet, but it's good to maintain the invariant that a
+	 * lock an AccessExclusiveLock on the database is sufficient to drop all
+	 * of its buffers without worrying about more being read later.
+	 *
+	 * Note that we need to do this before entering the PG_ENSURE_ERROR_CLEANUP
+	 * block below, because createdb_failure_callback expects this lock to
+	 * be held already.
 	 */
-	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT
-					  | CHECKPOINT_FLUSH_ALL);
+	if (dbstrategy == CREATEDB_WAL_LOG)
+		LockSharedObject(DatabaseRelationId, dboid, 0, AccessShareLock);
 
 	/*
 	 * Once we start copying subdirectories, we need to be able to clean 'em
@@ -775,101 +1367,24 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	 */
 	fparms.src_dboid = src_dboid;
 	fparms.dest_dboid = dboid;
+	fparms.strategy = dbstrategy;
+
 	PG_ENSURE_ERROR_CLEANUP(createdb_failure_callback,
 							PointerGetDatum(&fparms));
 	{
 		/*
-		 * Iterate through all tablespaces of the template database, and copy
-		 * each one to the new database.
+		 * If the user has asked to create a database with WAL_LOG strategy
+		 * then call CreateDatabaseUsingWalLog, which will copy the database
+		 * at the block level and it will WAL log each copied block.
+		 * Otherwise, call CreateDatabaseUsingFileCopy that will copy the
+		 * database file by file.
 		 */
-		rel = table_open(TableSpaceRelationId, AccessShareLock);
-		scan = table_beginscan_catalog(rel, 0, NULL);
-		while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
-		{
-			Form_pg_tablespace spaceform = (Form_pg_tablespace) GETSTRUCT(tuple);
-			Oid			srctablespace = spaceform->oid;
-			Oid			dsttablespace;
-			char	   *srcpath;
-			char	   *dstpath;
-			struct stat st;
-
-			/* No need to copy global tablespace */
-			if (srctablespace == GLOBALTABLESPACE_OID)
-				continue;
-
-			srcpath = GetDatabasePath(src_dboid, srctablespace);
-
-			if (stat(srcpath, &st) < 0 || !S_ISDIR(st.st_mode) ||
-				directory_is_empty(srcpath))
-			{
-				/* Assume we can ignore it */
-				pfree(srcpath);
-				continue;
-			}
-
-			if (srctablespace == src_deftablespace)
-				dsttablespace = dst_deftablespace;
-			else
-				dsttablespace = srctablespace;
-
-			dstpath = GetDatabasePath(dboid, dsttablespace);
-
-			/*
-			 * Copy this subdirectory to the new location
-			 *
-			 * We don't need to copy subdirectories
-			 */
-			copydir(srcpath, dstpath, false);
-
-			/* Record the filesystem change in XLOG */
-			{
-				xl_dbase_create_rec xlrec;
-
-				xlrec.db_id = dboid;
-				xlrec.tablespace_id = dsttablespace;
-				xlrec.src_db_id = src_dboid;
-				xlrec.src_tablespace_id = srctablespace;
-
-				XLogBeginInsert();
-				XLogRegisterData((char *) &xlrec, sizeof(xl_dbase_create_rec));
-
-				(void) XLogInsert(RM_DBASE_ID,
-								  XLOG_DBASE_CREATE | XLR_SPECIAL_REL_UPDATE);
-			}
-		}
-		table_endscan(scan);
-		table_close(rel, AccessShareLock);
-
-		/*
-		 * We force a checkpoint before committing.  This effectively means
-		 * that committed XLOG_DBASE_CREATE operations will never need to be
-		 * replayed (at least not in ordinary crash recovery; we still have to
-		 * make the XLOG entry for the benefit of PITR operations). This
-		 * avoids two nasty scenarios:
-		 *
-		 * #1: When PITR is off, we don't XLOG the contents of newly created
-		 * indexes; therefore the drop-and-recreate-whole-directory behavior
-		 * of DBASE_CREATE replay would lose such indexes.
-		 *
-		 * #2: Since we have to recopy the source database during DBASE_CREATE
-		 * replay, we run the risk of copying changes in it that were
-		 * committed after the original CREATE DATABASE command but before the
-		 * system crash that led to the replay.  This is at least unexpected
-		 * and at worst could lead to inconsistencies, eg duplicate table
-		 * names.
-		 *
-		 * (Both of these were real bugs in releases 8.0 through 8.0.3.)
-		 *
-		 * In PITR replay, the first of these isn't an issue, and the second
-		 * is only a risk if the CREATE DATABASE and subsequent template
-		 * database change both occur while a base backup is being taken.
-		 * There doesn't seem to be much we can do about that except document
-		 * it as a limitation.
-		 *
-		 * Perhaps if we ever implement CREATE DATABASE in a less cheesy way,
-		 * we can avoid this.
-		 */
-		RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
+		if (dbstrategy == CREATEDB_WAL_LOG)
+			CreateDatabaseUsingWalLog(src_dboid, dboid, src_deftablespace,
+									  dst_deftablespace);
+		else
+			CreateDatabaseUsingFileCopy(src_dboid, dboid, src_deftablespace,
+										dst_deftablespace);
 
 		/*
 		 * Close pg_database, but keep lock till commit.
@@ -956,6 +1471,25 @@ createdb_failure_callback(int code, Datum arg)
 	createdb_failure_params *fparms = (createdb_failure_params *) DatumGetPointer(arg);
 
 	/*
+	 * If we were copying database at block levels then drop pages for the
+	 * destination database that are in the shared buffer cache.  And tell
+	 * checkpointer to forget any pending fsync and unlink requests for files
+	 * in the database.  The reasoning behind doing this is same as explained
+	 * in dropdb function.  But unlike dropdb we don't need to call
+	 * pgstat_drop_database because this database is still not created so
+	 * there should not be any stat for this.
+	 */
+	if (fparms->strategy == CREATEDB_WAL_LOG)
+	{
+		DropDatabaseBuffers(fparms->dest_dboid);
+		ForgetDatabaseSyncRequests(fparms->dest_dboid);
+
+		/* Release lock on the target database. */
+		UnlockSharedObject(DatabaseRelationId, fparms->dest_dboid, 0,
+						   AccessShareLock);
+	}
+
+	/*
 	 * Release lock on source database before doing recursive remove. This is
 	 * not essential but it seems desirable to release the lock as soon as
 	 * possible.
@@ -1479,7 +2013,7 @@ movedb(const char *dbname, const char *tblspcname)
 		 * Record the filesystem change in XLOG
 		 */
 		{
-			xl_dbase_create_rec xlrec;
+			xl_dbase_create_file_copy_rec xlrec;
 
 			xlrec.db_id = db_id;
 			xlrec.tablespace_id = dst_tblspcoid;
@@ -1487,10 +2021,11 @@ movedb(const char *dbname, const char *tblspcname)
 			xlrec.src_tablespace_id = src_tblspcoid;
 
 			XLogBeginInsert();
-			XLogRegisterData((char *) &xlrec, sizeof(xl_dbase_create_rec));
+			XLogRegisterData((char *) &xlrec,
+							 sizeof(xl_dbase_create_file_copy_rec));
 
 			(void) XLogInsert(RM_DBASE_ID,
-							  XLOG_DBASE_CREATE | XLR_SPECIAL_REL_UPDATE);
+							  XLOG_DBASE_CREATE_FILE_COPY | XLR_SPECIAL_REL_UPDATE);
 		}
 
 		/*
@@ -1526,9 +2061,10 @@ movedb(const char *dbname, const char *tblspcname)
 
 		/*
 		 * Force another checkpoint here.  As in CREATE DATABASE, this is to
-		 * ensure that we don't have to replay a committed XLOG_DBASE_CREATE
-		 * operation, which would cause us to lose any unlogged operations
-		 * done in the new DB tablespace before the next checkpoint.
+		 * ensure that we don't have to replay a committed
+		 * XLOG_DBASE_CREATE_FILE_COPY operation, which would cause us to lose
+		 * any unlogged operations done in the new DB tablespace before the
+		 * next checkpoint.
 		 */
 		RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
 
@@ -2479,9 +3015,10 @@ dbase_redo(XLogReaderState *record)
 	/* Backup blocks are not used in dbase records */
 	Assert(!XLogRecHasAnyBlockRefs(record));
 
-	if (info == XLOG_DBASE_CREATE)
+	if (info == XLOG_DBASE_CREATE_FILE_COPY)
 	{
-		xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) XLogRecGetData(record);
+		xl_dbase_create_file_copy_rec *xlrec =
+		(xl_dbase_create_file_copy_rec *) XLogRecGetData(record);
 		char	   *src_path;
 		char	   *dst_path;
 		char	   *parent_path;
@@ -2568,6 +3105,44 @@ dbase_redo(XLogReaderState *record)
 		 */
 		copydir(src_path, dst_path, false);
 	}
+	else if (info == XLOG_DBASE_CREATE_WAL_LOG)
+	{
+		xl_dbase_create_wal_log_rec *xlrec =
+		(xl_dbase_create_wal_log_rec *) XLogRecGetData(record);
+		char	   *dbpath;
+
+		dbpath = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
+		if (!reachedConsistency)
+		{
+			char	   *parent_path;
+			struct stat st;
+
+			/*
+			 * Skip the replay of database directory creation if parent
+			 * tablespace directory is missing.  For more detailes refer
+			 * comments in above case XLOG_DBASE_CREATE_FILE_COPY.
+			 */
+			parent_path = pstrdup(dbpath);
+			get_parent_directory(parent_path);
+			if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode)))
+			{
+				XLogRememberMissingDir(xlrec->tablespace_id, InvalidOid, parent_path);
+				ereport(WARNING,
+						(errmsg("skipping replay of database creation WAL record"),
+						 errdetail("The target tablespace \"%s\" directory was not found.",
+								   parent_path),
+						 errhint("A future WAL record that removes the directory before reaching consistent mode is expected.")));
+				pfree(parent_path);
+
+				return;
+			}
+			pfree(parent_path);
+		}
+
+		/* Create the database directory with the version file. */
+		CreateDirAndVersionFile(dbpath, xlrec->db_id, xlrec->tablespace_id,
+								true);
+	}
 	else if (info == XLOG_DBASE_DROP)
 	{
 		xl_dbase_drop_rec *xlrec = (xl_dbase_drop_rec *) XLogRecGetData(record);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 124b996..51b4a00 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -14626,7 +14626,7 @@ index_copy_data(Relation rel, RelFileNode newrnode)
 	 * NOTE: any conflict in relfilenode value will be caught in
 	 * RelationCreateStorage().
 	 */
-	RelationCreateStorage(newrnode, rel->rd_rel->relpersistence);
+	RelationCreateStorage(newrnode, rel->rd_rel->relpersistence, true);
 
 	/* copy main fork */
 	RelationCopyStorage(RelationGetSmgr(rel), dstrel, MAIN_FORKNUM,
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 11005ed..d73a40c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -38,6 +38,7 @@
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/storage.h"
+#include "catalog/storage_xlog.h"
 #include "executor/instrument.h"
 #include "lib/binaryheap.h"
 #include "miscadmin.h"
@@ -486,6 +487,9 @@ static void FindAndDropRelFileNodeBuffers(RelFileNode rnode,
 										  ForkNumber forkNum,
 										  BlockNumber nForkBlock,
 										  BlockNumber firstDelBlock);
+static void RelationCopyStorageUsingBuffer(Relation src, Relation dst,
+										   ForkNumber forkNum,
+										   bool isunlogged);
 static void AtProcExit_Buffers(int code, Datum arg);
 static void CheckForBufferLeaks(void);
 static int	rnode_comparator(const void *p1, const void *p2);
@@ -772,23 +776,23 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
  * ReadBufferWithoutRelcache -- like ReadBufferExtended, but doesn't require
  *		a relcache entry for the relation.
  *
- * NB: At present, this function may only be used on permanent relations, which
- * is OK, because we only use it during XLOG replay.  If in the future we
- * want to use it on temporary or unlogged relations, we could pass additional
- * parameters.
+ * Pass permanent = true for a RELPERSISTENCE_PERMANENT relation, and
+ * permanent = false for a RELPERSISTENCE_UNLOGGED relation. This function
+ * cannot be used for temporary relations (and making that work might be
+ * difficult, unless we only want to read temporary relations for our own
+ * BackendId).
  */
 Buffer
 ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
 						  BlockNumber blockNum, ReadBufferMode mode,
-						  BufferAccessStrategy strategy)
+						  BufferAccessStrategy strategy, bool permanent)
 {
 	bool		hit;
 
 	SMgrRelation smgr = smgropen(rnode, InvalidBackendId);
 
-	Assert(InRecovery);
-
-	return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
+	return ReadBuffer_common(smgr, permanent ? RELPERSISTENCE_PERMANENT :
+							 RELPERSISTENCE_UNLOGGED, forkNum, blockNum,
 							 mode, strategy, &hit);
 }
 
@@ -3677,6 +3681,158 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 }
 
 /* ---------------------------------------------------------------------
+ *		RelationCopyStorageUsingBuffer
+ *
+ *		Copy fork's data using bufmgr.  Same as RelationCopyStorage but instead
+ *		of using smgrread and smgrextend this will copy using bufmgr APIs.
+ *
+ *		Refer comments atop CreateAndCopyRelationData() for details about
+ *		'permanent' parameter.
+ * --------------------------------------------------------------------
+ */
+static void
+RelationCopyStorageUsingBuffer(Relation src, Relation dst, ForkNumber forkNum,
+							   bool permanent)
+{
+	Buffer		srcBuf;
+	Buffer		dstBuf;
+	Page		srcPage;
+	Page		dstPage;
+	bool		use_wal;
+	BlockNumber nblocks;
+	BlockNumber blkno;
+	BufferAccessStrategy bstrategy_src;
+	BufferAccessStrategy bstrategy_dst;
+
+	/*
+	 * In general, we want to write WAL whenever wal_level > 'minimal', but
+	 * we can skip it when copying any fork of an unlogged relation other
+	 * than the init fork.
+	 */
+	use_wal = XLogIsNeeded() && (permanent || forkNum == INIT_FORKNUM);
+
+	/* Get number of blocks in the source relation. */
+	nblocks = smgrnblocks(RelationGetSmgr(src), forkNum);
+
+	/* Nothing to copy; just return. */
+	if (nblocks == 0)
+		return;
+
+	/* This is a bulk operation, so use buffer access strategies. */
+	bstrategy_src = GetAccessStrategy(BAS_BULKREAD);
+	bstrategy_dst = GetAccessStrategy(BAS_BULKWRITE);
+
+	/* Iterate over each block of the source relation file. */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		CHECK_FOR_INTERRUPTS();
+
+		/* Read block from source relation. */
+		srcBuf = ReadBufferWithoutRelcache(src->rd_node, forkNum, blkno,
+										   RBM_NORMAL, bstrategy_src,
+										   permanent);
+		srcPage = BufferGetPage(srcBuf);
+		if (PageIsNew(srcPage) || PageIsEmpty(srcPage))
+		{
+			ReleaseBuffer(srcBuf);
+			continue;
+		}
+
+		/* Use P_NEW to extend the destination relation. */
+		dstBuf = ReadBufferWithoutRelcache(dst->rd_node, forkNum, P_NEW,
+										   RBM_NORMAL, bstrategy_dst,
+										   permanent);
+		LockBuffer(dstBuf, BUFFER_LOCK_EXCLUSIVE);
+
+		START_CRIT_SECTION();
+
+		/* Copy page data from the source to the destination. */
+		dstPage = BufferGetPage(dstBuf);
+		memcpy(dstPage, srcPage, BLCKSZ);
+		MarkBufferDirty(dstBuf);
+
+		/* WAL-log the copied page. */
+		if (use_wal)
+			log_newpage_buffer(dstBuf, true);
+
+		END_CRIT_SECTION();
+
+		UnlockReleaseBuffer(dstBuf);
+		ReleaseBuffer(srcBuf);
+	}
+}
+
+/* ---------------------------------------------------------------------
+ *		CreateAndCopyRelationData
+ *
+ *		Create destination relation storage and copy all forks from the
+ *		source relation to the destination.
+ *
+ *		Pass permanent as true for permanent relations and false for
+ *		unlogged relations.  Currently this API is not supported for
+ *		temporary relations.
+ * --------------------------------------------------------------------
+ */
+void
+CreateAndCopyRelationData(RelFileNode src_rnode, RelFileNode dst_rnode,
+						  bool permanent)
+{
+	Relation		src_rel;
+	Relation		dst_rel;
+	char			relpersistence;
+
+	/* Set the relpersistence. */
+	relpersistence = permanent ?
+		RELPERSISTENCE_PERMANENT : RELPERSISTENCE_UNLOGGED;
+
+	/*
+	 * We can't use a real relcache entry for a relation in some other
+	 * database, but since we're only going to access the fields related
+	 * to physical storage, a fake one is good enough. If we didn't do this
+	 * and used the smgr layer directly, we would have to worry about
+	 * invalidations.
+	 */
+	src_rel = CreateFakeRelcacheEntry(src_rnode);
+	dst_rel = CreateFakeRelcacheEntry(dst_rnode);
+
+	/*
+	 * Create and copy all forks of the relation.  During create database we
+	 * have a separate cleanup mechanism which deletes complete database
+	 * directory.  Therefore, each individual relation doesn't need to be
+	 * registered for cleanup.
+	 */
+	RelationCreateStorage(dst_rnode, relpersistence, false);
+
+	/* copy main fork. */
+	RelationCopyStorageUsingBuffer(src_rel, dst_rel, MAIN_FORKNUM, permanent);
+
+	/* copy those extra forks that exist */
+	for (ForkNumber forkNum = MAIN_FORKNUM + 1;
+		 forkNum <= MAX_FORKNUM; forkNum++)
+	{
+		if (smgrexists(RelationGetSmgr(src_rel), forkNum))
+		{
+			smgrcreate(RelationGetSmgr(dst_rel), forkNum, false);
+
+			/*
+			 * WAL log creation if the relation is persistent, or this is the
+			 * init fork of an unlogged relation.
+			 */
+			if (permanent || forkNum == INIT_FORKNUM)
+				log_smgrcreate(&dst_rnode, forkNum);
+
+			/* Copy a fork's data, block by block. */
+			RelationCopyStorageUsingBuffer(src_rel, dst_rel, forkNum,
+										   permanent);
+		}
+	}
+
+	/* Release fake relcache entries. */
+	FreeFakeRelcacheEntry(src_rel);
+	FreeFakeRelcacheEntry(dst_rel);
+}
+
+/* ---------------------------------------------------------------------
  *		FlushDatabaseBuffers
  *
  *		This function writes all dirty pages of a database out to disk
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 5ae52dd..1543da6 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -176,6 +176,34 @@ ConditionalLockRelationOid(Oid relid, LOCKMODE lockmode)
 }
 
 /*
+ *		LockRelationId
+ *
+ * Lock, given a LockRelId.  Same as LockRelationOid but take LockRelId as an
+ * input.
+ */
+void
+LockRelationId(LockRelId *relid, LOCKMODE lockmode)
+{
+	LOCKTAG		tag;
+	LOCALLOCK  *locallock;
+	LockAcquireResult res;
+
+	SET_LOCKTAG_RELATION(tag, relid->dbId, relid->relId);
+
+	res = LockAcquireExtended(&tag, lockmode, false, false, true, &locallock);
+
+	/*
+	 * Now that we have the lock, check for invalidation messages; see notes
+	 * in LockRelationOid.
+	 */
+	if (res != LOCKACQUIRE_ALREADY_CLEAR)
+	{
+		AcceptInvalidationMessages();
+		MarkLockClear(locallock);
+	}
+}
+
+/*
  *		UnlockRelationId
  *
  * Unlock, given a LockRelId.  This is preferred over UnlockRelationOid
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index ff46a0e..1c8aba4 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -705,6 +705,9 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_TWOPHASE_FILE_WRITE:
 			event_name = "TwophaseFileWrite";
 			break;
+		case WAIT_EVENT_VERSION_FILE_WRITE:
+			event_name = "VersionFileWrite";
+			break;
 		case WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ:
 			event_name = "WALSenderTimelineHistoryRead";
 			break;
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index d47fac7..a15ce9e 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -3746,7 +3746,7 @@ RelationSetNewRelfilenode(Relation relation, char persistence)
 		/* handle these directly, at least for now */
 		SMgrRelation srel;
 
-		srel = RelationCreateStorage(newrnode, persistence);
+		srel = RelationCreateStorage(newrnode, persistence, true);
 		smgrclose(srel);
 	}
 	else
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 4d0718f..8b6b878 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -46,6 +46,8 @@
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/pg_tablespace.h"
 #include "catalog/storage.h"
@@ -252,6 +254,63 @@ RelationMapFilenodeToOid(Oid filenode, bool shared)
 }
 
 /*
+ * RelationMapOidToFilenodeForDatabase
+ *
+ * Like RelationMapOidToFilenode, but reads the mapping from the indicated
+ * path instead of using the one for the current database.
+ */
+Oid
+RelationMapOidToFilenodeForDatabase(char *dbpath, Oid relationId)
+{
+	RelMapFile	map;
+	int			i;
+
+	/* Read the relmap file from the source database. */
+	read_relmap_file(&map, dbpath, false, ERROR);
+
+	/* Iterate over the relmap entries to find the input relation OID. */
+	for (i = 0; i < map.num_mappings; i++)
+	{
+		if (relationId == map.mappings[i].mapoid)
+			return map.mappings[i].mapfilenode;
+	}
+
+	return InvalidOid;
+}
+
+/*
+ * RelationMapCopy
+ *
+ * Copy relmapfile from source db path to the destination db path and WAL log
+ * the operation. This is intended for use in creating a new relmap file
+ * for a database that doesn't have one yet, not for replacing an existing
+ * relmap file.
+ */
+void
+RelationMapCopy(Oid dbid, Oid tsid, char *srcdbpath, char *dstdbpath)
+{
+	RelMapFile map;
+
+	/*
+	 * Read the relmap file from the source database.
+	 */
+	read_relmap_file(&map, srcdbpath, false, ERROR);
+
+	/*
+	 * Write the same data into the destination database's relmap file.
+	 *
+	 * No sinval is needed because no one can be connected to the destination
+	 * database yet. For the same reason, there is no need to acquire
+	 * RelationMappingLock.
+	 *
+	 * There's no point in trying to preserve files here. The new database
+	 * isn't usable yet anyway, and won't ever be if we can't install a
+	 * relmap file.
+	 */
+	write_relmap_file(&map, true, false, false, dbid, tsid, dstdbpath);
+}
+
+/*
  * RelationMapUpdateMap
  *
  * Install a new relfilenode mapping for the specified relation.
@@ -1023,6 +1082,33 @@ relmap_redo(XLogReaderState *record)
 
 		/* We need to construct the pathname for this database */
 		dbpath = GetDatabasePath(xlrec->dbid, xlrec->tsid);
+		if (!reachedConsistency)
+		{
+			struct stat st;
+
+			/*
+			 * Skip replaying relmap file writes if the parent database
+			 * directory isn't present.  The reason we need to skip it is that
+			 * if we build the database using the wal_log strategy, then we
+			 * will be creating a new relmap file, and if we skipped creating
+			 * the database directory due to a missing tablespace directory,
+			 * then we will also need to skip this step.  For more details on
+			 * why the database directory creation WAL is skipped, refer to
+			 * comments in dbase_redo().
+			 */
+			if (!(stat(dbpath, &st) == 0 && S_ISDIR(st.st_mode)))
+			{
+				XLogRememberMissingDir(xlrec->tsid, xlrec->dbid, dbpath);
+				ereport(WARNING,
+						(errmsg("skipping replay of relmap file write WAL record"),
+						 errdetail("The target database \"%s\" directory was not found.",
+								   dbpath),
+						 errhint("A future WAL record that removes the directory before reaching consistent mode is expected.")));
+				pfree(dbpath);
+
+				return;
+			}
+		}
 
 		/*
 		 * Write out the new map and send sinval, but of course don't write a
@@ -1031,6 +1117,13 @@ relmap_redo(XLogReaderState *record)
 		 *
 		 * There shouldn't be anyone else updating relmaps during WAL replay,
 		 * but grab the lock to interlock against load_relmap_file().
+		 *
+		 * Note that we use the same WAL record for updating the relmap of
+		 * an existing database as we do for creating a new database. In
+		 * the latter case, taking the relmap log and sending sinval messages
+		 * is unnecessary, but harmless. If we wanted to avoid it, we could
+		 * add a flag to the WAL record to indicate which opration is being
+		 * performed.
 		 */
 		LWLockAcquire(RelationMappingLock, LW_EXCLUSIVE);
 		write_relmap_file(&newmap, false, true, false,
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 3ed2a2e..49966e7 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -372,7 +372,7 @@ extractPageInfo(XLogReaderState *record)
 
 	/* Is this a special record type that I recognize? */
 
-	if (rmid == RM_DBASE_ID && rminfo == XLOG_DBASE_CREATE)
+	if (rmid == RM_DBASE_ID && rminfo == XLOG_DBASE_CREATE_FILE_COPY)
 	{
 		/*
 		 * New databases can be safely ignored. It won't be present in the
@@ -384,6 +384,13 @@ extractPageInfo(XLogReaderState *record)
 		 * overwriting the database created in the target system.
 		 */
 	}
+	else if (rmid == RM_DBASE_ID && rminfo == XLOG_DBASE_CREATE_WAL_LOG)
+	{
+		/*
+		 * New databases can be safely ignored. It won't be present in the
+		 * source system, so it will be deleted.
+		 */
+	}
 	else if (rmid == RM_DBASE_ID && rminfo == XLOG_DBASE_DROP)
 	{
 		/*
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 63bfdf1..fc7cbcd 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2780,13 +2780,15 @@ psql_completion(const char *text, int start, int end)
 	/* CREATE DATABASE */
 	else if (Matches("CREATE", "DATABASE", MatchAny))
 		COMPLETE_WITH("OWNER", "TEMPLATE", "ENCODING", "TABLESPACE",
-					  "IS_TEMPLATE",
+					  "IS_TEMPLATE", "STRATEGY",
 					  "ALLOW_CONNECTIONS", "CONNECTION LIMIT",
 					  "LC_COLLATE", "LC_CTYPE", "LOCALE", "OID",
 					  "LOCALE_PROVIDER", "ICU_LOCALE");
 
 	else if (Matches("CREATE", "DATABASE", MatchAny, "TEMPLATE"))
 		COMPLETE_WITH_QUERY(Query_for_list_of_template_databases);
+	else if (Matches("CREATE", "DATABASE", MatchAny, "STRATEGY"))
+		COMPLETE_WITH("WAL_LOG", "FILE_COPY");
 
 	/* CREATE DOMAIN */
 	else if (Matches("CREATE", "DOMAIN", MatchAny))
diff --git a/src/bin/scripts/createdb.c b/src/bin/scripts/createdb.c
index 6f612ab..0bffa2f 100644
--- a/src/bin/scripts/createdb.c
+++ b/src/bin/scripts/createdb.c
@@ -34,6 +34,7 @@ main(int argc, char *argv[])
 		{"tablespace", required_argument, NULL, 'D'},
 		{"template", required_argument, NULL, 'T'},
 		{"encoding", required_argument, NULL, 'E'},
+		{"strategy", required_argument, NULL, 'S'},
 		{"lc-collate", required_argument, NULL, 1},
 		{"lc-ctype", required_argument, NULL, 2},
 		{"locale", required_argument, NULL, 'l'},
@@ -60,6 +61,7 @@ main(int argc, char *argv[])
 	char	   *tablespace = NULL;
 	char	   *template = NULL;
 	char	   *encoding = NULL;
+	char	   *strategy = NULL;
 	char	   *lc_collate = NULL;
 	char	   *lc_ctype = NULL;
 	char	   *locale = NULL;
@@ -77,7 +79,7 @@ main(int argc, char *argv[])
 
 	handle_help_version_opts(argc, argv, "createdb", help);
 
-	while ((c = getopt_long(argc, argv, "h:p:U:wWeO:D:T:E:l:", long_options, &optindex)) != -1)
+	while ((c = getopt_long(argc, argv, "h:p:U:wWeO:D:T:E:l:S:", long_options, &optindex)) != -1)
 	{
 		switch (c)
 		{
@@ -111,6 +113,9 @@ main(int argc, char *argv[])
 			case 'E':
 				encoding = pg_strdup(optarg);
 				break;
+			case 'S':
+				strategy = pg_strdup(optarg);
+				break;
 			case 1:
 				lc_collate = pg_strdup(optarg);
 				break;
@@ -215,6 +220,8 @@ main(int argc, char *argv[])
 		appendPQExpBufferStr(&sql, " ENCODING ");
 		appendStringLiteralConn(&sql, encoding, conn);
 	}
+	if (strategy)
+		appendPQExpBuffer(&sql, " STRATEGY %s", fmtId(strategy));
 	if (template)
 		appendPQExpBuffer(&sql, " TEMPLATE %s", fmtId(template));
 	if (lc_collate)
@@ -294,6 +301,7 @@ help(const char *progname)
 	printf(_("      --locale-provider={libc|icu}\n"
 			 "                               locale provider for the database's default collation\n"));
 	printf(_("  -O, --owner=OWNER            database user to own the new database\n"));
+	printf(_("  -S, --strategy=STRATEGY      database creation strategy wal_log or file_copy\n"));
 	printf(_("  -T, --template=TEMPLATE      template database to copy\n"));
 	printf(_("  -V, --version                output version information, then exit\n"));
 	printf(_("  -?, --help                   show this help, then exit\n"));
diff --git a/src/bin/scripts/t/020_createdb.pl b/src/bin/scripts/t/020_createdb.pl
index 35deec9..14d3a95 100644
--- a/src/bin/scripts/t/020_createdb.pl
+++ b/src/bin/scripts/t/020_createdb.pl
@@ -104,4 +104,24 @@ $node->command_checks_all(
 	],
 	'createdb with incorrect --lc-ctype');
 
+$node->command_checks_all(
+	[ 'createdb', '--strategy', "foo", 'foobar2' ],
+	1,
+	[qr/^$/],
+	[
+		qr/^createdb: error: database creation failed: ERROR:  invalid create database strategy|^createdb: error: database creation failed: ERROR:  invalid create database strategy foo/s
+	],
+	'createdb with incorrect --strategy');
+
+# Check database creation strategy
+$node->issues_sql_like(
+	[ 'createdb', '-T', 'foobar2', 'foobar6', '-S', 'wal_log'],
+	qr/statement: CREATE DATABASE foobar6 STRATEGY wal_log TEMPLATE foobar2/,
+	'create database with WAL_LOG strategy');
+
+$node->issues_sql_like(
+	[ 'createdb', '-T', 'foobar2', 'foobar7', '-S', 'file_copy'],
+	qr/statement: CREATE DATABASE foobar7 STRATEGY file_copy TEMPLATE foobar2/,
+	'create database with FILE_COPY strategy');
+
 done_testing();
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 9ffc741..844a023 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -22,7 +22,9 @@
 /* GUC variables */
 extern int	wal_skip_threshold;
 
-extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence);
+extern SMgrRelation RelationCreateStorage(RelFileNode rnode,
+										  char relpersistence,
+										  bool register_delete);
 extern void RelationDropStorage(Relation rel);
 extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
 extern void RelationPreTruncate(Relation rel);
diff --git a/src/include/commands/dbcommands_xlog.h b/src/include/commands/dbcommands_xlog.h
index 593a857..0ee2452 100644
--- a/src/include/commands/dbcommands_xlog.h
+++ b/src/include/commands/dbcommands_xlog.h
@@ -18,17 +18,32 @@
 #include "lib/stringinfo.h"
 
 /* record types */
-#define XLOG_DBASE_CREATE		0x00
-#define XLOG_DBASE_DROP			0x10
+#define XLOG_DBASE_CREATE_FILE_COPY		0x00
+#define XLOG_DBASE_CREATE_WAL_LOG		0x10
+#define XLOG_DBASE_DROP					0x20
 
-typedef struct xl_dbase_create_rec
+/*
+ * Single WAL record for an entire CREATE DATABASE operation. This is used
+ * by the FILE_COPY strategy.
+ */
+typedef struct xl_dbase_create_file_copy_rec
 {
-	/* Records copying of a single subdirectory incl. contents */
 	Oid			db_id;
 	Oid			tablespace_id;
 	Oid			src_db_id;
 	Oid			src_tablespace_id;
-} xl_dbase_create_rec;
+} xl_dbase_create_file_copy_rec;
+
+/*
+ * WAL record for the beginning of a CREATE DATABASE operation, when the
+ * WAL_LOG strategy is used. Each individual block will be logged separately
+ * afterward.
+ */
+typedef struct xl_dbase_create_wal_log_rec
+{
+	Oid			db_id;
+	Oid			tablespace_id;
+} xl_dbase_create_wal_log_rec;
 
 typedef struct xl_dbase_drop_rec
 {
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index dd01841..a6b657f 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -184,7 +184,8 @@ extern Buffer ReadBufferExtended(Relation reln, ForkNumber forkNum,
 								 BufferAccessStrategy strategy);
 extern Buffer ReadBufferWithoutRelcache(RelFileNode rnode,
 										ForkNumber forkNum, BlockNumber blockNum,
-										ReadBufferMode mode, BufferAccessStrategy strategy);
+										ReadBufferMode mode, BufferAccessStrategy strategy,
+										bool permanent);
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern void MarkBufferDirty(Buffer buffer);
@@ -203,6 +204,9 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
 extern void FlushOneBuffer(Buffer buffer);
 extern void FlushRelationBuffers(Relation rel);
 extern void FlushRelationsAllBuffers(struct SMgrRelationData **smgrs, int nrels);
+extern void CreateAndCopyRelationData(RelFileNode src_rnode,
+									  RelFileNode dst_rnode,
+									  bool permanent);
 extern void FlushDatabaseBuffers(Oid dbid);
 extern void DropRelFileNodeBuffers(struct SMgrRelationData *smgr_reln, ForkNumber *forkNum,
 								   int nforks, BlockNumber *firstDelBlock);
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index 49edbcc..be1d2c9 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -38,6 +38,7 @@ extern void RelationInitLockInfo(Relation relation);
 
 /* Lock a relation */
 extern void LockRelationOid(Oid relid, LOCKMODE lockmode);
+extern void LockRelationId(LockRelId *relid, LOCKMODE lockmode);
 extern bool ConditionalLockRelationOid(Oid relid, LOCKMODE lockmode);
 extern void UnlockRelationId(LockRelId *relid, LOCKMODE lockmode);
 extern void UnlockRelationOid(Oid relid, LOCKMODE lockmode);
diff --git a/src/include/utils/relmapper.h b/src/include/utils/relmapper.h
index 9fbb5a7..f10353e 100644
--- a/src/include/utils/relmapper.h
+++ b/src/include/utils/relmapper.h
@@ -38,7 +38,9 @@ typedef struct xl_relmap_update
 extern Oid	RelationMapOidToFilenode(Oid relationId, bool shared);
 
 extern Oid	RelationMapFilenodeToOid(Oid relationId, bool shared);
-
+extern Oid RelationMapOidToFilenodeForDatabase(char *dbpath, Oid relationId);
+extern void RelationMapCopy(Oid dbid, Oid tsid, char *srcdbpath,
+							char *dstdbpath);
 extern void RelationMapUpdateMap(Oid relationId, Oid fileNode, bool shared,
 								 bool immediate);
 
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 1c39ce0..d870c59 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -218,6 +218,7 @@ typedef enum
 	WAIT_EVENT_TWOPHASE_FILE_READ,
 	WAIT_EVENT_TWOPHASE_FILE_SYNC,
 	WAIT_EVENT_TWOPHASE_FILE_WRITE,
+	WAIT_EVENT_VERSION_FILE_WRITE,
 	WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ,
 	WAIT_EVENT_WAL_BOOTSTRAP_SYNC,
 	WAIT_EVENT_WAL_BOOTSTRAP_WRITE,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 85c808a..e0544b7 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -461,6 +461,8 @@ CoverPos
 CreateAmStmt
 CreateCastStmt
 CreateConversionStmt
+CreateDBRelInfo
+CreateDBStrategy
 CreateDomainStmt
 CreateEnumStmt
 CreateEventTrigStmt
@@ -3701,7 +3703,8 @@ xl_btree_update
 xl_btree_vacuum
 xl_clog_truncate
 xl_commit_ts_truncate
-xl_dbase_create_rec
+xl_dbase_create_file_copy_rec
+xl_dbase_create_wal_log_rec
 xl_dbase_drop_rec
 xl_end_of_recovery
 xl_hash_add_ovfl_page
-- 
1.8.3.1

#200

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Dilip Kumar (#199)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Mon, Mar 28, 2022 at 2:18 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have put the similar logic for relmap_update WAL replay as well,

There was some mistake in the last patch, basically, for relmap update
also I have checked the missing tablespace directory but I should have
checked the missing database directory so I have fixed that.

Now, is it possible to get the FPI without smgr_create wal in other
cases? If it is then that problem is orthogonal to this path, but
anyway I could not find any such scenario.

I have digged further into it, tried manually removing the directory
before XLOG_FPI, but I noticed that during FPI also
XLogReadBufferExtended() take cares of creating the missing files
using smgrcreate() and that intern take care of missing directory
creation so I don't think we have any problem here.

I don't understand whether XLOG_RELMAP_UPDATE should be just doing
smgrcreate() as we would for most WAL records or whether it should be
adopting the new system introduced by
49d9cfc68bf4e0d32a948fe72d5a0ef7f464944e. I wrote about this concern
over here:

/messages/by-id/CA+TgmoYcUPL+WOJL2ZzhH=zmrhj0iOQ=iCFM0SuYqBbqZEamEg@mail.gmail.com

But apart from that question your adaptations here look reasonable to me.

--
Robert Haas
EDB: http://www.enterprisedb.com

#201

Dilip Kumar

dilipbalaut@gmail.com

almost 4 years ago

In reply to: Robert Haas (#200)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Tue, Mar 29, 2022 at 12:38 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Mar 28, 2022 at 2:18 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have put the similar logic for relmap_update WAL replay as well,

There was some mistake in the last patch, basically, for relmap update
also I have checked the missing tablespace directory but I should have
checked the missing database directory so I have fixed that.

Now, is it possible to get the FPI without smgr_create wal in other
cases? If it is then that problem is orthogonal to this path, but
anyway I could not find any such scenario.

I have digged further into it, tried manually removing the directory
before XLOG_FPI, but I noticed that during FPI also
XLogReadBufferExtended() take cares of creating the missing files
using smgrcreate() and that intern take care of missing directory
creation so I don't think we have any problem here.

I don't understand whether XLOG_RELMAP_UPDATE should be just doing
smgrcreate()

XLOG_RELMAP_UPDATE is for the complete database so for which relnode
it will create smgr? I think you probably meant
TablespaceCreateDbspace()?

as we would for most WAL records or whether it should be

adopting the new system introduced by
49d9cfc68bf4e0d32a948fe72d5a0ef7f464944e. I wrote about this concern
over here:

okay, thanks.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#202

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Robert Haas (#200)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Mon, Mar 28, 2022 at 3:08 PM Robert Haas <robertmhaas@gmail.com> wrote:

smgrcreate() as we would for most WAL records or whether it should be
adopting the new system introduced by
49d9cfc68bf4e0d32a948fe72d5a0ef7f464944e. I wrote about this concern
over here:

/messages/by-id/CA+TgmoYcUPL+WOJL2ZzhH=zmrhj0iOQ=iCFM0SuYqBbqZEamEg@mail.gmail.com

But apart from that question your adaptations here look reasonable to me.

That commit having been reverted, I committed v6 instead. Let's see
what breaks...

--
Robert Haas
EDB: http://www.enterprisedb.com

#203

Andres Freund

andres@anarazel.de

almost 4 years ago

In reply to: Robert Haas (#202)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On 2022-03-29 11:55:05 -0400, Robert Haas wrote:

That commit having been reverted, I committed v6 instead. Let's see
what breaks...

It fails in CI (for the mirror of the postgres repo on github):
https://cirrus-ci.com/task/6279465603956736?logs=test_bin#L121
tap test log: https://api.cirrus-ci.com/v1/artifact/task/6279465603956736/log/src/bin/scripts/tmp_check/log/regress_log_020_createdb
postmaster log: https://api.cirrus-ci.com/v1/artifact/task/6279465603956736/log/src/bin/scripts/tmp_check/log/020_createdb_main.log

recent versions failed similarly on cfbot:
https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/37/3192
https://cirrus-ci.com/task/5217140407009280?logs=test_bin#L121

# Running: createdb -T foobar2 foobar6 -S wal_log
createdb: error: too many command-line arguments (first is "wal_log")
Try "createdb --help" for more information.
not ok 31 - createdb -T foobar2 foobar6 -S wal_log exit code 0

# Failed test 'createdb -T foobar2 foobar6 -S wal_log exit code 0'
# at t/020_createdb.pl line 117.
not ok 32 - create database with WAL_LOG strategy: SQL found in server log

# Failed test 'create database with WAL_LOG strategy: SQL found in server log'
# at t/020_createdb.pl line 117.
# ''
# doesn't match '(?^:statement: CREATE DATABASE foobar6 STRATEGY wal_log TEMPLATE foobar2)'
# Running: createdb -T foobar2 foobar7 -S file_copy
createdb: error: too many command-line arguments (first is "file_copy")
Try "createdb --help" for more information.
not ok 33 - createdb -T foobar2 foobar7 -S file_copy exit code 0

# Failed test 'createdb -T foobar2 foobar7 -S file_copy exit code 0'
# at t/020_createdb.pl line 122.
not ok 34 - create database with FILE_COPY strategy: SQL found in server log

# Failed test 'create database with FILE_COPY strategy: SQL found in server log'
# at t/020_createdb.pl line 122.
# ''
# doesn't match '(?^:statement: CREATE DATABASE foobar7 STRATEGY file_copy TEMPLATE foobar2)'

Looks like there's some problem with commandline parsing?

Greetings,

Andres Freund

#204

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Andres Freund (#203)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Tue, Mar 29, 2022 at 1:35 PM Andres Freund <andres@anarazel.de> wrote:

# Running: createdb -T foobar2 foobar6 -S wal_log
createdb: error: too many command-line arguments (first is "wal_log")
Try "createdb --help" for more information.
not ok 31 - createdb -T foobar2 foobar6 -S wal_log exit code 0

Looks like there's some problem with commandline parsing?

Apparently getopt_long() is fussier on Windows. I have committed a fix.

--
Robert Haas
EDB: http://www.enterprisedb.com

#205

Tom Lane

tgl@sss.pgh.pa.us

almost 4 years ago

In reply to: Andres Freund (#203)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

Andres Freund <andres@anarazel.de> writes:

Looks like there's some problem with commandline parsing?

That test script is expecting glibc-like laxness of switch
parsing. Put the switches before the non-switch arguments.

regards, tom lane

#206

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Tom Lane (#205)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Tue, Mar 29, 2022 at 1:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andres Freund <andres@anarazel.de> writes:

Looks like there's some problem with commandline parsing?

That test script is expecting glibc-like laxness of switch
parsing. Put the switches before the non-switch arguments.

I just did that. :-)

--
Robert Haas
EDB: http://www.enterprisedb.com

#207

Tom Lane

tgl@sss.pgh.pa.us

almost 4 years ago

In reply to: Robert Haas (#206)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Mar 29, 2022 at 1:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

That test script is expecting glibc-like laxness of switch
parsing. Put the switches before the non-switch arguments.

I just did that. :-)

Yup, you pushed while I was typing.

FWIW, I don't think it's "Windows" enforcing this, it's our own
src/port/getopt[_long].c. If there were a well-defined spec
for what glibc does with such cases, it might be interesting to
try to make our version bug-compatible with theirs. But AFAIK
it's some random algorithm that they probably feel at liberty
to change.

regards, tom lane

#208

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Tom Lane (#207)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Tue, Mar 29, 2022 at 2:17 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Mar 29, 2022 at 1:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

That test script is expecting glibc-like laxness of switch
parsing. Put the switches before the non-switch arguments.

I just did that. :-)

Yup, you pushed while I was typing.

FWIW, I don't think it's "Windows" enforcing this, it's our own
src/port/getopt[_long].c. If there were a well-defined spec
for what glibc does with such cases, it might be interesting to
try to make our version bug-compatible with theirs. But AFAIK
it's some random algorithm that they probably feel at liberty
to change.

I guess that characterization surprises me. The man page for
getopt_long() says this, and has for a long time at least on systems
I've used:

ENVIRONMENT
POSIXLY_CORRECT If set, option processing stops when the first non-
option is found and a leading `-' or `+' in the
optstring is ignored.

And also this:

BUGS
The argv argument is not really const as its elements may be permuted
(unless POSIXLY_CORRECT is set).

Doesn't that make it pretty clear what the GNU version is doing?

--
Robert Haas
EDB: http://www.enterprisedb.com

#209

Tom Lane

tgl@sss.pgh.pa.us

almost 4 years ago

In reply to: Robert Haas (#208)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Mar 29, 2022 at 2:17 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

it's some random algorithm that they probably feel at liberty
to change.

I guess that characterization surprises me. The man page for
getopt_long() says this, and has for a long time at least on systems
I've used:

Yeah, they say they follow the POSIX spec when you set POSIXLY_CORRECT.
What they don't spell out in any detail is what they do when you don't.
We know that it involves rearranging the argv[] array behind the
application's back, but not what the rules are for doing that. In
particular, they must have some undocumented and probably not very safe
method for deciding which arguments are neither switches nor switch
arguments.

(Actually, if I recall previous discussions properly, another stumbling
block to doing anything here is that we'd also have to change all our
documentation to explain it. Fixing the command line synopses would
be a mess already, and explaining the rules would be worse.)

regards, tom lane

#210

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Tom Lane (#209)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Tue, Mar 29, 2022 at 2:37 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Mar 29, 2022 at 2:17 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

it's some random algorithm that they probably feel at liberty
to change.

I guess that characterization surprises me. The man page for
getopt_long() says this, and has for a long time at least on systems
I've used:

Yeah, they say they follow the POSIX spec when you set POSIXLY_CORRECT.
What they don't spell out in any detail is what they do when you don't.
We know that it involves rearranging the argv[] array behind the
application's back, but not what the rules are for doing that. In
particular, they must have some undocumented and probably not very safe
method for deciding which arguments are neither switches nor switch
arguments.

I mean, I think of an option as something that starts with '-'. The
documentation contains a caveat that says: "The special argument ‘--’
forces in all cases the end of option scanning." So I think I would
expect it just looks for arguments starting with '-' that do not
follow an argument that is exactly "--".

https://github.com/gcc-mirror/gcc/blob/master/libiberty/getopt.c

If an element of ARGV starts with '-', and is not exactly "-" or "--",
then it is an option element. The characters of this element
(aside from the initial '-') are option characters. If `getopt'
is called repeatedly, it returns successively each of the option characters
from each of the option elements.

OK - so I was off slightly. Either "-" or "--" terminates the options
list. Apart from that anything starting with "-" is an option.

I think you're overestimating the level of mystery that's present
here, as well as the likelihood that the rules could ever be changed.

--
Robert Haas
EDB: http://www.enterprisedb.com

#211

Andres Freund

andres@anarazel.de

almost 4 years ago

In reply to: Robert Haas (#202)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

Hi,

On 2022-03-29 11:55:05 -0400, Robert Haas wrote:

I committed v6 instead.

Just noticed that it makes initdb a bit slower / the cluster a bit bigger,
because now there's WAL traffic from creating the databases. There's an
optimization (albeit insufficient) to reduce WAL traffic in bootstrap mode,
but not for single user mode when the CREATE DATABASEs happen.

In an optimized build, with wal-segsize 1 (the most extreme case) using
FILE_COPY vs WAL_LOG:

perf stat ~/build/postgres/dev-optimize/install/bin/initdb /tmp/initdb/ --wal-segsize=1
WAL_LOG:

487.58 msec task-clock # 0.848 CPUs utilized
2,874 context-switches # 5.894 K/sec
0 cpu-migrations # 0.000 /sec
10,209 page-faults # 20.938 K/sec
1,550,483,095 cycles # 3.180 GHz
2,537,618,094 instructions # 1.64 insn per cycle
492,780,121 branches # 1.011 G/sec
7,384,884 branch-misses # 1.50% of all branches

0.575213800 seconds time elapsed

0.349812000 seconds user
0.133225000 seconds sys

FILE_COPY:

476.54 msec task-clock # 0.854 CPUs utilized
3,005 context-switches # 6.306 K/sec
0 cpu-migrations # 0.000 /sec
10,050 page-faults # 21.090 K/sec
1,516,058,200 cycles # 3.181 GHz
2,504,126,907 instructions # 1.65 insn per cycle
488,042,856 branches # 1.024 G/sec
7,327,364 branch-misses # 1.50% of all branches

0.557934976 seconds time elapsed

0.360473000 seconds user
0.112109000 seconds sys

the numbers are similar if repeated.

du -s /tmp/initdb/
WAL_LOG: 35112
FILE_COPY: 29288

So it seems we should specify a strategy in initdb? It kind of makes sense -
we're not going to read anything from those database. And because of the
ringbuffer of 256kB, we'll not even reduce IO meaningfully.

- Andres

#212

Dilip Kumar

dilipbalaut@gmail.com

almost 4 years ago

In reply to: Andres Freund (#211)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Wed, Mar 30, 2022 at 6:47 AM Andres Freund <andres@anarazel.de> wrote:

du -s /tmp/initdb/
WAL_LOG: 35112
FILE_COPY: 29288

So it seems we should specify a strategy in initdb? It kind of makes sense -
we're not going to read anything from those database. And because of the
ringbuffer of 256kB, we'll not even reduce IO meaningfully.

I think this makes sense, so you mean with initdb we will always use
file_copy or we want to give a command line option for initdb ?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#213

Dilip Kumar

dilipbalaut@gmail.com

almost 4 years ago

In reply to: Robert Haas (#202)

1 attachment(s)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Tue, Mar 29, 2022 at 9:25 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Mar 28, 2022 at 3:08 PM Robert Haas <robertmhaas@gmail.com> wrote:

smgrcreate() as we would for most WAL records or whether it should be
adopting the new system introduced by
49d9cfc68bf4e0d32a948fe72d5a0ef7f464944e. I wrote about this concern
over here:

/messages/by-id/CA+TgmoYcUPL+WOJL2ZzhH=zmrhj0iOQ=iCFM0SuYqBbqZEamEg@mail.gmail.com

But apart from that question your adaptations here look reasonable to me.

That commit having been reverted, I committed v6 instead. Let's see
what breaks...

There was a duplicate error check for the invalid createdb strategy
option in the test case, although it would not create any issue but it
is duplicate so I have fixed it in the attached patch.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

remove_duplicate_error_check.patchtext/x-patch; charset=US-ASCII; name=remove_duplicate_error_check.patchDownload

diff --git a/src/bin/scripts/t/020_createdb.pl b/src/bin/scripts/t/020_createdb.pl
index b81f06a..18f6e31 100644
--- a/src/bin/scripts/t/020_createdb.pl
+++ b/src/bin/scripts/t/020_createdb.pl
@@ -109,7 +109,7 @@ $node->command_checks_all(
 	1,
 	[qr/^$/],
 	[
-		qr/^createdb: error: database creation failed: ERROR:  invalid create database strategy|^createdb: error: database creation failed: ERROR:  invalid create database strategy foo/s
+		qr/^createdb: error: database creation failed: ERROR:  invalid create database strategy foo/s
 	],
 	'createdb with incorrect --strategy');

#214

Andres Freund

andres@anarazel.de

almost 4 years ago

In reply to: Dilip Kumar (#212)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

Hi,

On 2022-03-30 09:28:58 +0530, Dilip Kumar wrote:

On Wed, Mar 30, 2022 at 6:47 AM Andres Freund <andres@anarazel.de> wrote:

du -s /tmp/initdb/
WAL_LOG: 35112
FILE_COPY: 29288

So it seems we should specify a strategy in initdb? It kind of makes sense -
we're not going to read anything from those database. And because of the
ringbuffer of 256kB, we'll not even reduce IO meaningfully.

I think this makes sense, so you mean with initdb we will always use
file_copy or we want to give a command line option for initdb ?

Don't see a need for a commandline option / a situation where using WAL_LOG
would be preferrable for initdb.

Greetings,

Andres Freund

#215

Andres Freund

andres@anarazel.de

almost 4 years ago

In reply to: Robert Haas (#202)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

Hi,

On 2022-03-29 11:55:05 -0400, Robert Haas wrote:

I committed v6 instead.

I was just discussing the WAL prefetching patch with Thomas. A question in
that discussion made me look at the coverage of REDO for CREATE DATABASE:
https://coverage.postgresql.org/src/backend/commands/dbcommands.c.gcov.html

Seems there's currently nothing hitting the REDO for
XLOG_DBASE_CREATE_FILE_COPY (currently line 3019). I think it'd be good to
keep coverage for that. How about adding a
CREATE DATABASE ... STRATEGY file_copy
to 001_stream_rep.pl?

Might be worth adding a test for ALTER DATABASE ... SET TABLESPACE at the same
time, this patch did affect that path in some minor ways. And, somewhat
shockingly, we don't have a single test for it.

- Andres

#216

Dilip Kumar

dilipbalaut@gmail.com

almost 4 years ago

In reply to: Andres Freund (#215)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Thu, Mar 31, 2022 at 5:07 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2022-03-29 11:55:05 -0400, Robert Haas wrote:

I committed v6 instead.

I was just discussing the WAL prefetching patch with Thomas. A question in
that discussion made me look at the coverage of REDO for CREATE DATABASE:
https://coverage.postgresql.org/src/backend/commands/dbcommands.c.gcov.html

Seems there's currently nothing hitting the REDO for
XLOG_DBASE_CREATE_FILE_COPY (currently line 3019). I think it'd be good to
keep coverage for that. How about adding a
CREATE DATABASE ... STRATEGY file_copy
to 001_stream_rep.pl?

Might be worth adding a test for ALTER DATABASE ... SET TABLESPACE at the same
time, this patch did affect that path in some minor ways. And, somewhat
shockingly, we don't have a single test for it.

I will add tests for both of these cases and send the patch.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#217

Dilip Kumar

dilipbalaut@gmail.com

almost 4 years ago

In reply to: Dilip Kumar (#216)

2 attachment(s)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Thu, Mar 31, 2022 at 9:46 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Mar 31, 2022 at 5:07 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2022-03-29 11:55:05 -0400, Robert Haas wrote:

I committed v6 instead.

I was just discussing the WAL prefetching patch with Thomas. A question in
that discussion made me look at the coverage of REDO for CREATE DATABASE:
https://coverage.postgresql.org/src/backend/commands/dbcommands.c.gcov.html

Seems there's currently nothing hitting the REDO for
XLOG_DBASE_CREATE_FILE_COPY (currently line 3019). I think it'd be good to
keep coverage for that. How about adding a
CREATE DATABASE ... STRATEGY file_copy
to 001_stream_rep.pl?

Might be worth adding a test for ALTER DATABASE ... SET TABLESPACE at the same
time, this patch did affect that path in some minor ways. And, somewhat
shockingly, we don't have a single test for it.

I will add tests for both of these cases and send the patch.

0001 is changing the strategy to file copy during initdb and 0002
patch adds the test cases for both these cases.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

0002-Create-database-test-coverage.patchtext/x-patch; charset=US-ASCII; name=0002-Create-database-test-coverage.patchDownload

From d0759bcfc4fed674e938e4a03159f5953ca9718d Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Thu, 31 Mar 2022 12:07:19 +0530
Subject: [PATCH 2/2] Create database test coverage

Test create database strategy wal replay and alter database
set tablespace.
---
 src/test/modules/test_misc/t/002_tablespace.pl | 12 ++++++++++++
 src/test/recovery/t/001_stream_rep.pl          | 24 ++++++++++++++++++++++++
 2 files changed, 36 insertions(+)

diff --git a/src/test/modules/test_misc/t/002_tablespace.pl b/src/test/modules/test_misc/t/002_tablespace.pl
index 04e5439..f3bbddc 100644
--- a/src/test/modules/test_misc/t/002_tablespace.pl
+++ b/src/test/modules/test_misc/t/002_tablespace.pl
@@ -83,7 +83,19 @@ $result = $node->psql('postgres',
 	"ALTER TABLE t SET tablespace regress_ts1");
 ok($result == 0, 'move table in-place->abs');
 
+# Test ALTER DATABASE SET TABLESPACE
+$result = $node->psql('postgres',
+	"CREATE DATABASE testdb TABLESPACE regress_ts1");
+ok($result == 0, 'create database in tablespace 1');
+$result = $node->psql('testdb',
+	"CREATE TABLE t ()");
+ok($result == 0, 'create table in testdb database');
+$result = $node->psql('postgres',
+	"ALTER DATABASE testdb SET TABLESPACE regress_ts2");
+ok($result == 0, 'move database to tablespace 2');
+
 # Drop everything
+$result = $node->psql('postgres', "DROP DATABASE testdb");
 $result = $node->psql('postgres',
 	"DROP TABLE t");
 ok($result == 0, 'create table in tablespace 1');
diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index 583ee87..3f1dd59 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -78,6 +78,30 @@ $result = $node_standby_2->safe_psql('postgres', "SELECT * FROM seq1");
 print "standby 2: $result\n";
 is($result, qq(33|0|t), 'check streamed sequence content on standby 2');
 
+# Create database with different strategies and check its presence in standby
+$node_primary->safe_psql('postgres',
+	"CREATE DATABASE testdb1 STRATEGY = FILE_COPY; ");
+$node_primary->safe_psql('testdb1',
+	"CREATE TABLE tab_int AS SELECT generate_series(1,10) AS a");
+$node_primary->safe_psql('postgres',
+	"CREATE DATABASE testdb2 STRATEGY = WAL_LOG; ");
+$node_primary->safe_psql('testdb2',
+	"CREATE TABLE tab_int AS SELECT generate_series(1,10) AS a");
+
+# Wait for standbys to catch up
+$primary_lsn = $node_primary->lsn('write');
+$node_primary->wait_for_catchup($node_standby_1, 'replay', $primary_lsn);
+
+$result =
+  $node_standby_1->safe_psql('testdb1', "SELECT count(*) FROM tab_int");
+print "standby 1: $result\n";
+is($result, qq(10), 'check streamed content on standby 1');
+
+$result =
+  $node_standby_1->safe_psql('testdb2', "SELECT count(*) FROM tab_int");
+print "standby 1: $result\n";
+is($result, qq(10), 'check streamed content on standby 1');
+
 # Check that only READ-only queries can run on standbys
 is($node_standby_1->psql('postgres', 'INSERT INTO tab_int VALUES (1)'),
 	3, 'read-only queries on standby 1');
-- 
1.8.3.1

0001-Use-file_copy-strategy-during-initdb.patchtext/x-patch; charset=US-ASCII; name=0001-Use-file_copy-strategy-during-initdb.patchDownload

From 4a997e2a95074a520777cd2b369f9c728b360969 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Thu, 31 Mar 2022 10:43:16 +0530
Subject: [PATCH 1/2] Use file_copy strategy during initdb

Because skipping the checkpoint during initdb will not result
in significant savings, so there is no point in using wal_log
as that will simply increase the cluster size by generating
extra wal.
---
 src/bin/initdb/initdb.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 5e36943..1256082 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -1856,6 +1856,11 @@ make_template0(FILE *cmdfd)
 	 * it would fail. To avoid that, assign a fixed OID to template0 rather
 	 * than letting the server choose one.
 	 *
+	 * Using file_copy strategy is preferable over wal_log here because
+	 * skipping the checkpoint during initdb will not result in significant
+	 * savings, so there is no point in using wal_log as that will simply
+	 * increase the cluster size by generating extra wal.
+	 *
 	 * (Note that, while the user could have dropped and recreated these
 	 * objects in the old cluster, the problem scenario only exists if the OID
 	 * that is in use in the old cluster is also used in the new cluster - and
@@ -1863,7 +1868,7 @@ make_template0(FILE *cmdfd)
 	 */
 	static const char *const template0_setup[] = {
 		"CREATE DATABASE template0 IS_TEMPLATE = true ALLOW_CONNECTIONS = false OID = "
-		CppAsString2(Template0ObjectId) ";\n\n",
+		CppAsString2(Template0ObjectId) " STRATEGY = file_copy;\n\n",
 
 		/*
 		 * template0 shouldn't have any collation-dependent objects, so unset
@@ -1899,7 +1904,10 @@ make_template0(FILE *cmdfd)
 }
 
 /*
- * copy template1 to postgres
+ * copy template1 to postgres.
+ *
+ * Use file_copy for creating the database; the reason for this is explained in
+ * comments atop template0_setup.
  */
 static void
 make_postgres(FILE *cmdfd)
@@ -1908,7 +1916,7 @@ make_postgres(FILE *cmdfd)
 
 	/* Assign a fixed OID to postgres, for the same reasons as template0 */
 	static const char *const postgres_setup[] = {
-		"CREATE DATABASE postgres OID = " CppAsString2(PostgresObjectId) ";\n\n",
+		"CREATE DATABASE postgres OID = " CppAsString2(PostgresObjectId) " STRATEGY = FILE_COPY;\n\n",
 		"COMMENT ON DATABASE postgres IS 'default administrative connection database';\n\n",
 		NULL
 	};
-- 
1.8.3.1

#218

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Dilip Kumar (#217)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Thu, Mar 31, 2022 at 3:52 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

0001 is changing the strategy to file copy during initdb and 0002
patch adds the test cases for both these cases.

IMHO, 0001 looks fine, except for needing some adjustments to the wording.

I'm less sure about 0002. It's testing the stuff Andres mentioned, but
I'm not sure how good the tests are.

Andres, thoughts? Do you want me to polish and commit 0001?

--
Robert Haas
EDB: http://www.enterprisedb.com

#219

Andres Freund

andres@anarazel.de

almost 4 years ago

In reply to: Dilip Kumar (#217)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

Hi,

On 2022-03-31 13:22:24 +0530, Dilip Kumar wrote:

0001 is changing the strategy to file copy during initdb and 0002
patch adds the test cases for both these cases.

Thanks!

From 4a997e2a95074a520777cd2b369f9c728b360969 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Thu, 31 Mar 2022 10:43:16 +0530
Subject: [PATCH 1/2] Use file_copy strategy during initdb

Because skipping the checkpoint during initdb will not result
in significant savings, so there is no point in using wal_log
as that will simply increase the cluster size by generating
extra wal.
---
src/bin/initdb/initdb.c | 14 +++++++++++---
1 file changed, 11 insertions(+), 3 deletions(-)
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 5e36943..1256082 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -1856,6 +1856,11 @@ make_template0(FILE *cmdfd)
* it would fail. To avoid that, assign a fixed OID to template0 rather
* than letting the server choose one.
*
+	 * Using file_copy strategy is preferable over wal_log here because
+	 * skipping the checkpoint during initdb will not result in significant
+	 * savings, so there is no point in using wal_log as that will simply
+	 * increase the cluster size by generating extra wal.

It's not just the increase in size, it's also the increase in time due to WAL logging.

* (Note that, while the user could have dropped and recreated these
* objects in the old cluster, the problem scenario only exists if the OID
* that is in use in the old cluster is also used in the new cluster - and
@@ -1863,7 +1868,7 @@ make_template0(FILE *cmdfd)
*/
static const char *const template0_setup[] = {
"CREATE DATABASE template0 IS_TEMPLATE = true ALLOW_CONNECTIONS = false OID = "
-		CppAsString2(Template0ObjectId) ";\n\n",
+		CppAsString2(Template0ObjectId) " STRATEGY = file_copy;\n\n",

I'd perhaps break this into a separate line, but...

From d0759bcfc4fed674e938e4a03159f5953ca9718d Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Thu, 31 Mar 2022 12:07:19 +0530
Subject: [PATCH 2/2] Create database test coverage

Test create database strategy wal replay and alter database
set tablespace.
---
src/test/modules/test_misc/t/002_tablespace.pl | 12 ++++++++++++
src/test/recovery/t/001_stream_rep.pl | 24 ++++++++++++++++++++++++
2 files changed, 36 insertions(+)
diff --git a/src/test/modules/test_misc/t/002_tablespace.pl b/src/test/modules/test_misc/t/002_tablespace.pl
index 04e5439..f3bbddc 100644
--- a/src/test/modules/test_misc/t/002_tablespace.pl
+++ b/src/test/modules/test_misc/t/002_tablespace.pl
@@ -83,7 +83,19 @@ $result = $node->psql('postgres',
"ALTER TABLE t SET tablespace regress_ts1");
ok($result == 0, 'move table in-place->abs');
+# Test ALTER DATABASE SET TABLESPACE
+$result = $node->psql('postgres',
+	"CREATE DATABASE testdb TABLESPACE regress_ts1");
+ok($result == 0, 'create database in tablespace 1');
+$result = $node->psql('testdb',
+	"CREATE TABLE t ()");
+ok($result == 0, 'create table in testdb database');
+$result = $node->psql('postgres',
+	"ALTER DATABASE testdb SET TABLESPACE regress_ts2");
+ok($result == 0, 'move database to tablespace 2');

This just tests the command doesn't fail, but not whether it actually did
something useful. Seem we should at least insert a row or two into the the
table, and verify they can be accessed?

+# Create database with different strategies and check its presence in standby
+$node_primary->safe_psql('postgres',
+	"CREATE DATABASE testdb1 STRATEGY = FILE_COPY; ");
+$node_primary->safe_psql('testdb1',
+	"CREATE TABLE tab_int AS SELECT generate_series(1,10) AS a");
+$node_primary->safe_psql('postgres',
+	"CREATE DATABASE testdb2 STRATEGY = WAL_LOG; ");
+$node_primary->safe_psql('testdb2',
+	"CREATE TABLE tab_int AS SELECT generate_series(1,10) AS a");
+
+# Wait for standbys to catch up
+$primary_lsn = $node_primary->lsn('write');
+$node_primary->wait_for_catchup($node_standby_1, 'replay', $primary_lsn);
+
+$result =
+  $node_standby_1->safe_psql('testdb1', "SELECT count(*) FROM tab_int");
+print "standby 1: $result\n";
+is($result, qq(10), 'check streamed content on standby 1');
+
+$result =
+  $node_standby_1->safe_psql('testdb2', "SELECT count(*) FROM tab_int");
+print "standby 1: $result\n";
+is($result, qq(10), 'check streamed content on standby 1');
+
# Check that only READ-only queries can run on standbys
is($node_standby_1->psql('postgres', 'INSERT INTO tab_int VALUES (1)'),
3, 'read-only queries on standby 1');

I'd probably add a function for creating database / table and then testing it,
with a strategy parameter. That way we can afterwards add more tests verifying
that everything worked.

Greetings,

Andres Freund

#220

Andres Freund

andres@anarazel.de

almost 4 years ago

In reply to: Robert Haas (#218)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

Hi,

On 2022-03-31 10:05:10 -0400, Robert Haas wrote:

On Thu, Mar 31, 2022 at 3:52 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

0001 is changing the strategy to file copy during initdb and 0002
patch adds the test cases for both these cases.

IMHO, 0001 looks fine, except for needing some adjustments to the wording.

Agreed.

I'm less sure about 0002. It's testing the stuff Andres mentioned, but
I'm not sure how good the tests are.

I came to a similar conclusion. It's still better than nothing, but it's just
a small bit of additional effort to do some basic testing that e.g. the move
actually worked...

Andres, thoughts? Do you want me to polish and commit 0001?

Yes please!

FWIW, once the freeze is done I'm planning to set up scripting to see which
parts of the code we whacked around don't have test coverage...

Greetings,

Andres Freund

#221

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Andres Freund (#220)

1 attachment(s)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Thu, Mar 31, 2022 at 12:25 PM Andres Freund <andres@anarazel.de> wrote:

Andres, thoughts? Do you want me to polish and commit 0001?

Yes please!

Here is a polished version. Comments?

FWIW, once the freeze is done I'm planning to set up scripting to see which
parts of the code we whacked around don't have test coverage...

Sounds terrifying.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachments:

0001-initdb-When-running-CREATE-DATABASE-use-STRATEGY-WAL.patchapplication/octet-stream; name=0001-initdb-When-running-CREATE-DATABASE-use-STRATEGY-WAL.patchDownload

From 2dcffc2f24c93c6d06b1e39a9d13d5bcddcc49cf Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Thu, 31 Mar 2022 14:29:05 -0400
Subject: [PATCH] initdb: When running CREATE DATABASE, use STRATEGY =
 WAL_COPY.

Dilip Kumar, reviewed by Andres Freund and by me.

Discussion: http://postgr.es/m/20220330011757.wr544o5y5my7ssoa@alap3.anarazel.de
---
 src/bin/initdb/initdb.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 5e36943ef3..9dd4a8de9a 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -1860,10 +1860,15 @@ make_template0(FILE *cmdfd)
 	 * objects in the old cluster, the problem scenario only exists if the OID
 	 * that is in use in the old cluster is also used in the new cluster - and
 	 * the new cluster should be the result of a fresh initdb.)
+	 *
+	 * We use "STRATEGY = file_copy" here because checkpoints during initdb
+	 * are cheap. "STRATEGY = wal_log" would generate more WAL, which would
+	 * be a little bit slower and make the new cluster a little bit bigger.
 	 */
 	static const char *const template0_setup[] = {
 		"CREATE DATABASE template0 IS_TEMPLATE = true ALLOW_CONNECTIONS = false OID = "
-		CppAsString2(Template0ObjectId) ";\n\n",
+		CppAsString2(Template0ObjectId)
+		" STRATEGY = file_copy;\n\n",
 
 		/*
 		 * template0 shouldn't have any collation-dependent objects, so unset
@@ -1906,9 +1911,12 @@ make_postgres(FILE *cmdfd)
 {
 	const char *const *line;
 
-	/* Assign a fixed OID to postgres, for the same reasons as template0 */
+	/*
+	 * Just as we did for template0, and for the same reasons, assign a fixed
+	 * OID to postgres and select the file_copy strategy.
+	 */
 	static const char *const postgres_setup[] = {
-		"CREATE DATABASE postgres OID = " CppAsString2(PostgresObjectId) ";\n\n",
+		"CREATE DATABASE postgres OID = " CppAsString2(PostgresObjectId) " STRATEGY = file_copy;\n\n",
 		"COMMENT ON DATABASE postgres IS 'default administrative connection database';\n\n",
 		NULL
 	};
-- 
2.24.3 (Apple Git-128)

#222

Andres Freund

andres@anarazel.de

almost 4 years ago

In reply to: Robert Haas (#221)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On 2022-03-31 14:31:43 -0400, Robert Haas wrote:

On Thu, Mar 31, 2022 at 12:25 PM Andres Freund <andres@anarazel.de> wrote:

Andres, thoughts? Do you want me to polish and commit 0001?

Yes please!

Here is a polished version. Comments?

LGTM.

#223

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Andres Freund (#222)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Thu, Mar 31, 2022 at 2:44 PM Andres Freund <andres@anarazel.de> wrote:

On 2022-03-31 14:31:43 -0400, Robert Haas wrote:

On Thu, Mar 31, 2022 at 12:25 PM Andres Freund <andres@anarazel.de> wrote:

Andres, thoughts? Do you want me to polish and commit 0001?

Yes please!

Here is a polished version. Comments?

LGTM.

Committed.

--
Robert Haas
EDB: http://www.enterprisedb.com

#224

Dilip Kumar

dilipbalaut@gmail.com

almost 4 years ago

In reply to: Andres Freund (#219)

1 attachment(s)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Thu, Mar 31, 2022 at 9:52 PM Andres Freund <andres@anarazel.de> wrote:

+     "ALTER DATABASE testdb SET TABLESPACE regress_ts2");
+ok($result == 0, 'move database to tablespace 2');
This just tests the command doesn't fail, but not whether it actually did
something useful. Seem we should at least insert a row or two into the the
table, and verify they can be accessed?

Now, added some tuples and verified them.

# Check that only READ-only queries can run on standbys
is($node_standby_1->psql('postgres', 'INSERT INTO tab_int VALUES (1)'),
3, 'read-only queries on standby 1');

I'd probably add a function for creating database / table and then testing it,
with a strategy parameter. That way we can afterwards add more tests verifying
that everything worked.

I have created a function to create a database and table and verify
the content in it. Another option is we can just keep the database
and table creation inside the function and the verification part
outside it so that if some future test case wants to create some extra
content and verify it then they can do so. But with the current
tests in mind the way I got it in the attached patch has less
duplicate code so I preferred it this way.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v2-0001-Create-database-test-coverage.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Create-database-test-coverage.patchDownload

From a672a27a9e502331a7c4ca2a16f3f660c8ed3fcd Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 1 Apr 2022 13:15:57 +0530
Subject: [PATCH v2] Create database test coverage

Test create database strategy wal replay and alter database
set tablespace.
---
 src/test/modules/test_misc/t/002_tablespace.pl | 14 ++++++++++++++
 src/test/recovery/t/001_stream_rep.pl          | 25 +++++++++++++++++++++++++
 2 files changed, 39 insertions(+)

diff --git a/src/test/modules/test_misc/t/002_tablespace.pl b/src/test/modules/test_misc/t/002_tablespace.pl
index 04e5439..6ac6f79 100644
--- a/src/test/modules/test_misc/t/002_tablespace.pl
+++ b/src/test/modules/test_misc/t/002_tablespace.pl
@@ -83,7 +83,21 @@ $result = $node->psql('postgres',
 	"ALTER TABLE t SET tablespace regress_ts1");
 ok($result == 0, 'move table in-place->abs');
 
+# Test ALTER DATABASE SET TABLESPACE
+$result = $node->psql('postgres',
+	"CREATE DATABASE testdb TABLESPACE regress_ts1");
+ok($result == 0, 'create database in tablespace 1');
+$result = $node->psql('testdb',
+	"CREATE TABLE tab_int AS SELECT generate_series(1,10) AS a");
+ok($result == 0, 'create table in testdb database');
+$result = $node->psql('postgres',
+	"ALTER DATABASE testdb SET TABLESPACE regress_ts2");
+ok($result == 0, 'move database to tablespace 2');
+$result = $node->safe_psql('testdb', "SELECT count(*) FROM tab_int");
+is($result, qq(10), 'check contents after moving to a different tablespace');
+
 # Drop everything
+$result = $node->psql('postgres', "DROP DATABASE testdb");
 $result = $node->psql('postgres',
 	"DROP TABLE t");
 ok($result == 0, 'create table in tablespace 1');
diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index 583ee87..cf4041a 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -78,6 +78,31 @@ $result = $node_standby_2->safe_psql('postgres', "SELECT * FROM seq1");
 print "standby 2: $result\n";
 is($result, qq(33|0|t), 'check streamed sequence content on standby 2');
 
+# Create database with some contents for a given strategy and check on standby
+sub test_createdb_strategy_and_check
+{
+	my $node1       = shift;
+	my $node2       = shift;
+	my $dbname      = shift;
+	my $strategy    = shift;
+
+	$node1->safe_psql('postgres',
+		"CREATE DATABASE $dbname STRATEGY = $strategy; ");
+	$node1->safe_psql($dbname,
+		"CREATE TABLE tab_int AS SELECT generate_series(1,10) AS a");
+	my $lsn = $node1->lsn('write');
+	$node1->wait_for_catchup($node2, 'replay', $lsn);
+
+	my $result = $node2->safe_psql($dbname, "SELECT count(*) FROM tab_int");
+	is($result, qq(10), 'check streamed content on standby');
+}
+# Test replication of file_copy strategy
+test_createdb_strategy_and_check($node_primary, $node_standby_1, "testdb1",
+	"file_copy");
+# Test replication of wal_log strategy
+test_createdb_strategy_and_check($node_primary, $node_standby_1, "testdb2",
+	"wal_log");
+
 # Check that only READ-only queries can run on standbys
 is($node_standby_1->psql('postgres', 'INSERT INTO tab_int VALUES (1)'),
 	3, 'read-only queries on standby 1');
-- 
1.8.3.1

#225

Andres Freund

andres@anarazel.de

almost 4 years ago

In reply to: Robert Haas (#202)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

Hi,

On 2022-03-29 11:55:05 -0400, Robert Haas wrote:

I committed v6 instead.

Coverity complains that this patch added GetDatabasePath() calls without
freeing its return value. Normally that'd be easy to dismiss, due to memory
contexts, but there's no granular resets in CreateDatabaseUsingFileCopy(). And
obviously there can be a lot of relations in one database - we shouldn't hold
onto the same path over and over again.

The case in recovery is worse, because there we don't have a memory context to
reset afaics. Oddly enough, it sure looks like we have an existing version of
this bug in the file-copy path?

Greetings,

Andres Freund

#226

Dilip Kumar

dilipbalaut@gmail.com

almost 4 years ago

In reply to: Andres Freund (#225)

1 attachment(s)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Sun, Apr 3, 2022 at 9:52 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2022-03-29 11:55:05 -0400, Robert Haas wrote:

I committed v6 instead.

Coverity complains that this patch added GetDatabasePath() calls without
freeing its return value. Normally that'd be easy to dismiss, due to memory
contexts, but there's no granular resets in CreateDatabaseUsingFileCopy(). And
obviously there can be a lot of relations in one database - we shouldn't hold
onto the same path over and over again.

The case in recovery is worse, because there we don't have a memory context to
reset afaics. Oddly enough, it sure looks like we have an existing version of
this bug in the file-copy path?

Yeah, I see that the createdb() and dbase_redo() had this existing
problem and with this patch we have created a few more such
occurrences.
The attached patch fixes it.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

fix_memory_leak.patchtext/x-patch; charset=US-ASCII; name=fix_memory_leak.patchDownload

diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index df16533..ff81c48 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -218,6 +218,8 @@ CreateDatabaseUsingWalLog(Oid src_dboid, Oid dst_dboid,
 	}
 
 	list_free_deep(rnodelist);
+	pfree(srcpath);
+	pfree(dstpath);
 }
 
 /*
@@ -628,6 +630,9 @@ CreateDatabaseUsingFileCopy(Oid src_dboid, Oid dst_dboid, Oid src_tsid,
 			(void) XLogInsert(RM_DBASE_ID,
 							  XLOG_DBASE_CREATE_FILE_COPY | XLR_SPECIAL_REL_UPDATE);
 		}
+
+		pfree(srcpath);
+		pfree(dstpath);
 	}
 	table_endscan(scan);
 	table_close(rel, AccessShareLock);
@@ -3051,6 +3056,8 @@ dbase_redo(XLogReaderState *record)
 		 * We don't need to copy subdirectories
 		 */
 		copydir(src_path, dst_path, false);
+		pfree(src_path);
+		pfree(dst_path);
 	}
 	else if (info == XLOG_DBASE_CREATE_WAL_LOG)
 	{
@@ -3063,6 +3070,7 @@ dbase_redo(XLogReaderState *record)
 		/* Create the database directory with the version file. */
 		CreateDirAndVersionFile(dbpath, xlrec->db_id, xlrec->tablespace_id,
 								true);
+		pfree(dbpath);
 	}
 	else if (info == XLOG_DBASE_DROP)
 	{

#227

Justin Pryzby

pryzby@telsasoft.com

over 3 years ago

In reply to: Robert Haas (#202)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Tue, Mar 29, 2022 at 11:55:05AM -0400, Robert Haas wrote:

On Mon, Mar 28, 2022 at 3:08 PM Robert Haas <robertmhaas@gmail.com> wrote:

smgrcreate() as we would for most WAL records or whether it should be
adopting the new system introduced by
49d9cfc68bf4e0d32a948fe72d5a0ef7f464944e. I wrote about this concern
over here:

/messages/by-id/CA+TgmoYcUPL+WOJL2ZzhH=zmrhj0iOQ=iCFM0SuYqBbqZEamEg@mail.gmail.com

But apart from that question your adaptations here look reasonable to me.

That commit having been reverted, I committed v6 instead. Let's see
what breaks...

There's a crash

2022-07-31 01:22:51.437 CDT client backend[13362] [unknown] PANIC: could not open critical system index 2662

(gdb) bt
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1 0x00007efe27999801 in __GI_abort () at abort.c:79
#2 0x00005583891941dc in errfinish (filename=<optimized out>, filename@entry=0x558389420437 "relcache.c", lineno=lineno@entry=4328,
funcname=funcname@entry=0x558389421680 <__func__.33178> "load_critical_index") at elog.c:675
#3 0x00005583891713ef in load_critical_index (indexoid=indexoid@entry=2662, heapoid=heapoid@entry=1259) at relcache.c:4328
#4 0x0000558389172667 in RelationCacheInitializePhase3 () at relcache.c:4103
#5 0x00005583891b93a4 in InitPostgres (in_dbname=in_dbname@entry=0x55838a50d468 "a", dboid=dboid@entry=0, username=username@entry=0x55838a50d448 "pryzbyj", useroid=useroid@entry=0,
load_session_libraries=<optimized out>, override_allow_connections=override_allow_connections@entry=false, out_dbname=0x0) at postinit.c:1087
#6 0x0000558388daa7bb in PostgresMain (dbname=0x55838a50d468 "a", username=username@entry=0x55838a50d448 "pryzbyj") at postgres.c:4081
#7 0x0000558388b9f423 in BackendRun (port=port@entry=0x55838a505dd0) at postmaster.c:4490
#8 0x0000558388ba6e07 in BackendStartup (port=port@entry=0x55838a505dd0) at postmaster.c:4218
#9 0x0000558388ba747f in ServerLoop () at postmaster.c:1808
#10 0x0000558388ba8f93 in PostmasterMain (argc=7, argv=<optimized out>) at postmaster.c:1480
#11 0x0000558388840e1f in main (argc=7, argv=0x55838a4dc000) at main.c:197

while :; do psql -qh /tmp postgres -c "DROP DATABASE a" -c "CREATE DATABASE a TEMPLATE postgres STRATEGY wal_log"; done
# Run this for a few loops and then ^C or hold down ^C until it stops,
# and then connect to postgres and try to connect to 'a':
postgres=# \c a
2022-07-31 01:22:51.437 CDT client backend[13362] [unknown] PANIC: could not open critical system index 2662

Unfortunately, that isn't very consistent, and you have have to run it a bunch
of times...

I don't know if it's an issue of any significance that CREATE DATABASE / ^C
leaves behind a broken database, but it is an issue that the cluster crashes.

While struggling to reproduce that problem, I also hit this warning, which may
or may not be the same. I added an abort() after WARNING in aset.c to get a
backtrace.

WARNING: problem in alloc set PortalContext: bogus aset link in block 0x55a63f2f9d60, chunk 0x55a63f2fb138

Program terminated with signal SIGABRT, Aborted.
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51 ../sysdeps/unix/sysv/linux/raise.c: No existe el archivo o el directorio.
(gdb) bt
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1 0x00007f81144f1801 in __GI_abort () at abort.c:79
#2 0x000055a63c834c5d in AllocSetCheck (context=context@entry=0x55a63f26fea0) at aset.c:1491
#3 0x000055a63c835b09 in AllocSetDelete (context=0x55a63f26fea0) at aset.c:638
#4 0x000055a63c854322 in MemoryContextDelete (context=0x55a63f26fea0) at mcxt.c:252
#5 0x000055a63c8591d5 in PortalDrop (portal=portal@entry=0x55a63f2bb7a0, isTopCommit=isTopCommit@entry=false) at portalmem.c:596
#6 0x000055a63c3e4a7b in exec_simple_query (query_string=query_string@entry=0x55a63f24db90 "CREATE DATABASE a TEMPLATE postgres STRATEGY wal_log ;") at postgres.c:1253
#7 0x000055a63c3e7fc1 in PostgresMain (dbname=<optimized out>, username=username@entry=0x55a63f279448 "pryzbyj") at postgres.c:4505
#8 0x000055a63c1dc423 in BackendRun (port=port@entry=0x55a63f271dd0) at postmaster.c:4490
#9 0x000055a63c1e3e07 in BackendStartup (port=port@entry=0x55a63f271dd0) at postmaster.c:4218
#10 0x000055a63c1e447f in ServerLoop () at postmaster.c:1808
#11 0x000055a63c1e5f93 in PostmasterMain (argc=7, argv=<optimized out>) at postmaster.c:1480
#12 0x000055a63be7de1f in main (argc=7, argv=0x55a63f248000) at main.c:197

I reproduced that by running this a couple dozen times in an interactive psql.
It doesn't seem to affect STRATEGY=file_copy.

SET statement_timeout=0; DROP DATABASE a; SET statement_timeout='60ms'; CREATE DATABASE a TEMPLATE postgres STRATEGY wal_log ; \c a \c postgres

Also, if I understand correctly, this patch seems to assume that nobody is
connected to the source database. But what's actually enforced is just that
nobody *else* is connected. Is it any issue that the current DB can be used as
a source? Anyway, both of the above problems are reproducible using a
different database.

|postgres=# CREATE DATABASE new TEMPLATE postgres STRATEGY wal_log;
|CREATE DATABASE

--
Justin

#228

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Justin Pryzby (#227)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Tue, Aug 2, 2022 at 1:50 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

Unfortunately, that isn't very consistent, and you have have to run it a bunch
of times...

I was eventually able to reproduce this in part by using the
interactive psql method you describe. It didn't crash, but it did spit
out a bunch of funny error messages:

postgres=# SET statement_timeout=0; DROP DATABASE a; SET
statement_timeout='60ms'; CREATE DATABASE a TEMPLATE postgres STRATEGY
wal_log ; \c a \c postgres
SET
ERROR: database "a" does not exist
SET
ERROR: canceling statement due to statement timeout
WARNING: problem in alloc set PortalContext: req size > alloc size
for chunk 0x7f99508911f0 in block 0x7f9950890800
WARNING: problem in alloc set PortalContext: bad size 0 for chunk
0x7f99508911f0 in block 0x7f9950890800
WARNING: problem in alloc set PortalContext: bad single-chunk
0x7f9950891208 in block 0x7f9950890800
WARNING: problem in alloc set PortalContext: found inconsistent
memory block 0x7f9950890800
WARNING: problem in alloc set PortalContext: req size > alloc size
for chunk 0x7f99508911f0 in block 0x7f9950890800
WARNING: problem in alloc set PortalContext: bad size 0 for chunk
0x7f99508911f0 in block 0x7f9950890800
WARNING: problem in alloc set PortalContext: bad single-chunk
0x7f9950891208 in block 0x7f9950890800
WARNING: problem in alloc set PortalContext: found inconsistent
memory block 0x7f9950890800
connection to server on socket "/tmp/.s.PGSQL.5432" failed: FATAL:
database "a" does not exist
Previous connection kept
postgres=# select * from pg_database;
oid | datname | datdba | encoding | datlocprovider | datistemplate
| datallowconn | datconnlimit | datfrozenxid | datminmxid |
dattablespace | datcollate | datctype | daticulocale |
datcollversion | datacl
-----+-----------+--------+----------+----------------+---------------+--------------+--------------+--------------+------------+---------------+-------------+-------------+--------------+----------------+----------------------------
5 | postgres | 10 | 6 | c | f
| t | -1 | 718 | 1 |
1663 | en_US.UTF-8 | en_US.UTF-8 | | |
1 | template1 | 10 | 6 | c | t
| t | -1 | 718 | 1 |
1663 | en_US.UTF-8 | en_US.UTF-8 | | |
{=c/rhaas,rhaas=CTc/rhaas}
4 | template0 | 10 | 6 | c | t
| f | -1 | 718 | 1 |
1663 | en_US.UTF-8 | en_US.UTF-8 | | |
{=c/rhaas,rhaas=CTc/rhaas}
(3 rows)

I then set backtrace_functions='AllocSetCheck' and reproduced it
again, which led to stack traces like this:

2022-08-02 16:50:32.490 EDT [98814] WARNING: problem in alloc set
PortalContext: bad single-chunk 0x7f9950886608 in block 0x7f9950885c00
2022-08-02 16:50:32.490 EDT [98814] BACKTRACE:
2 postgres 0x000000010cd37ef5 AllocSetCheck + 549
3 postgres 0x000000010cd37730 AllocSetReset + 48
4 postgres 0x000000010cd3f6f1
MemoryContextResetOnly + 81
5 postgres 0x000000010cd378b9 AllocSetDelete + 73
6 postgres 0x000000010cd41e09 PortalDrop + 425
7 postgres 0x000000010cd427bb
AtCleanup_Portals + 203
8 postgres 0x000000010c86476d
CleanupTransaction + 29
9 postgres 0x000000010c865d4f
AbortCurrentTransaction + 63
10 postgres 0x000000010cba1395 PostgresMain + 885
11 postgres 0x000000010caf5472 PostmasterMain + 7586
12 postgres 0x000000010ca31e3d main + 2205
13 libdyld.dylib 0x00007fff699afcc9 start + 1
14 ??? 0x0000000000000001 0x0 + 1

I recompiled with -O0 and hacked the code that emits the BACKTRACE:
bit to go into an infinite loop if it's hit, which enabled me to hook
up a debugger at the point of the failure. The debugger says:

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
frame #0: 0x000000010e98a157
postgres`send_message_to_server_log(edata=0x000000010ec0f658) at
elog.c:2916:4
frame #1: 0x000000010e9866d6 postgres`EmitErrorReport at elog.c:1537:3
frame #2: 0x000000010e986016 postgres`errfinish(filename="aset.c",
lineno=1470, funcname="AllocSetCheck") at elog.c:592:2
frame #3: 0x000000010e9c8465
postgres`AllocSetCheck(context=0x00007ff77c80d200) at aset.c:1469:5
frame #4: 0x000000010e9c7c05
postgres`AllocSetDelete(context=0x00007ff77c80d200) at aset.c:638:2
frame #5: 0x000000010e9d368b
postgres`MemoryContextDelete(context=0x00007ff77c80d200) at
mcxt.c:252:2
* frame #6: 0x000000010e9d705b
postgres`PortalDrop(portal=0x00007ff77e028920, isTopCommit=false) at
portalmem.c:596:2
frame #7: 0x000000010e9d7e0e postgres`AtCleanup_Portals at portalmem.c:907:3
frame #8: 0x000000010e22030d postgres`CleanupTransaction at xact.c:2890:2
frame #9: 0x000000010e2219da postgres`AbortCurrentTransaction at
xact.c:3328:4
frame #10: 0x000000010e763237
postgres`PostgresMain(dbname="postgres", username="rhaas") at
postgres.c:4232:3
frame #11: 0x000000010e6625aa
postgres`BackendRun(port=0x00007ff77c1042c0) at postmaster.c:4490:2
frame #12: 0x000000010e661b18
postgres`BackendStartup(port=0x00007ff77c1042c0) at
postmaster.c:4218:3
frame #13: 0x000000010e66088a postgres`ServerLoop at postmaster.c:1808:7
frame #14: 0x000000010e65def2 postgres`PostmasterMain(argc=1,
argv=0x00007ff77ae05cf0) at postmaster.c:1480:11
frame #15: 0x000000010e50521f postgres`main(argc=1,
argv=0x00007ff77ae05cf0) at main.c:197:3
frame #16: 0x00007fff699afcc9 libdyld.dylib`start + 1
(lldb) fr sel 6
frame #6: 0x000000010e9d705b
postgres`PortalDrop(portal=0x00007ff77e028920, isTopCommit=false) at
portalmem.c:596:2
593 MemoryContextDelete(portal->holdContext);
594
595 /* release subsidiary storage */
-> 596 MemoryContextDelete(portal->portalContext);
597
598 /* release portal struct (it's in TopPortalContext) */
599 pfree(portal);
(lldb) fr sel 3
frame #3: 0x000000010e9c8465
postgres`AllocSetCheck(context=0x00007ff77c80d200) at aset.c:1469:5
1466 * Check chunk size
1467 */
1468 if (dsize > chsize)
-> 1469 elog(WARNING, "problem in alloc set %s: req size > alloc size
for chunk %p in block %p",
1470 name, chunk, block);
1471 if (chsize < (1 << ALLOC_MINBITS))
1472 elog(WARNING, "problem in alloc set %s: bad size %zu for chunk
%p in block %p",
(lldb) p dsize
(Size) $3 = 20
(lldb) p chsize
(Size) $4 = 0

It seems like CreateDatabaseUsingWalLog() must be doing something that
corrupts PortalContext, but at the moment I'm not sure what that thing
could be.

--
Robert Haas
EDB: http://www.enterprisedb.com

#229

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Robert Haas (#228)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

Robert Haas <robertmhaas@gmail.com> writes:

WARNING: problem in alloc set PortalContext: req size > alloc size
for chunk 0x7f99508911f0 in block 0x7f9950890800
WARNING: problem in alloc set PortalContext: bad size 0 for chunk
0x7f99508911f0 in block 0x7f9950890800
WARNING: problem in alloc set PortalContext: bad single-chunk
0x7f9950891208 in block 0x7f9950890800
WARNING: problem in alloc set PortalContext: found inconsistent
memory block 0x7f9950890800
WARNING: problem in alloc set PortalContext: req size > alloc size
for chunk 0x7f99508911f0 in block 0x7f9950890800
WARNING: problem in alloc set PortalContext: bad size 0 for chunk
0x7f99508911f0 in block 0x7f9950890800
WARNING: problem in alloc set PortalContext: bad single-chunk
0x7f9950891208 in block 0x7f9950890800
WARNING: problem in alloc set PortalContext: found inconsistent
memory block 0x7f9950890800

This looks like nothing so much as the fallout from something scribbling
past the end of an allocated palloc chunk, or perhaps writing on
already-freed space. Perhaps running the test case under valgrind
would help to finger the culprit.

regards, tom lane

#230

Justin Pryzby

pryzby@telsasoft.com

over 3 years ago

In reply to: Tom Lane (#229)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Tue, Aug 02, 2022 at 05:46:34PM -0400, Tom Lane wrote:

Robert Haas <robertmhaas@gmail.com> writes:

WARNING: problem in alloc set PortalContext: req size > alloc size for chunk 0x7f99508911f0 in block 0x7f9950890800

This looks like nothing so much as the fallout from something scribbling
past the end of an allocated palloc chunk, or perhaps writing on
already-freed space. Perhaps running the test case under valgrind
would help to finger the culprit.

Yeah but my test case is so poor that it's a chore ...

(Sorry for that, but it took me 2 days to be able to reproduce the problem so I
sent it sooner rather than looking for a better way ... )

I got this interesting looking thing.

==11628== Invalid write of size 8
==11628== at 0x1D12B3A: smgrsetowner (smgr.c:213)
==11628== by 0x1C7C224: RelationGetSmgr (rel.h:572)
==11628== by 0x1C7C224: RelationCopyStorageUsingBuffer (bufmgr.c:3725)
==11628== by 0x1C7C7A6: CreateAndCopyRelationData (bufmgr.c:3817)
==11628== by 0x14A4518: CreateDatabaseUsingWalLog (dbcommands.c:221)
==11628== by 0x14AB009: createdb (dbcommands.c:1393)
==11628== by 0x1D2B9AF: standard_ProcessUtility (utility.c:776)
==11628== by 0x1D2C46A: ProcessUtility (utility.c:530)
==11628== by 0x1D265F5: PortalRunUtility (pquery.c:1158)
==11628== by 0x1D27089: PortalRunMulti (pquery.c:1315)
==11628== by 0x1D27A7C: PortalRun (pquery.c:791)
==11628== by 0x1D1E33D: exec_simple_query (postgres.c:1243)
==11628== by 0x1D218BC: PostgresMain (postgres.c:4505)
==11628== Address 0x1025bc18 is 2,712 bytes inside a block of size 8,192 free'd
==11628== at 0x4033A3F: free (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==11628== by 0x217D7C2: AllocSetReset (aset.c:608)
==11628== by 0x219B57A: MemoryContextResetOnly (mcxt.c:181)
==11628== by 0x217DBD5: AllocSetDelete (aset.c:654)
==11628== by 0x219C1EC: MemoryContextDelete (mcxt.c:252)
==11628== by 0x21A109F: PortalDrop (portalmem.c:596)
==11628== by 0x21A269C: AtCleanup_Portals (portalmem.c:907)
==11628== by 0x11FEAB1: CleanupTransaction (xact.c:2890)
==11628== by 0x120A74C: AbortCurrentTransaction (xact.c:3328)
==11628== by 0x1D2158C: PostgresMain (postgres.c:4232)
==11628== by 0x1B15DB5: BackendRun (postmaster.c:4490)
==11628== by 0x1B1D799: BackendStartup (postmaster.c:4218)
==11628== Block was alloc'd at
==11628== at 0x40327F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==11628== by 0x217F0DC: AllocSetAlloc (aset.c:920)
==11628== by 0x219E4D2: palloc (mcxt.c:1082)
==11628== by 0x14A14BE: ScanSourceDatabasePgClassTuple (dbcommands.c:444)
==11628== by 0x14A1CD8: ScanSourceDatabasePgClassPage (dbcommands.c:384)
==11628== by 0x14A20BF: ScanSourceDatabasePgClass (dbcommands.c:322)
==11628== by 0x14A4348: CreateDatabaseUsingWalLog (dbcommands.c:177)
==11628== by 0x14AB009: createdb (dbcommands.c:1393)
==11628== by 0x1D2B9AF: standard_ProcessUtility (utility.c:776)
==11628== by 0x1D2C46A: ProcessUtility (utility.c:530)
==11628== by 0x1D265F5: PortalRunUtility (pquery.c:1158)
==11628== by 0x1D27089: PortalRunMulti (pquery.c:1315)

--
Justin

#231

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Justin Pryzby (#230)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On 2022-08-02 17:04:16 -0500, Justin Pryzby wrote:

I got this interesting looking thing.

==11628== Invalid write of size 8
==11628== at 0x1D12B3A: smgrsetowner (smgr.c:213)
==11628== by 0x1C7C224: RelationGetSmgr (rel.h:572)
==11628== by 0x1C7C224: RelationCopyStorageUsingBuffer (bufmgr.c:3725)
==11628== by 0x1C7C7A6: CreateAndCopyRelationData (bufmgr.c:3817)
==11628== by 0x14A4518: CreateDatabaseUsingWalLog (dbcommands.c:221)
==11628== by 0x14AB009: createdb (dbcommands.c:1393)
==11628== by 0x1D2B9AF: standard_ProcessUtility (utility.c:776)
==11628== by 0x1D2C46A: ProcessUtility (utility.c:530)
==11628== by 0x1D265F5: PortalRunUtility (pquery.c:1158)
==11628== by 0x1D27089: PortalRunMulti (pquery.c:1315)
==11628== by 0x1D27A7C: PortalRun (pquery.c:791)
==11628== by 0x1D1E33D: exec_simple_query (postgres.c:1243)
==11628== by 0x1D218BC: PostgresMain (postgres.c:4505)
==11628== Address 0x1025bc18 is 2,712 bytes inside a block of size 8,192 free'd
==11628== at 0x4033A3F: free (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==11628== by 0x217D7C2: AllocSetReset (aset.c:608)
==11628== by 0x219B57A: MemoryContextResetOnly (mcxt.c:181)
==11628== by 0x217DBD5: AllocSetDelete (aset.c:654)
==11628== by 0x219C1EC: MemoryContextDelete (mcxt.c:252)
==11628== by 0x21A109F: PortalDrop (portalmem.c:596)
==11628== by 0x21A269C: AtCleanup_Portals (portalmem.c:907)
==11628== by 0x11FEAB1: CleanupTransaction (xact.c:2890)
==11628== by 0x120A74C: AbortCurrentTransaction (xact.c:3328)
==11628== by 0x1D2158C: PostgresMain (postgres.c:4232)
==11628== by 0x1B15DB5: BackendRun (postmaster.c:4490)
==11628== by 0x1B1D799: BackendStartup (postmaster.c:4218)
==11628== Block was alloc'd at
==11628== at 0x40327F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==11628== by 0x217F0DC: AllocSetAlloc (aset.c:920)
==11628== by 0x219E4D2: palloc (mcxt.c:1082)
==11628== by 0x14A14BE: ScanSourceDatabasePgClassTuple (dbcommands.c:444)
==11628== by 0x14A1CD8: ScanSourceDatabasePgClassPage (dbcommands.c:384)
==11628== by 0x14A20BF: ScanSourceDatabasePgClass (dbcommands.c:322)
==11628== by 0x14A4348: CreateDatabaseUsingWalLog (dbcommands.c:177)
==11628== by 0x14AB009: createdb (dbcommands.c:1393)
==11628== by 0x1D2B9AF: standard_ProcessUtility (utility.c:776)
==11628== by 0x1D2C46A: ProcessUtility (utility.c:530)
==11628== by 0x1D265F5: PortalRunUtility (pquery.c:1158)
==11628== by 0x1D27089: PortalRunMulti (pquery.c:1315)

Ick. That looks like somehow we end up with smgr entries still pointing to
fake relcache entries, created in a prior attempt at create database.

Looks like you'd need error trapping to call FreeFakeRelcacheEntry() (or just
smgrclearowner()) in case of error.

Or perhaps we can instead prevent the fake relcache entry being set as the
owner in the first place?

Why do we even need fake relcache entries here? Looks like all that they're
used for is a bunch of RelationGetSmgr() calls? Can't we instead just pass the
rnode to smgropen()? Given that we're doing that once for every buffer in the
body of RelationCopyStorageUsingBuffer(), doing it in a bunch of other
less-frequent places can't be a problem.
can't

Greetings,

Andres Freund

#232

Dilip Kumar

dilipbalaut@gmail.com

over 3 years ago

In reply to: Andres Freund (#231)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Wed, Aug 3, 2022 at 3:53 AM Andres Freund <andres@anarazel.de> wrote:

On 2022-08-02 17:04:16 -0500, Justin Pryzby wrote:

I got this interesting looking thing.

==11628== Invalid write of size 8
==11628== at 0x1D12B3A: smgrsetowner (smgr.c:213)
==11628== by 0x1C7C224: RelationGetSmgr (rel.h:572)
==11628== by 0x1C7C224: RelationCopyStorageUsingBuffer (bufmgr.c:3725)
==11628== by 0x1C7C7A6: CreateAndCopyRelationData (bufmgr.c:3817)
==11628== by 0x14A4518: CreateDatabaseUsingWalLog (dbcommands.c:221)
==11628== by 0x14AB009: createdb (dbcommands.c:1393)
==11628== by 0x1D2B9AF: standard_ProcessUtility (utility.c:776)
==11628== by 0x1D2C46A: ProcessUtility (utility.c:530)
==11628== by 0x1D265F5: PortalRunUtility (pquery.c:1158)
==11628== by 0x1D27089: PortalRunMulti (pquery.c:1315)
==11628== by 0x1D27A7C: PortalRun (pquery.c:791)
==11628== by 0x1D1E33D: exec_simple_query (postgres.c:1243)
==11628== by 0x1D218BC: PostgresMain (postgres.c:4505)
==11628== Address 0x1025bc18 is 2,712 bytes inside a block of size 8,192 free'd
==11628== at 0x4033A3F: free (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==11628== by 0x217D7C2: AllocSetReset (aset.c:608)
==11628== by 0x219B57A: MemoryContextResetOnly (mcxt.c:181)
==11628== by 0x217DBD5: AllocSetDelete (aset.c:654)
==11628== by 0x219C1EC: MemoryContextDelete (mcxt.c:252)
==11628== by 0x21A109F: PortalDrop (portalmem.c:596)
==11628== by 0x21A269C: AtCleanup_Portals (portalmem.c:907)
==11628== by 0x11FEAB1: CleanupTransaction (xact.c:2890)
==11628== by 0x120A74C: AbortCurrentTransaction (xact.c:3328)
==11628== by 0x1D2158C: PostgresMain (postgres.c:4232)
==11628== by 0x1B15DB5: BackendRun (postmaster.c:4490)
==11628== by 0x1B1D799: BackendStartup (postmaster.c:4218)
==11628== Block was alloc'd at
==11628== at 0x40327F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==11628== by 0x217F0DC: AllocSetAlloc (aset.c:920)
==11628== by 0x219E4D2: palloc (mcxt.c:1082)
==11628== by 0x14A14BE: ScanSourceDatabasePgClassTuple (dbcommands.c:444)
==11628== by 0x14A1CD8: ScanSourceDatabasePgClassPage (dbcommands.c:384)
==11628== by 0x14A20BF: ScanSourceDatabasePgClass (dbcommands.c:322)
==11628== by 0x14A4348: CreateDatabaseUsingWalLog (dbcommands.c:177)
==11628== by 0x14AB009: createdb (dbcommands.c:1393)
==11628== by 0x1D2B9AF: standard_ProcessUtility (utility.c:776)
==11628== by 0x1D2C46A: ProcessUtility (utility.c:530)
==11628== by 0x1D265F5: PortalRunUtility (pquery.c:1158)
==11628== by 0x1D27089: PortalRunMulti (pquery.c:1315)

Ick. That looks like somehow we end up with smgr entries still pointing to
fake relcache entries, created in a prior attempt at create database.

The surprising thing is how the smgr entry survived the transaction
abort, I mean AtEOXact_SMgr should have closed the smgr and should
have removed from the
smgr cache.

Looks like you'd need error trapping to call FreeFakeRelcacheEntry() (or just
smgrclearowner()) in case of error.

Or perhaps we can instead prevent the fake relcache entry being set as the
owner in the first place?

Why do we even need fake relcache entries here? Looks like all that they're
used for is a bunch of RelationGetSmgr() calls? Can't we instead just pass the
rnode to smgropen()? Given that we're doing that once for every buffer in the
body of RelationCopyStorageUsingBuffer(), doing it in a bunch of other
less-frequent places can't be a problem.
can't

I think in some of the previous versions of the patch we were using
smgropen() but changed it so that we do not reuse the smgr after it
gets removed during interrupt processing, see discussion here[1]/messages/by-id/CA+TgmoYKovODW2Y7rQmmRFaKu445p9uAahjpgfbY8eyeL07BXA@mail.gmail.com

[1]: /messages/by-id/CA+TgmoYKovODW2Y7rQmmRFaKu445p9uAahjpgfbY8eyeL07BXA@mail.gmail.com
/messages/by-id/CA+TgmoYKovODW2Y7rQmmRFaKu445p9uAahjpgfbY8eyeL07BXA@mail.gmail.com

From the Valgrind report, it is clear that we are getting the smgr
entry whose smgr->smgr_owner is pointing into the fake relcache entry.
So I am investigating further how it is possible for the smgr created
during a previous create database attempt to survive beyond abort
transaction.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#233

Dilip Kumar

dilipbalaut@gmail.com

over 3 years ago

In reply to: Dilip Kumar (#232)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Wed, Aug 3, 2022 at 11:28 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Aug 3, 2022 at 3:53 AM Andres Freund <andres@anarazel.de> wrote:

On 2022-08-02 17:04:16 -0500, Justin Pryzby wrote:

I got this interesting looking thing.

==11628== Invalid write of size 8
==11628== at 0x1D12B3A: smgrsetowner (smgr.c:213)
==11628== by 0x1C7C224: RelationGetSmgr (rel.h:572)
==11628== by 0x1C7C224: RelationCopyStorageUsingBuffer (bufmgr.c:3725)
==11628== by 0x1C7C7A6: CreateAndCopyRelationData (bufmgr.c:3817)
==11628== by 0x14A4518: CreateDatabaseUsingWalLog (dbcommands.c:221)
==11628== by 0x14AB009: createdb (dbcommands.c:1393)
==11628== by 0x1D2B9AF: standard_ProcessUtility (utility.c:776)
==11628== by 0x1D2C46A: ProcessUtility (utility.c:530)
==11628== by 0x1D265F5: PortalRunUtility (pquery.c:1158)
==11628== by 0x1D27089: PortalRunMulti (pquery.c:1315)
==11628== by 0x1D27A7C: PortalRun (pquery.c:791)
==11628== by 0x1D1E33D: exec_simple_query (postgres.c:1243)
==11628== by 0x1D218BC: PostgresMain (postgres.c:4505)
==11628== Address 0x1025bc18 is 2,712 bytes inside a block of size 8,192 free'd
==11628== at 0x4033A3F: free (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==11628== by 0x217D7C2: AllocSetReset (aset.c:608)
==11628== by 0x219B57A: MemoryContextResetOnly (mcxt.c:181)
==11628== by 0x217DBD5: AllocSetDelete (aset.c:654)
==11628== by 0x219C1EC: MemoryContextDelete (mcxt.c:252)
==11628== by 0x21A109F: PortalDrop (portalmem.c:596)
==11628== by 0x21A269C: AtCleanup_Portals (portalmem.c:907)
==11628== by 0x11FEAB1: CleanupTransaction (xact.c:2890)
==11628== by 0x120A74C: AbortCurrentTransaction (xact.c:3328)
==11628== by 0x1D2158C: PostgresMain (postgres.c:4232)
==11628== by 0x1B15DB5: BackendRun (postmaster.c:4490)
==11628== by 0x1B1D799: BackendStartup (postmaster.c:4218)
==11628== Block was alloc'd at
==11628== at 0x40327F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==11628== by 0x217F0DC: AllocSetAlloc (aset.c:920)
==11628== by 0x219E4D2: palloc (mcxt.c:1082)
==11628== by 0x14A14BE: ScanSourceDatabasePgClassTuple (dbcommands.c:444)
==11628== by 0x14A1CD8: ScanSourceDatabasePgClassPage (dbcommands.c:384)
==11628== by 0x14A20BF: ScanSourceDatabasePgClass (dbcommands.c:322)
==11628== by 0x14A4348: CreateDatabaseUsingWalLog (dbcommands.c:177)
==11628== by 0x14AB009: createdb (dbcommands.c:1393)
==11628== by 0x1D2B9AF: standard_ProcessUtility (utility.c:776)
==11628== by 0x1D2C46A: ProcessUtility (utility.c:530)
==11628== by 0x1D265F5: PortalRunUtility (pquery.c:1158)
==11628== by 0x1D27089: PortalRunMulti (pquery.c:1315)

Ick. That looks like somehow we end up with smgr entries still pointing to
fake relcache entries, created in a prior attempt at create database.

The surprising thing is how the smgr entry survived the transaction
abort, I mean AtEOXact_SMgr should have closed the smgr and should
have removed from the
smgr cache.

Okay, so AtEOXact_SMgr will only get rid of unowned smgr and ours are
owned by a fake relcache and whose lifetime is just portal memory
context which will go away at the transaction end. So as Andres
suggested options could be that a) we catch the error and do
FreeFakeRelcacheEntry. b) directly use smgropen instead of
RelationGetSmgr because actually, we do not want the owner to be set
for these smgrs.

I think option b) looks better to me, I will prepare a patch and test
whether the error goes away with that or not.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#234

Dilip Kumar

dilipbalaut@gmail.com

over 3 years ago

In reply to: Dilip Kumar (#233)

1 attachment(s)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Wed, Aug 3, 2022 at 12:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Okay, so AtEOXact_SMgr will only get rid of unowned smgr and ours are
owned by a fake relcache and whose lifetime is just portal memory
context which will go away at the transaction end. So as Andres
suggested options could be that a) we catch the error and do
FreeFakeRelcacheEntry. b) directly use smgropen instead of
RelationGetSmgr because actually, we do not want the owner to be set
for these smgrs.

I think option b) looks better to me, I will prepare a patch and test
whether the error goes away with that or not.

Here is the patch which directly uses smgropen instead of using fake
relcache entry. We don't preserve the smgr pointer and whenever
required we again call the smgropen.

With this patch it resolved the problem for me at least what I was
able to reproduce. I was able to reproduce the WARNING messages that
Robert got as well as the valgrind error that Justin got and with this
patch both are resolved.

@Justin can you help in verifying the original issue?

Another alternative could be that continue using fake relcache entry
but instead of RelationGetSmgr() create some new function which
doesn't set the owner in the smgr.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

0001-BugfixInWalLogCreateDB.patchtext/x-patch; charset=US-ASCII; name=0001-BugfixInWalLogCreateDB.patchDownload

From aa1f6ff66f1c4bc41ffbc066a5543e7030a4501f Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 3 Aug 2022 13:28:46 +0530
Subject: [PATCH] BugfixInWalLogCreateDB

---
 src/backend/commands/dbcommands.c   | 12 +----------
 src/backend/storage/buffer/bufmgr.c | 43 +++++++++++++------------------------
 2 files changed, 16 insertions(+), 39 deletions(-)

diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 7bc53f3..0423831 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -258,7 +258,6 @@ ScanSourceDatabasePgClass(Oid tbid, Oid dbid, char *srcpath)
 	Page		page;
 	List	   *rlocatorlist = NIL;
 	LockRelId	relid;
-	Relation	rel;
 	Snapshot	snapshot;
 	BufferAccessStrategy bstrategy;
 
@@ -276,16 +275,7 @@ ScanSourceDatabasePgClass(Oid tbid, Oid dbid, char *srcpath)
 	rlocator.dbOid = dbid;
 	rlocator.relNumber = relfilenumber;
 
-	/*
-	 * We can't use a real relcache entry for a relation in some other
-	 * database, but since we're only going to access the fields related to
-	 * physical storage, a fake one is good enough. If we didn't do this and
-	 * used the smgr layer directly, we would have to worry about
-	 * invalidations.
-	 */
-	rel = CreateFakeRelcacheEntry(rlocator);
-	nblocks = smgrnblocks(RelationGetSmgr(rel), MAIN_FORKNUM);
-	FreeFakeRelcacheEntry(rel);
+	nblocks = smgrnblocks(smgropen(rlocator, InvalidBackendId), MAIN_FORKNUM);
 
 	/* Use a buffer access strategy since this is a bulk read operation. */
 	bstrategy = GetAccessStrategy(BAS_BULKREAD);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 6b30138..5f6e242 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -487,9 +487,9 @@ static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
 									   BlockNumber firstDelBlock);
-static void RelationCopyStorageUsingBuffer(Relation src, Relation dst,
-										   ForkNumber forkNum,
-										   bool isunlogged);
+static void RelationCopyStorageUsingBuffer(RelFileLocator srclocator,
+										   RelFileLocator dstlocator,
+										   ForkNumber forkNum, bool permanent);
 static void AtProcExit_Buffers(int code, Datum arg);
 static void CheckForBufferLeaks(void);
 static int	rlocator_comparator(const void *p1, const void *p2);
@@ -3701,8 +3701,9 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
  * --------------------------------------------------------------------
  */
 static void
-RelationCopyStorageUsingBuffer(Relation src, Relation dst, ForkNumber forkNum,
-							   bool permanent)
+RelationCopyStorageUsingBuffer(RelFileLocator srclocator,
+							   RelFileLocator dstlocator,
+							   ForkNumber forkNum, bool permanent)
 {
 	Buffer		srcBuf;
 	Buffer		dstBuf;
@@ -3722,7 +3723,8 @@ RelationCopyStorageUsingBuffer(Relation src, Relation dst, ForkNumber forkNum,
 	use_wal = XLogIsNeeded() && (permanent || forkNum == INIT_FORKNUM);
 
 	/* Get number of blocks in the source relation. */
-	nblocks = smgrnblocks(RelationGetSmgr(src), forkNum);
+	nblocks = smgrnblocks(smgropen(srclocator, InvalidBackendId),
+						  forkNum);
 
 	/* Nothing to copy; just return. */
 	if (nblocks == 0)
@@ -3738,7 +3740,7 @@ RelationCopyStorageUsingBuffer(Relation src, Relation dst, ForkNumber forkNum,
 		CHECK_FOR_INTERRUPTS();
 
 		/* Read block from source relation. */
-		srcBuf = ReadBufferWithoutRelcache(src->rd_locator, forkNum, blkno,
+		srcBuf = ReadBufferWithoutRelcache(srclocator, forkNum, blkno,
 										   RBM_NORMAL, bstrategy_src,
 										   permanent);
 		srcPage = BufferGetPage(srcBuf);
@@ -3749,7 +3751,7 @@ RelationCopyStorageUsingBuffer(Relation src, Relation dst, ForkNumber forkNum,
 		}
 
 		/* Use P_NEW to extend the destination relation. */
-		dstBuf = ReadBufferWithoutRelcache(dst->rd_locator, forkNum, P_NEW,
+		dstBuf = ReadBufferWithoutRelcache(dstlocator, forkNum, P_NEW,
 										   RBM_NORMAL, bstrategy_dst,
 										   permanent);
 		LockBuffer(dstBuf, BUFFER_LOCK_EXCLUSIVE);
@@ -3787,8 +3789,6 @@ void
 CreateAndCopyRelationData(RelFileLocator src_rlocator,
 						  RelFileLocator dst_rlocator, bool permanent)
 {
-	Relation	src_rel;
-	Relation	dst_rel;
 	char		relpersistence;
 
 	/* Set the relpersistence. */
@@ -3796,16 +3796,6 @@ CreateAndCopyRelationData(RelFileLocator src_rlocator,
 		RELPERSISTENCE_PERMANENT : RELPERSISTENCE_UNLOGGED;
 
 	/*
-	 * We can't use a real relcache entry for a relation in some other
-	 * database, but since we're only going to access the fields related to
-	 * physical storage, a fake one is good enough. If we didn't do this and
-	 * used the smgr layer directly, we would have to worry about
-	 * invalidations.
-	 */
-	src_rel = CreateFakeRelcacheEntry(src_rlocator);
-	dst_rel = CreateFakeRelcacheEntry(dst_rlocator);
-
-	/*
 	 * Create and copy all forks of the relation.  During create database we
 	 * have a separate cleanup mechanism which deletes complete database
 	 * directory.  Therefore, each individual relation doesn't need to be
@@ -3814,15 +3804,16 @@ CreateAndCopyRelationData(RelFileLocator src_rlocator,
 	RelationCreateStorage(dst_rlocator, relpersistence, false);
 
 	/* copy main fork. */
-	RelationCopyStorageUsingBuffer(src_rel, dst_rel, MAIN_FORKNUM, permanent);
+	RelationCopyStorageUsingBuffer(src_rlocator, dst_rlocator, MAIN_FORKNUM,
+								   permanent);
 
 	/* copy those extra forks that exist */
 	for (ForkNumber forkNum = MAIN_FORKNUM + 1;
 		 forkNum <= MAX_FORKNUM; forkNum++)
 	{
-		if (smgrexists(RelationGetSmgr(src_rel), forkNum))
+		if (smgrexists(smgropen(src_rlocator, InvalidBackendId), forkNum))
 		{
-			smgrcreate(RelationGetSmgr(dst_rel), forkNum, false);
+			smgrcreate(smgropen(dst_rlocator, InvalidBackendId), forkNum, false);
 
 			/*
 			 * WAL log creation if the relation is persistent, or this is the
@@ -3832,14 +3823,10 @@ CreateAndCopyRelationData(RelFileLocator src_rlocator,
 				log_smgrcreate(&dst_rlocator, forkNum);
 
 			/* Copy a fork's data, block by block. */
-			RelationCopyStorageUsingBuffer(src_rel, dst_rel, forkNum,
+			RelationCopyStorageUsingBuffer(src_rlocator, dst_rlocator, forkNum,
 										   permanent);
 		}
 	}
-
-	/* Release fake relcache entries. */
-	FreeFakeRelcacheEntry(src_rel);
-	FreeFakeRelcacheEntry(dst_rel);
 }
 
 /* ---------------------------------------------------------------------
-- 
1.8.3.1

#235

Dilip Kumar

dilipbalaut@gmail.com

over 3 years ago

In reply to: Dilip Kumar (#234)

1 attachment(s)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Wed, Aug 3, 2022 at 1:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Aug 3, 2022 at 12:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Okay, so AtEOXact_SMgr will only get rid of unowned smgr and ours are
owned by a fake relcache and whose lifetime is just portal memory
context which will go away at the transaction end. So as Andres
suggested options could be that a) we catch the error and do
FreeFakeRelcacheEntry. b) directly use smgropen instead of
RelationGetSmgr because actually, we do not want the owner to be set
for these smgrs.

I think option b) looks better to me, I will prepare a patch and test
whether the error goes away with that or not.

Here is the patch which directly uses smgropen instead of using fake
relcache entry. We don't preserve the smgr pointer and whenever
required we again call the smgropen.

With this patch it resolved the problem for me at least what I was
able to reproduce. I was able to reproduce the WARNING messages that
Robert got as well as the valgrind error that Justin got and with this
patch both are resolved.

Another version of the patch which closes the smgr at the end using
smgrcloserellocator() and I have also added a commit message.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v1-0001-Avoid-setting-the-fake-relcache-entry-as-smgr-own.patchtext/x-patch; charset=UTF-8; name=v1-0001-Avoid-setting-the-fake-relcache-entry-as-smgr-own.patchDownload

From b151b54880dd17c94a25e8de908e30fe0d9a8542 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 3 Aug 2022 13:28:46 +0530
Subject: [PATCH v1] Avoid setting the fake relcache entry as smgr owner
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

During CREATE DATABASE, we are not connected to the source and the
destination DB so in order to operate on the storage we are using
FakeRelCacheEntry and by using that we are calling RelationGetSmgr().

So the problem is that this function will set the temporary
FakeRelCacheEntry as an owner of the smgr.  Now if there is any
error before we close the FakeRelCacheEntry then the memory of the
fake relcache entry will be released at the transaction abort but
the smgr will survive the transaction.  So now smgr is pointing
to some already release memory and it will have random behavior
when we try to access the smgr next time.

For fixing the issue instead of using the FakeRelCacheEntry, directly
call the smgropen() but do not keep the reference to the smgr.
So every time call smgropen() whenever we need it.  This is required to
ensure that we do not access the smgr pointer which might have already
been closed during interrupt processing.
---
 src/backend/commands/dbcommands.c   | 15 +++--------
 src/backend/storage/buffer/bufmgr.c | 51 +++++++++++++++++--------------------
 2 files changed, 28 insertions(+), 38 deletions(-)

diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 7bc53f3..9342e8e 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -258,8 +258,8 @@ ScanSourceDatabasePgClass(Oid tbid, Oid dbid, char *srcpath)
 	Page		page;
 	List	   *rlocatorlist = NIL;
 	LockRelId	relid;
-	Relation	rel;
 	Snapshot	snapshot;
+	SMgrRelation	smgr;
 	BufferAccessStrategy bstrategy;
 
 	/* Get pg_class relfilenumber. */
@@ -276,16 +276,9 @@ ScanSourceDatabasePgClass(Oid tbid, Oid dbid, char *srcpath)
 	rlocator.dbOid = dbid;
 	rlocator.relNumber = relfilenumber;
 
-	/*
-	 * We can't use a real relcache entry for a relation in some other
-	 * database, but since we're only going to access the fields related to
-	 * physical storage, a fake one is good enough. If we didn't do this and
-	 * used the smgr layer directly, we would have to worry about
-	 * invalidations.
-	 */
-	rel = CreateFakeRelcacheEntry(rlocator);
-	nblocks = smgrnblocks(RelationGetSmgr(rel), MAIN_FORKNUM);
-	FreeFakeRelcacheEntry(rel);
+	smgr = smgropen(rlocator, InvalidBackendId);
+	nblocks = smgrnblocks(smgr, MAIN_FORKNUM);
+	smgrclose(smgr);
 
 	/* Use a buffer access strategy since this is a bulk read operation. */
 	bstrategy = GetAccessStrategy(BAS_BULKREAD);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 6b30138..8a7ccf5 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -487,9 +487,9 @@ static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
 									   BlockNumber firstDelBlock);
-static void RelationCopyStorageUsingBuffer(Relation src, Relation dst,
-										   ForkNumber forkNum,
-										   bool isunlogged);
+static void RelationCopyStorageUsingBuffer(RelFileLocator srclocator,
+										   RelFileLocator dstlocator,
+										   ForkNumber forkNum, bool permanent);
 static void AtProcExit_Buffers(int code, Datum arg);
 static void CheckForBufferLeaks(void);
 static int	rlocator_comparator(const void *p1, const void *p2);
@@ -3701,8 +3701,9 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
  * --------------------------------------------------------------------
  */
 static void
-RelationCopyStorageUsingBuffer(Relation src, Relation dst, ForkNumber forkNum,
-							   bool permanent)
+RelationCopyStorageUsingBuffer(RelFileLocator srclocator,
+							   RelFileLocator dstlocator,
+							   ForkNumber forkNum, bool permanent)
 {
 	Buffer		srcBuf;
 	Buffer		dstBuf;
@@ -3722,7 +3723,8 @@ RelationCopyStorageUsingBuffer(Relation src, Relation dst, ForkNumber forkNum,
 	use_wal = XLogIsNeeded() && (permanent || forkNum == INIT_FORKNUM);
 
 	/* Get number of blocks in the source relation. */
-	nblocks = smgrnblocks(RelationGetSmgr(src), forkNum);
+	nblocks = smgrnblocks(smgropen(srclocator, InvalidBackendId),
+						  forkNum);
 
 	/* Nothing to copy; just return. */
 	if (nblocks == 0)
@@ -3738,7 +3740,7 @@ RelationCopyStorageUsingBuffer(Relation src, Relation dst, ForkNumber forkNum,
 		CHECK_FOR_INTERRUPTS();
 
 		/* Read block from source relation. */
-		srcBuf = ReadBufferWithoutRelcache(src->rd_locator, forkNum, blkno,
+		srcBuf = ReadBufferWithoutRelcache(srclocator, forkNum, blkno,
 										   RBM_NORMAL, bstrategy_src,
 										   permanent);
 		srcPage = BufferGetPage(srcBuf);
@@ -3749,7 +3751,7 @@ RelationCopyStorageUsingBuffer(Relation src, Relation dst, ForkNumber forkNum,
 		}
 
 		/* Use P_NEW to extend the destination relation. */
-		dstBuf = ReadBufferWithoutRelcache(dst->rd_locator, forkNum, P_NEW,
+		dstBuf = ReadBufferWithoutRelcache(dstlocator, forkNum, P_NEW,
 										   RBM_NORMAL, bstrategy_dst,
 										   permanent);
 		LockBuffer(dstBuf, BUFFER_LOCK_EXCLUSIVE);
@@ -3787,8 +3789,7 @@ void
 CreateAndCopyRelationData(RelFileLocator src_rlocator,
 						  RelFileLocator dst_rlocator, bool permanent)
 {
-	Relation	src_rel;
-	Relation	dst_rel;
+	RelFileLocatorBackend rlocator;
 	char		relpersistence;
 
 	/* Set the relpersistence. */
@@ -3796,16 +3797,6 @@ CreateAndCopyRelationData(RelFileLocator src_rlocator,
 		RELPERSISTENCE_PERMANENT : RELPERSISTENCE_UNLOGGED;
 
 	/*
-	 * We can't use a real relcache entry for a relation in some other
-	 * database, but since we're only going to access the fields related to
-	 * physical storage, a fake one is good enough. If we didn't do this and
-	 * used the smgr layer directly, we would have to worry about
-	 * invalidations.
-	 */
-	src_rel = CreateFakeRelcacheEntry(src_rlocator);
-	dst_rel = CreateFakeRelcacheEntry(dst_rlocator);
-
-	/*
 	 * Create and copy all forks of the relation.  During create database we
 	 * have a separate cleanup mechanism which deletes complete database
 	 * directory.  Therefore, each individual relation doesn't need to be
@@ -3814,15 +3805,16 @@ CreateAndCopyRelationData(RelFileLocator src_rlocator,
 	RelationCreateStorage(dst_rlocator, relpersistence, false);
 
 	/* copy main fork. */
-	RelationCopyStorageUsingBuffer(src_rel, dst_rel, MAIN_FORKNUM, permanent);
+	RelationCopyStorageUsingBuffer(src_rlocator, dst_rlocator, MAIN_FORKNUM,
+								   permanent);
 
 	/* copy those extra forks that exist */
 	for (ForkNumber forkNum = MAIN_FORKNUM + 1;
 		 forkNum <= MAX_FORKNUM; forkNum++)
 	{
-		if (smgrexists(RelationGetSmgr(src_rel), forkNum))
+		if (smgrexists(smgropen(src_rlocator, InvalidBackendId), forkNum))
 		{
-			smgrcreate(RelationGetSmgr(dst_rel), forkNum, false);
+			smgrcreate(smgropen(dst_rlocator, InvalidBackendId), forkNum, false);
 
 			/*
 			 * WAL log creation if the relation is persistent, or this is the
@@ -3832,14 +3824,19 @@ CreateAndCopyRelationData(RelFileLocator src_rlocator,
 				log_smgrcreate(&dst_rlocator, forkNum);
 
 			/* Copy a fork's data, block by block. */
-			RelationCopyStorageUsingBuffer(src_rel, dst_rel, forkNum,
+			RelationCopyStorageUsingBuffer(src_rlocator, dst_rlocator, forkNum,
 										   permanent);
 		}
 	}
 
-	/* Release fake relcache entries. */
-	FreeFakeRelcacheEntry(src_rel);
-	FreeFakeRelcacheEntry(dst_rel);
+	/* close source and destination smgr if exists. */
+	rlocator.backend = InvalidBackendId;
+
+	rlocator.locator = src_rlocator;
+	smgrcloserellocator(rlocator);
+
+	rlocator.locator = dst_rlocator;
+	smgrcloserellocator(rlocator);
 }
 
 /* ---------------------------------------------------------------------
-- 
1.8.3.1

#236

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Dilip Kumar (#235)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Wed, Aug 3, 2022 at 7:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Another version of the patch which closes the smgr at the end using
smgrcloserellocator() and I have also added a commit message.

Hmm, but didn't we decide against doing it that way intentionally? The
comment you're deleting says "If we didn't do this and used the smgr
layer directly, we would have to worry about invalidations."

--
Robert Haas
EDB: http://www.enterprisedb.com

#237

Dilip Kumar

dilipbalaut@gmail.com

over 3 years ago

In reply to: Robert Haas (#236)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Wed, 3 Aug 2022 at 9:22 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Aug 3, 2022 at 7:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Another version of the patch which closes the smgr at the end using
smgrcloserellocator() and I have also added a commit message.

Hmm, but didn't we decide against doing it that way intentionally? The
comment you're deleting says "If we didn't do this and used the smgr
layer directly, we would have to worry about invalidations."

I think we only need to worry if we keep the smgr reference around and try
to reuse it. But in this patch I am not keeping the reference to the smgr.

—
Dilip

--

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#238

Justin Pryzby

pryzby@telsasoft.com

over 3 years ago

In reply to: Dilip Kumar (#235)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Wed, Aug 03, 2022 at 04:45:23PM +0530, Dilip Kumar wrote:

Another version of the patch which closes the smgr at the end using
smgrcloserellocator() and I have also added a commit message.

Thanks for providing a patch.
This seems to fix the second problem with accessing freed memory.

But I reproduced the first problem with a handful of tries interrupting the
while loop:

2022-08-03 10:39:50.129 CDT client backend[5530] [unknown] PANIC: could not open critical system index 2662

In the failure, when trying to connect to the new "a" DB, it does this:

[pid 10700] openat(AT_FDCWD, "base/17003/pg_filenode.map", O_RDONLY) = 11
[pid 10700] read(11, "\27'Y\0\21\0\0\0\353\4\0\0\353\4\0\0\341\4\0\0\341\4\0\0\347\4\0\0\347\4\0\0\337\4\0\0\337\4\0\0\24\v\0\0\24\v\0\0\25\v\0\0\25\v\0\0K\20\0\0K\20\0\0L\20\0\0L\20\0\0\202\n\0\0\202\n\0\0\203\n\0\0\203\n\0\0\217\n\0\0\217\n\0\0\220\n\0\0\220\n\0\0b\n\0\0b\n\0\0c\n\0\0c\n\0\0f\n\0\0f\n\0\0g\n\0\0g\n\0\0\177\r\0\0\177\r\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\362\366\252\337", 524) = 524
[pid 10700] close(11) = 0
[pid 10700] openat(AT_FDCWD, "base/17003/pg_internal.init", O_RDONLY) = -1 ENOENT (No such file or directory)
[pid 10700] openat(AT_FDCWD, "base/17003/1259", O_RDWR) = 11
[pid 10700] lseek(11, 0, SEEK_END) = 106496
[pid 10700] lseek(11, 0, SEEK_END) = 106496

And then reads nothing but zero bytes from FD 11 (rel 1259/pg_class)

So far, I haven't succeeded in eliciting anything useful from valgrind.

--
Justin

#239

Dilip Kumar

dilipbalaut@gmail.com

over 3 years ago

In reply to: Justin Pryzby (#238)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Wed, Aug 3, 2022 at 9:32 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

On Wed, Aug 03, 2022 at 04:45:23PM +0530, Dilip Kumar wrote:

Another version of the patch which closes the smgr at the end using
smgrcloserellocator() and I have also added a commit message.

Thanks for providing a patch.
This seems to fix the second problem with accessing freed memory.

Thanks for the confirmation.

But I reproduced the first problem with a handful of tries interrupting the
while loop:

2022-08-03 10:39:50.129 CDT client backend[5530] [unknown] PANIC: could not open critical system index 2662

In the failure, when trying to connect to the new "a" DB, it does this:

[pid 10700] openat(AT_FDCWD, "base/17003/pg_filenode.map", O_RDONLY) = 11
[pid 10700] read(11, "\27'Y\0\21\0\0\0\353\4\0\0\353\4\0\0\341\4\0\0\341\4\0\0\347\4\0\0\347\4\0\0\337\4\0\0\337\4\0\0\24\v\0\0\24\v\0\0\25\v\0\0\25\v\0\0K\20\0\0K\20\0\0L\20\0\0L\20\0\0\202\n\0\0\202\n\0\0\203\n\0\0\203\n\0\0\217\n\0\0\217\n\0\0\220\n\0\0\220\n\0\0b\n\0\0b\n\0\0c\n\0\0c\n\0\0f\n\0\0f\n\0\0g\n\0\0g\n\0\0\177\r\0\0\177\r\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\362\366\252\337", 524) = 524
[pid 10700] close(11) = 0
[pid 10700] openat(AT_FDCWD, "base/17003/pg_internal.init", O_RDONLY) = -1 ENOENT (No such file or directory)
[pid 10700] openat(AT_FDCWD, "base/17003/1259", O_RDWR) = 11
[pid 10700] lseek(11, 0, SEEK_END) = 106496
[pid 10700] lseek(11, 0, SEEK_END) = 106496

And then reads nothing but zero bytes from FD 11 (rel 1259/pg_class)

So far, I haven't succeeded in eliciting anything useful from valgrind.

I tried multiple times but had no luck with reproducing this issue.
Do you have some logs to know that just before ^C what was the last
query executed and whether it got canceled or executed completely?
Because theoretically, if the create database failed anywhere in
between then it should at least clean the directory of that newly
created database but seems there are some corrupted data in that
directory so seems it is not symptoms of just the create database
failure but some combination of multiple things. I will put more
thought into this tomorrow.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#240

Justin Pryzby

pryzby@telsasoft.com

over 3 years ago

In reply to: Justin Pryzby (#238)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Wed, Aug 03, 2022 at 11:02:00AM -0500, Justin Pryzby wrote:

But I reproduced the first problem with a handful of tries interrupting the
while loop:

2022-08-03 10:39:50.129 CDT client backend[5530] [unknown] PANIC: could not open critical system index 2662

...

So far, I haven't succeeded in eliciting anything useful from valgrind.

Now, I've reproduced the problem under valgrind, but it doesn't show anything
useful:

pryzbyj@pryzbyj:~$ while :; do psql -h /tmp template1 -c "DROP DATABASE a" -c "CREATE DATABASE a TEMPLATE postgres STRATEGY wal_log"; done
ERROR: database "a" does not exist
CREATE DATABASE
^CCancel request sent
ERROR: canceling statement due to user request
ERROR: database "a" already exists
^C
pryzbyj@pryzbyj:~$ ^C
pryzbyj@pryzbyj:~$ ^C
pryzbyj@pryzbyj:~$ ^C
pryzbyj@pryzbyj:~$ psql -h /tmp a -c ''
2022-08-03 11:57:39.178 CDT client backend[31321] [unknown] PANIC: could not open critical system index 2662
psql: error: fallï¿½ la conexiï¿½n al servidor en el socket ï¿½/tmp/.s.PGSQL.5432ï¿½: PANIC: could not open critical system index 2662

On the server process, nothing interesting but the backtrace (the error was
before this, while writing the relation file, but there's nothing suspicious).

2022-08-03 11:08:06.628 CDT client backend[2841] [unknown] PANIC: could not open critical system index 2662
==2841==
==2841== Process terminating with default action of signal 6 (SIGABRT)
==2841== at 0x5FBBE97: raise (raise.c:51)
==2841== by 0x5FBD800: abort (abort.c:79)
==2841== by 0x2118DEF: errfinish (elog.c:675)
==2841== by 0x20F6002: load_critical_index (relcache.c:4328)
==2841== by 0x20F727A: RelationCacheInitializePhase3 (relcache.c:4103)
==2841== by 0x213DFA5: InitPostgres (postinit.c:1087)
==2841== by 0x1D20D72: PostgresMain (postgres.c:4081)
==2841== by 0x1B15CFD: BackendRun (postmaster.c:4490)
==2841== by 0x1B1D6E1: BackendStartup (postmaster.c:4218)
==2841== by 0x1B1DD59: ServerLoop (postmaster.c:1808)
==2841== by 0x1B1F86D: PostmasterMain (postmaster.c:1480)
==2841== by 0x17B7150: main (main.c:197)

Below, I reproduced it without valgrind (and without LANG):

pryzbyj@pryzbyj:~/src/postgres$ while :; do psql -qh /tmp template1 -c "DROP DATABASE a" -c "CREATE DATABASE a TEMPLATE postgres STRATEGY wal_log"; done
2022-08-03 11:59:52.675 CDT checkpointer[1881] LOG: checkpoint starting: immediate force wait
2022-08-03 11:59:52.862 CDT checkpointer[1881] LOG: checkpoint complete: wrote 4 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.045 s, sync=0.038 s, total=0.188 s; sync files=3, longest=0.019 s, average=0.013 s; distance=3 kB, estimate=3 kB; lsn=0/24862508, redo lsn=0/248624D0
2022-08-03 11:59:53.213 CDT checkpointer[1881] LOG: checkpoint starting: immediate force wait
2022-08-03 11:59:53.409 CDT checkpointer[1881] LOG: checkpoint complete: wrote 4 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.030 s, sync=0.054 s, total=0.196 s; sync files=4, longest=0.029 s, average=0.014 s; distance=4042 kB, estimate=4042 kB; lsn=0/24C54D88, redo lsn=0/24C54D50
^CCancel request sent
2022-08-03 11:59:53.750 CDT checkpointer[1881] LOG: checkpoint starting: immediate force wait
2022-08-03 11:59:53.930 CDT checkpointer[1881] LOG: checkpoint complete: wrote 4 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.029 s, sync=0.027 s, total=0.181 s; sync files=4, longest=0.022 s, average=0.007 s; distance=4042 kB, estimate=4042 kB; lsn=0/250476D0, redo lsn=0/25047698
2022-08-03 11:59:54.270 CDT checkpointer[1881] LOG: checkpoint starting: immediate force wait
^C2022-08-03 11:59:54.294 CDT client backend[1903] psql ERROR: canceling statement due to user request
2022-08-03 11:59:54.294 CDT client backend[1903] psql STATEMENT: DROP DATABASE a
Cancel request sent
ERROR: canceling statement due to user request
2022-08-03 11:59:54.296 CDT client backend[1903] psql ERROR: database "a" already exists
2022-08-03 11:59:54.296 CDT client backend[1903] psql STATEMENT: CREATE DATABASE a TEMPLATE postgres STRATEGY wal_log
ERROR: database "a" already exists
^C
pryzbyj@pryzbyj:~/src/postgres$ ^C
pryzbyj@pryzbyj:~/src/postgres$ ^C
pryzbyj@pryzbyj:~/src/postgres$ 2022-08-03 11:59:54.427 CDT checkpointer[1881] LOG: checkpoint complete: wrote 4 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.024 s, sync=0.036 s, total=0.158 s; sync files=4, longest=0.027 s, average=0.009 s; distance=4042 kB, estimate=4042 kB; lsn=0/2543A108, redo lsn=0/2543A0A8
^C
pryzbyj@pryzbyj:~/src/postgres$ ^C
pryzbyj@pryzbyj:~/src/postgres$ ^C
pryzbyj@pryzbyj:~/src/postgres$ psql -h /tmp a -c '' 2022-08-03 11:59:56.617 CDT client backend[1914] [unknown] PANIC: could not open critical system index 2662

#241

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Justin Pryzby (#240)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

Hi,

On 2022-08-03 12:01:18 -0500, Justin Pryzby wrote:

Now, I've reproduced the problem under valgrind, but it doesn't show anything
useful

Yea, that looks like an issue on a different level.

pryzbyj@pryzbyj:~$ while :; do psql -h /tmp template1 -c "DROP DATABASE a" -c "CREATE DATABASE a TEMPLATE postgres STRATEGY wal_log"; done
ERROR: database "a" does not exist
CREATE DATABASE
^CCancel request sent
ERROR: canceling statement due to user request
ERROR: database "a" already exists
^C

Hm. This looks more like an issue of DROP DATABASE not being interruptible. I
suspect this isn't actually related to STRATEGY wal_log and could likely be
reproduced in older versions too.

It's pretty obvious that dropdb() isn't safe against being interrupted. We
delete the data before we have committed the deletion of the pg_database
entry.

Seems like we should hold interrupts across the remove_dbtablespaces() until
*after* we've committed the transaction?

Greetings,

Andres Freund

#242

Justin Pryzby

pryzby@telsasoft.com

over 3 years ago

In reply to: Andres Freund (#241)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Wed, Aug 03, 2022 at 11:26:43AM -0700, Andres Freund wrote:

Hm. This looks more like an issue of DROP DATABASE not being interruptible. I
suspect this isn't actually related to STRATEGY wal_log and could likely be
reproduced in older versions too.

I couldn't reproduce it with file_copy, but my recipe isn't exactly reliable.
That may just mean that it's easier to hit now.

--
Justin

#243

Dilip Kumar

dilipbalaut@gmail.com

over 3 years ago

In reply to: Justin Pryzby (#242)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Thu, Aug 4, 2022 at 12:18 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

On Wed, Aug 03, 2022 at 11:26:43AM -0700, Andres Freund wrote:

Hm. This looks more like an issue of DROP DATABASE not being interruptible. I
suspect this isn't actually related to STRATEGY wal_log and could likely be
reproduced in older versions too.

I couldn't reproduce it with file_copy, but my recipe isn't exactly reliable.
That may just mean that it's easier to hit now.

I think this looks like a problem with drop db but IMHO you are seeing
this behavior only when a database is created using WAL LOG because in
this strategy we are using buffers to write the destination database
pages and some of the dirty buffers and sync requests might still be
pending. And now when we try to drop the database it drops all the
dirty buffers and all pending sync requests and then before it
actually removes the directory it gets interrupted and now you see the
database directory on disk which is partially corrupted. See below
sequence of drop database

dropdb()
{
...
DropDatabaseBuffers(db_id);
...
ForgetDatabaseSyncRequests(db_id);
...
RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);

WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SMGRRELEASE));
-- Inside this it can process the cancel query and get interrupted
remove_dbtablespaces(db_id);
..
}

I reproduced the same error by inducing error just before
WaitForProcSignalBarrier.

postgres[14968]=# CREATE DATABASE a STRATEGY WAL_LOG ; drop database a;
CREATE DATABASE
ERROR: XX000: test error
LOCATION: dropdb, dbcommands.c:1684
postgres[14968]=# \c a
connection to server on socket "/tmp/.s.PGSQL.5432" failed: PANIC:
could not open critical system index 2662
Previous connection kept
postgres[14968]=#

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#244

Dilip Kumar

dilipbalaut@gmail.com

over 3 years ago

In reply to: Dilip Kumar (#243)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Thu, Aug 4, 2022 at 9:41 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Aug 4, 2022 at 12:18 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

On Wed, Aug 03, 2022 at 11:26:43AM -0700, Andres Freund wrote:

Hm. This looks more like an issue of DROP DATABASE not being interruptible. I
suspect this isn't actually related to STRATEGY wal_log and could likely be
reproduced in older versions too.

I couldn't reproduce it with file_copy, but my recipe isn't exactly reliable.
That may just mean that it's easier to hit now.

I think this looks like a problem with drop db but IMHO you are seeing
this behavior only when a database is created using WAL LOG because in
this strategy we are using buffers to write the destination database
pages and some of the dirty buffers and sync requests might still be
pending. And now when we try to drop the database it drops all the
dirty buffers and all pending sync requests and then before it
actually removes the directory it gets interrupted and now you see the
database directory on disk which is partially corrupted. See below
sequence of drop database

dropdb()
{
...
DropDatabaseBuffers(db_id);
...
ForgetDatabaseSyncRequests(db_id);
...
RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);

WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SMGRRELEASE));
-- Inside this it can process the cancel query and get interrupted
remove_dbtablespaces(db_id);
..
}

I reproduced the same error by inducing error just before
WaitForProcSignalBarrier.

postgres[14968]=# CREATE DATABASE a STRATEGY WAL_LOG ; drop database a;
CREATE DATABASE
ERROR: XX000: test error
LOCATION: dropdb, dbcommands.c:1684
postgres[14968]=# \c a
connection to server on socket "/tmp/.s.PGSQL.5432" failed: PANIC:
could not open critical system index 2662
Previous connection kept
postgres[14968]=#

So basically, from this we can say it is completely a problem with
drop databases, I mean I can produce any behavior by interrupting drop
database
1. If we created some tables/inserted data and the drop database got
cancelled, it might have a database directory and those objects are
lost.
2. Or you can even drop the database directory and then get cancelled
before deleting the pg_database entry then also you will end up with a
corrupted database, doesn't matter whether you created it with WAL LOG
or FILE COPY.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#245

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Dilip Kumar (#235)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Wed, Aug 3, 2022 at 7:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Another version of the patch which closes the smgr at the end using
smgrcloserellocator() and I have also added a commit message.

I have reviewed this patch and I don't see a problem with it. However,
it would be nice if Andres or someone else who understands this area
well (Tom? Thomas?) would also review it, because I also reviewed
what's in the tree now and that turns out to be buggy, which leads me
to conclude that I don't understand this area as well as would be
desirable.

I'm inclined to hold off on committing this until next week, not only
for that reason, but also because there's a wrap planned on Monday,
and committing anything now seems like it might have too much of a
risk of upsetting that plan.

--
Robert Haas
EDB: http://www.enterprisedb.com

#246

Justin Pryzby

pryzby@telsasoft.com

over 3 years ago

In reply to: Robert Haas (#245)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Thu, Aug 04, 2022 at 04:07:01PM -0400, Robert Haas wrote:

I'm inclined to hold off on committing this until next week, not only

I don't see any reason to hurry to fix problems that occur when DROP DATABASE
is interrupted.

Sorry to beat up your patches so much and for such crappy test cases^C

--
Justin

#247

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Justin Pryzby (#246)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

Justin Pryzby <pryzby@telsasoft.com> writes:

On Thu, Aug 04, 2022 at 04:07:01PM -0400, Robert Haas wrote:

I'm inclined to hold off on committing this until next week, not only

+1

+1 ... there are some other v15 open items that I don't think we'll
see fixed for beta3, either.

regards, tom lane

#248

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Dilip Kumar (#235)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

Hi,

On 2022-08-03 16:45:23 +0530, Dilip Kumar wrote:

Another version of the patch which closes the smgr at the end using
smgrcloserellocator() and I have also added a commit message.

What's the motivation behind the explicit close?

@@ -258,8 +258,8 @@ ScanSourceDatabasePgClass(Oid tbid, Oid dbid, char *srcpath)
Page page;
List *rlocatorlist = NIL;
LockRelId relid;
- Relation rel;
Snapshot snapshot;
+ SMgrRelation smgr;
BufferAccessStrategy bstrategy;

/* Get pg_class relfilenumber. */
@@ -276,16 +276,9 @@ ScanSourceDatabasePgClass(Oid tbid, Oid dbid, char *srcpath)
rlocator.dbOid = dbid;
rlocator.relNumber = relfilenumber;
-	/*
-	 * We can't use a real relcache entry for a relation in some other
-	 * database, but since we're only going to access the fields related to
-	 * physical storage, a fake one is good enough. If we didn't do this and
-	 * used the smgr layer directly, we would have to worry about
-	 * invalidations.
-	 */
-	rel = CreateFakeRelcacheEntry(rlocator);
-	nblocks = smgrnblocks(RelationGetSmgr(rel), MAIN_FORKNUM);
-	FreeFakeRelcacheEntry(rel);
+	smgr = smgropen(rlocator, InvalidBackendId);
+	nblocks = smgrnblocks(smgr, MAIN_FORKNUM);
+	smgrclose(smgr);

Why are you opening and then closing again? Part of the motivation for the
question is that a local SMgrRelation variable may lead to it being used
further, opening up interrupt processing issues.

+	rlocator.locator = src_rlocator;
+	smgrcloserellocator(rlocator);
+
+	rlocator.locator = dst_rlocator;
+	smgrcloserellocator(rlocator);

As mentioned above, it's not clear to me why this is now done...

Otherwise looks good to me.

Greetings,

Andres Freund

#249

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Robert Haas (#245)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

Hi,

On 2022-08-04 16:07:01 -0400, Robert Haas wrote:

On Wed, Aug 3, 2022 at 7:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Another version of the patch which closes the smgr at the end using
smgrcloserellocator() and I have also added a commit message.

I have reviewed this patch and I don't see a problem with it. However,
it would be nice if Andres or someone else who understands this area
well (Tom? Thomas?) would also review it, because I also reviewed
what's in the tree now and that turns out to be buggy, which leads me
to conclude that I don't understand this area as well as would be
desirable.

I don't think this issue is something I'd have caught "originally"
either. It's probably more of a "fake relcache entry" issue (or undocumented
limitation) than a bug in the new code.

I did a quick review upthread - some minor quibbles aside, I think it looks
good.

I'm inclined to hold off on committing this until next week, not only
for that reason, but also because there's a wrap planned on Monday,
and committing anything now seems like it might have too much of a
risk of upsetting that plan.

Makes sense. Unlikely to be a blocker for anybody.

Greetings,

Andres Freund

#250

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Dilip Kumar (#244)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

Hi,

On 2022-08-04 16:38:35 +0530, Dilip Kumar wrote:

So basically, from this we can say it is completely a problem with
drop databases, I mean I can produce any behavior by interrupting drop
database
1. If we created some tables/inserted data and the drop database got
cancelled, it might have a database directory and those objects are
lost.
2. Or you can even drop the database directory and then get cancelled
before deleting the pg_database entry then also you will end up with a
corrupted database, doesn't matter whether you created it with WAL LOG
or FILE COPY.

Yea. I think at the very least we need to start holding interrupts before the
DropDatabaseBuffers() and do so until commit. That's probably best done by
doing the transaction commit inside dropdb.

Greetings,

Andres Freund

#251

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Robert Haas (#245)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

Robert Haas <robertmhaas@gmail.com> writes:

I have reviewed this patch and I don't see a problem with it. However,
it would be nice if Andres or someone else who understands this area
well (Tom? Thomas?) would also review it, because I also reviewed
what's in the tree now and that turns out to be buggy, which leads me
to conclude that I don't understand this area as well as would be
desirable.

FWIW, I approve of getting rid of the use of CreateFakeRelcacheEntry
here, because I do not think that mechanism is meant to be used
outside of WAL replay. However, this patch fails to remove it from
CreateAndCopyRelationData, which seems likely to be just as much
at risk.

The "invalidation" comment bothered me for awhile, but I think it's
fine: we know that no other backend can connect to the source DB
because we have it locked, and we know that no other backend can
connect to the destination DB because it doesn't exist yet according
to the catalogs, so nothing could possibly occur to invalidate our
idea of where the physical files are. It would be nice to document
these assumptions, though, rather than merely remove all the relevant
commentary.

While I'm at it, I would like to strenuously object to the current
framing of CreateAndCopyRelationData as a general-purpose copying
mechanism. Because of the above assumptions, I think it's utterly
unsafe to use anywhere except in CREATE DATABASE. The header comment
fails to warn about that at all, and placing it in bufmgr.c rather
than static in dbcommands.c is just an invitation to future misuse.
Perhaps I'm overly sensitive to that because I just finished cleaning
up somebody's misuse of non-general-purpose code (1aa8dad41), but
as this stands I think it's positively dangerous.

regards, tom lane

#252

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Andres Freund (#250)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

Andres Freund <andres@anarazel.de> writes:

Yea. I think at the very least we need to start holding interrupts before the
DropDatabaseBuffers() and do so until commit. That's probably best done by
doing the transaction commit inside dropdb.

We've talked before about ignoring interrupts across commit, but
I find the idea a bit scary. In any case, DROP DATABASE is far
from the only place with a problem.

regards, tom lane

#253

Justin Pryzby

pryzby@telsasoft.com

over 3 years ago

In reply to: Tom Lane (#251)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Thu, Aug 04, 2022 at 06:02:50PM -0400, Tom Lane wrote:

The "invalidation" comment bothered me for awhile, but I think it's
fine: we know that no other backend can connect to the source DB
because we have it locked,

About that - is it any problem that the currently-connected db can be used as a
template? It's no issue for 2-phase commit, because "create database" cannot
run in an txn.

--
Justin

#254

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Tom Lane (#252)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

Hi,

On 2022-08-04 18:05:25 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

Yea. I think at the very least we need to start holding interrupts before the
DropDatabaseBuffers() and do so until commit. That's probably best done by
doing the transaction commit inside dropdb.

We've talked before about ignoring interrupts across commit, but
I find the idea a bit scary.

I'm not actually suggesting to do so across commit, just until the
commit. Maintaining that seems easiest if dropdb() does the commit internally.

In any case, DROP DATABASE is far from the only place with a problem.

What other place has a database corrupting potential of this magnitude just
because interrupts are accepted? We throw valid s_b contents away and then
accept interrupts before committing - with predictable results. We also accept
interrupts as part of deleting the db data dir (due to catalog access).

Greetings,

Andres Freund

#255

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Tom Lane (#251)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

I wrote:

While I'm at it, I would like to strenuously object to the current
framing of CreateAndCopyRelationData as a general-purpose copying
mechanism.

And while I'm piling on, how is this bit in RelationCopyStorageUsingBuffer
not completely broken?

/* Read block from source relation. */
srcBuf = ReadBufferWithoutRelcache(src->rd_locator, forkNum, blkno,
RBM_NORMAL, bstrategy_src,
permanent);
srcPage = BufferGetPage(srcBuf);
if (PageIsNew(srcPage) || PageIsEmpty(srcPage))
{
ReleaseBuffer(srcBuf);
continue;
}

/* Use P_NEW to extend the destination relation. */
dstBuf = ReadBufferWithoutRelcache(dst->rd_locator, forkNum, P_NEW,
RBM_NORMAL, bstrategy_dst,
permanent);

You can't skip pages just because they are empty. Well, maybe you could
if you were doing something to ensure that you zero-fill the corresponding
blocks on the destination side. But this isn't doing that. It's using
P_NEW for dstBuf, which will have the effect of silently collapsing out
such pages. Maybe in isolation a heap could withstand that, but its
indexes won't be happy (and I guess t_ctid chain links won't either).

I think you should just lose the if() stanza. There's no optimization to
be had here that's worth any extra complication.

(This seems worth fixing before beta3, as it looks like a rather
nasty data corruption hazard.)

regards, tom lane

#256

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Tom Lane (#255)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

I wrote:

And while I'm piling on, how is this bit in RelationCopyStorageUsingBuffer
not completely broken?

[pile^2] Also, what is the rationale for locking the target buffer
but not the source buffer? That seems pretty hard to justify from
here, even granting the assumption that we don't expect any other
processes to be interested in these buffers (which I don't grant,
because checkpointer).

regards, tom lane

#257

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Tom Lane (#255)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

Hi,

On 2022-08-04 19:01:06 -0400, Tom Lane wrote:

And while I'm piling on, how is this bit in RelationCopyStorageUsingBuffer
not completely broken?

/* Read block from source relation. */
srcBuf = ReadBufferWithoutRelcache(src->rd_locator, forkNum, blkno,
RBM_NORMAL, bstrategy_src,
permanent);
srcPage = BufferGetPage(srcBuf);
if (PageIsNew(srcPage) || PageIsEmpty(srcPage))
{
ReleaseBuffer(srcBuf);
continue;
}

/* Use P_NEW to extend the destination relation. */
dstBuf = ReadBufferWithoutRelcache(dst->rd_locator, forkNum, P_NEW,
RBM_NORMAL, bstrategy_dst,
permanent);

You can't skip pages just because they are empty. Well, maybe you could
if you were doing something to ensure that you zero-fill the corresponding
blocks on the destination side. But this isn't doing that. It's using
P_NEW for dstBuf, which will have the effect of silently collapsing out
such pages. Maybe in isolation a heap could withstand that, but its
indexes won't be happy (and I guess t_ctid chain links won't either).

I think you should just lose the if() stanza. There's no optimization to
be had here that's worth any extra complication.

(This seems worth fixing before beta3, as it looks like a rather
nasty data corruption hazard.)

Ugh, yes. And even with this fixed I think this should grow at least an
assertion that the block numbers match, probably even an elog.

Greetings,

Andres

#258

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Andres Freund (#254)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

Andres Freund <andres@anarazel.de> writes:

On 2022-08-04 18:05:25 -0400, Tom Lane wrote:

In any case, DROP DATABASE is far from the only place with a problem.

What other place has a database corrupting potential of this magnitude just
because interrupts are accepted? We throw valid s_b contents away and then
accept interrupts before committing - with predictable results. We also accept
interrupts as part of deleting the db data dir (due to catalog access).

Those things would be better handled by moving the data-discarding
steps to post-commit. Maybe that argues for having an internal
commit halfway through DROP DATABASE: remove pg_database row,
commit, start new transaction, clean up.

regards, tom lane

#259

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Andres Freund (#257)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

Andres Freund <andres@anarazel.de> writes:

On 2022-08-04 19:01:06 -0400, Tom Lane wrote:

(This seems worth fixing before beta3, as it looks like a rather
nasty data corruption hazard.)

Ugh, yes. And even with this fixed I think this should grow at least an
assertion that the block numbers match, probably even an elog.

Yeah, the assumption that P_NEW would automatically match the source block
was making me itchy too. An explicit test-and-elog seems worth the
cycles.

regards, tom lane

#260

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Tom Lane (#256)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

Hi,

On August 4, 2022 4:11:13 PM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wrote:

And while I'm piling on, how is this bit in RelationCopyStorageUsingBuffer
not completely broken?

[pile^2] Also, what is the rationale for locking the target buffer
but not the source buffer? That seems pretty hard to justify from
here, even granting the assumption that we don't expect any other
processes to be interested in these buffers (which I don't grant,
because checkpointer).

I'm not arguing it's good or should stay that way, but it's probably okayish that checkpointer / bgwriter have access, given that they will never modify buffers. They just take a lock to prevent concurrent modifications, which RelationCopyStorageUsingBuffer hopefully doesn't do.

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

#261

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Tom Lane (#259)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

Hi,

On August 4, 2022 4:20:16 PM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Yeah, the assumption that P_NEW would automatically match the source block
was making me itchy too. An explicit test-and-elog seems worth the
cycles.

Is there a good reason to rely on P_NEW at all? Both from an efficiency and robustness POV it seems like it'd be better to use smgrextend to bulk extend just after smgrcreate() and then fill s_b u using RBM_ZERO (or whatever it is called). That bulk smgrextend would later be a good point to use fallocate so the FS can immediately size the file correctly, without a lot of pointless writes of zeroes.

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

#262

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Andres Freund (#260)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

Andres Freund <andres@anarazel.de> writes:

On August 4, 2022 4:11:13 PM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:

[pile^2] Also, what is the rationale for locking the target buffer
but not the source buffer? That seems pretty hard to justify from
here, even granting the assumption that we don't expect any other
processes to be interested in these buffers (which I don't grant,
because checkpointer).

I'm not arguing it's good or should stay that way, but it's probably okayish that checkpointer / bgwriter have access, given that they will never modify buffers. They just take a lock to prevent concurrent modifications, which RelationCopyStorageUsingBuffer hopefully doesn't do.

I'm not arguing that it's actively broken today --- but AFAIR,
every other access to a shared buffer takes a buffer lock.
It does not seem to me to be very future-proof for this code to
decide it's exempt from that rule, without so much as a comment
justifying it. Furthermore, what's the gain? We aren't expecting
contention here, I think. If we were, then it probably *would* be
actively broken.

regards, tom lane

#263

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Andres Freund (#261)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

Andres Freund <andres@anarazel.de> writes:

Is there a good reason to rely on P_NEW at all? Both from an efficiency
and robustness POV it seems like it'd be better to use smgrextend to
bulk extend just after smgrcreate() and then fill s_b u using RBM_ZERO
(or whatever it is called). That bulk smgrextend would later be a good
point to use fallocate so the FS can immediately size the file
correctly, without a lot of pointless writes of zeroes.

Hmm ... makes sense. We'd be using smgrextend to write just the last page
of the file, relying on its API spec "Note that we assume writing a block
beyond current EOF causes intervening file space to become filled with
zeroes". I don't know that we're using that assumption aggressively
today, but as long as it doesn't confuse the kernel it'd probably be fine.

regards, tom lane

#264

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Tom Lane (#251)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Thu, Aug 4, 2022 at 6:02 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

I have reviewed this patch and I don't see a problem with it. However,
it would be nice if Andres or someone else who understands this area
well (Tom? Thomas?) would also review it, because I also reviewed
what's in the tree now and that turns out to be buggy, which leads me
to conclude that I don't understand this area as well as would be
desirable.

FWIW, I approve of getting rid of the use of CreateFakeRelcacheEntry
here, because I do not think that mechanism is meant to be used
outside of WAL replay. However, this patch fails to remove it from
CreateAndCopyRelationData, which seems likely to be just as much
at risk.

It looks to me like it does?

The "invalidation" comment bothered me for awhile, but I think it's
fine: we know that no other backend can connect to the source DB
because we have it locked, and we know that no other backend can
connect to the destination DB because it doesn't exist yet according
to the catalogs, so nothing could possibly occur to invalidate our
idea of where the physical files are. It would be nice to document
these assumptions, though, rather than merely remove all the relevant
commentary.

I don't think that's the point. We could always suffer a sinval reset
or a PROCSIGNAL_BARRIER_SMGRRELEASE. But since the code avoids ever
reusing the smgr, it should be OK. I think.

While I'm at it, I would like to strenuously object to the current
framing of CreateAndCopyRelationData as a general-purpose copying
mechanism. Because of the above assumptions, I think it's utterly
unsafe to use anywhere except in CREATE DATABASE. The header comment
fails to warn about that at all, and placing it in bufmgr.c rather
than static in dbcommands.c is just an invitation to future misuse.
Perhaps I'm overly sensitive to that because I just finished cleaning
up somebody's misuse of non-general-purpose code (1aa8dad41), but
as this stands I think it's positively dangerous.

OK. No objection to you revising the comments however you feel is appropriate.

--
Robert Haas
EDB: http://www.enterprisedb.com

#265

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Tom Lane (#255)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Thu, Aug 4, 2022 at 7:01 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

And while I'm piling on, how is this bit in RelationCopyStorageUsingBuffer
not completely broken?

Ouch. That's pretty bad.

--
Robert Haas
EDB: http://www.enterprisedb.com

#266

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Tom Lane (#256)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Thu, Aug 4, 2022 at 7:11 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

[pile^2] Also, what is the rationale for locking the target buffer
but not the source buffer? That seems pretty hard to justify from
here, even granting the assumption that we don't expect any other
processes to be interested in these buffers (which I don't grant,
because checkpointer).

Ooph. I agree.

--
Robert Haas
EDB: http://www.enterprisedb.com

#267

Dilip Kumar

dilipbalaut@gmail.com

over 3 years ago

In reply to: Tom Lane (#255)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Fri, Aug 5, 2022 at 4:31 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wrote:

While I'm at it, I would like to strenuously object to the current
framing of CreateAndCopyRelationData as a general-purpose copying
mechanism.

And while I'm piling on, how is this bit in RelationCopyStorageUsingBuffer
not completely broken?

/* Read block from source relation. */
srcBuf = ReadBufferWithoutRelcache(src->rd_locator, forkNum, blkno,
RBM_NORMAL, bstrategy_src,
permanent);
srcPage = BufferGetPage(srcBuf);
if (PageIsNew(srcPage) || PageIsEmpty(srcPage))
{
ReleaseBuffer(srcBuf);
continue;
}

/* Use P_NEW to extend the destination relation. */
dstBuf = ReadBufferWithoutRelcache(dst->rd_locator, forkNum, P_NEW,
RBM_NORMAL, bstrategy_dst,
permanent);

You can't skip pages just because they are empty. Well, maybe you could
if you were doing something to ensure that you zero-fill the corresponding
blocks on the destination side. But this isn't doing that. It's using
P_NEW for dstBuf, which will have the effect of silently collapsing out
such pages. Maybe in isolation a heap could withstand that, but its
indexes won't be happy (and I guess t_ctid chain links won't either).

I think you should just lose the if() stanza. There's no optimization to
be had here that's worth any extra complication.

(This seems worth fixing before beta3, as it looks like a rather
nasty data corruption hazard.)

Yeah this is broken.

--
Dilip
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#268

Dilip Kumar

dilipbalaut@gmail.com

over 3 years ago

In reply to: Andres Freund (#261)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Fri, Aug 5, 2022 at 5:36 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On August 4, 2022 4:20:16 PM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Yeah, the assumption that P_NEW would automatically match the source block
was making me itchy too. An explicit test-and-elog seems worth the
cycles.

Is there a good reason to rely on P_NEW at all?

I think there were 2 arguments for which we used bufmgr instead of
smgrextend for the destination database

1) (Comment from Andres) The big benefit would be that the *target*
database does not have to be written out / shared buffer is
immediately populated. [1]/messages/by-id/20210905202800.ji4fnfs3xzhvo7l6@alap3.anarazel.de
2) (Comment from Robert) We wanted to avoid writing new code which
bypasses the shared buffers.

[1]: /messages/by-id/20210905202800.ji4fnfs3xzhvo7l6@alap3.anarazel.de

Both from an efficiency and robustness POV it seems like it'd be
better to use smgrextend to bulk extend just after smgrcreate() and
then fill s_b u using RBM_ZERO (or whatever it is called). That bulk
smgrextend would later be a good point to use fallocate so the FS can
immediately size the file correctly, without a lot of pointless writes
of zeroes.

Yeah okay, so you mean since we already know the nblocks in the source
file so we can directly do smgrextend in bulk before the copy loop and
then we can just copy block by block using bufmgr with proper blkno
instead of P_NEW. Yeah I think this looks optimized to me and this
will take care of the above 2 points I mentioned that we will still
have the target database pages in the shared buffers and we are not
bypassing the shared buffers also.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#269

Dilip Kumar

dilipbalaut@gmail.com

over 3 years ago

In reply to: Andres Freund (#248)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Fri, Aug 5, 2022 at 2:59 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2022-08-03 16:45:23 +0530, Dilip Kumar wrote:

Another version of the patch which closes the smgr at the end using
smgrcloserellocator() and I have also added a commit message.

What's the motivation behind the explicit close?
@@ -258,8 +258,8 @@ ScanSourceDatabasePgClass(Oid tbid, Oid dbid, char *srcpath)
Page page;
List *rlocatorlist = NIL;
LockRelId relid;
- Relation rel;
Snapshot snapshot;
+ SMgrRelation smgr;
BufferAccessStrategy bstrategy;

/* Get pg_class relfilenumber. */
@@ -276,16 +276,9 @@ ScanSourceDatabasePgClass(Oid tbid, Oid dbid, char *srcpath)
rlocator.dbOid = dbid;
rlocator.relNumber = relfilenumber;
-     /*
-      * We can't use a real relcache entry for a relation in some other
-      * database, but since we're only going to access the fields related to
-      * physical storage, a fake one is good enough. If we didn't do this and
-      * used the smgr layer directly, we would have to worry about
-      * invalidations.
-      */
-     rel = CreateFakeRelcacheEntry(rlocator);
-     nblocks = smgrnblocks(RelationGetSmgr(rel), MAIN_FORKNUM);
-     FreeFakeRelcacheEntry(rel);
+     smgr = smgropen(rlocator, InvalidBackendId);
+     nblocks = smgrnblocks(smgr, MAIN_FORKNUM);
+     smgrclose(smgr);
Why are you opening and then closing again? Part of the motivation for the
question is that a local SMgrRelation variable may lead to it being used
further, opening up interrupt processing issues.

Yeah okay, I think there is no reason to close, in the previous
version I had like below and I think that's a better idea.

nblocks = smgrnblocks(smgropen(rlocator, InvalidBackendId), MAIN_FORKNUM)

+     rlocator.locator = src_rlocator;
+     smgrcloserellocator(rlocator);
+
+     rlocator.locator = dst_rlocator;
+     smgrcloserellocator(rlocator);
As mentioned above, it's not clear to me why this is now done...

Otherwise looks good to me.

Yeah maybe it is not necessary to close as these unowned smgr will
automatically get closed on the transaction end. Actually the
previous person of the patch had both these comments fixed. The
reason for explicitly closing it is that I have noticed that most of
the places we explicitly close the smgr where we do smgropen e.g.
index_copy_data(), heapam_relation_copy_data() OTOH some places we
don't close it e.g. IssuePendingWritebacks(). So now I think that in
our case better we do not close it because I do not like this specific
code at the end to close the smgr.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#270

Dilip Kumar

dilipbalaut@gmail.com

over 3 years ago

In reply to: Dilip Kumar (#269)

3 attachment(s)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Fri, Aug 5, 2022 at 10:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Yeah maybe it is not necessary to close as these unowned smgr will
automatically get closed on the transaction end. Actually the
previous person of the patch had both these comments fixed. The
reason for explicitly closing it is that I have noticed that most of
the places we explicitly close the smgr where we do smgropen e.g.
index_copy_data(), heapam_relation_copy_data() OTOH some places we
don't close it e.g. IssuePendingWritebacks(). So now I think that in
our case better we do not close it because I do not like this specific
code at the end to close the smgr.

PFA patches for different problems discussed in the thread

0001 - Fix the problem of skipping the empty block and buffer lock on
source buffer
0002 - Remove fake relcache entry (same as 0001-BugfixInWalLogCreateDB.patch)
0003 - Optimization to avoid extending block by block

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v2-0003-Optimize-copy-storage-from-source-to-destination.patchtext/x-patch; charset=US-ASCII; name=v2-0003-Optimize-copy-storage-from-source-to-destination.patchDownload

From 8a71e5dd10ff65d250815dc17f8f64212c2e57b0 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Fri, 5 Aug 2022 11:25:23 +0530
Subject: [PATCH v2 3/3] Optimize copy storage from source to destination

Instead of extending block at a time directly bulkextend the destination
relation and then just perform the block level copy.
---
 src/backend/storage/buffer/bufmgr.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index b488306..b7df980 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3710,6 +3710,7 @@ RelationCopyStorageUsingBuffer(RelFileLocator srclocator,
 	Page		srcPage;
 	Page		dstPage;
 	bool		use_wal;
+	char		buffer[BLCKSZ];
 	BlockNumber nblocks;
 	BlockNumber blkno;
 	BufferAccessStrategy bstrategy_src;
@@ -3730,6 +3731,14 @@ RelationCopyStorageUsingBuffer(RelFileLocator srclocator,
 	if (nblocks == 0)
 		return;
 
+	/*
+	 * Bulk extend the destination relation of the same size as the source
+	 * relation before starting to copy block by block.
+	 */
+	memset(buffer, 0, BLCKSZ);
+	smgrextend(smgropen(dstlocator, InvalidBackendId), forkNum, nblocks - 1,
+			   buffer, true);
+
 	/* This is a bulk operation, so use buffer access strategies. */
 	bstrategy_src = GetAccessStrategy(BAS_BULKREAD);
 	bstrategy_dst = GetAccessStrategy(BAS_BULKWRITE);
@@ -3748,7 +3757,7 @@ RelationCopyStorageUsingBuffer(RelFileLocator srclocator,
 		srcPage = BufferGetPage(srcBuf);
 
 		/* Use P_NEW to extend the destination relation. */
-		dstBuf = ReadBufferWithoutRelcache(dstlocator, forkNum, P_NEW,
+		dstBuf = ReadBufferWithoutRelcache(dstlocator, forkNum, blkno,
 										   RBM_NORMAL, bstrategy_dst,
 										   permanent);
 		LockBuffer(dstBuf, BUFFER_LOCK_EXCLUSIVE);
-- 
1.8.3.1

v2-0002-Avoid-setting-the-fake-relcache-entry-as-smgr-own.patchtext/x-patch; charset=UTF-8; name=v2-0002-Avoid-setting-the-fake-relcache-entry-as-smgr-own.patchDownload

From 85d17fb7c91fdb6b40f09d00e9a5606ba2c90e57 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Fri, 5 Aug 2022 10:59:18 +0530
Subject: [PATCH v2 2/3] Avoid setting the fake relcache entry as smgr owner
 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8
 Content-Transfer-Encoding: 8bit
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

During CREATE DATABASE, we are not connected to the source and the
destination DB so in order to operate on the storage we are using
FakeRelCacheEntry and by using that we are calling RelationGetSmgr().

So the problem is that this function will set the temporary
FakeRelCacheEntry as an owner of the smgr.  Now if there is any
error before we close the FakeRelCacheEntry then the memory of the
fake relcache entry will be released at the transaction abort but
the smgr will survive the transaction.  So now smgr is pointing
to some already release memory and it will have random behavior
when we try to access the smgr next time.

For fixing the issue instead of using the FakeRelCacheEntry, directly
call the smgropen() but do not keep the reference to the smgr.
So every time call smgropen() whenever we need it.  This is required to
ensure that we do not access the smgr pointer which might have already
been closed during interrupt processing.
---
 src/backend/commands/dbcommands.c   | 12 +----------
 src/backend/storage/buffer/bufmgr.c | 43 +++++++++++++------------------------
 2 files changed, 16 insertions(+), 39 deletions(-)

diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 9f990a8..88d4fe1 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -258,7 +258,6 @@ ScanSourceDatabasePgClass(Oid tbid, Oid dbid, char *srcpath)
 	Page		page;
 	List	   *rlocatorlist = NIL;
 	LockRelId	relid;
-	Relation	rel;
 	Snapshot	snapshot;
 	BufferAccessStrategy bstrategy;
 
@@ -276,16 +275,7 @@ ScanSourceDatabasePgClass(Oid tbid, Oid dbid, char *srcpath)
 	rlocator.dbOid = dbid;
 	rlocator.relNumber = relfilenumber;
 
-	/*
-	 * We can't use a real relcache entry for a relation in some other
-	 * database, but since we're only going to access the fields related to
-	 * physical storage, a fake one is good enough. If we didn't do this and
-	 * used the smgr layer directly, we would have to worry about
-	 * invalidations.
-	 */
-	rel = CreateFakeRelcacheEntry(rlocator);
-	nblocks = smgrnblocks(RelationGetSmgr(rel), MAIN_FORKNUM);
-	FreeFakeRelcacheEntry(rel);
+	nblocks = smgrnblocks(smgropen(rlocator, InvalidBackendId), MAIN_FORKNUM);
 
 	/* Use a buffer access strategy since this is a bulk read operation. */
 	bstrategy = GetAccessStrategy(BAS_BULKREAD);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7f992c3..b488306 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -487,9 +487,9 @@ static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
 									   BlockNumber firstDelBlock);
-static void RelationCopyStorageUsingBuffer(Relation src, Relation dst,
-										   ForkNumber forkNum,
-										   bool isunlogged);
+static void RelationCopyStorageUsingBuffer(RelFileLocator srclocator,
+										   RelFileLocator dstlocator,
+										   ForkNumber forkNum, bool permanent);
 static void AtProcExit_Buffers(int code, Datum arg);
 static void CheckForBufferLeaks(void);
 static int	rlocator_comparator(const void *p1, const void *p2);
@@ -3701,8 +3701,9 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
  * --------------------------------------------------------------------
  */
 static void
-RelationCopyStorageUsingBuffer(Relation src, Relation dst, ForkNumber forkNum,
-							   bool permanent)
+RelationCopyStorageUsingBuffer(RelFileLocator srclocator,
+							   RelFileLocator dstlocator,
+							   ForkNumber forkNum, bool permanent)
 {
 	Buffer		srcBuf;
 	Buffer		dstBuf;
@@ -3722,7 +3723,8 @@ RelationCopyStorageUsingBuffer(Relation src, Relation dst, ForkNumber forkNum,
 	use_wal = XLogIsNeeded() && (permanent || forkNum == INIT_FORKNUM);
 
 	/* Get number of blocks in the source relation. */
-	nblocks = smgrnblocks(RelationGetSmgr(src), forkNum);
+	nblocks = smgrnblocks(smgropen(srclocator, InvalidBackendId),
+						  forkNum);
 
 	/* Nothing to copy; just return. */
 	if (nblocks == 0)
@@ -3738,7 +3740,7 @@ RelationCopyStorageUsingBuffer(Relation src, Relation dst, ForkNumber forkNum,
 		CHECK_FOR_INTERRUPTS();
 
 		/* Read block from source relation. */
-		srcBuf = ReadBufferWithoutRelcache(src->rd_locator, forkNum, blkno,
+		srcBuf = ReadBufferWithoutRelcache(srclocator, forkNum, blkno,
 										   RBM_NORMAL, bstrategy_src,
 										   permanent);
 
@@ -3746,7 +3748,7 @@ RelationCopyStorageUsingBuffer(Relation src, Relation dst, ForkNumber forkNum,
 		srcPage = BufferGetPage(srcBuf);
 
 		/* Use P_NEW to extend the destination relation. */
-		dstBuf = ReadBufferWithoutRelcache(dst->rd_locator, forkNum, P_NEW,
+		dstBuf = ReadBufferWithoutRelcache(dstlocator, forkNum, P_NEW,
 										   RBM_NORMAL, bstrategy_dst,
 										   permanent);
 		LockBuffer(dstBuf, BUFFER_LOCK_EXCLUSIVE);
@@ -3784,8 +3786,6 @@ void
 CreateAndCopyRelationData(RelFileLocator src_rlocator,
 						  RelFileLocator dst_rlocator, bool permanent)
 {
-	Relation	src_rel;
-	Relation	dst_rel;
 	char		relpersistence;
 
 	/* Set the relpersistence. */
@@ -3793,16 +3793,6 @@ CreateAndCopyRelationData(RelFileLocator src_rlocator,
 		RELPERSISTENCE_PERMANENT : RELPERSISTENCE_UNLOGGED;
 
 	/*
-	 * We can't use a real relcache entry for a relation in some other
-	 * database, but since we're only going to access the fields related to
-	 * physical storage, a fake one is good enough. If we didn't do this and
-	 * used the smgr layer directly, we would have to worry about
-	 * invalidations.
-	 */
-	src_rel = CreateFakeRelcacheEntry(src_rlocator);
-	dst_rel = CreateFakeRelcacheEntry(dst_rlocator);
-
-	/*
 	 * Create and copy all forks of the relation.  During create database we
 	 * have a separate cleanup mechanism which deletes complete database
 	 * directory.  Therefore, each individual relation doesn't need to be
@@ -3811,15 +3801,16 @@ CreateAndCopyRelationData(RelFileLocator src_rlocator,
 	RelationCreateStorage(dst_rlocator, relpersistence, false);
 
 	/* copy main fork. */
-	RelationCopyStorageUsingBuffer(src_rel, dst_rel, MAIN_FORKNUM, permanent);
+	RelationCopyStorageUsingBuffer(src_rlocator, dst_rlocator, MAIN_FORKNUM,
+								   permanent);
 
 	/* copy those extra forks that exist */
 	for (ForkNumber forkNum = MAIN_FORKNUM + 1;
 		 forkNum <= MAX_FORKNUM; forkNum++)
 	{
-		if (smgrexists(RelationGetSmgr(src_rel), forkNum))
+		if (smgrexists(smgropen(src_rlocator, InvalidBackendId), forkNum))
 		{
-			smgrcreate(RelationGetSmgr(dst_rel), forkNum, false);
+			smgrcreate(smgropen(dst_rlocator, InvalidBackendId), forkNum, false);
 
 			/*
 			 * WAL log creation if the relation is persistent, or this is the
@@ -3829,14 +3820,10 @@ CreateAndCopyRelationData(RelFileLocator src_rlocator,
 				log_smgrcreate(&dst_rlocator, forkNum);
 
 			/* Copy a fork's data, block by block. */
-			RelationCopyStorageUsingBuffer(src_rel, dst_rel, forkNum,
+			RelationCopyStorageUsingBuffer(src_rlocator, dst_rlocator, forkNum,
 										   permanent);
 		}
 	}
-
-	/* Release fake relcache entries. */
-	FreeFakeRelcacheEntry(src_rel);
-	FreeFakeRelcacheEntry(dst_rel);
 }
 
 /* ---------------------------------------------------------------------
-- 
1.8.3.1

v2-0001-Assorted-bug-fixes-while-coying-the-storage-durin.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Assorted-bug-fixes-while-coying-the-storage-durin.patchDownload

From fb4bd9f9aff0c51e5f576742a84b40f7cebb6872 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Fri, 5 Aug 2022 10:51:04 +0530
Subject: [PATCH v2 1/3] Assorted bug fixes while coying the storage during
 create database

While copying the storage the code is skipping the new/empty pages
which could create corrupted storage as that could have broken ctid
links and ther such issue.  Also fix the missing buffer lock on the
destination buffer.
---
 src/backend/storage/buffer/bufmgr.c | 11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 6b30138..7f992c3 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3741,23 +3741,20 @@ RelationCopyStorageUsingBuffer(Relation src, Relation dst, ForkNumber forkNum,
 		srcBuf = ReadBufferWithoutRelcache(src->rd_locator, forkNum, blkno,
 										   RBM_NORMAL, bstrategy_src,
 										   permanent);
+
+		LockBuffer(srcBuf, BUFFER_LOCK_SHARE);
 		srcPage = BufferGetPage(srcBuf);
-		if (PageIsNew(srcPage) || PageIsEmpty(srcPage))
-		{
-			ReleaseBuffer(srcBuf);
-			continue;
-		}
 
 		/* Use P_NEW to extend the destination relation. */
 		dstBuf = ReadBufferWithoutRelcache(dst->rd_locator, forkNum, P_NEW,
 										   RBM_NORMAL, bstrategy_dst,
 										   permanent);
 		LockBuffer(dstBuf, BUFFER_LOCK_EXCLUSIVE);
+		dstPage = BufferGetPage(dstBuf);
 
 		START_CRIT_SECTION();
 
 		/* Copy page data from the source to the destination. */
-		dstPage = BufferGetPage(dstBuf);
 		memcpy(dstPage, srcPage, BLCKSZ);
 		MarkBufferDirty(dstBuf);
 
@@ -3767,8 +3764,8 @@ RelationCopyStorageUsingBuffer(Relation src, Relation dst, ForkNumber forkNum,
 
 		END_CRIT_SECTION();
 
+		UnlockReleaseBuffer(srcBuf);
 		UnlockReleaseBuffer(dstBuf);
-		ReleaseBuffer(srcBuf);
 	}
 }
 
-- 
1.8.3.1

#271

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Tom Lane (#258)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

Hi,

On 2022-08-04 19:14:08 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2022-08-04 18:05:25 -0400, Tom Lane wrote:

In any case, DROP DATABASE is far from the only place with a problem.

What other place has a database corrupting potential of this magnitude just
because interrupts are accepted? We throw valid s_b contents away and then
accept interrupts before committing - with predictable results. We also accept
interrupts as part of deleting the db data dir (due to catalog access).

Those things would be better handled by moving the data-discarding
steps to post-commit. Maybe that argues for having an internal
commit halfway through DROP DATABASE: remove pg_database row,
commit, start new transaction, clean up.

That'd still require holding interrupts, I think. We shouldn't accept
interrupts until the on-disk data is actually deleted.

In theory I think we should have a pg_database column indicating whether the
database is valid or not. For database creation, insert the pg_database row
with valid=false, commit, then do the filesystem operation, then mark as
valid, commit. For database drop, mark as invalid, commit, remove filesystem
stuff, delete row, commit. With dropdb allowed against an invalid database,
but obviously nothing else. But clearly this isn't a short term /
backpatchable thing.

Greetings,

Andres Freund

#272

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Dilip Kumar (#270)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

Dilip Kumar <dilipbalaut@gmail.com> writes:

PFA patches for different problems discussed in the thread

0001 - Fix the problem of skipping the empty block and buffer lock on
source buffer
0002 - Remove fake relcache entry (same as 0001-BugfixInWalLogCreateDB.patch)
0003 - Optimization to avoid extending block by block

I pushed 0001, because it seems fairly critical to get that in before
beta3. The others can stand more leisurely discussion.

I note from
https://coverage.postgresql.org/src/backend/storage/buffer/bufmgr.c.gcov.html
that the block-skipping path is actually taken in our tests (this won't be
visible there for very much longer of course). So we actually *are*
making a corrupt copy, and we haven't noticed. This is perhaps not too
surprising, because the only test case that I can find is in
020_createdb.pl:

$node->issues_sql_like(
[ 'createdb', '-T', 'foobar2', '-S', 'wal_log', 'foobar6' ],
qr/statement: CREATE DATABASE foobar6 STRATEGY wal_log TEMPLATE foobar2/,
'create database with WAL_LOG strategy');

which is, um, not exactly a robust test of whether anything happened
at all, let alone whether it was correct. I'm not real sure that
this test would even notice if the CREATE reported failure.

regards, tom lane

#273

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Dilip Kumar (#270)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

Dilip Kumar <dilipbalaut@gmail.com> writes:

On Fri, Aug 5, 2022 at 10:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Yeah maybe it is not necessary to close as these unowned smgr will
automatically get closed on the transaction end.

I do not think this is a great idea for the per-relation smgrs created
during RelationCopyStorageUsingBuffer. Yeah, they'll be mopped up at
transaction end, but that doesn't mean that creating possibly tens of
thousands of transient smgrs isn't going to cause performance issues.

I think RelationCopyStorageUsingBuffer needs to open and then close
the smgrs it uses, which means that ReadBufferWithoutRelcache is not the
appropriate API for it to use, either; need to go down another level.

regards, tom lane

#274

Dilip Kumar

dilipbalaut@gmail.com

over 3 years ago

In reply to: Tom Lane (#273)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Sat, Aug 6, 2022 at 9:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Dilip Kumar <dilipbalaut@gmail.com> writes:

On Fri, Aug 5, 2022 at 10:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Yeah maybe it is not necessary to close as these unowned smgr will
automatically get closed on the transaction end.

I do not think this is a great idea for the per-relation smgrs created
during RelationCopyStorageUsingBuffer. Yeah, they'll be mopped up at
transaction end, but that doesn't mean that creating possibly tens of
thousands of transient smgrs isn't going to cause performance issues.

Okay, so for that we can simply call smgrcloserellocator(rlocator);
before exiting the RelationCopyStorageUsingBuffer() right?

I think RelationCopyStorageUsingBuffer needs to open and then close
the smgrs it uses, which means that ReadBufferWithoutRelcache is not the
appropriate API for it to use, either; need to go down another level.

Not sure how going down another level would help, the whole point is
that we don't want to keep the reference of the smgr for a long time
especially in the loop which is interruptible. So everytime we need
smgr we can call smgropen and if it is already in the smgr cache then
we will get it from there. So I think it makes sense that when we are
exiting the function that time we can just call smgrcloserellocator()
so that if it is opened it will be closed and otherwise it will do
nothing.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#275

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Dilip Kumar (#274)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

Hi,

On 2022-08-07 09:24:40 +0530, Dilip Kumar wrote:

On Sat, Aug 6, 2022 at 9:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Dilip Kumar <dilipbalaut@gmail.com> writes:

On Fri, Aug 5, 2022 at 10:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Yeah maybe it is not necessary to close as these unowned smgr will
automatically get closed on the transaction end.

I do not think this is a great idea for the per-relation smgrs created
during RelationCopyStorageUsingBuffer. Yeah, they'll be mopped up at
transaction end, but that doesn't mean that creating possibly tens of
thousands of transient smgrs isn't going to cause performance issues.

I was assuming that the files would get reopened at the end of the transaction
anyway, but it looks like that's not the case, unless wal_level=minimal.

Hm. CreateAndCopyRelationData() calls RelationCreateStorage() with
register_delete = false, which is ok because createdb_failure_callback will
clean things up. But that's another thing that's not great for a routine with
a general name...

Okay, so for that we can simply call smgrcloserellocator(rlocator);
before exiting the RelationCopyStorageUsingBuffer() right?

Yea, I think so.

I think RelationCopyStorageUsingBuffer needs to open and then close
the smgrs it uses, which means that ReadBufferWithoutRelcache is not the
appropriate API for it to use, either; need to go down another level.

Not sure how going down another level would help, the whole point is
that we don't want to keep the reference of the smgr for a long time
especially in the loop which is interruptible.

Yea, I'm not following either.

Greetings,

Andres Freund

#276

Dilip Kumar

dilipbalaut@gmail.com

over 3 years ago

In reply to: Andres Freund (#275)

2 attachment(s)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Sun, Aug 7, 2022 at 9:47 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2022-08-07 09:24:40 +0530, Dilip Kumar wrote:

On Sat, Aug 6, 2022 at 9:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Dilip Kumar <dilipbalaut@gmail.com> writes:

On Fri, Aug 5, 2022 at 10:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Yeah maybe it is not necessary to close as these unowned smgr will
automatically get closed on the transaction end.

I do not think this is a great idea for the per-relation smgrs created
during RelationCopyStorageUsingBuffer. Yeah, they'll be mopped up at
transaction end, but that doesn't mean that creating possibly tens of
thousands of transient smgrs isn't going to cause performance issues.

I was assuming that the files would get reopened at the end of the transaction
anyway, but it looks like that's not the case, unless wal_level=minimal.

Hm. CreateAndCopyRelationData() calls RelationCreateStorage() with
register_delete = false, which is ok because createdb_failure_callback will
clean things up. But that's another thing that's not great for a routine with
a general name...

Okay, so for that we can simply call smgrcloserellocator(rlocator);
before exiting the RelationCopyStorageUsingBuffer() right?

Yea, I think so.

Done, along with that, I have also got the hunk of smgropen and
smgrclose in ScanSourceDatabasePgClass() which I had in v1 patch[1]+ smgr = smgropen(rlocator, InvalidBackendId); + nblocks = smgrnblocks(smgr, MAIN_FORKNUM); + smgrclose(smgr);.
Because here we do not want to reuse the smgr of the pg_class again so
instead of closing at the end with smgrcloserellocator() we can just
keep the smgr reference and close immediately after getting the number
of blocks. Whereas in CreateAndCopyRelationData and
RelationCopyStorageUsingBuffer() we are using the smgr of the source
and dest relation multiple time so it make sense to not close it
immediately and we can close while exiting the function with
smgrcloserellocator().

[1]
+ smgr = smgropen(rlocator, InvalidBackendId);
+ nblocks = smgrnblocks(smgr, MAIN_FORKNUM);
+ smgrclose(smgr);

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v3-0002-Optimize-copy-storage-from-source-to-destination.patchapplication/octet-stream; name=v3-0002-Optimize-copy-storage-from-source-to-destination.patchDownload

From 4324772a967515b4aa097c3a67ed35348373281c Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Fri, 5 Aug 2022 11:25:23 +0530
Subject: [PATCH v3 2/2] Optimize copy storage from source to destination

Instead of extending block at a time directly bulkextend the destination
relation and then just perform the block level copy.
---
 src/backend/storage/buffer/bufmgr.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 9c1bd508d3..377389e762 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3710,6 +3710,7 @@ RelationCopyStorageUsingBuffer(RelFileLocator srclocator,
 	Page		srcPage;
 	Page		dstPage;
 	bool		use_wal;
+	char		buffer[BLCKSZ];
 	BlockNumber nblocks;
 	BlockNumber blkno;
 	BufferAccessStrategy bstrategy_src;
@@ -3730,6 +3731,14 @@ RelationCopyStorageUsingBuffer(RelFileLocator srclocator,
 	if (nblocks == 0)
 		return;
 
+	/*
+	 * Bulk extend the destination relation of the same size as the source
+	 * relation before starting to copy block by block.
+	 */
+	memset(buffer, 0, BLCKSZ);
+	smgrextend(smgropen(dstlocator, InvalidBackendId), forkNum, nblocks - 1,
+			   buffer, true);
+
 	/* This is a bulk operation, so use buffer access strategies. */
 	bstrategy_src = GetAccessStrategy(BAS_BULKREAD);
 	bstrategy_dst = GetAccessStrategy(BAS_BULKWRITE);
@@ -3747,7 +3756,7 @@ RelationCopyStorageUsingBuffer(RelFileLocator srclocator,
 		srcPage = BufferGetPage(srcBuf);
 
 		/* Use P_NEW to extend the destination relation. */
-		dstBuf = ReadBufferWithoutRelcache(dstlocator, forkNum, P_NEW,
+		dstBuf = ReadBufferWithoutRelcache(dstlocator, forkNum, blkno,
 										   RBM_NORMAL, bstrategy_dst,
 										   permanent);
 		LockBuffer(dstBuf, BUFFER_LOCK_EXCLUSIVE);
-- 
2.23.0

v3-0001-Avoid-setting-the-fake-relcache-entry-as-smgr-own.patchapplication/octet-stream; name=v3-0001-Avoid-setting-the-fake-relcache-entry-as-smgr-own.patchDownload

From 139ad5292bb39975d4cfbaa2ab8e6875bf77381d Mon Sep 17 00:00:00 2001
From: dilipkumar <dilipbalaut@gmail.com>
Date: Wed, 10 Aug 2022 09:29:03 +0530
Subject: [PATCH v3 1/2] Avoid setting the fake relcache entry as smgr owner
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

During CREATE DATABASE, we are not connected to the source and the
destination DB so in order to operate on the storage we are using
FakeRelCacheEntry and by using that we are calling RelationGetSmgr().

So the problem is that this function will set the temporary
FakeRelCacheEntry as an owner of the smgr.  Now if there is any
error before we close the FakeRelCacheEntry then the memory of the
fake relcache entry will be released at the transaction abort but
the smgr will survive the transaction.  So now smgr is pointing
to some already release memory and it will have random behavior
when we try to access the smgr next time.

For fixing the issue instead of using the FakeRelCacheEntry, directly
call the smgropen() but do not keep the reference to the smgr.
So every time call smgropen() whenever we need it.  This is required to
ensure that we do not access the smgr pointer which might have already
been closed during interrupt processing.
---
 src/backend/commands/dbcommands.c   | 15 +++------
 src/backend/storage/buffer/bufmgr.c | 51 ++++++++++++++---------------
 2 files changed, 28 insertions(+), 38 deletions(-)

diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 9f990a8d68..b31a30550b 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -258,8 +258,8 @@ ScanSourceDatabasePgClass(Oid tbid, Oid dbid, char *srcpath)
 	Page		page;
 	List	   *rlocatorlist = NIL;
 	LockRelId	relid;
-	Relation	rel;
 	Snapshot	snapshot;
+	SMgrRelation	smgr;
 	BufferAccessStrategy bstrategy;
 
 	/* Get pg_class relfilenumber. */
@@ -276,16 +276,9 @@ ScanSourceDatabasePgClass(Oid tbid, Oid dbid, char *srcpath)
 	rlocator.dbOid = dbid;
 	rlocator.relNumber = relfilenumber;
 
-	/*
-	 * We can't use a real relcache entry for a relation in some other
-	 * database, but since we're only going to access the fields related to
-	 * physical storage, a fake one is good enough. If we didn't do this and
-	 * used the smgr layer directly, we would have to worry about
-	 * invalidations.
-	 */
-	rel = CreateFakeRelcacheEntry(rlocator);
-	nblocks = smgrnblocks(RelationGetSmgr(rel), MAIN_FORKNUM);
-	FreeFakeRelcacheEntry(rel);
+	smgr = smgropen(rlocator, InvalidBackendId);
+	nblocks = smgrnblocks(smgr, MAIN_FORKNUM);
+	smgrclose(smgr);
 
 	/* Use a buffer access strategy since this is a bulk read operation. */
 	bstrategy = GetAccessStrategy(BAS_BULKREAD);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 8ef0436c52..9c1bd508d3 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -487,9 +487,9 @@ static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
 									   BlockNumber firstDelBlock);
-static void RelationCopyStorageUsingBuffer(Relation src, Relation dst,
-										   ForkNumber forkNum,
-										   bool isunlogged);
+static void RelationCopyStorageUsingBuffer(RelFileLocator srclocator,
+										   RelFileLocator dstlocator,
+										   ForkNumber forkNum, bool permanent);
 static void AtProcExit_Buffers(int code, Datum arg);
 static void CheckForBufferLeaks(void);
 static int	rlocator_comparator(const void *p1, const void *p2);
@@ -3701,8 +3701,9 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
  * --------------------------------------------------------------------
  */
 static void
-RelationCopyStorageUsingBuffer(Relation src, Relation dst, ForkNumber forkNum,
-							   bool permanent)
+RelationCopyStorageUsingBuffer(RelFileLocator srclocator,
+							   RelFileLocator dstlocator,
+							   ForkNumber forkNum, bool permanent)
 {
 	Buffer		srcBuf;
 	Buffer		dstBuf;
@@ -3722,7 +3723,8 @@ RelationCopyStorageUsingBuffer(Relation src, Relation dst, ForkNumber forkNum,
 	use_wal = XLogIsNeeded() && (permanent || forkNum == INIT_FORKNUM);
 
 	/* Get number of blocks in the source relation. */
-	nblocks = smgrnblocks(RelationGetSmgr(src), forkNum);
+	nblocks = smgrnblocks(smgropen(srclocator, InvalidBackendId),
+						  forkNum);
 
 	/* Nothing to copy; just return. */
 	if (nblocks == 0)
@@ -3738,14 +3740,14 @@ RelationCopyStorageUsingBuffer(Relation src, Relation dst, ForkNumber forkNum,
 		CHECK_FOR_INTERRUPTS();
 
 		/* Read block from source relation. */
-		srcBuf = ReadBufferWithoutRelcache(src->rd_locator, forkNum, blkno,
+		srcBuf = ReadBufferWithoutRelcache(srclocator, forkNum, blkno,
 										   RBM_NORMAL, bstrategy_src,
 										   permanent);
 		LockBuffer(srcBuf, BUFFER_LOCK_SHARE);
 		srcPage = BufferGetPage(srcBuf);
 
 		/* Use P_NEW to extend the destination relation. */
-		dstBuf = ReadBufferWithoutRelcache(dst->rd_locator, forkNum, P_NEW,
+		dstBuf = ReadBufferWithoutRelcache(dstlocator, forkNum, P_NEW,
 										   RBM_NORMAL, bstrategy_dst,
 										   permanent);
 		LockBuffer(dstBuf, BUFFER_LOCK_EXCLUSIVE);
@@ -3783,24 +3785,13 @@ void
 CreateAndCopyRelationData(RelFileLocator src_rlocator,
 						  RelFileLocator dst_rlocator, bool permanent)
 {
-	Relation	src_rel;
-	Relation	dst_rel;
+	RelFileLocatorBackend rlocator;
 	char		relpersistence;
 
 	/* Set the relpersistence. */
 	relpersistence = permanent ?
 		RELPERSISTENCE_PERMANENT : RELPERSISTENCE_UNLOGGED;
 
-	/*
-	 * We can't use a real relcache entry for a relation in some other
-	 * database, but since we're only going to access the fields related to
-	 * physical storage, a fake one is good enough. If we didn't do this and
-	 * used the smgr layer directly, we would have to worry about
-	 * invalidations.
-	 */
-	src_rel = CreateFakeRelcacheEntry(src_rlocator);
-	dst_rel = CreateFakeRelcacheEntry(dst_rlocator);
-
 	/*
 	 * Create and copy all forks of the relation.  During create database we
 	 * have a separate cleanup mechanism which deletes complete database
@@ -3810,15 +3801,16 @@ CreateAndCopyRelationData(RelFileLocator src_rlocator,
 	RelationCreateStorage(dst_rlocator, relpersistence, false);
 
 	/* copy main fork. */
-	RelationCopyStorageUsingBuffer(src_rel, dst_rel, MAIN_FORKNUM, permanent);
+	RelationCopyStorageUsingBuffer(src_rlocator, dst_rlocator, MAIN_FORKNUM,
+								   permanent);
 
 	/* copy those extra forks that exist */
 	for (ForkNumber forkNum = MAIN_FORKNUM + 1;
 		 forkNum <= MAX_FORKNUM; forkNum++)
 	{
-		if (smgrexists(RelationGetSmgr(src_rel), forkNum))
+		if (smgrexists(smgropen(src_rlocator, InvalidBackendId), forkNum))
 		{
-			smgrcreate(RelationGetSmgr(dst_rel), forkNum, false);
+			smgrcreate(smgropen(dst_rlocator, InvalidBackendId), forkNum, false);
 
 			/*
 			 * WAL log creation if the relation is persistent, or this is the
@@ -3828,14 +3820,19 @@ CreateAndCopyRelationData(RelFileLocator src_rlocator,
 				log_smgrcreate(&dst_rlocator, forkNum);
 
 			/* Copy a fork's data, block by block. */
-			RelationCopyStorageUsingBuffer(src_rel, dst_rel, forkNum,
+			RelationCopyStorageUsingBuffer(src_rlocator, dst_rlocator, forkNum,
 										   permanent);
 		}
 	}
 
-	/* Release fake relcache entries. */
-	FreeFakeRelcacheEntry(src_rel);
-	FreeFakeRelcacheEntry(dst_rel);
+	/* close source and destination smgr if exists. */
+	rlocator.backend = InvalidBackendId;
+
+	rlocator.locator = src_rlocator;
+	smgrcloserellocator(rlocator);
+
+	rlocator.locator = dst_rlocator;
+	smgrcloserellocator(rlocator);
 }
 
 /* ---------------------------------------------------------------------
-- 
2.23.0

#277

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Dilip Kumar (#276)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Wed, Aug 10, 2022 at 1:01 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Done, along with that, I have also got the hunk of smgropen and
smgrclose in ScanSourceDatabasePgClass() which I had in v1 patch[1].
Because here we do not want to reuse the smgr of the pg_class again so
instead of closing at the end with smgrcloserellocator() we can just
keep the smgr reference and close immediately after getting the number
of blocks. Whereas in CreateAndCopyRelationData and
RelationCopyStorageUsingBuffer() we are using the smgr of the source
and dest relation multiple time so it make sense to not close it
immediately and we can close while exiting the function with
smgrcloserellocator().

As far as I know, this 0001 addresses all outstanding comments and
fixes the reported bug.

Does anyone think otherwise?

--
Robert Haas
EDB: http://www.enterprisedb.com

#278

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Robert Haas (#277)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Thu, Aug 11, 2022 at 2:15 PM Robert Haas <robertmhaas@gmail.com> wrote:

As far as I know, this 0001 addresses all outstanding comments and
fixes the reported bug.

Does anyone think otherwise?

If they do, they're keeping quiet, so I committed this and
back-patched it to v15.

Regarding 0002 -- should it, perhaps, use PGAlignedBlock?

Although 0002 is formally a performance optimization, I'm inclined to
think that if we're going to commit it, it should also be back-patched
into v15, because letting the code diverge when we're not even out of
beta yet seems painful.

--
Robert Haas
EDB: http://www.enterprisedb.com

#279

Dilip Kumar

dilipbalaut@gmail.com

over 3 years ago

In reply to: Robert Haas (#278)

1 attachment(s)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Fri, Aug 12, 2022 at 6:33 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Aug 11, 2022 at 2:15 PM Robert Haas <robertmhaas@gmail.com> wrote:

As far as I know, this 0001 addresses all outstanding comments and
fixes the reported bug.

Does anyone think otherwise?

If they do, they're keeping quiet, so I committed this and
back-patched it to v15.

Regarding 0002 -- should it, perhaps, use PGAlignedBlock?

Yes we can do that, although here we are not using this buffer
directly as a "Page" so we do not have any real alignment issue but I
do not see any problem in using PGAlignedBlock so change that.

Although 0002 is formally a performance optimization, I'm inclined to
think that if we're going to commit it, it should also be back-patched
into v15, because letting the code diverge when we're not even out of
beta yet seems painful.

Yeah that makes sense to me.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v4-0001-Optimize-copy-storage-from-source-to-destination.patchtext/x-patch; charset=US-ASCII; name=v4-0001-Optimize-copy-storage-from-source-to-destination.patchDownload

From 59fadefe04f8f2eeb6bc5e2e02efde56d5ace8aa Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Fri, 5 Aug 2022 11:25:23 +0530
Subject: [PATCH v4] Optimize copy storage from source to destination

Instead of extending block at a time directly bulkextend the destination
relation and then just perform the block level copy.
---
 src/backend/storage/buffer/bufmgr.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 9c1bd50..7a1202c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3712,6 +3712,7 @@ RelationCopyStorageUsingBuffer(RelFileLocator srclocator,
 	bool		use_wal;
 	BlockNumber nblocks;
 	BlockNumber blkno;
+	PGAlignedBlock buf;
 	BufferAccessStrategy bstrategy_src;
 	BufferAccessStrategy bstrategy_dst;
 
@@ -3730,6 +3731,14 @@ RelationCopyStorageUsingBuffer(RelFileLocator srclocator,
 	if (nblocks == 0)
 		return;
 
+	/*
+	 * Bulk extend the destination relation of the same size as the source
+	 * relation before starting to copy block by block.
+	 */
+	memset(buf.data, 0, BLCKSZ);
+	smgrextend(smgropen(dstlocator, InvalidBackendId), forkNum, nblocks - 1,
+			   buf.data, true);
+
 	/* This is a bulk operation, so use buffer access strategies. */
 	bstrategy_src = GetAccessStrategy(BAS_BULKREAD);
 	bstrategy_dst = GetAccessStrategy(BAS_BULKWRITE);
@@ -3747,7 +3756,7 @@ RelationCopyStorageUsingBuffer(RelFileLocator srclocator,
 		srcPage = BufferGetPage(srcBuf);
 
 		/* Use P_NEW to extend the destination relation. */
-		dstBuf = ReadBufferWithoutRelcache(dstlocator, forkNum, P_NEW,
+		dstBuf = ReadBufferWithoutRelcache(dstlocator, forkNum, blkno,
 										   RBM_NORMAL, bstrategy_dst,
 										   permanent);
 		LockBuffer(dstBuf, BUFFER_LOCK_EXCLUSIVE);
-- 
1.8.3.1

#280

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Dilip Kumar (#279)

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Wed, Aug 17, 2022 at 12:02 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Although 0002 is formally a performance optimization, I'm inclined to
think that if we're going to commit it, it should also be back-patched
into v15, because letting the code diverge when we're not even out of
beta yet seems painful.

Yeah that makes sense to me.

OK, done.

--
Robert Haas
EDB: http://www.enterprisedb.com