Avoiding smgrimmedsync() during nbtree index builds

[1]: /messages/by-id/20210112193431.2edcz776qjen7kao@alap3.anarazel.de

andres@anarazel.de

almost 5 years ago

In reply to: Heikki Linnakangas (#2)

Re: Avoiding smgrimmedsync() during nbtree index builds

Hi,

On 2021-01-21 23:54:04 +0200, Heikki Linnakangas wrote:

On 21/01/2021 22:36, Andres Freund wrote:

A quick hack (probably not quite correct!) to evaluate the benefit shows
that the attached script takes 2m17.223s with the smgrimmedsync and
0m22.870s passing skipFsync=false to write/extend. Entirely IO bound in
the former case, CPU bound in the latter.

Creating lots of tables with indexes (directly or indirectly through
relations having a toast table) is pretty common, particularly after the
introduction of partitioning.

Thinking through the correctness of replacing smgrimmedsync() with sync
requests, the potential problems that I can see are:

1) redo point falls between the log_newpage() and the
write()/register_dirty_segment() in smgrextend/smgrwrite.
2) redo point falls between write() and register_dirty_segment()

But both of these are fine in the context of initially filling a newly
created relfilenode, as far as I can tell? Otherwise the current
smgrimmedsync() approach wouldn't be safe either, as far as I can tell?

Hmm. If the redo point falls between write() and the
register_dirty_segment(), and the checkpointer finishes the whole checkpoint
before register_dirty_segment(), you are not safe. That can't happen with
write from the buffer manager, because the checkpointer would block waiting
for the flush of the buffer to finish.

Hm, right.

The easiest way to address that race would be to just record the redo
pointer in _bt_leafbuild() and continue to do the smgrimmedsync if a
checkpoint started since the start of the index build.

Another approach would be to utilize PGPROC.delayChkpt, but I would
rather not unnecessarily expand the use of that.

It's kind of interesting - in my aio branch I moved the
register_dirty_segment() to before the actual asynchronous write (due to
availability of the necessary data), which ought to be safe because of
the buffer interlocking. But that doesn't work here, or for other places
doing writes without going through s_b. It'd be great if we could come
up with a general solution, but I don't immediately see anything great.

The best I can come up with is adding helper functions to wrap some of
the complexity for "unbuffered" writes of doing an immedsync iff the
redo pointer changed. Something very roughly like

typedef struct UnbufferedWriteState { XLogRecPtr redo; uint64 numwrites;} UnbufferedWriteState;
void unbuffered_prep(UnbufferedWriteState* state);
void unbuffered_write(UnbufferedWriteState* state, ...);
void unbuffered_extend(UnbufferedWriteState* state, ...);
void unbuffered_finish(UnbufferedWriteState* state);

which wouldn't just do the dance to avoid the immedsync() if possible,
but also took care of PageSetChecksumInplace() (and PageEncryptInplace()
if we get that [1]/messages/by-id/20210112193431.2edcz776qjen7kao@alap3.anarazel.de).

Greetings,

Andres Freund

melanieplageman@gmail.com

over 4 years ago

In reply to: Andres Freund (#3)

1 attachment(s)

Re: Avoiding smgrimmedsync() during nbtree index builds

So, I've written a patch which avoids doing the immediate fsync for
index builds either by using shared buffers or by queueing sync requests
for the checkpointer. If a checkpoint starts during the index build and
the backend is not using shared buffers for the index build, it will
need to do the fsync.

The reviewer will notice that _bt_load() extends the index relation for
the metapage before beginning the actual load of leaf pages but does not
actually write the metapage until the end of the index build. When using
shared buffers, it was difficult to create block 0 of the index after
creating all of the other blocks, as the block number is assigned inside
of ReadBuffer_common(), and it doesn't really work with the current
bufmgr API to extend a relation with a caller-specified block number.

I am not entirely sure of the correctness of doing an smgrextend() (when
not using shared buffers) without writing any WAL. However, the metapage
contents are not written until after WAL logging them later in
_bt_blwritepage(), so, perhaps it is okay?

I am also not fond of the change to the signature of _bt_uppershutdown()
that this implementation forces. Now, I must pass the shared buffer
(when using shared buffers) that I've reserved (pinned and locked) for
the metapage and, if not using shared buffers, the page I've allocated
for the metapage, before doing the index build to _bt_uppershutdown()
after doing the rest of the index build. I don't know that it seems
incorrect -- more that it feels a bit messy (and inefficient) to hold
onto that shared buffer or memory for the duration of the index build,
during which I have no intention of doing anything with that buffer or
memory. However, the alternative I devised was to change
ReadBuffer_common() or to add a new ReadBufferExtended() mode which
indicated that the caller would specify the block number and whether or
not it was an extend, which also didn't seem right.

For the extensions of the index done during index build, I use
ReadBufferExtended() directly instead of _bt_getbuf() for a few reasons.
I thought (am not sure) that I don't need to do
LockRelationForExtension() during index build. Also, I decided to use
RBM_ZERO_AND_LOCK mode so that I had an exclusive lock on the buffer
content instead of doing _bt_lockbuf() (which is what _bt_getbuf()
does). And, most of the places I added the call to ReadBufferExtended(),
the non-shared buffer code path is already initializing the page, so it
made more sense to just share that codepath.

I considered whether or not it made sense to add a new btree utility
function which calls ReadBufferExtended() in this way, however, I wasn't
sure how much that would buy me. The other place it might be able to be
used is btvacuumpage(), but that case is different enough that I'm not
even sure what the function would be called -- basically it would just
be an alternative to _bt_getbuf() for a couple of somewhat unrelated edge
cases.

On Thu, Jan 21, 2021 at 5:51 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2021-01-21 23:54:04 +0200, Heikki Linnakangas wrote:

On 21/01/2021 22:36, Andres Freund wrote:

A quick hack (probably not quite correct!) to evaluate the benefit shows
that the attached script takes 2m17.223s with the smgrimmedsync and
0m22.870s passing skipFsync=false to write/extend. Entirely IO bound in
the former case, CPU bound in the latter.

Creating lots of tables with indexes (directly or indirectly through
relations having a toast table) is pretty common, particularly after the
introduction of partitioning.

Moving index builds of indexes which would fit in shared buffers back
into shared buffers has the benefit of eliminating the need to write
them out and fsync them if they will be subsequently used and thus read
right back into shared buffers. This avoids some of the unnecessary
fsyncs Andres is talking about here as well as avoiding some of the
extra IO required to write them and then read them into shared buffers.

I have dummy criteria for whether or not to use shared buffers (if the
number of tuples to be indexed is > 1000). I am considering using a
threshold of some percentage of the size of shared buffers as the
actual criteria for determining where to do the index build.

Thinking through the correctness of replacing smgrimmedsync() with sync
requests, the potential problems that I can see are:

1) redo point falls between the log_newpage() and the
write()/register_dirty_segment() in smgrextend/smgrwrite.
2) redo point falls between write() and register_dirty_segment()

But both of these are fine in the context of initially filling a newly
created relfilenode, as far as I can tell? Otherwise the current
smgrimmedsync() approach wouldn't be safe either, as far as I can tell?

Hmm. If the redo point falls between write() and the
register_dirty_segment(), and the checkpointer finishes the whole checkpoint
before register_dirty_segment(), you are not safe. That can't happen with
write from the buffer manager, because the checkpointer would block waiting
for the flush of the buffer to finish.

Hm, right.

The easiest way to address that race would be to just record the redo
pointer in _bt_leafbuild() and continue to do the smgrimmedsync if a
checkpoint started since the start of the index build.

Another approach would be to utilize PGPROC.delayChkpt, but I would
rather not unnecessarily expand the use of that.

It's kind of interesting - in my aio branch I moved the
register_dirty_segment() to before the actual asynchronous write (due to
availability of the necessary data), which ought to be safe because of
the buffer interlocking. But that doesn't work here, or for other places
doing writes without going through s_b. It'd be great if we could come
up with a general solution, but I don't immediately see anything great.

The best I can come up with is adding helper functions to wrap some of
the complexity for "unbuffered" writes of doing an immedsync iff the
redo pointer changed. Something very roughly like

typedef struct UnbufferedWriteState { XLogRecPtr redo; uint64 numwrites;} UnbufferedWriteState;
void unbuffered_prep(UnbufferedWriteState* state);
void unbuffered_write(UnbufferedWriteState* state, ...);
void unbuffered_extend(UnbufferedWriteState* state, ...);
void unbuffered_finish(UnbufferedWriteState* state);

which wouldn't just do the dance to avoid the immedsync() if possible,
but also took care of PageSetChecksumInplace() (and PageEncryptInplace()
if we get that [1]).

Regarding the implementation, I think having an API to do these
"unbuffered" or "direct" writes outside of shared buffers is a good
idea. In this specific case, the proposed API would not change the code
much. I would just wrap the small diffs I added to the beginning and end
of _bt_load() in the API calls for unbuffered_prep() and
unbuffered_finish() and then tuck away the second half of
_bt_blwritepage() in unbuffered_write()/unbuffered_extend(). I figured I
would do so after ensuring the correctness of the logic in this patch.
Then I will work on a patch which implements the unbuffered_write() API
and demonstrates its utility with at least a few of the most compelling
most compelling use cases in the code.

- Melanie

Attachments:

v1-0001-Index-build-avoids-immed-fsync.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Index-build-avoids-immed-fsync.patchDownload

From 59837dfabd306bc17dcc02bd5f63c7bf5809f9d0 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 15 Apr 2021 07:01:01 -0400
Subject: [PATCH v1] Index build avoids immed fsync

Avoid immediate fsync for just built indexes either by using shared
buffers or by leveraging checkpointer's SyncRequest queue. When a
checkpoint begins during the index build, if not using shared buffers,
the backend will have to do its own fsync.
---
 src/backend/access/nbtree/nbtree.c  |  39 +++---
 src/backend/access/nbtree/nbtsort.c | 189 +++++++++++++++++++++++-----
 src/backend/access/transam/xlog.c   |  14 +++
 src/include/access/xlog.h           |   1 +
 4 files changed, 190 insertions(+), 53 deletions(-)

diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 1360ab80c1..ed3ee8d0e3 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -150,30 +150,29 @@ void
 btbuildempty(Relation index)
 {
 	Page		metapage;
+	Buffer metabuf;
 
-	/* Construct metapage. */
-	metapage = (Page) palloc(BLCKSZ);
-	_bt_initmetapage(metapage, P_NONE, 0, _bt_allequalimage(index, false));
-
+	// TODO: test this.
 	/*
-	 * Write the page and log it.  It might seem that an immediate sync would
-	 * be sufficient to guarantee that the file exists on disk, but recovery
-	 * itself might remove it while replaying, for example, an
-	 * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we need
-	 * this even when wal_level=minimal.
+	 * Construct metapage.
+	 * Because we don't need to lock the relation for extension (since
+	 * noone knows about it yet) and we don't need to initialize the
+	 * new page, as it is done below by _bt_blnewpage(), _bt_getbuf()
+	 * (with P_NEW and BT_WRITE) is overkill. However, it might be worth
+	 * either modifying it or adding a new helper function instead of
+	 * calling ReadBufferExtended() directly. We pass mode RBM_ZERO_AND_LOCK
+	 * because we want to hold an exclusive lock on the buffer content
 	 */
-	PageSetChecksumInplace(metapage, BTREE_METAPAGE);
-	smgrwrite(index->rd_smgr, INIT_FORKNUM, BTREE_METAPAGE,
-			  (char *) metapage, true);
-	log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
-				BTREE_METAPAGE, metapage, true);
+	metabuf = ReadBufferExtended(index, MAIN_FORKNUM, P_NEW, RBM_ZERO_AND_LOCK, NULL);
 
-	/*
-	 * An immediate sync is required even if we xlog'd the page, because the
-	 * write did not go through shared_buffers and therefore a concurrent
-	 * checkpoint may have moved the redo pointer past our xlog record.
-	 */
-	smgrimmedsync(index->rd_smgr, INIT_FORKNUM);
+	metapage = BufferGetPage(metabuf);
+	_bt_initmetapage(metapage, P_NONE, 0, _bt_allequalimage(index, false));
+
+	START_CRIT_SECTION();
+	MarkBufferDirty(metabuf);
+	log_newpage_buffer(metabuf, true);
+	END_CRIT_SECTION();
+	_bt_relbuf(index, metabuf);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 2c4d7f6e25..bde02361e1 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -233,6 +233,7 @@ typedef struct BTPageState
 {
 	Page		btps_page;		/* workspace for page building */
 	BlockNumber btps_blkno;		/* block # to write this page at */
+	Buffer btps_buf; /* buffer to write this page to */
 	IndexTuple	btps_lowkey;	/* page's strict lower bound pivot tuple */
 	OffsetNumber btps_lastoff;	/* last item offset loaded */
 	Size		btps_lastextra; /* last item's extra posting list space */
@@ -250,9 +251,11 @@ typedef struct BTWriteState
 	Relation	index;
 	BTScanInsert inskey;		/* generic insertion scankey */
 	bool		btws_use_wal;	/* dump pages to WAL? */
-	BlockNumber btws_pages_alloced; /* # pages allocated */
-	BlockNumber btws_pages_written; /* # pages written out */
+	BlockNumber btws_pages_alloced; /* # pages allocated for index build outside SB */
+	BlockNumber btws_pages_written; /* # pages written out for index build outside SB */
 	Page		btws_zeropage;	/* workspace for filling zeroes */
+	XLogRecPtr redo; /* cached redo pointer to determine if backend fsync is required at end of index build */
+	bool use_shared_buffers;
 } BTWriteState;
 
 
@@ -261,10 +264,11 @@ static double _bt_spools_heapscan(Relation heap, Relation index,
 static void _bt_spooldestroy(BTSpool *btspool);
 static void _bt_spool(BTSpool *btspool, ItemPointer self,
 					  Datum *values, bool *isnull);
-static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2);
+static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2,
+						  bool use_shared_buffers);
 static void _bt_build_callback(Relation index, ItemPointer tid, Datum *values,
 							   bool *isnull, bool tupleIsAlive, void *state);
-static Page _bt_blnewpage(uint32 level);
+static Page _bt_blnewpage(uint32 level, Buffer buf);
 static BTPageState *_bt_pagestate(BTWriteState *wstate, uint32 level);
 static void _bt_slideleft(Page rightmostpage);
 static void _bt_sortaddtup(Page page, Size itemsize,
@@ -275,7 +279,8 @@ static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
 static void _bt_sort_dedup_finish_pending(BTWriteState *wstate,
 										  BTPageState *state,
 										  BTDedupState dstate);
-static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
+static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state,
+							  Buffer metabuf, Page metapage);
 static void _bt_load(BTWriteState *wstate,
 					 BTSpool *btspool, BTSpool *btspool2);
 static void _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent,
@@ -323,13 +328,24 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			 RelationGetRelationName(index));
 
 	reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
+	/*
+	 * Based on the number of tuples, either create a buffered or unbuffered
+	 * write state. if the number of tuples is small, make a buffered write
+	 * if the number of tuples is larger, then we make an unbuffered write state
+	 * and must ensure that we check the redo pointer to know whether or not we
+	 * need to fsync ourselves
+	 */
 
 	/*
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2);
+	if (reltuples > 1000)
+		_bt_leafbuild(buildstate.spool, buildstate.spool2, false);
+	else
+		_bt_leafbuild(buildstate.spool, buildstate.spool2, true);
+
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
@@ -535,7 +551,7 @@ _bt_spool(BTSpool *btspool, ItemPointer self, Datum *values, bool *isnull)
  * create an entire btree.
  */
 static void
-_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
+_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool use_shared_buffers)
 {
 	BTWriteState wstate;
 
@@ -565,9 +581,11 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 	wstate.btws_use_wal = RelationNeedsWAL(wstate.index);
 
 	/* reserve the metapage */
-	wstate.btws_pages_alloced = BTREE_METAPAGE + 1;
+	wstate.btws_pages_alloced = 0;
 	wstate.btws_pages_written = 0;
 	wstate.btws_zeropage = NULL;	/* until needed */
+	wstate.redo = InvalidXLogRecPtr;
+	wstate.use_shared_buffers = use_shared_buffers;
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_BTREE_PHASE_LEAF_LOAD);
@@ -605,14 +623,18 @@ _bt_build_callback(Relation index,
 
 /*
  * allocate workspace for a new, clean btree page, not linked to any siblings.
+ * If index is not built in shared buffers, buf should be InvalidBuffer
  */
 static Page
-_bt_blnewpage(uint32 level)
+_bt_blnewpage(uint32 level, Buffer buf)
 {
 	Page		page;
 	BTPageOpaque opaque;
 
-	page = (Page) palloc(BLCKSZ);
+	if (buf)
+		page = BufferGetPage(buf);
+	else
+		page = (Page) palloc(BLCKSZ);
 
 	/* Zero the page and set up standard page header info */
 	_bt_pageinit(page, BLCKSZ);
@@ -634,8 +656,20 @@ _bt_blnewpage(uint32 level)
  * emit a completed btree page, and release the working storage.
  */
 static void
-_bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
+_bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno, Buffer buf)
 {
+	if (wstate->use_shared_buffers)
+	{
+		Assert(buf);
+		START_CRIT_SECTION();
+		MarkBufferDirty(buf);
+		if (wstate->btws_use_wal)
+			log_newpage_buffer(buf, true);
+		END_CRIT_SECTION();
+		_bt_relbuf(wstate->index, buf);
+		return;
+	}
+
 	/* Ensure rd_smgr is open (could have been closed by relcache flush!) */
 	RelationOpenSmgr(wstate->index);
 
@@ -661,7 +695,7 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
 		smgrextend(wstate->index->rd_smgr, MAIN_FORKNUM,
 				   wstate->btws_pages_written++,
 				   (char *) wstate->btws_zeropage,
-				   true);
+				   false);
 	}
 
 	PageSetChecksumInplace(page, blkno);
@@ -674,14 +708,14 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
 	{
 		/* extending the file... */
 		smgrextend(wstate->index->rd_smgr, MAIN_FORKNUM, blkno,
-				   (char *) page, true);
+				   (char *) page, false);
 		wstate->btws_pages_written++;
 	}
 	else
 	{
 		/* overwriting a block we zero-filled before */
 		smgrwrite(wstate->index->rd_smgr, MAIN_FORKNUM, blkno,
-				  (char *) page, true);
+				  (char *) page, false);
 	}
 
 	pfree(page);
@@ -694,13 +728,37 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
 static BTPageState *
 _bt_pagestate(BTWriteState *wstate, uint32 level)
 {
+	Buffer       buf;
+	BlockNumber blkno;
+
 	BTPageState *state = (BTPageState *) palloc0(sizeof(BTPageState));
 
-	/* create initial page for level */
-	state->btps_page = _bt_blnewpage(level);
+	if (wstate->use_shared_buffers)
+	{
+		/*
+		 * Because we don't need to lock the relation for extension (since
+		 * noone knows about it yet) and we don't need to initialize the
+		 * new page, as it is done below by _bt_blnewpage(), _bt_getbuf()
+		 * (with P_NEW and BT_WRITE) is overkill. However, it might be worth
+		 * either modifying it or adding a new helper function instead of
+		 * calling ReadBufferExtended() directly. We pass mode RBM_ZERO_AND_LOCK
+		 * because we want to hold an exclusive lock on the buffer content
+		 */
+		buf = ReadBufferExtended(wstate->index, MAIN_FORKNUM, P_NEW, RBM_ZERO_AND_LOCK, NULL);
+
+		blkno = BufferGetBlockNumber(buf);
+	}
+	else
+	{
+		buf = InvalidBuffer;
+		blkno = wstate->btws_pages_alloced++;
+	}
 
+	/* create initial page for level */
+	state->btps_page = _bt_blnewpage(level, buf);
 	/* and assign it a page position */
-	state->btps_blkno = wstate->btws_pages_alloced++;
+	state->btps_blkno = blkno;
+	state->btps_buf = buf;
 
 	state->btps_lowkey = NULL;
 	/* initialize lastoff so first item goes into P_FIRSTKEY */
@@ -835,6 +893,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 {
 	Page		npage;
 	BlockNumber nblkno;
+	Buffer nbuf;
 	OffsetNumber last_off;
 	Size		last_truncextra;
 	Size		pgspc;
@@ -849,6 +908,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 
 	npage = state->btps_page;
 	nblkno = state->btps_blkno;
+	nbuf = state->btps_buf;
 	last_off = state->btps_lastoff;
 	last_truncextra = state->btps_lastextra;
 	state->btps_lastextra = truncextra;
@@ -905,16 +965,37 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 		 */
 		Page		opage = npage;
 		BlockNumber oblkno = nblkno;
+		Buffer obuf = nbuf;
 		ItemId		ii;
 		ItemId		hii;
 		IndexTuple	oitup;
 
-		/* Create new page of same level */
-		npage = _bt_blnewpage(state->btps_level);
+		if (wstate->use_shared_buffers)
+		{
+			/*
+			 * Get a new shared buffer.
+			 * Because we don't need to lock the relation for extension (since
+			 * noone knows about it yet) and we don't need to initialize the
+			 * new page, as it is done below by _bt_blnewpage(), _bt_getbuf()
+			 * (with P_NEW and BT_WRITE) is overkill. However, it might be worth
+			 * either modifying it or adding a new helper function instead of
+			 * calling ReadBufferExtended() directly. We pass mode RBM_ZERO_AND_LOCK
+			 * because we want to hold an exclusive lock on the buffer content
+			 */
+			nbuf = ReadBufferExtended(wstate->index, MAIN_FORKNUM, P_NEW, RBM_ZERO_AND_LOCK, NULL);
 
-		/* and assign it a page position */
-		nblkno = wstate->btws_pages_alloced++;
+			/* assign a page position */
+			nblkno = BufferGetBlockNumber(nbuf);
+		}
+		else
+		{
+			nbuf = InvalidBuffer;
+			/* assign a page position */
+			nblkno = wstate->btws_pages_alloced++;
+		}
 
+		/* Create new page of same level */
+		npage = _bt_blnewpage(state->btps_level, nbuf);
 		/*
 		 * We copy the last item on the page into the new page, and then
 		 * rearrange the old page so that the 'last item' becomes its high key
@@ -1023,10 +1104,10 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 
 		/*
 		 * Write out the old page.  We never need to touch it again, so we can
-		 * free the opage workspace too.
+		 * free the opage workspace too. obuf has been released and is no longer
+		 * valid.
 		 */
-		_bt_blwritepage(wstate, opage, oblkno);
-
+		 _bt_blwritepage(wstate, opage, oblkno, obuf);
 		/*
 		 * Reset last_off to point to new page
 		 */
@@ -1060,6 +1141,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 
 	state->btps_page = npage;
 	state->btps_blkno = nblkno;
+	state->btps_buf = nbuf;
 	state->btps_lastoff = last_off;
 }
 
@@ -1105,12 +1187,11 @@ _bt_sort_dedup_finish_pending(BTWriteState *wstate, BTPageState *state,
  * Finish writing out the completed btree.
  */
 static void
-_bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
+_bt_uppershutdown(BTWriteState *wstate, BTPageState *state, Buffer metabuf, Page metapage)
 {
 	BTPageState *s;
 	BlockNumber rootblkno = P_NONE;
 	uint32		rootlevel = 0;
-	Page		metapage;
 
 	/*
 	 * Each iteration of this loop completes one more level of the tree.
@@ -1156,20 +1237,22 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 		 * back one slot.  Then we can dump out the page.
 		 */
 		_bt_slideleft(s->btps_page);
-		_bt_blwritepage(wstate, s->btps_page, s->btps_blkno);
+		_bt_blwritepage(wstate, s->btps_page, s->btps_blkno, s->btps_buf);
+		s->btps_buf = InvalidBuffer;
 		s->btps_page = NULL;	/* writepage freed the workspace */
 	}
 
 	/*
-	 * As the last step in the process, construct the metapage and make it
+	 * As the last step in the process, initialize the metapage and make it
 	 * point to the new root (unless we had no data at all, in which case it's
 	 * set to point to "P_NONE").  This changes the index to the "valid" state
 	 * by filling in a valid magic number in the metapage.
+	 * After this, metapage will have been freed or invalid and metabuf, if ever
+	 * valid, will have been released.
 	 */
-	metapage = (Page) palloc(BLCKSZ);
 	_bt_initmetapage(metapage, rootblkno, rootlevel,
 					 wstate->inskey->allequalimage);
-	_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
+	_bt_blwritepage(wstate, metapage, BTREE_METAPAGE, metabuf);
 }
 
 /*
@@ -1190,10 +1273,47 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
 	bool		deduplicate;
+	Buffer metabuf;
+	Page metapage;
 
 	deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
 		BTGetDeduplicateItems(wstate->index);
 
+	/*
+	 * Extend the index relation upfront to reserve the metapage
+	 */
+	if (wstate->use_shared_buffers)
+	{
+		/*
+		 * We should not need to LockRelationForExtension() as no one else knows
+		 * about this index yet?
+		 * Extend the index relation by one block for the metapage. _bt_getbuf()
+		 * is not used here as it does _bt_pageinit() which is one later by
+		 * _bt_initmetapage(). We will fill in the metapage and write it out at
+		 * the end of index build when we have all of the information required
+		 * for the metapage. However, we initially extend the relation for it to
+		 * occupy block 0 because it is much easier when using shared buffers to
+		 * extend the relation with a block number that is always increasing by
+		 * 1. Also, by passing RBM_ZERO_AND_LOCK, we have LW_EXCLUSIVE on the
+		 * buffer content and thus don't need _bt_lockbuf().
+		 */
+		metabuf = ReadBufferExtended(wstate->index, MAIN_FORKNUM, P_NEW, RBM_ZERO_AND_LOCK, NULL);
+		metapage = BufferGetPage(metabuf);
+	}
+	else
+	{
+		wstate->redo = GetRedoRecPtr();
+		metabuf = InvalidBuffer;
+		metapage = (Page) palloc(BLCKSZ);
+		RelationOpenSmgr(wstate->index);
+
+		/* extending the file... */
+		smgrextend(wstate->index->rd_smgr, MAIN_FORKNUM, BTREE_METAPAGE,
+		           (char *) metapage, false);
+		wstate->btws_pages_written++;
+		wstate->btws_pages_alloced++;
+	}
+
 	if (merge)
 	{
 		/*
@@ -1415,7 +1535,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	}
 
 	/* Close down final pages and write the metapage */
-	_bt_uppershutdown(wstate, state);
+	_bt_uppershutdown(wstate, state, metabuf, metapage);
 
 	/*
 	 * When we WAL-logged index pages, we must nonetheless fsync index files.
@@ -1428,8 +1548,11 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	 */
 	if (wstate->btws_use_wal)
 	{
-		RelationOpenSmgr(wstate->index);
-		smgrimmedsync(wstate->index->rd_smgr, MAIN_FORKNUM);
+		if (!wstate->use_shared_buffers && RedoRecPtrChanged(wstate->redo))
+		{
+			RelationOpenSmgr(wstate->index);
+			smgrimmedsync(wstate->index->rd_smgr, MAIN_FORKNUM);
+		}
 	}
 }
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index adfc6f67e2..d3b6c60278 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8546,6 +8546,20 @@ GetRedoRecPtr(void)
 	return RedoRecPtr;
 }
 
+bool
+RedoRecPtrChanged(XLogRecPtr comparator_ptr)
+{
+	XLogRecPtr	ptr;
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	ptr = XLogCtl->RedoRecPtr;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	if (RedoRecPtr < ptr)
+		RedoRecPtr = ptr;
+	return RedoRecPtr != comparator_ptr;
+}
+
 /*
  * Return information needed to decide whether a modified block needs a
  * full-page image to be included in the WAL record.
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index f542af0a26..44e4b01559 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -346,6 +346,7 @@ extern XLogRecPtr XLogRestorePoint(const char *rpName);
 extern void UpdateFullPageWrites(void);
 extern void GetFullPageWriteInfo(XLogRecPtr *RedoRecPtr_p, bool *doPageWrites_p);
 extern XLogRecPtr GetRedoRecPtr(void);
+extern bool RedoRecPtrChanged(XLogRecPtr comparator_ptr);
 extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
-- 
2.27.0

[1]: /messages/by-id/20200124195226.lth52iydq2n2uilq@alap3.anarazel.de

melanieplageman@gmail.com

over 4 years ago

In reply to: Melanie Plageman (#4)

1 attachment(s)

Re: Avoiding smgrimmedsync() during nbtree index builds

On Mon, May 3, 2021 at 5:24 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Thu, Jan 21, 2021 at 5:51 PM Andres Freund <andres@anarazel.de> wrote:

On 2021-01-21 23:54:04 +0200, Heikki Linnakangas wrote:

On 21/01/2021 22:36, Andres Freund wrote:

Thinking through the correctness of replacing smgrimmedsync() with sync
requests, the potential problems that I can see are:

1) redo point falls between the log_newpage() and the
write()/register_dirty_segment() in smgrextend/smgrwrite.
2) redo point falls between write() and register_dirty_segment()

But both of these are fine in the context of initially filling a newly
created relfilenode, as far as I can tell? Otherwise the current
smgrimmedsync() approach wouldn't be safe either, as far as I can tell?

Hmm. If the redo point falls between write() and the
register_dirty_segment(), and the checkpointer finishes the whole checkpoint
before register_dirty_segment(), you are not safe. That can't happen with
write from the buffer manager, because the checkpointer would block waiting
for the flush of the buffer to finish.

Hm, right.

The easiest way to address that race would be to just record the redo
pointer in _bt_leafbuild() and continue to do the smgrimmedsync if a
checkpoint started since the start of the index build.

Another approach would be to utilize PGPROC.delayChkpt, but I would
rather not unnecessarily expand the use of that.

It's kind of interesting - in my aio branch I moved the
register_dirty_segment() to before the actual asynchronous write (due to
availability of the necessary data), which ought to be safe because of
the buffer interlocking. But that doesn't work here, or for other places
doing writes without going through s_b. It'd be great if we could come
up with a general solution, but I don't immediately see anything great.

The best I can come up with is adding helper functions to wrap some of
the complexity for "unbuffered" writes of doing an immedsync iff the
redo pointer changed. Something very roughly like

typedef struct UnbufferedWriteState { XLogRecPtr redo; uint64 numwrites;} UnbufferedWriteState;
void unbuffered_prep(UnbufferedWriteState* state);
void unbuffered_write(UnbufferedWriteState* state, ...);
void unbuffered_extend(UnbufferedWriteState* state, ...);
void unbuffered_finish(UnbufferedWriteState* state);

which wouldn't just do the dance to avoid the immedsync() if possible,
but also took care of PageSetChecksumInplace() (and PageEncryptInplace()
if we get that [1]).

Regarding the implementation, I think having an API to do these
"unbuffered" or "direct" writes outside of shared buffers is a good
idea. In this specific case, the proposed API would not change the code
much. I would just wrap the small diffs I added to the beginning and end
of _bt_load() in the API calls for unbuffered_prep() and
unbuffered_finish() and then tuck away the second half of
_bt_blwritepage() in unbuffered_write()/unbuffered_extend(). I figured I
would do so after ensuring the correctness of the logic in this patch.
Then I will work on a patch which implements the unbuffered_write() API
and demonstrates its utility with at least a few of the most compelling
most compelling use cases in the code.

I've taken a pass at writing the API for "direct" or "unbuffered" writes
and extends. It introduces the suggested functions: unbuffered_prep(),
unbuffered_finish(), unbuffered_write(), and unbuffered_extend().

This is a rough cut -- corrections welcome and encouraged!

unbuffered_prep() saves the xlog redo pointer at the time it is called.
Then, if the redo pointer hasn't changed by the time unbuffered_finish()
is called, the backend can avoid calling smgrimmedsync(). Note that this
only works if intervening calls to smgrwrite() and smgrextend() pass
skipFsync=False.

unbuffered_write() and unbuffered_extend() might be able to be used even
if unbuffered_prep() and unbuffered_finish() are not used -- for example
hash indexes do something I don't entirely understand in which they call
smgrextend() directly when allocating buckets but then initialize the
new bucket pages using the bufmgr machinery.

I also intend to move accounting of pages written and extended into the
unbuffered_write() and unbuffered_extend() functions using the functions
I propose in [1]/messages/by-id/20200124195226.lth52iydq2n2uilq@alap3.anarazel.de to populate a new view, pg_stat_buffers. Then this
"unbuffered" IO would be counted in stats.

I picked the name "direct" for the directory in /src/backend/storage
because I thought that these functions are analogous to direct IO in
Linux -- in that they are doing IO without going through Postgres bufmgr
-- unPGbuffered, basically. Other suggestions were "raw" and "relIO".
Raw seemed confusing since raw device IO is pretty far from what is
happening here. RelIO didn't seem like it belonged next to bufmgr (to
me). However, direct and unbuffered will both soon become fraught
terminology with the introduction of AIO and direct IO to Postgres...

- Melanie

Attachments:

v1-0001-Add-unbuffered-IO-and-avoid-immed-fsync.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Add-unbuffered-IO-and-avoid-immed-fsync.patchDownload

From f68147500e88741debdae730763fb57d42c6ab79 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 28 Sep 2021 14:51:11 -0400
Subject: [PATCH v1] Add unbuffered IO and avoid immed fsync

Replace unbuffered extends and writes
---
 src/backend/access/gist/gistbuild.c       | 16 +++----
 src/backend/access/hash/hashpage.c        |  9 ++--
 src/backend/access/heap/heapam_handler.c  | 16 ++++---
 src/backend/access/heap/rewriteheap.c     | 25 ++++-------
 src/backend/access/heap/visibilitymap.c   |  8 ++--
 src/backend/access/nbtree/nbtree.c        | 17 ++++---
 src/backend/access/nbtree/nbtsort.c       | 39 +++++-----------
 src/backend/access/spgist/spginsert.c     | 25 +++++------
 src/backend/access/transam/xlog.c         | 13 ++++++
 src/backend/catalog/storage.c             | 25 +++--------
 src/backend/storage/Makefile              |  2 +-
 src/backend/storage/direct/Makefile       | 17 +++++++
 src/backend/storage/direct/directmgr.c    | 55 +++++++++++++++++++++++
 src/backend/storage/freespace/freespace.c | 10 +++--
 src/include/access/xlog.h                 |  1 +
 15 files changed, 171 insertions(+), 107 deletions(-)
 create mode 100644 src/backend/storage/direct/Makefile
 create mode 100644 src/backend/storage/direct/directmgr.c

diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index baad28c09f..34f712590c 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -43,6 +43,7 @@
 #include "miscadmin.h"
 #include "optimizer/optimizer.h"
 #include "storage/bufmgr.h"
+#include "storage/directmgr.h"
 #include "storage/smgr.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -91,6 +92,7 @@ typedef struct
 
 	int64		indtuples;		/* number of tuples indexed */
 
+	UnBufferedWriteState ub_wstate;
 	/*
 	 * Extra data structures used during a buffering build. 'gfbb' contains
 	 * information related to managing the build buffers. 'parentMap' is a
@@ -194,6 +196,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.heaprel = heap;
 	buildstate.sortstate = NULL;
 	buildstate.giststate = initGISTstate(index);
+	buildstate.ub_wstate.smgr_rel = RelationGetSmgr(index);
 
 	/*
 	 * Create a temporary memory context that is reset once for each tuple
@@ -403,14 +406,14 @@ gist_indexsortbuild(GISTBuildState *state)
 	state->pages_allocated = 0;
 	state->pages_written = 0;
 	state->ready_num_pages = 0;
+	unbuffered_prep(&state->ub_wstate);
 
 	/*
 	 * Write an empty page as a placeholder for the root page. It will be
 	 * replaced with the real root page at the end.
 	 */
 	page = palloc0(BLCKSZ);
-	smgrextend(RelationGetSmgr(state->indexrel), MAIN_FORKNUM, GIST_ROOT_BLKNO,
-			   page, true);
+	unbuffered_extend(&state->ub_wstate, MAIN_FORKNUM, GIST_ROOT_BLKNO, page, true);
 	state->pages_allocated++;
 	state->pages_written++;
 
@@ -450,13 +453,12 @@ gist_indexsortbuild(GISTBuildState *state)
 
 	/* Write out the root */
 	PageSetLSN(pagestate->page, GistBuildLSN);
-	PageSetChecksumInplace(pagestate->page, GIST_ROOT_BLKNO);
-	smgrwrite(RelationGetSmgr(state->indexrel), MAIN_FORKNUM, GIST_ROOT_BLKNO,
-			  pagestate->page, true);
+	unbuffered_write(&state->ub_wstate, MAIN_FORKNUM, GIST_ROOT_BLKNO, pagestate->page);
 	if (RelationNeedsWAL(state->indexrel))
 		log_newpage(&state->indexrel->rd_node, MAIN_FORKNUM, GIST_ROOT_BLKNO,
 					pagestate->page, true);
 
+	unbuffered_finish(&state->ub_wstate, MAIN_FORKNUM);
 	pfree(pagestate->page);
 	pfree(pagestate);
 }
@@ -570,9 +572,7 @@ gist_indexsortbuild_flush_ready_pages(GISTBuildState *state)
 			elog(ERROR, "unexpected block number to flush GiST sorting build");
 
 		PageSetLSN(page, GistBuildLSN);
-		PageSetChecksumInplace(page, blkno);
-		smgrextend(RelationGetSmgr(state->indexrel), MAIN_FORKNUM, blkno, page,
-				   true);
+		unbuffered_extend(&state->ub_wstate, MAIN_FORKNUM, blkno, page, false);
 
 		state->pages_written++;
 	}
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 159646c7c3..fcc0e28a36 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -32,6 +32,7 @@
 #include "access/hash_xlog.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
+#include "storage/directmgr.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
 #include "storage/smgr.h"
@@ -990,8 +991,10 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
 	PGAlignedBlock zerobuf;
 	Page		page;
 	HashPageOpaque ovflopaque;
+	UnBufferedWriteState ub_wstate;
 
 	lastblock = firstblock + nblocks - 1;
+	ub_wstate.smgr_rel = RelationGetSmgr(rel);
 
 	/*
 	 * Check for overflow in block number calculation; if so, we cannot extend
@@ -1000,6 +1003,8 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
 	if (lastblock < firstblock || lastblock == InvalidBlockNumber)
 		return false;
 
+	unbuffered_prep(&ub_wstate);
+
 	page = (Page) zerobuf.data;
 
 	/*
@@ -1024,9 +1029,7 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
 					zerobuf.data,
 					true);
 
-	PageSetChecksumInplace(page, lastblock);
-	smgrextend(RelationGetSmgr(rel), MAIN_FORKNUM, lastblock, zerobuf.data,
-			   false);
+	unbuffered_extend(&ub_wstate, MAIN_FORKNUM, lastblock, zerobuf.data, false);
 
 	return true;
 }
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 9befe012a9..fa4780e186 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -38,6 +38,7 @@
 #include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/bufpage.h"
+#include "storage/directmgr.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
 #include "storage/procarray.h"
@@ -575,6 +576,7 @@ heapam_relation_set_new_filenode(Relation rel,
 								 MultiXactId *minmulti)
 {
 	SMgrRelation srel;
+	UnBufferedWriteState ub_wstate;
 
 	/*
 	 * Initialize to the minimum XID that could put tuples in the table. We
@@ -594,15 +596,15 @@ heapam_relation_set_new_filenode(Relation rel,
 	*minmulti = GetOldestMultiXactId();
 
 	srel = RelationCreateStorage(*newrnode, persistence);
+	ub_wstate.smgr_rel = srel;
+	unbuffered_prep(&ub_wstate);
 
 	/*
 	 * If required, set up an init fork for an unlogged table so that it can
-	 * be correctly reinitialized on restart.  An immediate sync is required
-	 * even if the page has been logged, because the write did not go through
-	 * shared_buffers and therefore a concurrent checkpoint may have moved the
-	 * redo pointer past our xlog record.  Recovery may as well remove it
-	 * while replaying, for example, XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE
-	 * record. Therefore, logging is necessary even if wal_level=minimal.
+	 * be correctly reinitialized on restart.
+	 * Recovery may as well remove our xlog record while replaying, for
+	 * example, XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record. Therefore,
+	 * logging is necessary even if wal_level=minimal.
 	 */
 	if (persistence == RELPERSISTENCE_UNLOGGED)
 	{
@@ -611,7 +613,7 @@ heapam_relation_set_new_filenode(Relation rel,
 			   rel->rd_rel->relkind == RELKIND_TOASTVALUE);
 		smgrcreate(srel, INIT_FORKNUM, false);
 		log_smgrcreate(newrnode, INIT_FORKNUM);
-		smgrimmedsync(srel, INIT_FORKNUM);
+		unbuffered_finish(&ub_wstate, INIT_FORKNUM);
 	}
 
 	smgrclose(srel);
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 986a776bbd..6d1f4c8f6b 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -119,6 +119,7 @@
 #include "replication/logical.h"
 #include "replication/slot.h"
 #include "storage/bufmgr.h"
+#include "storage/directmgr.h"
 #include "storage/fd.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
@@ -152,6 +153,7 @@ typedef struct RewriteStateData
 	HTAB	   *rs_old_new_tid_map; /* unmatched B tuples */
 	HTAB	   *rs_logical_mappings;	/* logical remapping files */
 	uint32		rs_num_rewrite_mappings;	/* # in memory mappings */
+	UnBufferedWriteState rs_unbuffered_wstate;
 }			RewriteStateData;
 
 /*
@@ -264,6 +266,9 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
 	state->rs_freeze_xid = freeze_xid;
 	state->rs_cutoff_multi = cutoff_multi;
 	state->rs_cxt = rw_cxt;
+	state->rs_unbuffered_wstate.smgr_rel = RelationGetSmgr(state->rs_new_rel);
+
+	unbuffered_prep(&state->rs_unbuffered_wstate);
 
 	/* Initialize hash tables used to track update chains */
 	hash_ctl.keysize = sizeof(TidHashKey);
@@ -324,21 +329,12 @@ end_heap_rewrite(RewriteState state)
 						state->rs_buffer,
 						true);
 
-		PageSetChecksumInplace(state->rs_buffer, state->rs_blockno);
-
-		smgrextend(RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM,
-				   state->rs_blockno, (char *) state->rs_buffer, true);
+		unbuffered_extend(&state->rs_unbuffered_wstate, MAIN_FORKNUM,
+				state->rs_blockno, state->rs_buffer, false);
 	}
 
-	/*
-	 * When we WAL-logged rel pages, we must nonetheless fsync them.  The
-	 * reason is the same as in storage.c's RelationCopyStorage(): we're
-	 * writing data that's not in shared buffers, and so a CHECKPOINT
-	 * occurring during the rewriteheap operation won't have fsync'd data we
-	 * wrote before the checkpoint.
-	 */
 	if (RelationNeedsWAL(state->rs_new_rel))
-		smgrimmedsync(RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM);
+		unbuffered_finish(&state->rs_unbuffered_wstate, MAIN_FORKNUM);
 
 	logical_end_heap_rewrite(state);
 
@@ -690,10 +686,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
 			 * need for smgr to schedule an fsync for this write; we'll do it
 			 * ourselves in end_heap_rewrite.
 			 */
-			PageSetChecksumInplace(page, state->rs_blockno);
-
-			smgrextend(RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM,
-					   state->rs_blockno, (char *) page, true);
+			unbuffered_extend(&state->rs_unbuffered_wstate, MAIN_FORKNUM, state->rs_blockno, page, false);
 
 			state->rs_blockno++;
 			state->rs_buffer_valid = false;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 114fbbdd30..0f357fc4ff 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -92,6 +92,7 @@
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "storage/bufmgr.h"
+#include "storage/directmgr.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
 #include "utils/inval.h"
@@ -616,7 +617,9 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 	BlockNumber vm_nblocks_now;
 	PGAlignedBlock pg;
 	SMgrRelation reln;
+	UnBufferedWriteState ub_wstate;
 
+	ub_wstate.smgr_rel = RelationGetSmgr(rel);
 	PageInit((Page) pg.data, BLCKSZ, 0);
 
 	/*
@@ -654,9 +657,8 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 	/* Now extend the file */
 	while (vm_nblocks_now < vm_nblocks)
 	{
-		PageSetChecksumInplace((Page) pg.data, vm_nblocks_now);
-
-		smgrextend(reln, VISIBILITYMAP_FORKNUM, vm_nblocks_now, pg.data, false);
+		// TODO: aren't these pages empty? why checksum them
+		unbuffered_extend(&ub_wstate, VISIBILITYMAP_FORKNUM, vm_nblocks_now, (Page) pg.data, false);
 		vm_nblocks_now++;
 	}
 
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 40ad0956e0..c3e3418570 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -29,6 +29,7 @@
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "storage/condition_variable.h"
+#include "storage/directmgr.h"
 #include "storage/indexfsm.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
@@ -150,6 +151,11 @@ void
 btbuildempty(Relation index)
 {
 	Page		metapage;
+	UnBufferedWriteState wstate;
+
+	wstate.smgr_rel = RelationGetSmgr(index);
+
+	unbuffered_prep(&wstate);
 
 	/* Construct metapage. */
 	metapage = (Page) palloc(BLCKSZ);
@@ -162,18 +168,15 @@ btbuildempty(Relation index)
 	 * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we need
 	 * this even when wal_level=minimal.
 	 */
-	PageSetChecksumInplace(metapage, BTREE_METAPAGE);
-	smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, BTREE_METAPAGE,
-			  (char *) metapage, true);
+	unbuffered_write(&wstate, INIT_FORKNUM, BTREE_METAPAGE, metapage);
 	log_newpage(&RelationGetSmgr(index)->smgr_rnode.node, INIT_FORKNUM,
 				BTREE_METAPAGE, metapage, true);
 
 	/*
-	 * An immediate sync is required even if we xlog'd the page, because the
-	 * write did not go through shared_buffers and therefore a concurrent
-	 * checkpoint may have moved the redo pointer past our xlog record.
+	 * Even though we xlog'd the page, a concurrent checkpoint may have moved
+	 * the redo pointer past our xlog record, so we may still need to fsync.
 	 */
-	smgrimmedsync(RelationGetSmgr(index), INIT_FORKNUM);
+	unbuffered_finish(&wstate, INIT_FORKNUM);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 54c8eb1289..9cb9757875 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -57,6 +57,7 @@
 #include "executor/instrument.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "storage/directmgr.h"
 #include "storage/smgr.h"
 #include "tcop/tcopprot.h"		/* pgrminclude ignore */
 #include "utils/rel.h"
@@ -253,6 +254,7 @@ typedef struct BTWriteState
 	BlockNumber btws_pages_alloced; /* # pages allocated */
 	BlockNumber btws_pages_written; /* # pages written out */
 	Page		btws_zeropage;	/* workspace for filling zeroes */
+	UnBufferedWriteState ub_wstate;
 } BTWriteState;
 
 
@@ -560,6 +562,8 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
+	wstate.ub_wstate.smgr_rel = RelationGetSmgr(btspool->index);
+	wstate.ub_wstate.redo = InvalidXLogRecPtr;
 	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
 	/* _bt_mkscankey() won't set allequalimage without metapage */
 	wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
@@ -656,31 +660,19 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
 		if (!wstate->btws_zeropage)
 			wstate->btws_zeropage = (Page) palloc0(BLCKSZ);
 		/* don't set checksum for all-zero page */
-		smgrextend(RelationGetSmgr(wstate->index), MAIN_FORKNUM,
-				   wstate->btws_pages_written++,
-				   (char *) wstate->btws_zeropage,
-				   true);
+		unbuffered_extend(&wstate->ub_wstate, MAIN_FORKNUM, wstate->btws_pages_written++, wstate->btws_zeropage, true);
 	}
 
-	PageSetChecksumInplace(page, blkno);
 
-	/*
-	 * Now write the page.  There's no need for smgr to schedule an fsync for
-	 * this write; we'll do it ourselves before ending the build.
-	 */
+	/* Now write the page. Either we are extending the file... */
 	if (blkno == wstate->btws_pages_written)
 	{
-		/* extending the file... */
-		smgrextend(RelationGetSmgr(wstate->index), MAIN_FORKNUM, blkno,
-				   (char *) page, true);
+		unbuffered_extend(&wstate->ub_wstate, MAIN_FORKNUM, blkno, page, false);
 		wstate->btws_pages_written++;
 	}
+	/* or we are overwriting a block we zero-filled before. */
 	else
-	{
-		/* overwriting a block we zero-filled before */
-		smgrwrite(RelationGetSmgr(wstate->index), MAIN_FORKNUM, blkno,
-				  (char *) page, true);
-	}
+		unbuffered_write(&wstate->ub_wstate, MAIN_FORKNUM, blkno, page);
 
 	pfree(page);
 }
@@ -1189,6 +1181,8 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	int64		tuples_done = 0;
 	bool		deduplicate;
 
+
+	unbuffered_prep(&wstate->ub_wstate);
 	deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
 		BTGetDeduplicateItems(wstate->index);
 
@@ -1415,17 +1409,8 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	/* Close down final pages and write the metapage */
 	_bt_uppershutdown(wstate, state);
 
-	/*
-	 * When we WAL-logged index pages, we must nonetheless fsync index files.
-	 * Since we're building outside shared buffers, a CHECKPOINT occurring
-	 * during the build has no way to flush the previously written data to
-	 * disk (indeed it won't know the index even exists).  A crash later on
-	 * would replay WAL from the checkpoint, therefore it wouldn't replay our
-	 * earlier WAL entries. If we do not fsync those pages here, they might
-	 * still not be on disk when the crash occurs.
-	 */
 	if (wstate->btws_use_wal)
-		smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM);
+		unbuffered_finish(&wstate->ub_wstate, MAIN_FORKNUM);
 }
 
 /*
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index cc4394b1c8..1aeb8bc714 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -25,6 +25,7 @@
 #include "catalog/index.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/directmgr.h"
 #include "storage/smgr.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -156,6 +157,10 @@ void
 spgbuildempty(Relation index)
 {
 	Page		page;
+	UnBufferedWriteState wstate;
+
+	wstate.smgr_rel = RelationGetSmgr(index);
+	unbuffered_prep(&wstate);
 
 	/* Construct metapage. */
 	page = (Page) palloc(BLCKSZ);
@@ -168,36 +173,30 @@ spgbuildempty(Relation index)
 	 * of their existing content when the corresponding create records are
 	 * replayed.
 	 */
-	PageSetChecksumInplace(page, SPGIST_METAPAGE_BLKNO);
-	smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, SPGIST_METAPAGE_BLKNO,
-			  (char *) page, true);
+	unbuffered_write(&wstate, INIT_FORKNUM, SPGIST_METAPAGE_BLKNO, page);
 	log_newpage(&(RelationGetSmgr(index))->smgr_rnode.node, INIT_FORKNUM,
 				SPGIST_METAPAGE_BLKNO, page, true);
 
 	/* Likewise for the root page. */
 	SpGistInitPage(page, SPGIST_LEAF);
 
-	PageSetChecksumInplace(page, SPGIST_ROOT_BLKNO);
-	smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, SPGIST_ROOT_BLKNO,
-			  (char *) page, true);
+	unbuffered_write(&wstate, INIT_FORKNUM, SPGIST_ROOT_BLKNO, page);
 	log_newpage(&(RelationGetSmgr(index))->smgr_rnode.node, INIT_FORKNUM,
 				SPGIST_ROOT_BLKNO, page, true);
 
 	/* Likewise for the null-tuples root page. */
 	SpGistInitPage(page, SPGIST_LEAF | SPGIST_NULLS);
 
-	PageSetChecksumInplace(page, SPGIST_NULL_BLKNO);
-	smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, SPGIST_NULL_BLKNO,
-			  (char *) page, true);
+	unbuffered_write(&wstate, INIT_FORKNUM, SPGIST_NULL_BLKNO, page);
 	log_newpage(&(RelationGetSmgr(index))->smgr_rnode.node, INIT_FORKNUM,
 				SPGIST_NULL_BLKNO, page, true);
 
 	/*
-	 * An immediate sync is required even if we xlog'd the pages, because the
-	 * writes did not go through shared buffers and therefore a concurrent
-	 * checkpoint may have moved the redo pointer past our xlog record.
+	 * Because the writes did not go through shared buffers, if a concurrent
+	 * checkpoint moved the redo pointer past our xlog record, an immediate
+	 * sync is required even if we xlog'd the pages.
 	 */
-	smgrimmedsync(RelationGetSmgr(index), INIT_FORKNUM);
+	unbuffered_finish(&wstate, INIT_FORKNUM);
 }
 
 /*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e51a7a749d..d11a928b62 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -12941,6 +12941,19 @@ CheckForStandbyTrigger(void)
 	return false;
 }
 
+bool RedoRecPtrChanged(XLogRecPtr comparator_ptr)
+{
+	XLogRecPtr ptr;
+	SpinLockAcquire(&XLogCtl->info_lck);
+	ptr = XLogCtl->RedoRecPtr;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	if (RedoRecPtr < ptr)
+		RedoRecPtr = ptr;
+
+	return RedoRecPtr != comparator_ptr;
+}
+
 /*
  * Remove the files signaling a standby promotion request.
  */
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index c5ad28d71f..c63085b1aa 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -28,6 +28,7 @@
 #include "catalog/storage.h"
 #include "catalog/storage_xlog.h"
 #include "miscadmin.h"
+#include "storage/directmgr.h"
 #include "storage/freespace.h"
 #include "storage/smgr.h"
 #include "utils/hsearch.h"
@@ -420,6 +421,10 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 	bool		copying_initfork;
 	BlockNumber nblocks;
 	BlockNumber blkno;
+	UnBufferedWriteState wstate;
+
+	wstate.smgr_rel = dst;
+	unbuffered_prep(&wstate);
 
 	page = (Page) buf.data;
 
@@ -477,27 +482,11 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 		if (use_wal)
 			log_newpage(&dst->smgr_rnode.node, forkNum, blkno, page, false);
 
-		PageSetChecksumInplace(page, blkno);
-
-		/*
-		 * Now write the page.  We say skipFsync = true because there's no
-		 * need for smgr to schedule an fsync for this write; we'll do it
-		 * ourselves below.
-		 */
-		smgrextend(dst, forkNum, blkno, buf.data, true);
+		unbuffered_extend(&wstate, forkNum, blkno, buf.data, false);
 	}
 
-	/*
-	 * When we WAL-logged rel pages, we must nonetheless fsync them.  The
-	 * reason is that since we're copying outside shared buffers, a CHECKPOINT
-	 * occurring during the copy has no way to flush the previously written
-	 * data to disk (indeed it won't know the new rel even exists).  A crash
-	 * later on would replay WAL from the checkpoint, therefore it wouldn't
-	 * replay our earlier WAL entries. If we do not fsync those pages here,
-	 * they might still not be on disk when the crash occurs.
-	 */
 	if (use_wal || copying_initfork)
-		smgrimmedsync(dst, forkNum);
+		unbuffered_finish(&wstate, forkNum);
 }
 
 /*
diff --git a/src/backend/storage/Makefile b/src/backend/storage/Makefile
index 8376cdfca2..501fae5f9d 100644
--- a/src/backend/storage/Makefile
+++ b/src/backend/storage/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/storage
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS     = buffer file freespace ipc large_object lmgr page smgr sync
+SUBDIRS     = buffer direct file freespace ipc large_object lmgr page smgr sync
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/direct/Makefile b/src/backend/storage/direct/Makefile
new file mode 100644
index 0000000000..d82bbed48c
--- /dev/null
+++ b/src/backend/storage/direct/Makefile
@@ -0,0 +1,17 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for storage/direct
+#
+# IDENTIFICATION
+#    src/backend/storage/direct/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/storage/direct
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = directmgr.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/direct/directmgr.c b/src/backend/storage/direct/directmgr.c
new file mode 100644
index 0000000000..d6f02487f8
--- /dev/null
+++ b/src/backend/storage/direct/directmgr.c
@@ -0,0 +1,55 @@
+/*-------------------------------------------------------------------------
+ *
+ * directmgr.c
+ *	  routines for managing unbuffered IO
+ *
+ * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/direct/directmgr.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+
+#include "access/xlog.h"
+#include "storage/directmgr.h"
+#include "utils/rel.h"
+
+void unbuffered_prep(UnBufferedWriteState *wstate)
+{
+	wstate->redo = GetRedoRecPtr();
+}
+
+/*
+ * When writing data outside shared buffers, a concurrent CHECKPOINT can move
+ * the redo pointer past our WAL entries and won't flush our data to disk. If
+ * the database crashes before the data makes it to disk, our WAL won't be
+ * replayed and the data will be lost.
+ * Thus, if a CHECKPOINT begins between unbuffered_prep() and
+ * unbuffered_finish(), the backend must fsync the data itself.
+ */
+void unbuffered_finish(UnBufferedWriteState *wstate, ForkNumber forknum)
+{
+	if (RedoRecPtrChanged(wstate->redo))
+		smgrimmedsync(wstate->smgr_rel, forknum);
+}
+
+void
+unbuffered_write(UnBufferedWriteState *wstate, ForkNumber forknum, BlockNumber blocknum, Page page)
+{
+	PageSetChecksumInplace(page, blocknum);
+	smgrwrite(wstate->smgr_rel, forknum, blocknum, (char *) page, false);
+}
+
+void
+unbuffered_extend(UnBufferedWriteState *wstate, ForkNumber forknum, BlockNumber blocknum, Page page, bool empty)
+{
+	if (!empty)
+		PageSetChecksumInplace(page, blocknum);
+	smgrextend(wstate->smgr_rel, forknum, blocknum, (char *) page, false);
+}
+
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 09d4b16067..26abdfa589 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -26,6 +26,7 @@
 #include "access/htup_details.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
+#include "storage/directmgr.h"
 #include "storage/freespace.h"
 #include "storage/fsm_internals.h"
 #include "storage/lmgr.h"
@@ -608,6 +609,9 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 	BlockNumber fsm_nblocks_now;
 	PGAlignedBlock pg;
 	SMgrRelation reln;
+	UnBufferedWriteState ub_wstate;
+
+	ub_wstate.smgr_rel = RelationGetSmgr(rel);
 
 	PageInit((Page) pg.data, BLCKSZ, 0);
 
@@ -647,10 +651,8 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 	/* Extend as needed. */
 	while (fsm_nblocks_now < fsm_nblocks)
 	{
-		PageSetChecksumInplace((Page) pg.data, fsm_nblocks_now);
-
-		smgrextend(reln, FSM_FORKNUM, fsm_nblocks_now,
-				   pg.data, false);
+		// TODO: why was it checksumming all zero pages?
+		unbuffered_extend(&ub_wstate, FSM_FORKNUM, fsm_nblocks_now, (Page) pg.data, false);
 		fsm_nblocks_now++;
 	}
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 5e2c94a05f..4a7b0d42de 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -314,6 +314,7 @@ extern XLogRecPtr GetRedoRecPtr(void);
 extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
+extern bool RedoRecPtrChanged(XLogRecPtr comparator_ptr);
 extern void RemovePromoteSignalFiles(void);
 
 extern bool PromoteIsTriggered(void);
-- 
2.27.0

melanieplageman@gmail.com

about 4 years ago

In reply to: Melanie Plageman (#4)

2 attachment(s)

Re: Avoiding smgrimmedsync() during nbtree index builds

On Mon, May 3, 2021 at 5:24 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

So, I've written a patch which avoids doing the immediate fsync for
index builds either by using shared buffers or by queueing sync requests
for the checkpointer. If a checkpoint starts during the index build and
the backend is not using shared buffers for the index build, it will
need to do the fsync.

I've attached a rebased version of the patch (old patch doesn't apply).

With the patch applied (compiled at O2), creating twenty empty tables in
a transaction with a text column and an index on another column (like in
the attached SQL [make a test_idx schema first]) results in a fairly
consistent 15-30% speedup on my laptop (timings still in tens of ms -
avg 50 ms to avg 65 ms so run variation affects the % a lot).
Reducing the number of fsync calls from 40 to 1 was what likely causes
this difference.

- Melanie

Attachments:

v2-0001-Index-build-avoids-immed-fsync.patchapplication/octet-stream; name=v2-0001-Index-build-avoids-immed-fsync.patchDownload

From 2130175c5d794f60c5f15d6eb1b626c81fb7c39b Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 15 Apr 2021 07:01:01 -0400
Subject: [PATCH v2] Index build avoids immed fsync

Avoid immediate fsync for just built indexes either by using shared
buffers or by leveraging checkpointer's SyncRequest queue. When a
checkpoint begins during the index build, if not using shared buffers,
the backend will have to do its own fsync.
---
 src/backend/access/nbtree/nbtree.c  |  39 +++---
 src/backend/access/nbtree/nbtsort.c | 186 +++++++++++++++++++++++-----
 src/backend/access/transam/xlog.c   |  14 +++
 src/include/access/xlog.h           |   1 +
 4 files changed, 188 insertions(+), 52 deletions(-)

diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 40ad0956e0..a2e32f18e6 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -150,30 +150,29 @@ void
 btbuildempty(Relation index)
 {
 	Page		metapage;
+	Buffer metabuf;
 
-	/* Construct metapage. */
-	metapage = (Page) palloc(BLCKSZ);
-	_bt_initmetapage(metapage, P_NONE, 0, _bt_allequalimage(index, false));
-
+	// TODO: test this.
 	/*
-	 * Write the page and log it.  It might seem that an immediate sync would
-	 * be sufficient to guarantee that the file exists on disk, but recovery
-	 * itself might remove it while replaying, for example, an
-	 * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we need
-	 * this even when wal_level=minimal.
+	 * Construct metapage.
+	 * Because we don't need to lock the relation for extension (since
+	 * noone knows about it yet) and we don't need to initialize the
+	 * new page, as it is done below by _bt_blnewpage(), _bt_getbuf()
+	 * (with P_NEW and BT_WRITE) is overkill. However, it might be worth
+	 * either modifying it or adding a new helper function instead of
+	 * calling ReadBufferExtended() directly. We pass mode RBM_ZERO_AND_LOCK
+	 * because we want to hold an exclusive lock on the buffer content
 	 */
-	PageSetChecksumInplace(metapage, BTREE_METAPAGE);
-	smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, BTREE_METAPAGE,
-			  (char *) metapage, true);
-	log_newpage(&RelationGetSmgr(index)->smgr_rnode.node, INIT_FORKNUM,
-				BTREE_METAPAGE, metapage, true);
+	metabuf = ReadBufferExtended(index, MAIN_FORKNUM, P_NEW, RBM_ZERO_AND_LOCK, NULL);
 
-	/*
-	 * An immediate sync is required even if we xlog'd the page, because the
-	 * write did not go through shared_buffers and therefore a concurrent
-	 * checkpoint may have moved the redo pointer past our xlog record.
-	 */
-	smgrimmedsync(RelationGetSmgr(index), INIT_FORKNUM);
+	metapage = BufferGetPage(metabuf);
+	_bt_initmetapage(metapage, P_NONE, 0, _bt_allequalimage(index, false));
+
+	START_CRIT_SECTION();
+	MarkBufferDirty(metabuf);
+	log_newpage_buffer(metabuf, true);
+	END_CRIT_SECTION();
+	_bt_relbuf(index, metabuf);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 1e02be9746..43f91390eb 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -233,6 +233,7 @@ typedef struct BTPageState
 {
 	Page		btps_page;		/* workspace for page building */
 	BlockNumber btps_blkno;		/* block # to write this page at */
+	Buffer btps_buf; /* buffer to write this page to */
 	IndexTuple	btps_lowkey;	/* page's strict lower bound pivot tuple */
 	OffsetNumber btps_lastoff;	/* last item offset loaded */
 	Size		btps_lastextra; /* last item's extra posting list space */
@@ -250,9 +251,11 @@ typedef struct BTWriteState
 	Relation	index;
 	BTScanInsert inskey;		/* generic insertion scankey */
 	bool		btws_use_wal;	/* dump pages to WAL? */
-	BlockNumber btws_pages_alloced; /* # pages allocated */
-	BlockNumber btws_pages_written; /* # pages written out */
+	BlockNumber btws_pages_alloced; /* # pages allocated for index build outside SB */
+	BlockNumber btws_pages_written; /* # pages written out for index build outside SB */
 	Page		btws_zeropage;	/* workspace for filling zeroes */
+	XLogRecPtr redo; /* cached redo pointer to determine if backend fsync is required at end of index build */
+	bool use_shared_buffers;
 } BTWriteState;
 
 
@@ -261,10 +264,11 @@ static double _bt_spools_heapscan(Relation heap, Relation index,
 static void _bt_spooldestroy(BTSpool *btspool);
 static void _bt_spool(BTSpool *btspool, ItemPointer self,
 					  Datum *values, bool *isnull);
-static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2);
+static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2,
+						  bool use_shared_buffers);
 static void _bt_build_callback(Relation index, ItemPointer tid, Datum *values,
 							   bool *isnull, bool tupleIsAlive, void *state);
-static Page _bt_blnewpage(uint32 level);
+static Page _bt_blnewpage(uint32 level, Buffer buf);
 static BTPageState *_bt_pagestate(BTWriteState *wstate, uint32 level);
 static void _bt_slideleft(Page rightmostpage);
 static void _bt_sortaddtup(Page page, Size itemsize,
@@ -275,7 +279,8 @@ static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
 static void _bt_sort_dedup_finish_pending(BTWriteState *wstate,
 										  BTPageState *state,
 										  BTDedupState dstate);
-static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
+static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state,
+							  Buffer metabuf, Page metapage);
 static void _bt_load(BTWriteState *wstate,
 					 BTSpool *btspool, BTSpool *btspool2);
 static void _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent,
@@ -323,13 +328,24 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			 RelationGetRelationName(index));
 
 	reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
+	/*
+	 * Based on the number of tuples, either create a buffered or unbuffered
+	 * write state. if the number of tuples is small, make a buffered write
+	 * if the number of tuples is larger, then we make an unbuffered write state
+	 * and must ensure that we check the redo pointer to know whether or not we
+	 * need to fsync ourselves
+	 */
 
 	/*
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2);
+	if (reltuples > 1000)
+		_bt_leafbuild(buildstate.spool, buildstate.spool2, false);
+	else
+		_bt_leafbuild(buildstate.spool, buildstate.spool2, true);
+
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
@@ -535,7 +551,7 @@ _bt_spool(BTSpool *btspool, ItemPointer self, Datum *values, bool *isnull)
  * create an entire btree.
  */
 static void
-_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
+_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool use_shared_buffers)
 {
 	BTWriteState wstate;
 
@@ -566,9 +582,11 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 	wstate.btws_use_wal = RelationNeedsWAL(wstate.index);
 
 	/* reserve the metapage */
-	wstate.btws_pages_alloced = BTREE_METAPAGE + 1;
+	wstate.btws_pages_alloced = 0;
 	wstate.btws_pages_written = 0;
 	wstate.btws_zeropage = NULL;	/* until needed */
+	wstate.redo = InvalidXLogRecPtr;
+	wstate.use_shared_buffers = use_shared_buffers;
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_BTREE_PHASE_LEAF_LOAD);
@@ -606,14 +624,18 @@ _bt_build_callback(Relation index,
 
 /*
  * allocate workspace for a new, clean btree page, not linked to any siblings.
+ * If index is not built in shared buffers, buf should be InvalidBuffer
  */
 static Page
-_bt_blnewpage(uint32 level)
+_bt_blnewpage(uint32 level, Buffer buf)
 {
 	Page		page;
 	BTPageOpaque opaque;
 
-	page = (Page) palloc(BLCKSZ);
+	if (buf)
+		page = BufferGetPage(buf);
+	else
+		page = (Page) palloc(BLCKSZ);
 
 	/* Zero the page and set up standard page header info */
 	_bt_pageinit(page, BLCKSZ);
@@ -635,8 +657,20 @@ _bt_blnewpage(uint32 level)
  * emit a completed btree page, and release the working storage.
  */
 static void
-_bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
+_bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno, Buffer buf)
 {
+	if (wstate->use_shared_buffers)
+	{
+		Assert(buf);
+		START_CRIT_SECTION();
+		MarkBufferDirty(buf);
+		if (wstate->btws_use_wal)
+			log_newpage_buffer(buf, true);
+		END_CRIT_SECTION();
+		_bt_relbuf(wstate->index, buf);
+		return;
+	}
+
 	/* XLOG stuff */
 	if (wstate->btws_use_wal)
 	{
@@ -659,7 +693,7 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
 		smgrextend(RelationGetSmgr(wstate->index), MAIN_FORKNUM,
 				   wstate->btws_pages_written++,
 				   (char *) wstate->btws_zeropage,
-				   true);
+				   false);
 	}
 
 	PageSetChecksumInplace(page, blkno);
@@ -672,14 +706,14 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
 	{
 		/* extending the file... */
 		smgrextend(RelationGetSmgr(wstate->index), MAIN_FORKNUM, blkno,
-				   (char *) page, true);
+				   (char *) page, false);
 		wstate->btws_pages_written++;
 	}
 	else
 	{
 		/* overwriting a block we zero-filled before */
 		smgrwrite(RelationGetSmgr(wstate->index), MAIN_FORKNUM, blkno,
-				  (char *) page, true);
+				  (char *) page, false);
 	}
 
 	pfree(page);
@@ -692,13 +726,37 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
 static BTPageState *
 _bt_pagestate(BTWriteState *wstate, uint32 level)
 {
+	Buffer       buf;
+	BlockNumber blkno;
+
 	BTPageState *state = (BTPageState *) palloc0(sizeof(BTPageState));
 
-	/* create initial page for level */
-	state->btps_page = _bt_blnewpage(level);
+	if (wstate->use_shared_buffers)
+	{
+		/*
+		 * Because we don't need to lock the relation for extension (since
+		 * noone knows about it yet) and we don't need to initialize the
+		 * new page, as it is done below by _bt_blnewpage(), _bt_getbuf()
+		 * (with P_NEW and BT_WRITE) is overkill. However, it might be worth
+		 * either modifying it or adding a new helper function instead of
+		 * calling ReadBufferExtended() directly. We pass mode RBM_ZERO_AND_LOCK
+		 * because we want to hold an exclusive lock on the buffer content
+		 */
+		buf = ReadBufferExtended(wstate->index, MAIN_FORKNUM, P_NEW, RBM_ZERO_AND_LOCK, NULL);
 
+		blkno = BufferGetBlockNumber(buf);
+	}
+	else
+	{
+		buf = InvalidBuffer;
+		blkno = wstate->btws_pages_alloced++;
+	}
+
+	/* create initial page for level */
+	state->btps_page = _bt_blnewpage(level, buf);
 	/* and assign it a page position */
-	state->btps_blkno = wstate->btws_pages_alloced++;
+	state->btps_blkno = blkno;
+	state->btps_buf = buf;
 
 	state->btps_lowkey = NULL;
 	/* initialize lastoff so first item goes into P_FIRSTKEY */
@@ -833,6 +891,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 {
 	Page		npage;
 	BlockNumber nblkno;
+	Buffer nbuf;
 	OffsetNumber last_off;
 	Size		last_truncextra;
 	Size		pgspc;
@@ -847,6 +906,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 
 	npage = state->btps_page;
 	nblkno = state->btps_blkno;
+	nbuf = state->btps_buf;
 	last_off = state->btps_lastoff;
 	last_truncextra = state->btps_lastextra;
 	state->btps_lastextra = truncextra;
@@ -903,16 +963,37 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 		 */
 		Page		opage = npage;
 		BlockNumber oblkno = nblkno;
+		Buffer obuf = nbuf;
 		ItemId		ii;
 		ItemId		hii;
 		IndexTuple	oitup;
 
-		/* Create new page of same level */
-		npage = _bt_blnewpage(state->btps_level);
+		if (wstate->use_shared_buffers)
+		{
+			/*
+			 * Get a new shared buffer.
+			 * Because we don't need to lock the relation for extension (since
+			 * noone knows about it yet) and we don't need to initialize the
+			 * new page, as it is done below by _bt_blnewpage(), _bt_getbuf()
+			 * (with P_NEW and BT_WRITE) is overkill. However, it might be worth
+			 * either modifying it or adding a new helper function instead of
+			 * calling ReadBufferExtended() directly. We pass mode RBM_ZERO_AND_LOCK
+			 * because we want to hold an exclusive lock on the buffer content
+			 */
+			nbuf = ReadBufferExtended(wstate->index, MAIN_FORKNUM, P_NEW, RBM_ZERO_AND_LOCK, NULL);
 
-		/* and assign it a page position */
-		nblkno = wstate->btws_pages_alloced++;
+			/* assign a page position */
+			nblkno = BufferGetBlockNumber(nbuf);
+		}
+		else
+		{
+			nbuf = InvalidBuffer;
+			/* assign a page position */
+			nblkno = wstate->btws_pages_alloced++;
+		}
 
+		/* Create new page of same level */
+		npage = _bt_blnewpage(state->btps_level, nbuf);
 		/*
 		 * We copy the last item on the page into the new page, and then
 		 * rearrange the old page so that the 'last item' becomes its high key
@@ -1021,10 +1102,10 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 
 		/*
 		 * Write out the old page.  We never need to touch it again, so we can
-		 * free the opage workspace too.
+		 * free the opage workspace too. obuf has been released and is no longer
+		 * valid.
 		 */
-		_bt_blwritepage(wstate, opage, oblkno);
-
+		 _bt_blwritepage(wstate, opage, oblkno, obuf);
 		/*
 		 * Reset last_off to point to new page
 		 */
@@ -1058,6 +1139,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 
 	state->btps_page = npage;
 	state->btps_blkno = nblkno;
+	state->btps_buf = nbuf;
 	state->btps_lastoff = last_off;
 }
 
@@ -1103,12 +1185,11 @@ _bt_sort_dedup_finish_pending(BTWriteState *wstate, BTPageState *state,
  * Finish writing out the completed btree.
  */
 static void
-_bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
+_bt_uppershutdown(BTWriteState *wstate, BTPageState *state, Buffer metabuf, Page metapage)
 {
 	BTPageState *s;
 	BlockNumber rootblkno = P_NONE;
 	uint32		rootlevel = 0;
-	Page		metapage;
 
 	/*
 	 * Each iteration of this loop completes one more level of the tree.
@@ -1154,20 +1235,22 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 		 * back one slot.  Then we can dump out the page.
 		 */
 		_bt_slideleft(s->btps_page);
-		_bt_blwritepage(wstate, s->btps_page, s->btps_blkno);
+		_bt_blwritepage(wstate, s->btps_page, s->btps_blkno, s->btps_buf);
+		s->btps_buf = InvalidBuffer;
 		s->btps_page = NULL;	/* writepage freed the workspace */
 	}
 
 	/*
-	 * As the last step in the process, construct the metapage and make it
+	 * As the last step in the process, initialize the metapage and make it
 	 * point to the new root (unless we had no data at all, in which case it's
 	 * set to point to "P_NONE").  This changes the index to the "valid" state
 	 * by filling in a valid magic number in the metapage.
+	 * After this, metapage will have been freed or invalid and metabuf, if ever
+	 * valid, will have been released.
 	 */
-	metapage = (Page) palloc(BLCKSZ);
 	_bt_initmetapage(metapage, rootblkno, rootlevel,
 					 wstate->inskey->allequalimage);
-	_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
+	_bt_blwritepage(wstate, metapage, BTREE_METAPAGE, metabuf);
 }
 
 /*
@@ -1188,10 +1271,46 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
 	bool		deduplicate;
+	Buffer metabuf;
+	Page metapage;
 
 	deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
 		BTGetDeduplicateItems(wstate->index);
 
+	/*
+	 * Extend the index relation upfront to reserve the metapage
+	 */
+	if (wstate->use_shared_buffers)
+	{
+		/*
+		 * We should not need to LockRelationForExtension() as no one else knows
+		 * about this index yet?
+		 * Extend the index relation by one block for the metapage. _bt_getbuf()
+		 * is not used here as it does _bt_pageinit() which is one later by
+		 * _bt_initmetapage(). We will fill in the metapage and write it out at
+		 * the end of index build when we have all of the information required
+		 * for the metapage. However, we initially extend the relation for it to
+		 * occupy block 0 because it is much easier when using shared buffers to
+		 * extend the relation with a block number that is always increasing by
+		 * 1. Also, by passing RBM_ZERO_AND_LOCK, we have LW_EXCLUSIVE on the
+		 * buffer content and thus don't need _bt_lockbuf().
+		 */
+		metabuf = ReadBufferExtended(wstate->index, MAIN_FORKNUM, P_NEW, RBM_ZERO_AND_LOCK, NULL);
+		metapage = BufferGetPage(metabuf);
+	}
+	else
+	{
+		wstate->redo = GetRedoRecPtr();
+		metabuf = InvalidBuffer;
+		metapage = (Page) palloc(BLCKSZ);
+
+		/* extending the file... */
+		smgrextend(wstate->index->rd_smgr, MAIN_FORKNUM, BTREE_METAPAGE,
+		           (char *) metapage, false);
+		wstate->btws_pages_written++;
+		wstate->btws_pages_alloced++;
+	}
+
 	if (merge)
 	{
 		/*
@@ -1413,7 +1532,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	}
 
 	/* Close down final pages and write the metapage */
-	_bt_uppershutdown(wstate, state);
+	_bt_uppershutdown(wstate, state, metabuf, metapage);
 
 	/*
 	 * When we WAL-logged index pages, we must nonetheless fsync index files.
@@ -1425,7 +1544,10 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	 * still not be on disk when the crash occurs.
 	 */
 	if (wstate->btws_use_wal)
-		smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM);
+	{
+		if (!wstate->use_shared_buffers && RedoRecPtrChanged(wstate->redo))
+			smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM);
+	}
 }
 
 /*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1616448368..63fd212787 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8704,6 +8704,20 @@ GetRedoRecPtr(void)
 	return RedoRecPtr;
 }
 
+bool
+RedoRecPtrChanged(XLogRecPtr comparator_ptr)
+{
+	XLogRecPtr	ptr;
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	ptr = XLogCtl->RedoRecPtr;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	if (RedoRecPtr < ptr)
+		RedoRecPtr = ptr;
+	return RedoRecPtr != comparator_ptr;
+}
+
 /*
  * Return information needed to decide whether a modified block needs a
  * full-page image to be included in the WAL record.
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 898df2ee03..be384353cb 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -309,6 +309,7 @@ extern XLogRecPtr XLogRestorePoint(const char *rpName);
 extern void UpdateFullPageWrites(void);
 extern void GetFullPageWriteInfo(XLogRecPtr *RedoRecPtr_p, bool *doPageWrites_p);
 extern XLogRecPtr GetRedoRecPtr(void);
+extern bool RedoRecPtrChanged(XLogRecPtr comparator_ptr);
 extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(TimeLineID *insertTLI);
 extern TimeLineID GetWALInsertionTimeLine(void);
-- 
2.32.0

index_create.sqlapplication/octet-stream; name=index_create.sqlDownload

melanieplageman@gmail.com

about 4 years ago

In reply to: Melanie Plageman (#6)

Re: Avoiding smgrimmedsync() during nbtree index builds

On Fri, Nov 19, 2021 at 3:11 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Mon, May 3, 2021 at 5:24 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

So, I've written a patch which avoids doing the immediate fsync for
index builds either by using shared buffers or by queueing sync requests
for the checkpointer. If a checkpoint starts during the index build and
the backend is not using shared buffers for the index build, it will
need to do the fsync.

I've attached a rebased version of the patch (old patch doesn't apply).

With the patch applied (compiled at O2), creating twenty empty tables in
a transaction with a text column and an index on another column (like in
the attached SQL [make a test_idx schema first]) results in a fairly
consistent 15-30% speedup on my laptop (timings still in tens of ms -
avg 50 ms to avg 65 ms so run variation affects the % a lot).
Reducing the number of fsync calls from 40 to 1 was what likely causes
this difference.

Correction for the above: I haven't worked on mac in a while and didn't
realize that wal_sync_method=fsync was not enough to ensure that all
buffered data would actually be flushed to disk on mac (which was
required for my test).

Setting wal_sync_method to fsync_writethrough with my small test I see
over a 5-6X improvement in time taken - from 1 second average to 0.2
seconds average. And running Andres' "createlots.sql" test, I see around
a 16x improvement - from around 11 minutes to around 40 seconds. I ran
it on a laptop running macos and other than wal_sync_method, I only
changed shared_buffers (to 1GB).

- Melanie

andres@anarazel.de

about 4 years ago

In reply to: Melanie Plageman (#6)

Re: Avoiding smgrimmedsync() during nbtree index builds

Hi,

On 2021-11-19 15:11:57 -0500, Melanie Plageman wrote:

From 2130175c5d794f60c5f15d6eb1b626c81fb7c39b Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 15 Apr 2021 07:01:01 -0400
Subject: [PATCH v2] Index build avoids immed fsync

Avoid immediate fsync for just built indexes either by using shared
buffers or by leveraging checkpointer's SyncRequest queue. When a
checkpoint begins during the index build, if not using shared buffers,
the backend will have to do its own fsync.
---
src/backend/access/nbtree/nbtree.c | 39 +++---
src/backend/access/nbtree/nbtsort.c | 186 +++++++++++++++++++++++-----
src/backend/access/transam/xlog.c | 14 +++
src/include/access/xlog.h | 1 +
4 files changed, 188 insertions(+), 52 deletions(-)
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 40ad0956e0..a2e32f18e6 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -150,30 +150,29 @@ void
btbuildempty(Relation index)
{
Page		metapage;
+	Buffer metabuf;
-	/* Construct metapage. */
-	metapage = (Page) palloc(BLCKSZ);
-	_bt_initmetapage(metapage, P_NONE, 0, _bt_allequalimage(index, false));
-
+	// TODO: test this.

Shouldn't this path have plenty coverage?

/*
-	 * Write the page and log it.  It might seem that an immediate sync would
-	 * be sufficient to guarantee that the file exists on disk, but recovery
-	 * itself might remove it while replaying, for example, an
-	 * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we need
-	 * this even when wal_level=minimal.
+	 * Construct metapage.
+	 * Because we don't need to lock the relation for extension (since
+	 * noone knows about it yet) and we don't need to initialize the
+	 * new page, as it is done below by _bt_blnewpage(), _bt_getbuf()
+	 * (with P_NEW and BT_WRITE) is overkill.

Isn't the more relevant operation the log_newpage_buffer()?

However, it might be worth
+	 * either modifying it or adding a new helper function instead of
+	 * calling ReadBufferExtended() directly. We pass mode RBM_ZERO_AND_LOCK
+	 * because we want to hold an exclusive lock on the buffer content
*/

"modifying it" refers to what?

I don't see a problem using ReadBufferExtended() here. Why would you like to
replace it with something else?

+	/*
+	 * Based on the number of tuples, either create a buffered or unbuffered
+	 * write state. if the number of tuples is small, make a buffered write
+	 * if the number of tuples is larger, then we make an unbuffered write state
+	 * and must ensure that we check the redo pointer to know whether or not we
+	 * need to fsync ourselves
+	 */

/*
* Finish the build by (1) completing the sort of the spool file, (2)
* inserting the sorted tuples into btree pages and (3) building the upper
* levels.  Finally, it may also be necessary to end use of parallelism.
*/
-	_bt_leafbuild(buildstate.spool, buildstate.spool2);
+	if (reltuples > 1000)

I'm ok with some random magic constant here, but it seems worht putting it in
some constant / #define to make it more obvious.

+		_bt_leafbuild(buildstate.spool, buildstate.spool2, false);
+	else
+		_bt_leafbuild(buildstate.spool, buildstate.spool2, true);

Why duplicate the function call?

/*
* allocate workspace for a new, clean btree page, not linked to any siblings.
+ * If index is not built in shared buffers, buf should be InvalidBuffer
*/
static Page
-_bt_blnewpage(uint32 level)
+_bt_blnewpage(uint32 level, Buffer buf)
{
Page		page;
BTPageOpaque opaque;

-	page = (Page) palloc(BLCKSZ);
+	if (buf)
+		page = BufferGetPage(buf);
+	else
+		page = (Page) palloc(BLCKSZ);

/* Zero the page and set up standard page header info */
_bt_pageinit(page, BLCKSZ);

Ick, that seems pretty ugly API-wise and subsequently too likely to lead to
pfree()ing a page that's actually in shared buffers. I think it'd make sense
to separate the allocation from the initialization bits?

@@ -635,8 +657,20 @@ _bt_blnewpage(uint32 level)
* emit a completed btree page, and release the working storage.
*/
static void
-_bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
+_bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno, Buffer buf)
{
+	if (wstate->use_shared_buffers)
+	{
+		Assert(buf);
+		START_CRIT_SECTION();
+		MarkBufferDirty(buf);
+		if (wstate->btws_use_wal)
+			log_newpage_buffer(buf, true);
+		END_CRIT_SECTION();
+		_bt_relbuf(wstate->index, buf);
+		return;
+	}
+
/* XLOG stuff */
if (wstate->btws_use_wal)
{
@@ -659,7 +693,7 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
smgrextend(RelationGetSmgr(wstate->index), MAIN_FORKNUM,
wstate->btws_pages_written++,
(char *) wstate->btws_zeropage,
-				   true);
+				   false);
}

Is there a good place to document the way we ensure durability for this path?

+	/*
+	 * Extend the index relation upfront to reserve the metapage
+	 */
+	if (wstate->use_shared_buffers)
+	{
+		/*
+		 * We should not need to LockRelationForExtension() as no one else knows
+		 * about this index yet?
+		 * Extend the index relation by one block for the metapage. _bt_getbuf()
+		 * is not used here as it does _bt_pageinit() which is one later by

*done

+		 * _bt_initmetapage(). We will fill in the metapage and write it out at
+		 * the end of index build when we have all of the information required
+		 * for the metapage. However, we initially extend the relation for it to
+		 * occupy block 0 because it is much easier when using shared buffers to
+		 * extend the relation with a block number that is always increasing by
+		 * 1.

Not quite following what you're trying to get at here. There generally is no
way to extend a relation except by increasing block numbers?

@@ -1425,7 +1544,10 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
* still not be on disk when the crash occurs.
*/
if (wstate->btws_use_wal)
-		smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM);
+	{
+		if (!wstate->use_shared_buffers && RedoRecPtrChanged(wstate->redo))
+			smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM);
+	}
}

This needs documentation. The whole comment above isn't accurate anymore afaict?

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1616448368..63fd212787 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8704,6 +8704,20 @@ GetRedoRecPtr(void)
return RedoRecPtr;
}

+bool
+RedoRecPtrChanged(XLogRecPtr comparator_ptr)
+{
+	XLogRecPtr	ptr;
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	ptr = XLogCtl->RedoRecPtr;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	if (RedoRecPtr < ptr)
+		RedoRecPtr = ptr;
+	return RedoRecPtr != comparator_ptr;
+}

What's the deal with the < comparison?

Greetings,

Andres Freund

melanieplageman@gmail.com

about 4 years ago

In reply to: Andres Freund (#8)

2 attachment(s)

Re: Avoiding smgrimmedsync() during nbtree index builds

I have attached a v3 which includes two commits -- one of which
implements the directmgr API and uses it and the other which adds
functionality to use either directmgr or bufmgr API during index build.

Also registering for march commitfest.

On Thu, Dec 9, 2021 at 2:33 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2021-11-19 15:11:57 -0500, Melanie Plageman wrote:
From 2130175c5d794f60c5f15d6eb1b626c81fb7c39b Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 15 Apr 2021 07:01:01 -0400
Subject: [PATCH v2] Index build avoids immed fsync

Avoid immediate fsync for just built indexes either by using shared
buffers or by leveraging checkpointer's SyncRequest queue. When a
checkpoint begins during the index build, if not using shared buffers,
the backend will have to do its own fsync.
---
src/backend/access/nbtree/nbtree.c | 39 +++---
src/backend/access/nbtree/nbtsort.c | 186 +++++++++++++++++++++++-----
src/backend/access/transam/xlog.c | 14 +++
src/include/access/xlog.h | 1 +
4 files changed, 188 insertions(+), 52 deletions(-)
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 40ad0956e0..a2e32f18e6 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -150,30 +150,29 @@ void
btbuildempty(Relation index)
{
Page            metapage;
+     Buffer metabuf;
-     /* Construct metapage. */
-     metapage = (Page) palloc(BLCKSZ);
-     _bt_initmetapage(metapage, P_NONE, 0, _bt_allequalimage(index, false));
-
+     // TODO: test this.
Shouldn't this path have plenty coverage?

Yep. Sorry.

/*
-      * Write the page and log it.  It might seem that an immediate sync would
-      * be sufficient to guarantee that the file exists on disk, but recovery
-      * itself might remove it while replaying, for example, an
-      * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we need
-      * this even when wal_level=minimal.
+      * Construct metapage.
+      * Because we don't need to lock the relation for extension (since
+      * noone knows about it yet) and we don't need to initialize the
+      * new page, as it is done below by _bt_blnewpage(), _bt_getbuf()
+      * (with P_NEW and BT_WRITE) is overkill.

Isn't the more relevant operation the log_newpage_buffer()?

Returning to this after some time away, many of my comments no longer
make sense to me either. I can't actually tell which diff your question
applies to because this comment was copy-pasted in two different places
in my code. Also, I've removed this comment and added new ones. So,
given all that, is there still something about log_newpage_buffer() I
should be commenting on?

However, it might be worth
+      * either modifying it or adding a new helper function instead of
+      * calling ReadBufferExtended() directly. We pass mode RBM_ZERO_AND_LOCK
+      * because we want to hold an exclusive lock on the buffer content
*/
"modifying it" refers to what?

I don't see a problem using ReadBufferExtended() here. Why would you like to
replace it with something else?

I would just disregard these comments now.

+     /*
+      * Based on the number of tuples, either create a buffered or unbuffered
+      * write state. if the number of tuples is small, make a buffered write
+      * if the number of tuples is larger, then we make an unbuffered write state
+      * and must ensure that we check the redo pointer to know whether or not we
+      * need to fsync ourselves
+      */

/*
* Finish the build by (1) completing the sort of the spool file, (2)
* inserting the sorted tuples into btree pages and (3) building the upper
* levels.  Finally, it may also be necessary to end use of parallelism.
*/
-     _bt_leafbuild(buildstate.spool, buildstate.spool2);
+     if (reltuples > 1000)

I'm ok with some random magic constant here, but it seems worht putting it in
some constant / #define to make it more obvious.

Done.

+             _bt_leafbuild(buildstate.spool, buildstate.spool2, false);
+     else
+             _bt_leafbuild(buildstate.spool, buildstate.spool2, true);

Why duplicate the function call?

Fixed.

/*
* allocate workspace for a new, clean btree page, not linked to any siblings.
+ * If index is not built in shared buffers, buf should be InvalidBuffer
*/
static Page
-_bt_blnewpage(uint32 level)
+_bt_blnewpage(uint32 level, Buffer buf)
{
Page            page;
BTPageOpaque opaque;
-     page = (Page) palloc(BLCKSZ);
+     if (buf)
+             page = BufferGetPage(buf);
+     else
+             page = (Page) palloc(BLCKSZ);
/* Zero the page and set up standard page header info */
_bt_pageinit(page, BLCKSZ);
Ick, that seems pretty ugly API-wise and subsequently too likely to lead to
pfree()ing a page that's actually in shared buffers. I think it'd make sense
to separate the allocation from the initialization bits?

Fixed.

@@ -635,8 +657,20 @@ _bt_blnewpage(uint32 level)
* emit a completed btree page, and release the working storage.
*/
static void
-_bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
+_bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno, Buffer buf)
{
+     if (wstate->use_shared_buffers)
+     {
+             Assert(buf);
+             START_CRIT_SECTION();
+             MarkBufferDirty(buf);
+             if (wstate->btws_use_wal)
+                     log_newpage_buffer(buf, true);
+             END_CRIT_SECTION();
+             _bt_relbuf(wstate->index, buf);
+             return;
+     }
+
/* XLOG stuff */
if (wstate->btws_use_wal)
{
@@ -659,7 +693,7 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
smgrextend(RelationGetSmgr(wstate->index), MAIN_FORKNUM,
wstate->btws_pages_written++,
(char *) wstate->btws_zeropage,
-                                true);
+                                false);
}

Is there a good place to document the way we ensure durability for this path?

I added some new comments. Let me know if you think that I am still
missing this documentation.

+     /*
+      * Extend the index relation upfront to reserve the metapage
+      */
+     if (wstate->use_shared_buffers)
+     {
+             /*
+              * We should not need to LockRelationForExtension() as no one else knows
+              * about this index yet?
+              * Extend the index relation by one block for the metapage. _bt_getbuf()
+              * is not used here as it does _bt_pageinit() which is one later by

*done

+              * _bt_initmetapage(). We will fill in the metapage and write it out at
+              * the end of index build when we have all of the information required
+              * for the metapage. However, we initially extend the relation for it to
+              * occupy block 0 because it is much easier when using shared buffers to
+              * extend the relation with a block number that is always increasing by
+              * 1.

Not quite following what you're trying to get at here. There generally is no
way to extend a relation except by increasing block numbers?

I've updated this comment too. It should make more sense now.

@@ -1425,7 +1544,10 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
* still not be on disk when the crash occurs.
*/
if (wstate->btws_use_wal)
-             smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM);
+     {
+             if (!wstate->use_shared_buffers && RedoRecPtrChanged(wstate->redo))
+                     smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM);
+     }
}

This needs documentation. The whole comment above isn't accurate anymore afaict?

Should be correct now.

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1616448368..63fd212787 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8704,6 +8704,20 @@ GetRedoRecPtr(void)
return RedoRecPtr;
}

+bool
+RedoRecPtrChanged(XLogRecPtr comparator_ptr)
+{
+     XLogRecPtr      ptr;
+
+     SpinLockAcquire(&XLogCtl->info_lck);
+     ptr = XLogCtl->RedoRecPtr;
+     SpinLockRelease(&XLogCtl->info_lck);
+
+     if (RedoRecPtr < ptr)
+             RedoRecPtr = ptr;
+     return RedoRecPtr != comparator_ptr;
+}

What's the deal with the < comparison?

I saw that GetRedoRecPtr() does this and thought maybe I should do the
same in this function. I'm not quite sure where I should be getting the
redo pointer.

Maybe I should just call GetRedoRecPtr() and compare it to the one I
saved? I suppose I also thought that maybe someone else in the future
would like to have a function like RedoRecPtrChanged().

- Melanie

Attachments:

v3-0001-Add-unbuffered-IO-and-avoid-immed-fsync.patchtext/x-patch; charset=US-ASCII; name=v3-0001-Add-unbuffered-IO-and-avoid-immed-fsync.patchDownload

From 492d9d088df670a837efb2c107114457663378d8 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 28 Sep 2021 14:51:11 -0400
Subject: [PATCH v3 1/2] Add unbuffered IO and avoid immed fsync

Replace unbuffered extends and writes

When writing data outside of shared buffers, the backend must do a
series of steps to ensure the data is both durable and recoverable. When
writing or extending a page of data, the backend must log, checksum, and
write out the page before freeing it.

Additionally, though the data is durable once the WAL entries are on
permanent storage, the XLOG Redo pointer cannot be moved past this WAL
until the page data is safely on permanent storage. If a crash were to
occur before the data is fsync'd, the WAL wouldn't be replayed during
recovery, and the data would be lost. Thus, the backend must ensure that
either the Redo pointer has not moved or that the data is fsync'd before
freeing the page.

This is not a problem with pages written in shared buffers because the
checkpointer will block until all buffers that were dirtied before it
began finish before it moves the Redo pointer past their associated WAL
entries.

This commit makes two main changes:

1) It wraps smgrextend() and smgrwrite() in functions from a new API
   for writing data outside of shared buffers, directmgr.

2) It saves the XLOG Redo pointer location before doing the write or
   extend. It also adds an fsync request for the page to the
   checkpointer's pending-ops table. Then, after doing the write or
   extend, if the Redo pointer has moved (meaning a checkpoint has
   started since it saved it last), then the backend fsync's the page
   itself. Otherwise, it lets the checkpointer take care of fsync'ing
   the page the next time it processes the pending-ops table.
---
 src/backend/access/gist/gistbuild.c       | 17 +++----
 src/backend/access/hash/hashpage.c        |  9 ++--
 src/backend/access/heap/heapam_handler.c  | 16 ++++---
 src/backend/access/heap/rewriteheap.c     | 26 ++++-------
 src/backend/access/heap/visibilitymap.c   |  8 ++--
 src/backend/access/nbtree/nbtree.c        | 17 ++++---
 src/backend/access/nbtree/nbtsort.c       | 39 +++++-----------
 src/backend/access/spgist/spginsert.c     | 25 +++++-----
 src/backend/access/transam/xlog.c         | 13 ++++++
 src/backend/catalog/storage.c             | 25 +++-------
 src/backend/storage/Makefile              |  2 +-
 src/backend/storage/direct/Makefile       | 17 +++++++
 src/backend/storage/direct/directmgr.c    | 57 +++++++++++++++++++++++
 src/backend/storage/freespace/freespace.c | 10 ++--
 src/include/access/xlog.h                 |  1 +
 15 files changed, 175 insertions(+), 107 deletions(-)
 create mode 100644 src/backend/storage/direct/Makefile
 create mode 100644 src/backend/storage/direct/directmgr.c

diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 9854116fca..748ca65492 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -43,6 +43,7 @@
 #include "miscadmin.h"
 #include "optimizer/optimizer.h"
 #include "storage/bufmgr.h"
+#include "storage/directmgr.h"
 #include "storage/smgr.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -91,6 +92,7 @@ typedef struct
 
 	int64		indtuples;		/* number of tuples indexed */
 
+	UnBufferedWriteState ub_wstate;
 	/*
 	 * Extra data structures used during a buffering build. 'gfbb' contains
 	 * information related to managing the build buffers. 'parentMap' is a
@@ -194,6 +196,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.heaprel = heap;
 	buildstate.sortstate = NULL;
 	buildstate.giststate = initGISTstate(index);
+	buildstate.ub_wstate.smgr_rel = RelationGetSmgr(index);
 
 	/*
 	 * Create a temporary memory context that is reset once for each tuple
@@ -403,14 +406,15 @@ gist_indexsortbuild(GISTBuildState *state)
 	state->pages_allocated = 0;
 	state->pages_written = 0;
 	state->ready_num_pages = 0;
+	unbuffered_prep(&state->ub_wstate);
 
 	/*
 	 * Write an empty page as a placeholder for the root page. It will be
 	 * replaced with the real root page at the end.
 	 */
 	page = palloc0(BLCKSZ);
-	smgrextend(RelationGetSmgr(state->indexrel), MAIN_FORKNUM, GIST_ROOT_BLKNO,
-			   page, true);
+	unbuffered_extend(&state->ub_wstate, MAIN_FORKNUM, GIST_ROOT_BLKNO, page,
+			true);
 	state->pages_allocated++;
 	state->pages_written++;
 
@@ -450,13 +454,12 @@ gist_indexsortbuild(GISTBuildState *state)
 
 	/* Write out the root */
 	PageSetLSN(pagestate->page, GistBuildLSN);
-	PageSetChecksumInplace(pagestate->page, GIST_ROOT_BLKNO);
-	smgrwrite(RelationGetSmgr(state->indexrel), MAIN_FORKNUM, GIST_ROOT_BLKNO,
-			  pagestate->page, true);
+	unbuffered_write(&state->ub_wstate, MAIN_FORKNUM, GIST_ROOT_BLKNO, pagestate->page);
 	if (RelationNeedsWAL(state->indexrel))
 		log_newpage(&state->indexrel->rd_node, MAIN_FORKNUM, GIST_ROOT_BLKNO,
 					pagestate->page, true);
 
+	unbuffered_finish(&state->ub_wstate, MAIN_FORKNUM);
 	pfree(pagestate->page);
 	pfree(pagestate);
 }
@@ -570,9 +573,7 @@ gist_indexsortbuild_flush_ready_pages(GISTBuildState *state)
 			elog(ERROR, "unexpected block number to flush GiST sorting build");
 
 		PageSetLSN(page, GistBuildLSN);
-		PageSetChecksumInplace(page, blkno);
-		smgrextend(RelationGetSmgr(state->indexrel), MAIN_FORKNUM, blkno, page,
-				   true);
+		unbuffered_extend(&state->ub_wstate, MAIN_FORKNUM, blkno, page, false);
 
 		state->pages_written++;
 	}
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index ee351aea09..722420adf5 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -32,6 +32,7 @@
 #include "access/hash_xlog.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
+#include "storage/directmgr.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
 #include "storage/smgr.h"
@@ -990,8 +991,10 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
 	PGAlignedBlock zerobuf;
 	Page		page;
 	HashPageOpaque ovflopaque;
+	UnBufferedWriteState ub_wstate;
 
 	lastblock = firstblock + nblocks - 1;
+	ub_wstate.smgr_rel = RelationGetSmgr(rel);
 
 	/*
 	 * Check for overflow in block number calculation; if so, we cannot extend
@@ -1000,6 +1003,8 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
 	if (lastblock < firstblock || lastblock == InvalidBlockNumber)
 		return false;
 
+	unbuffered_prep(&ub_wstate);
+
 	page = (Page) zerobuf.data;
 
 	/*
@@ -1024,9 +1029,7 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
 					zerobuf.data,
 					true);
 
-	PageSetChecksumInplace(page, lastblock);
-	smgrextend(RelationGetSmgr(rel), MAIN_FORKNUM, lastblock, zerobuf.data,
-			   false);
+	unbuffered_extend(&ub_wstate, MAIN_FORKNUM, lastblock, zerobuf.data, false);
 
 	return true;
 }
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 39ef8a0b77..8824e39a91 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -38,6 +38,7 @@
 #include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/bufpage.h"
+#include "storage/directmgr.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
 #include "storage/procarray.h"
@@ -575,6 +576,7 @@ heapam_relation_set_new_filenode(Relation rel,
 								 MultiXactId *minmulti)
 {
 	SMgrRelation srel;
+	UnBufferedWriteState ub_wstate;
 
 	/*
 	 * Initialize to the minimum XID that could put tuples in the table. We
@@ -594,15 +596,15 @@ heapam_relation_set_new_filenode(Relation rel,
 	*minmulti = GetOldestMultiXactId();
 
 	srel = RelationCreateStorage(*newrnode, persistence);
+	ub_wstate.smgr_rel = srel;
+	unbuffered_prep(&ub_wstate);
 
 	/*
 	 * If required, set up an init fork for an unlogged table so that it can
-	 * be correctly reinitialized on restart.  An immediate sync is required
-	 * even if the page has been logged, because the write did not go through
-	 * shared_buffers and therefore a concurrent checkpoint may have moved the
-	 * redo pointer past our xlog record.  Recovery may as well remove it
-	 * while replaying, for example, XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE
-	 * record. Therefore, logging is necessary even if wal_level=minimal.
+	 * be correctly reinitialized on restart.
+	 * Recovery may as well remove our xlog record while replaying, for
+	 * example, XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record. Therefore,
+	 * logging is necessary even if wal_level=minimal.
 	 */
 	if (persistence == RELPERSISTENCE_UNLOGGED)
 	{
@@ -611,7 +613,7 @@ heapam_relation_set_new_filenode(Relation rel,
 			   rel->rd_rel->relkind == RELKIND_TOASTVALUE);
 		smgrcreate(srel, INIT_FORKNUM, false);
 		log_smgrcreate(newrnode, INIT_FORKNUM);
-		smgrimmedsync(srel, INIT_FORKNUM);
+		unbuffered_finish(&ub_wstate, INIT_FORKNUM);
 	}
 
 	smgrclose(srel);
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 5bb5ebba23..5d93850ccc 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -119,6 +119,7 @@
 #include "replication/logical.h"
 #include "replication/slot.h"
 #include "storage/bufmgr.h"
+#include "storage/directmgr.h"
 #include "storage/fd.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
@@ -152,6 +153,7 @@ typedef struct RewriteStateData
 	HTAB	   *rs_old_new_tid_map; /* unmatched B tuples */
 	HTAB	   *rs_logical_mappings;	/* logical remapping files */
 	uint32		rs_num_rewrite_mappings;	/* # in memory mappings */
+	UnBufferedWriteState rs_unbuffered_wstate;
 }			RewriteStateData;
 
 /*
@@ -264,6 +266,9 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
 	state->rs_freeze_xid = freeze_xid;
 	state->rs_cutoff_multi = cutoff_multi;
 	state->rs_cxt = rw_cxt;
+	state->rs_unbuffered_wstate.smgr_rel = RelationGetSmgr(state->rs_new_rel);
+
+	unbuffered_prep(&state->rs_unbuffered_wstate);
 
 	/* Initialize hash tables used to track update chains */
 	hash_ctl.keysize = sizeof(TidHashKey);
@@ -324,21 +329,12 @@ end_heap_rewrite(RewriteState state)
 						state->rs_buffer,
 						true);
 
-		PageSetChecksumInplace(state->rs_buffer, state->rs_blockno);
-
-		smgrextend(RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM,
-				   state->rs_blockno, (char *) state->rs_buffer, true);
+		unbuffered_extend(&state->rs_unbuffered_wstate, MAIN_FORKNUM,
+				state->rs_blockno, state->rs_buffer, false);
 	}
 
-	/*
-	 * When we WAL-logged rel pages, we must nonetheless fsync them.  The
-	 * reason is the same as in storage.c's RelationCopyStorage(): we're
-	 * writing data that's not in shared buffers, and so a CHECKPOINT
-	 * occurring during the rewriteheap operation won't have fsync'd data we
-	 * wrote before the checkpoint.
-	 */
 	if (RelationNeedsWAL(state->rs_new_rel))
-		smgrimmedsync(RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM);
+		unbuffered_finish(&state->rs_unbuffered_wstate, MAIN_FORKNUM);
 
 	logical_end_heap_rewrite(state);
 
@@ -690,10 +686,8 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
 			 * need for smgr to schedule an fsync for this write; we'll do it
 			 * ourselves in end_heap_rewrite.
 			 */
-			PageSetChecksumInplace(page, state->rs_blockno);
-
-			smgrextend(RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM,
-					   state->rs_blockno, (char *) page, true);
+			unbuffered_extend(&state->rs_unbuffered_wstate, MAIN_FORKNUM,
+					state->rs_blockno, page, false);
 
 			state->rs_blockno++;
 			state->rs_buffer_valid = false;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 9032d4758f..1a7482e267 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -92,6 +92,7 @@
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "storage/bufmgr.h"
+#include "storage/directmgr.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
 #include "utils/inval.h"
@@ -616,7 +617,9 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 	BlockNumber vm_nblocks_now;
 	PGAlignedBlock pg;
 	SMgrRelation reln;
+	UnBufferedWriteState ub_wstate;
 
+	ub_wstate.smgr_rel = RelationGetSmgr(rel);
 	PageInit((Page) pg.data, BLCKSZ, 0);
 
 	/*
@@ -654,9 +657,8 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 	/* Now extend the file */
 	while (vm_nblocks_now < vm_nblocks)
 	{
-		PageSetChecksumInplace((Page) pg.data, vm_nblocks_now);
-
-		smgrextend(reln, VISIBILITYMAP_FORKNUM, vm_nblocks_now, pg.data, false);
+		// TODO: aren't these pages empty? why checksum them
+		unbuffered_extend(&ub_wstate, VISIBILITYMAP_FORKNUM, vm_nblocks_now, (Page) pg.data, false);
 		vm_nblocks_now++;
 	}
 
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 13024af2fa..6ab6651420 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -29,6 +29,7 @@
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "storage/condition_variable.h"
+#include "storage/directmgr.h"
 #include "storage/indexfsm.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
@@ -151,6 +152,11 @@ void
 btbuildempty(Relation index)
 {
 	Page		metapage;
+	UnBufferedWriteState wstate;
+
+	wstate.smgr_rel = RelationGetSmgr(index);
+
+	unbuffered_prep(&wstate);
 
 	/* Construct metapage. */
 	metapage = (Page) palloc(BLCKSZ);
@@ -163,18 +169,15 @@ btbuildempty(Relation index)
 	 * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we need
 	 * this even when wal_level=minimal.
 	 */
-	PageSetChecksumInplace(metapage, BTREE_METAPAGE);
-	smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, BTREE_METAPAGE,
-			  (char *) metapage, true);
+	unbuffered_write(&wstate, INIT_FORKNUM, BTREE_METAPAGE, metapage);
 	log_newpage(&RelationGetSmgr(index)->smgr_rnode.node, INIT_FORKNUM,
 				BTREE_METAPAGE, metapage, true);
 
 	/*
-	 * An immediate sync is required even if we xlog'd the page, because the
-	 * write did not go through shared_buffers and therefore a concurrent
-	 * checkpoint may have moved the redo pointer past our xlog record.
+	 * Even though we xlog'd the page, a concurrent checkpoint may have moved
+	 * the redo pointer past our xlog record, so we may still need to fsync.
 	 */
-	smgrimmedsync(RelationGetSmgr(index), INIT_FORKNUM);
+	unbuffered_finish(&wstate, INIT_FORKNUM);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index dc220146fd..5687acd99d 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -57,6 +57,7 @@
 #include "executor/instrument.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "storage/directmgr.h"
 #include "storage/smgr.h"
 #include "tcop/tcopprot.h"		/* pgrminclude ignore */
 #include "utils/rel.h"
@@ -253,6 +254,7 @@ typedef struct BTWriteState
 	BlockNumber btws_pages_alloced; /* # pages allocated */
 	BlockNumber btws_pages_written; /* # pages written out */
 	Page		btws_zeropage;	/* workspace for filling zeroes */
+	UnBufferedWriteState ub_wstate;
 } BTWriteState;
 
 
@@ -560,6 +562,8 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
+	wstate.ub_wstate.smgr_rel = RelationGetSmgr(btspool->index);
+	wstate.ub_wstate.redo = InvalidXLogRecPtr;
 	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
 	/* _bt_mkscankey() won't set allequalimage without metapage */
 	wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
@@ -656,31 +660,19 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
 		if (!wstate->btws_zeropage)
 			wstate->btws_zeropage = (Page) palloc0(BLCKSZ);
 		/* don't set checksum for all-zero page */
-		smgrextend(RelationGetSmgr(wstate->index), MAIN_FORKNUM,
-				   wstate->btws_pages_written++,
-				   (char *) wstate->btws_zeropage,
-				   true);
+		unbuffered_extend(&wstate->ub_wstate, MAIN_FORKNUM, wstate->btws_pages_written++, wstate->btws_zeropage, true);
 	}
 
-	PageSetChecksumInplace(page, blkno);
 
-	/*
-	 * Now write the page.  There's no need for smgr to schedule an fsync for
-	 * this write; we'll do it ourselves before ending the build.
-	 */
+	/* Now write the page. Either we are extending the file... */
 	if (blkno == wstate->btws_pages_written)
 	{
-		/* extending the file... */
-		smgrextend(RelationGetSmgr(wstate->index), MAIN_FORKNUM, blkno,
-				   (char *) page, true);
+		unbuffered_extend(&wstate->ub_wstate, MAIN_FORKNUM, blkno, page, false);
 		wstate->btws_pages_written++;
 	}
+	/* or we are overwriting a block we zero-filled before. */
 	else
-	{
-		/* overwriting a block we zero-filled before */
-		smgrwrite(RelationGetSmgr(wstate->index), MAIN_FORKNUM, blkno,
-				  (char *) page, true);
-	}
+		unbuffered_write(&wstate->ub_wstate, MAIN_FORKNUM, blkno, page);
 
 	pfree(page);
 }
@@ -1189,6 +1181,8 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	int64		tuples_done = 0;
 	bool		deduplicate;
 
+
+	unbuffered_prep(&wstate->ub_wstate);
 	deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
 		BTGetDeduplicateItems(wstate->index);
 
@@ -1415,17 +1409,8 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	/* Close down final pages and write the metapage */
 	_bt_uppershutdown(wstate, state);
 
-	/*
-	 * When we WAL-logged index pages, we must nonetheless fsync index files.
-	 * Since we're building outside shared buffers, a CHECKPOINT occurring
-	 * during the build has no way to flush the previously written data to
-	 * disk (indeed it won't know the index even exists).  A crash later on
-	 * would replay WAL from the checkpoint, therefore it wouldn't replay our
-	 * earlier WAL entries. If we do not fsync those pages here, they might
-	 * still not be on disk when the crash occurs.
-	 */
 	if (wstate->btws_use_wal)
-		smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM);
+		unbuffered_finish(&wstate->ub_wstate, MAIN_FORKNUM);
 }
 
 /*
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index bfb74049d0..9fd5686b8d 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -25,6 +25,7 @@
 #include "catalog/index.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/directmgr.h"
 #include "storage/smgr.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -156,6 +157,10 @@ void
 spgbuildempty(Relation index)
 {
 	Page		page;
+	UnBufferedWriteState wstate;
+
+	wstate.smgr_rel = RelationGetSmgr(index);
+	unbuffered_prep(&wstate);
 
 	/* Construct metapage. */
 	page = (Page) palloc(BLCKSZ);
@@ -168,36 +173,30 @@ spgbuildempty(Relation index)
 	 * of their existing content when the corresponding create records are
 	 * replayed.
 	 */
-	PageSetChecksumInplace(page, SPGIST_METAPAGE_BLKNO);
-	smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, SPGIST_METAPAGE_BLKNO,
-			  (char *) page, true);
+	unbuffered_write(&wstate, INIT_FORKNUM, SPGIST_METAPAGE_BLKNO, page);
 	log_newpage(&(RelationGetSmgr(index))->smgr_rnode.node, INIT_FORKNUM,
 				SPGIST_METAPAGE_BLKNO, page, true);
 
 	/* Likewise for the root page. */
 	SpGistInitPage(page, SPGIST_LEAF);
 
-	PageSetChecksumInplace(page, SPGIST_ROOT_BLKNO);
-	smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, SPGIST_ROOT_BLKNO,
-			  (char *) page, true);
+	unbuffered_write(&wstate, INIT_FORKNUM, SPGIST_ROOT_BLKNO, page);
 	log_newpage(&(RelationGetSmgr(index))->smgr_rnode.node, INIT_FORKNUM,
 				SPGIST_ROOT_BLKNO, page, true);
 
 	/* Likewise for the null-tuples root page. */
 	SpGistInitPage(page, SPGIST_LEAF | SPGIST_NULLS);
 
-	PageSetChecksumInplace(page, SPGIST_NULL_BLKNO);
-	smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, SPGIST_NULL_BLKNO,
-			  (char *) page, true);
+	unbuffered_write(&wstate, INIT_FORKNUM, SPGIST_NULL_BLKNO, page);
 	log_newpage(&(RelationGetSmgr(index))->smgr_rnode.node, INIT_FORKNUM,
 				SPGIST_NULL_BLKNO, page, true);
 
 	/*
-	 * An immediate sync is required even if we xlog'd the pages, because the
-	 * writes did not go through shared buffers and therefore a concurrent
-	 * checkpoint may have moved the redo pointer past our xlog record.
+	 * Because the writes did not go through shared buffers, if a concurrent
+	 * checkpoint moved the redo pointer past our xlog record, an immediate
+	 * sync is required even if we xlog'd the pages.
 	 */
-	smgrimmedsync(RelationGetSmgr(index), INIT_FORKNUM);
+	unbuffered_finish(&wstate, INIT_FORKNUM);
 }
 
 /*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c9d4cbf3ff..a06e8d005b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -13191,6 +13191,19 @@ CheckForStandbyTrigger(void)
 	return false;
 }
 
+bool RedoRecPtrChanged(XLogRecPtr comparator_ptr)
+{
+	XLogRecPtr ptr;
+	SpinLockAcquire(&XLogCtl->info_lck);
+	ptr = XLogCtl->RedoRecPtr;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	if (RedoRecPtr < ptr)
+		RedoRecPtr = ptr;
+
+	return RedoRecPtr != comparator_ptr;
+}
+
 /*
  * Remove the files signaling a standby promotion request.
  */
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 9b8075536a..05e66d0434 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -28,6 +28,7 @@
 #include "catalog/storage.h"
 #include "catalog/storage_xlog.h"
 #include "miscadmin.h"
+#include "storage/directmgr.h"
 #include "storage/freespace.h"
 #include "storage/smgr.h"
 #include "utils/hsearch.h"
@@ -420,6 +421,10 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 	bool		copying_initfork;
 	BlockNumber nblocks;
 	BlockNumber blkno;
+	UnBufferedWriteState wstate;
+
+	wstate.smgr_rel = dst;
+	unbuffered_prep(&wstate);
 
 	page = (Page) buf.data;
 
@@ -477,27 +482,11 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 		if (use_wal)
 			log_newpage(&dst->smgr_rnode.node, forkNum, blkno, page, false);
 
-		PageSetChecksumInplace(page, blkno);
-
-		/*
-		 * Now write the page.  We say skipFsync = true because there's no
-		 * need for smgr to schedule an fsync for this write; we'll do it
-		 * ourselves below.
-		 */
-		smgrextend(dst, forkNum, blkno, buf.data, true);
+		unbuffered_extend(&wstate, forkNum, blkno, buf.data, false);
 	}
 
-	/*
-	 * When we WAL-logged rel pages, we must nonetheless fsync them.  The
-	 * reason is that since we're copying outside shared buffers, a CHECKPOINT
-	 * occurring during the copy has no way to flush the previously written
-	 * data to disk (indeed it won't know the new rel even exists).  A crash
-	 * later on would replay WAL from the checkpoint, therefore it wouldn't
-	 * replay our earlier WAL entries. If we do not fsync those pages here,
-	 * they might still not be on disk when the crash occurs.
-	 */
 	if (use_wal || copying_initfork)
-		smgrimmedsync(dst, forkNum);
+		unbuffered_finish(&wstate, forkNum);
 }
 
 /*
diff --git a/src/backend/storage/Makefile b/src/backend/storage/Makefile
index 8376cdfca2..501fae5f9d 100644
--- a/src/backend/storage/Makefile
+++ b/src/backend/storage/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/storage
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS     = buffer file freespace ipc large_object lmgr page smgr sync
+SUBDIRS     = buffer direct file freespace ipc large_object lmgr page smgr sync
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/direct/Makefile b/src/backend/storage/direct/Makefile
new file mode 100644
index 0000000000..d82bbed48c
--- /dev/null
+++ b/src/backend/storage/direct/Makefile
@@ -0,0 +1,17 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for storage/direct
+#
+# IDENTIFICATION
+#    src/backend/storage/direct/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/storage/direct
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = directmgr.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/direct/directmgr.c b/src/backend/storage/direct/directmgr.c
new file mode 100644
index 0000000000..d33c2a7a5f
--- /dev/null
+++ b/src/backend/storage/direct/directmgr.c
@@ -0,0 +1,57 @@
+/*-------------------------------------------------------------------------
+ *
+ * directmgr.c
+ *	  routines for managing unbuffered IO
+ *
+ * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/direct/directmgr.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+
+#include "access/xlog.h"
+#include "storage/directmgr.h"
+#include "utils/rel.h"
+
+void unbuffered_prep(UnBufferedWriteState *wstate)
+{
+	wstate->redo = GetRedoRecPtr();
+}
+
+/*
+ * When writing data outside shared buffers, a concurrent CHECKPOINT can move
+ * the redo pointer past our WAL entries and won't flush our data to disk. If
+ * the database crashes before the data makes it to disk, our WAL won't be
+ * replayed and the data will be lost.
+ * Thus, if a CHECKPOINT begins between unbuffered_prep() and
+ * unbuffered_finish(), the backend must fsync the data itself.
+ */
+void unbuffered_finish(UnBufferedWriteState *wstate, ForkNumber forknum)
+{
+	if (RedoRecPtrChanged(wstate->redo))
+		smgrimmedsync(wstate->smgr_rel, forknum);
+}
+
+void
+unbuffered_write(UnBufferedWriteState *wstate, ForkNumber forknum, BlockNumber
+		blocknum, Page page)
+{
+	PageSetChecksumInplace(page, blocknum);
+	smgrwrite(wstate->smgr_rel, forknum, blocknum, (char *) page, false);
+}
+
+void
+unbuffered_extend(UnBufferedWriteState *wstate, ForkNumber forknum, BlockNumber
+		blocknum, Page page, bool empty)
+{
+	if (!empty)
+		PageSetChecksumInplace(page, blocknum);
+	smgrextend(wstate->smgr_rel, forknum, blocknum, (char *) page, false);
+}
+
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index c88cb91f06..d31f75bd60 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -26,6 +26,7 @@
 #include "access/htup_details.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
+#include "storage/directmgr.h"
 #include "storage/freespace.h"
 #include "storage/fsm_internals.h"
 #include "storage/lmgr.h"
@@ -608,6 +609,9 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 	BlockNumber fsm_nblocks_now;
 	PGAlignedBlock pg;
 	SMgrRelation reln;
+	UnBufferedWriteState ub_wstate;
+
+	ub_wstate.smgr_rel = RelationGetSmgr(rel);
 
 	PageInit((Page) pg.data, BLCKSZ, 0);
 
@@ -647,10 +651,8 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 	/* Extend as needed. */
 	while (fsm_nblocks_now < fsm_nblocks)
 	{
-		PageSetChecksumInplace((Page) pg.data, fsm_nblocks_now);
-
-		smgrextend(reln, FSM_FORKNUM, fsm_nblocks_now,
-				   pg.data, false);
+		// TODO: why was it checksumming all zero pages?
+		unbuffered_extend(&ub_wstate, FSM_FORKNUM, fsm_nblocks_now, (Page) pg.data, false);
 		fsm_nblocks_now++;
 	}
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index bb0c52686a..4961067f81 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -312,6 +312,7 @@ extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(TimeLineID *insertTLI);
 extern TimeLineID GetWALInsertionTimeLine(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
+extern bool RedoRecPtrChanged(XLogRecPtr comparator_ptr);
 extern void RemovePromoteSignalFiles(void);
 
 extern bool PromoteIsTriggered(void);
-- 
2.30.2

v3-0002-Use-shared-buffers-when-possible-for-index-build.patchtext/x-patch; charset=US-ASCII; name=v3-0002-Use-shared-buffers-when-possible-for-index-build.patchDownload

From f1bd40b2880d9cffeb2e313719c3fd6b4c8e5f04 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 10 Jan 2022 17:34:01 -0500
Subject: [PATCH v3 2/2] Use shared buffers when possible for index build

When there are not too many tuples, building the index in shared buffers
makes sense. It allows the buffer manager to handle how best to do the
IO.
---
 src/backend/access/nbtree/nbtree.c  |  34 ++--
 src/backend/access/nbtree/nbtsort.c | 268 ++++++++++++++++++++++------
 2 files changed, 225 insertions(+), 77 deletions(-)

diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 6ab6651420..0714c7fdd6 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -151,33 +151,27 @@ bthandler(PG_FUNCTION_ARGS)
 void
 btbuildempty(Relation index)
 {
+	/*
+	 * Since this only writes one page, use shared buffers.
+	 */
 	Page		metapage;
-	UnBufferedWriteState wstate;
-
-	wstate.smgr_rel = RelationGetSmgr(index);
-
-	unbuffered_prep(&wstate);
-
-	/* Construct metapage. */
-	metapage = (Page) palloc(BLCKSZ);
-	_bt_initmetapage(metapage, P_NONE, 0, _bt_allequalimage(index, false));
+	Buffer metabuf;
 
 	/*
-	 * Write the page and log it.  It might seem that an immediate sync would
-	 * be sufficient to guarantee that the file exists on disk, but recovery
-	 * itself might remove it while replaying, for example, an
-	 * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we need
-	 * this even when wal_level=minimal.
+	 * Allocate a buffer for metapage and initialize metapage.
 	 */
-	unbuffered_write(&wstate, INIT_FORKNUM, BTREE_METAPAGE, metapage);
-	log_newpage(&RelationGetSmgr(index)->smgr_rnode.node, INIT_FORKNUM,
-				BTREE_METAPAGE, metapage, true);
+	metabuf = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_ZERO_AND_LOCK, NULL);
+	metapage = BufferGetPage(metabuf);
+	_bt_initmetapage(metapage, P_NONE, 0, _bt_allequalimage(index, false));
 
 	/*
-	 * Even though we xlog'd the page, a concurrent checkpoint may have moved
-	 * the redo pointer past our xlog record, so we may still need to fsync.
+	 * Mark metapage buffer as dirty and XLOG it
 	 */
-	unbuffered_finish(&wstate, INIT_FORKNUM);
+	START_CRIT_SECTION();
+	MarkBufferDirty(metabuf);
+	log_newpage_buffer(metabuf, true);
+	END_CRIT_SECTION();
+	_bt_relbuf(index, metabuf);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 5687acd99d..68b53e6c04 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -234,6 +234,7 @@ typedef struct BTPageState
 {
 	Page		btps_page;		/* workspace for page building */
 	BlockNumber btps_blkno;		/* block # to write this page at */
+	Buffer btps_buf; /* buffer to write this page to */
 	IndexTuple	btps_lowkey;	/* page's strict lower bound pivot tuple */
 	OffsetNumber btps_lastoff;	/* last item offset loaded */
 	Size		btps_lastextra; /* last item's extra posting list space */
@@ -251,10 +252,25 @@ typedef struct BTWriteState
 	Relation	index;
 	BTScanInsert inskey;		/* generic insertion scankey */
 	bool		btws_use_wal;	/* dump pages to WAL? */
-	BlockNumber btws_pages_alloced; /* # pages allocated */
-	BlockNumber btws_pages_written; /* # pages written out */
+	BlockNumber btws_pages_alloced; /* # pages allocated for index builds outside SB */
+	BlockNumber btws_pages_written; /* # pages written out for index builds outside SB */
 	Page		btws_zeropage;	/* workspace for filling zeroes */
 	UnBufferedWriteState ub_wstate;
+	/*
+	 * Allocate a new btree page. This does not initialize the page.
+	 */
+	Page (*_bt_bl_alloc_page) (struct BTWriteState *wstate, BlockNumber
+			*blockno, Buffer *buf);
+	/*
+	 * Emit a completed btree page, and release the working storage.
+	 */
+	void (*_bt_blwritepage) (struct BTWriteState *wstate, Page page,
+			BlockNumber blkno, Buffer buf);
+
+	void (*_bt_bl_unbuffered_prep) (UnBufferedWriteState *wstate);
+
+	void (*_bt_bl_unbuffered_finish) (UnBufferedWriteState *wstate, ForkNumber
+			forknum);
 } BTWriteState;
 
 
@@ -263,10 +279,22 @@ static double _bt_spools_heapscan(Relation heap, Relation index,
 static void _bt_spooldestroy(BTSpool *btspool);
 static void _bt_spool(BTSpool *btspool, ItemPointer self,
 					  Datum *values, bool *isnull);
-static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2);
+static void _bt_leafbuild(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2);
 static void _bt_build_callback(Relation index, ItemPointer tid, Datum *values,
 							   bool *isnull, bool tupleIsAlive, void *state);
-static Page _bt_blnewpage(uint32 level);
+
+static Page
+_bt_bl_alloc_page_direct(BTWriteState *wstate, BlockNumber *blockno, Buffer *buf);
+static Page
+_bt_bl_alloc_page_shared(BTWriteState *wstate, BlockNumber *blockno, Buffer *buf);
+
+static Page _bt_blnewpage(uint32 level, Page page);
+
+static void
+_bt_blwritepage_direct(BTWriteState *wstate, Page page, BlockNumber blkno, Buffer buf);
+static void
+_bt_blwritepage_shared(BTWriteState *wstate, Page page, BlockNumber blkno, Buffer buf);
+
 static BTPageState *_bt_pagestate(BTWriteState *wstate, uint32 level);
 static void _bt_slideleft(Page rightmostpage);
 static void _bt_sortaddtup(Page page, Size itemsize,
@@ -277,9 +305,10 @@ static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
 static void _bt_sort_dedup_finish_pending(BTWriteState *wstate,
 										  BTPageState *state,
 										  BTDedupState dstate);
-static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
 static void _bt_load(BTWriteState *wstate,
 					 BTSpool *btspool, BTSpool *btspool2);
+static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state, Buffer
+		metabuf, Page metapage);
 static void _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent,
 							   int request);
 static void _bt_end_parallel(BTLeader *btleader);
@@ -292,6 +321,21 @@ static void _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 									   Sharedsort *sharedsort2, int sortmem,
 									   bool progress);
 
+#define BT_BUILD_SB_THRESHOLD 1024
+
+static const BTWriteState wstate_shared = {
+	._bt_bl_alloc_page = _bt_bl_alloc_page_shared,
+	._bt_blwritepage = _bt_blwritepage_shared,
+	._bt_bl_unbuffered_prep = NULL,
+	._bt_bl_unbuffered_finish = NULL,
+};
+
+static const BTWriteState wstate_direct = {
+	._bt_bl_alloc_page = _bt_bl_alloc_page_direct,
+	._bt_blwritepage = _bt_blwritepage_direct,
+	._bt_bl_unbuffered_prep = unbuffered_prep,
+	._bt_bl_unbuffered_finish = unbuffered_finish,
+};
 
 /*
  *	btbuild() -- build a new btree index.
@@ -301,6 +345,7 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 {
 	IndexBuildResult *result;
 	BTBuildState buildstate;
+	BTWriteState writestate;
 	double		reltuples;
 
 #ifdef BTREE_BUILD_STATS
@@ -330,8 +375,12 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
+	 *
+	 * Don't use shared buffers if the number of tuples is too large.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2);
+	writestate = reltuples < BT_BUILD_SB_THRESHOLD ? wstate_shared : wstate_direct;
+
+	_bt_leafbuild(&writestate, buildstate.spool, buildstate.spool2);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
@@ -537,10 +586,8 @@ _bt_spool(BTSpool *btspool, ItemPointer self, Datum *values, bool *isnull)
  * create an entire btree.
  */
 static void
-_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
+_bt_leafbuild(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 {
-	BTWriteState wstate;
-
 #ifdef BTREE_BUILD_STATS
 	if (log_btree_build_stats)
 	{
@@ -560,23 +607,38 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 		tuplesort_performsort(btspool2->sortstate);
 	}
 
-	wstate.heap = btspool->heap;
-	wstate.index = btspool->index;
-	wstate.ub_wstate.smgr_rel = RelationGetSmgr(btspool->index);
-	wstate.ub_wstate.redo = InvalidXLogRecPtr;
-	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
-	/* _bt_mkscankey() won't set allequalimage without metapage */
-	wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
-	wstate.btws_use_wal = RelationNeedsWAL(wstate.index);
+	wstate->heap = btspool->heap;
+	wstate->index = btspool->index;
+	wstate->ub_wstate.smgr_rel = RelationGetSmgr(btspool->index);
+	wstate->ub_wstate.redo = InvalidXLogRecPtr;
+	wstate->inskey = _bt_mkscankey(wstate->index, NULL);
+	/* _bt-mkscankey() won't set allequalimage without metapage */
+	wstate->inskey->allequalimage = _bt_allequalimage(wstate->index, true);
+	wstate->btws_use_wal = RelationNeedsWAL(wstate->index);
 
 	/* reserve the metapage */
-	wstate.btws_pages_alloced = BTREE_METAPAGE + 1;
-	wstate.btws_pages_written = 0;
-	wstate.btws_zeropage = NULL;	/* until needed */
+	wstate->btws_pages_alloced = 0;
+	wstate->btws_pages_written = 0;
+	wstate->btws_zeropage = NULL;
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_BTREE_PHASE_LEAF_LOAD);
-	_bt_load(&wstate, btspool, btspool2);
+
+	/*
+	 * If not using shared buffers, save the redo pointer location in case a
+	 * checkpoint begins during the index build.
+	 */
+	if (wstate->_bt_bl_unbuffered_prep && wstate->btws_use_wal)
+		wstate->_bt_bl_unbuffered_prep(&wstate->ub_wstate);
+
+	_bt_load(wstate, btspool, btspool2);
+
+	/*
+	 * If not using shared buffers, check if backend must fsync the page
+	 * itself.
+	 */
+	if (wstate->_bt_bl_unbuffered_finish && wstate->btws_use_wal)
+		wstate->_bt_bl_unbuffered_finish(&wstate->ub_wstate, MAIN_FORKNUM);
 }
 
 /*
@@ -609,15 +671,15 @@ _bt_build_callback(Relation index,
 }
 
 /*
- * allocate workspace for a new, clean btree page, not linked to any siblings.
+ * Set up workspace for a new, clean btree page, not linked to any siblings.
+ * Caller must allocate the passed in page.
  */
 static Page
-_bt_blnewpage(uint32 level)
+_bt_blnewpage(uint32 level, Page page)
 {
-	Page		page;
 	BTPageOpaque opaque;
 
-	page = (Page) palloc(BLCKSZ);
+	Assert(page);
 
 	/* Zero the page and set up standard page header info */
 	_bt_pageinit(page, BLCKSZ);
@@ -635,18 +697,18 @@ _bt_blnewpage(uint32 level)
 	return page;
 }
 
-/*
- * emit a completed btree page, and release the working storage.
- */
 static void
-_bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
+_bt_blwritepage_direct(BTWriteState *wstate, Page page, BlockNumber blkno, Buffer buf)
 {
-	/* XLOG stuff */
+	/*
+	 * Indexes built outside shared buffers must XLOG the page, issue the page
+	 * write request, and take care of fsync'ing the page to the device if a
+	 * checkpoint began after the beginning of the index build.
+	 *
+	 * Use the XLOG_FPI record type for this.
+	 */
 	if (wstate->btws_use_wal)
-	{
-		/* We use the XLOG_FPI record type for this */
 		log_newpage(&wstate->index->rd_node, MAIN_FORKNUM, blkno, page, true);
-	}
 
 	/*
 	 * If we have to write pages nonsequentially, fill in the space with
@@ -660,7 +722,8 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
 		if (!wstate->btws_zeropage)
 			wstate->btws_zeropage = (Page) palloc0(BLCKSZ);
 		/* don't set checksum for all-zero page */
-		unbuffered_extend(&wstate->ub_wstate, MAIN_FORKNUM, wstate->btws_pages_written++, wstate->btws_zeropage, true);
+		unbuffered_extend(&wstate->ub_wstate, MAIN_FORKNUM,
+				wstate->btws_pages_written++, wstate->btws_zeropage, true);
 	}
 
 
@@ -677,6 +740,61 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
 	pfree(page);
 }
 
+static void
+_bt_blwritepage_shared(BTWriteState *wstate, Page page, BlockNumber blkno, Buffer buf)
+{
+	/*
+	 * Indexes built in shared buffers need only to mark the buffer as dirty
+	 * and XLOG it.
+	 */
+	Assert(buf);
+	START_CRIT_SECTION();
+	MarkBufferDirty(buf);
+	if (wstate->btws_use_wal)
+		log_newpage_buffer(buf, true);
+	END_CRIT_SECTION();
+	_bt_relbuf(wstate->index, buf);
+}
+
+static Page
+_bt_bl_alloc_page_direct(BTWriteState *wstate, BlockNumber *blockno, Buffer *buf)
+{
+	 /* buf is only used when using shared buffers, so set it to InvalidBuffer */
+	*buf = InvalidBuffer;
+
+	/*
+	 * Assign block number for the page.
+	 * This will be used to link to sibling page(s) later and, if this is the
+	 * initial page in the level, saved in the BTPageState
+	 */
+	*blockno = wstate->btws_pages_alloced++;
+
+	/* now allocate and set up the new page */
+	return palloc(BLCKSZ);
+}
+
+static Page
+_bt_bl_alloc_page_shared(BTWriteState *wstate, BlockNumber *blockno, Buffer *buf)
+{
+	/*
+	 * Find a shared buffer for the page. Pass mode RBM_ZERO_AND_LOCK to get an
+	 * exclusive lock on the buffer content. No lock on the relation as a whole
+	 * is needed (as in LockRelationForExtension()) because the initial index
+	 * build is not yet complete.
+	 */
+	*buf = ReadBufferExtended(wstate->index, MAIN_FORKNUM, P_NEW,
+			RBM_ZERO_AND_LOCK, NULL);
+
+	/*
+	 * bufmgr will assign a block number for the new page.
+	 * This will be used to link to sibling page(s) later and, if this is the
+	 * initial page in the level, saved in the BTPageState
+	 */
+	*blockno = BufferGetBlockNumber(*buf);
+
+	return BufferGetPage(*buf);
+}
+
 /*
  * allocate and initialize a new BTPageState.  the returned structure
  * is suitable for immediate use by _bt_buildadd.
@@ -684,13 +802,20 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
 static BTPageState *
 _bt_pagestate(BTWriteState *wstate, uint32 level)
 {
+	Buffer buf;
+	BlockNumber blockno;
+
 	BTPageState *state = (BTPageState *) palloc0(sizeof(BTPageState));
 
-	/* create initial page for level */
-	state->btps_page = _bt_blnewpage(level);
+	/*
+	 * Allocate and initialize initial page for the level, and if using shared
+	 * buffers, extend the relation and allocate a shared buffer for the block.
+	 */
+	state->btps_page = _bt_blnewpage(level, wstate->_bt_bl_alloc_page(wstate,
+				&blockno, &buf));
 
-	/* and assign it a page position */
-	state->btps_blkno = wstate->btws_pages_alloced++;
+	state->btps_blkno = blockno;
+	state->btps_buf = buf;
 
 	state->btps_lowkey = NULL;
 	/* initialize lastoff so first item goes into P_FIRSTKEY */
@@ -825,6 +950,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 {
 	Page		npage;
 	BlockNumber nblkno;
+	Buffer nbuf;
 	OffsetNumber last_off;
 	Size		last_truncextra;
 	Size		pgspc;
@@ -839,6 +965,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 
 	npage = state->btps_page;
 	nblkno = state->btps_blkno;
+	nbuf = state->btps_buf;
 	last_off = state->btps_lastoff;
 	last_truncextra = state->btps_lastextra;
 	state->btps_lastextra = truncextra;
@@ -895,15 +1022,14 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 		 */
 		Page		opage = npage;
 		BlockNumber oblkno = nblkno;
+		Buffer obuf = nbuf;
 		ItemId		ii;
 		ItemId		hii;
 		IndexTuple	oitup;
 
-		/* Create new page of same level */
-		npage = _bt_blnewpage(state->btps_level);
-
-		/* and assign it a page position */
-		nblkno = wstate->btws_pages_alloced++;
+		/* Create and initialize a new page of same level */
+		npage = _bt_blnewpage(state->btps_level,
+				wstate->_bt_bl_alloc_page(wstate, &nblkno, &nbuf));
 
 		/*
 		 * We copy the last item on the page into the new page, and then
@@ -1013,9 +1139,12 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 
 		/*
 		 * Write out the old page.  We never need to touch it again, so we can
-		 * free the opage workspace too.
+		 * free the opage workspace too. obuf has been released and is no longer
+		 * valid.
 		 */
-		_bt_blwritepage(wstate, opage, oblkno);
+		 wstate->_bt_blwritepage(wstate, opage, oblkno, obuf);
+		 obuf = InvalidBuffer;
+		 opage = NULL;
 
 		/*
 		 * Reset last_off to point to new page
@@ -1050,6 +1179,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 
 	state->btps_page = npage;
 	state->btps_blkno = nblkno;
+	state->btps_buf = nbuf;
 	state->btps_lastoff = last_off;
 }
 
@@ -1095,12 +1225,12 @@ _bt_sort_dedup_finish_pending(BTWriteState *wstate, BTPageState *state,
  * Finish writing out the completed btree.
  */
 static void
-_bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
+_bt_uppershutdown(BTWriteState *wstate, BTPageState *state, Buffer metabuf,
+		Page metapage)
 {
 	BTPageState *s;
 	BlockNumber rootblkno = P_NONE;
 	uint32		rootlevel = 0;
-	Page		metapage;
 
 	/*
 	 * Each iteration of this loop completes one more level of the tree.
@@ -1146,20 +1276,24 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 		 * back one slot.  Then we can dump out the page.
 		 */
 		_bt_slideleft(s->btps_page);
-		_bt_blwritepage(wstate, s->btps_page, s->btps_blkno);
+		wstate->_bt_blwritepage(wstate, s->btps_page, s->btps_blkno, s->btps_buf);
+		s->btps_buf = InvalidBuffer;
 		s->btps_page = NULL;	/* writepage freed the workspace */
 	}
 
 	/*
-	 * As the last step in the process, construct the metapage and make it
+	 * As the last step in the process, initialize the metapage and make it
 	 * point to the new root (unless we had no data at all, in which case it's
 	 * set to point to "P_NONE").  This changes the index to the "valid" state
 	 * by filling in a valid magic number in the metapage.
+	 * After this, metapage will have been freed or invalid and metabuf, if ever
+	 * valid, will have been released.
 	 */
-	metapage = (Page) palloc(BLCKSZ);
 	_bt_initmetapage(metapage, rootblkno, rootlevel,
 					 wstate->inskey->allequalimage);
-	_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
+	wstate->_bt_blwritepage(wstate, metapage, BTREE_METAPAGE, metabuf);
+	metabuf = InvalidBuffer;
+	metapage = NULL;
 }
 
 /*
@@ -1169,6 +1303,10 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 static void
 _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 {
+	Page metapage;
+	BlockNumber metablkno;
+	Buffer metabuf;
+
 	BTPageState *state = NULL;
 	bool		merge = (btspool2 != NULL);
 	IndexTuple	itup,
@@ -1181,11 +1319,30 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	int64		tuples_done = 0;
 	bool		deduplicate;
 
-
-	unbuffered_prep(&wstate->ub_wstate);
 	deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
 		BTGetDeduplicateItems(wstate->index);
 
+	/*
+	 * Reserve block 0 for the metapage up front.
+	 *
+	 * When using the shared buffers API it is easier to allocate the buffer
+	 * for block 0 first instead of trying skip block 0 and allocate it at the
+	 * end of index build.
+	 *
+	 * When not using the shared buffers API, there is no harm in allocating
+	 * the metapage first. When block 1 is written, the direct writepage
+	 * function will zero-fill block 0. When writing out the metapage at the
+	 * end of index build, it will overwrite that block 0.
+	 *
+	 * The metapage will be initialized and written out at the end of the index
+	 * build when all of the information needed to do so is available.
+	 *
+	 * The block number will always be BTREE_METAPAGE, so the metablkno
+	 * variable is unused and only created to avoid a special case in the
+	 * direct alloc function.
+	 */
+	metapage = wstate->_bt_bl_alloc_page(wstate, &metablkno, &metabuf);
+
 	if (merge)
 	{
 		/*
@@ -1407,10 +1564,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	}
 
 	/* Close down final pages and write the metapage */
-	_bt_uppershutdown(wstate, state);
-
-	if (wstate->btws_use_wal)
-		unbuffered_finish(&wstate->ub_wstate, MAIN_FORKNUM);
+	_bt_uppershutdown(wstate, state, metabuf, metapage);
 }
 
 /*
-- 
2.30.2

#10

melanieplageman@gmail.com

about 4 years ago

In reply to: Melanie Plageman (#9)

2 attachment(s)

Re: Avoiding smgrimmedsync() during nbtree index builds

On Mon, Jan 10, 2022 at 5:50 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

I have attached a v3 which includes two commits -- one of which
implements the directmgr API and uses it and the other which adds
functionality to use either directmgr or bufmgr API during index build.

Also registering for march commitfest.

Forgot directmgr.h. Attached v4

- Melanie

Attachments:

v4-0002-Use-shared-buffers-when-possible-for-index-build.patchtext/x-patch; charset=US-ASCII; name=v4-0002-Use-shared-buffers-when-possible-for-index-build.patchDownload

From 291d5ea92e52f5ac4ff637f383c9bc92973c9e2b Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 10 Jan 2022 17:34:01 -0500
Subject: [PATCH v4 2/2] Use shared buffers when possible for index build

When there are not too many tuples, building the index in shared buffers
makes sense. It allows the buffer manager to handle how best to do the
IO.
---
 src/backend/access/nbtree/nbtree.c  |  34 ++--
 src/backend/access/nbtree/nbtsort.c | 268 ++++++++++++++++++++++------
 2 files changed, 225 insertions(+), 77 deletions(-)

diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 6ab6651420..0714c7fdd6 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -151,33 +151,27 @@ bthandler(PG_FUNCTION_ARGS)
 void
 btbuildempty(Relation index)
 {
+	/*
+	 * Since this only writes one page, use shared buffers.
+	 */
 	Page		metapage;
-	UnBufferedWriteState wstate;
-
-	wstate.smgr_rel = RelationGetSmgr(index);
-
-	unbuffered_prep(&wstate);
-
-	/* Construct metapage. */
-	metapage = (Page) palloc(BLCKSZ);
-	_bt_initmetapage(metapage, P_NONE, 0, _bt_allequalimage(index, false));
+	Buffer metabuf;
 
 	/*
-	 * Write the page and log it.  It might seem that an immediate sync would
-	 * be sufficient to guarantee that the file exists on disk, but recovery
-	 * itself might remove it while replaying, for example, an
-	 * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we need
-	 * this even when wal_level=minimal.
+	 * Allocate a buffer for metapage and initialize metapage.
 	 */
-	unbuffered_write(&wstate, INIT_FORKNUM, BTREE_METAPAGE, metapage);
-	log_newpage(&RelationGetSmgr(index)->smgr_rnode.node, INIT_FORKNUM,
-				BTREE_METAPAGE, metapage, true);
+	metabuf = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_ZERO_AND_LOCK, NULL);
+	metapage = BufferGetPage(metabuf);
+	_bt_initmetapage(metapage, P_NONE, 0, _bt_allequalimage(index, false));
 
 	/*
-	 * Even though we xlog'd the page, a concurrent checkpoint may have moved
-	 * the redo pointer past our xlog record, so we may still need to fsync.
+	 * Mark metapage buffer as dirty and XLOG it
 	 */
-	unbuffered_finish(&wstate, INIT_FORKNUM);
+	START_CRIT_SECTION();
+	MarkBufferDirty(metabuf);
+	log_newpage_buffer(metabuf, true);
+	END_CRIT_SECTION();
+	_bt_relbuf(index, metabuf);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 5687acd99d..68b53e6c04 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -234,6 +234,7 @@ typedef struct BTPageState
 {
 	Page		btps_page;		/* workspace for page building */
 	BlockNumber btps_blkno;		/* block # to write this page at */
+	Buffer btps_buf; /* buffer to write this page to */
 	IndexTuple	btps_lowkey;	/* page's strict lower bound pivot tuple */
 	OffsetNumber btps_lastoff;	/* last item offset loaded */
 	Size		btps_lastextra; /* last item's extra posting list space */
@@ -251,10 +252,25 @@ typedef struct BTWriteState
 	Relation	index;
 	BTScanInsert inskey;		/* generic insertion scankey */
 	bool		btws_use_wal;	/* dump pages to WAL? */
-	BlockNumber btws_pages_alloced; /* # pages allocated */
-	BlockNumber btws_pages_written; /* # pages written out */
+	BlockNumber btws_pages_alloced; /* # pages allocated for index builds outside SB */
+	BlockNumber btws_pages_written; /* # pages written out for index builds outside SB */
 	Page		btws_zeropage;	/* workspace for filling zeroes */
 	UnBufferedWriteState ub_wstate;
+	/*
+	 * Allocate a new btree page. This does not initialize the page.
+	 */
+	Page (*_bt_bl_alloc_page) (struct BTWriteState *wstate, BlockNumber
+			*blockno, Buffer *buf);
+	/*
+	 * Emit a completed btree page, and release the working storage.
+	 */
+	void (*_bt_blwritepage) (struct BTWriteState *wstate, Page page,
+			BlockNumber blkno, Buffer buf);
+
+	void (*_bt_bl_unbuffered_prep) (UnBufferedWriteState *wstate);
+
+	void (*_bt_bl_unbuffered_finish) (UnBufferedWriteState *wstate, ForkNumber
+			forknum);
 } BTWriteState;
 
 
@@ -263,10 +279,22 @@ static double _bt_spools_heapscan(Relation heap, Relation index,
 static void _bt_spooldestroy(BTSpool *btspool);
 static void _bt_spool(BTSpool *btspool, ItemPointer self,
 					  Datum *values, bool *isnull);
-static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2);
+static void _bt_leafbuild(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2);
 static void _bt_build_callback(Relation index, ItemPointer tid, Datum *values,
 							   bool *isnull, bool tupleIsAlive, void *state);
-static Page _bt_blnewpage(uint32 level);
+
+static Page
+_bt_bl_alloc_page_direct(BTWriteState *wstate, BlockNumber *blockno, Buffer *buf);
+static Page
+_bt_bl_alloc_page_shared(BTWriteState *wstate, BlockNumber *blockno, Buffer *buf);
+
+static Page _bt_blnewpage(uint32 level, Page page);
+
+static void
+_bt_blwritepage_direct(BTWriteState *wstate, Page page, BlockNumber blkno, Buffer buf);
+static void
+_bt_blwritepage_shared(BTWriteState *wstate, Page page, BlockNumber blkno, Buffer buf);
+
 static BTPageState *_bt_pagestate(BTWriteState *wstate, uint32 level);
 static void _bt_slideleft(Page rightmostpage);
 static void _bt_sortaddtup(Page page, Size itemsize,
@@ -277,9 +305,10 @@ static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
 static void _bt_sort_dedup_finish_pending(BTWriteState *wstate,
 										  BTPageState *state,
 										  BTDedupState dstate);
-static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
 static void _bt_load(BTWriteState *wstate,
 					 BTSpool *btspool, BTSpool *btspool2);
+static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state, Buffer
+		metabuf, Page metapage);
 static void _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent,
 							   int request);
 static void _bt_end_parallel(BTLeader *btleader);
@@ -292,6 +321,21 @@ static void _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 									   Sharedsort *sharedsort2, int sortmem,
 									   bool progress);
 
+#define BT_BUILD_SB_THRESHOLD 1024
+
+static const BTWriteState wstate_shared = {
+	._bt_bl_alloc_page = _bt_bl_alloc_page_shared,
+	._bt_blwritepage = _bt_blwritepage_shared,
+	._bt_bl_unbuffered_prep = NULL,
+	._bt_bl_unbuffered_finish = NULL,
+};
+
+static const BTWriteState wstate_direct = {
+	._bt_bl_alloc_page = _bt_bl_alloc_page_direct,
+	._bt_blwritepage = _bt_blwritepage_direct,
+	._bt_bl_unbuffered_prep = unbuffered_prep,
+	._bt_bl_unbuffered_finish = unbuffered_finish,
+};
 
 /*
  *	btbuild() -- build a new btree index.
@@ -301,6 +345,7 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 {
 	IndexBuildResult *result;
 	BTBuildState buildstate;
+	BTWriteState writestate;
 	double		reltuples;
 
 #ifdef BTREE_BUILD_STATS
@@ -330,8 +375,12 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
+	 *
+	 * Don't use shared buffers if the number of tuples is too large.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2);
+	writestate = reltuples < BT_BUILD_SB_THRESHOLD ? wstate_shared : wstate_direct;
+
+	_bt_leafbuild(&writestate, buildstate.spool, buildstate.spool2);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
@@ -537,10 +586,8 @@ _bt_spool(BTSpool *btspool, ItemPointer self, Datum *values, bool *isnull)
  * create an entire btree.
  */
 static void
-_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
+_bt_leafbuild(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 {
-	BTWriteState wstate;
-
 #ifdef BTREE_BUILD_STATS
 	if (log_btree_build_stats)
 	{
@@ -560,23 +607,38 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 		tuplesort_performsort(btspool2->sortstate);
 	}
 
-	wstate.heap = btspool->heap;
-	wstate.index = btspool->index;
-	wstate.ub_wstate.smgr_rel = RelationGetSmgr(btspool->index);
-	wstate.ub_wstate.redo = InvalidXLogRecPtr;
-	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
-	/* _bt_mkscankey() won't set allequalimage without metapage */
-	wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
-	wstate.btws_use_wal = RelationNeedsWAL(wstate.index);
+	wstate->heap = btspool->heap;
+	wstate->index = btspool->index;
+	wstate->ub_wstate.smgr_rel = RelationGetSmgr(btspool->index);
+	wstate->ub_wstate.redo = InvalidXLogRecPtr;
+	wstate->inskey = _bt_mkscankey(wstate->index, NULL);
+	/* _bt-mkscankey() won't set allequalimage without metapage */
+	wstate->inskey->allequalimage = _bt_allequalimage(wstate->index, true);
+	wstate->btws_use_wal = RelationNeedsWAL(wstate->index);
 
 	/* reserve the metapage */
-	wstate.btws_pages_alloced = BTREE_METAPAGE + 1;
-	wstate.btws_pages_written = 0;
-	wstate.btws_zeropage = NULL;	/* until needed */
+	wstate->btws_pages_alloced = 0;
+	wstate->btws_pages_written = 0;
+	wstate->btws_zeropage = NULL;
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_BTREE_PHASE_LEAF_LOAD);
-	_bt_load(&wstate, btspool, btspool2);
+
+	/*
+	 * If not using shared buffers, save the redo pointer location in case a
+	 * checkpoint begins during the index build.
+	 */
+	if (wstate->_bt_bl_unbuffered_prep && wstate->btws_use_wal)
+		wstate->_bt_bl_unbuffered_prep(&wstate->ub_wstate);
+
+	_bt_load(wstate, btspool, btspool2);
+
+	/*
+	 * If not using shared buffers, check if backend must fsync the page
+	 * itself.
+	 */
+	if (wstate->_bt_bl_unbuffered_finish && wstate->btws_use_wal)
+		wstate->_bt_bl_unbuffered_finish(&wstate->ub_wstate, MAIN_FORKNUM);
 }
 
 /*
@@ -609,15 +671,15 @@ _bt_build_callback(Relation index,
 }
 
 /*
- * allocate workspace for a new, clean btree page, not linked to any siblings.
+ * Set up workspace for a new, clean btree page, not linked to any siblings.
+ * Caller must allocate the passed in page.
  */
 static Page
-_bt_blnewpage(uint32 level)
+_bt_blnewpage(uint32 level, Page page)
 {
-	Page		page;
 	BTPageOpaque opaque;
 
-	page = (Page) palloc(BLCKSZ);
+	Assert(page);
 
 	/* Zero the page and set up standard page header info */
 	_bt_pageinit(page, BLCKSZ);
@@ -635,18 +697,18 @@ _bt_blnewpage(uint32 level)
 	return page;
 }
 
-/*
- * emit a completed btree page, and release the working storage.
- */
 static void
-_bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
+_bt_blwritepage_direct(BTWriteState *wstate, Page page, BlockNumber blkno, Buffer buf)
 {
-	/* XLOG stuff */
+	/*
+	 * Indexes built outside shared buffers must XLOG the page, issue the page
+	 * write request, and take care of fsync'ing the page to the device if a
+	 * checkpoint began after the beginning of the index build.
+	 *
+	 * Use the XLOG_FPI record type for this.
+	 */
 	if (wstate->btws_use_wal)
-	{
-		/* We use the XLOG_FPI record type for this */
 		log_newpage(&wstate->index->rd_node, MAIN_FORKNUM, blkno, page, true);
-	}
 
 	/*
 	 * If we have to write pages nonsequentially, fill in the space with
@@ -660,7 +722,8 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
 		if (!wstate->btws_zeropage)
 			wstate->btws_zeropage = (Page) palloc0(BLCKSZ);
 		/* don't set checksum for all-zero page */
-		unbuffered_extend(&wstate->ub_wstate, MAIN_FORKNUM, wstate->btws_pages_written++, wstate->btws_zeropage, true);
+		unbuffered_extend(&wstate->ub_wstate, MAIN_FORKNUM,
+				wstate->btws_pages_written++, wstate->btws_zeropage, true);
 	}
 
 
@@ -677,6 +740,61 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
 	pfree(page);
 }
 
+static void
+_bt_blwritepage_shared(BTWriteState *wstate, Page page, BlockNumber blkno, Buffer buf)
+{
+	/*
+	 * Indexes built in shared buffers need only to mark the buffer as dirty
+	 * and XLOG it.
+	 */
+	Assert(buf);
+	START_CRIT_SECTION();
+	MarkBufferDirty(buf);
+	if (wstate->btws_use_wal)
+		log_newpage_buffer(buf, true);
+	END_CRIT_SECTION();
+	_bt_relbuf(wstate->index, buf);
+}
+
+static Page
+_bt_bl_alloc_page_direct(BTWriteState *wstate, BlockNumber *blockno, Buffer *buf)
+{
+	 /* buf is only used when using shared buffers, so set it to InvalidBuffer */
+	*buf = InvalidBuffer;
+
+	/*
+	 * Assign block number for the page.
+	 * This will be used to link to sibling page(s) later and, if this is the
+	 * initial page in the level, saved in the BTPageState
+	 */
+	*blockno = wstate->btws_pages_alloced++;
+
+	/* now allocate and set up the new page */
+	return palloc(BLCKSZ);
+}
+
+static Page
+_bt_bl_alloc_page_shared(BTWriteState *wstate, BlockNumber *blockno, Buffer *buf)
+{
+	/*
+	 * Find a shared buffer for the page. Pass mode RBM_ZERO_AND_LOCK to get an
+	 * exclusive lock on the buffer content. No lock on the relation as a whole
+	 * is needed (as in LockRelationForExtension()) because the initial index
+	 * build is not yet complete.
+	 */
+	*buf = ReadBufferExtended(wstate->index, MAIN_FORKNUM, P_NEW,
+			RBM_ZERO_AND_LOCK, NULL);
+
+	/*
+	 * bufmgr will assign a block number for the new page.
+	 * This will be used to link to sibling page(s) later and, if this is the
+	 * initial page in the level, saved in the BTPageState
+	 */
+	*blockno = BufferGetBlockNumber(*buf);
+
+	return BufferGetPage(*buf);
+}
+
 /*
  * allocate and initialize a new BTPageState.  the returned structure
  * is suitable for immediate use by _bt_buildadd.
@@ -684,13 +802,20 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
 static BTPageState *
 _bt_pagestate(BTWriteState *wstate, uint32 level)
 {
+	Buffer buf;
+	BlockNumber blockno;
+
 	BTPageState *state = (BTPageState *) palloc0(sizeof(BTPageState));
 
-	/* create initial page for level */
-	state->btps_page = _bt_blnewpage(level);
+	/*
+	 * Allocate and initialize initial page for the level, and if using shared
+	 * buffers, extend the relation and allocate a shared buffer for the block.
+	 */
+	state->btps_page = _bt_blnewpage(level, wstate->_bt_bl_alloc_page(wstate,
+				&blockno, &buf));
 
-	/* and assign it a page position */
-	state->btps_blkno = wstate->btws_pages_alloced++;
+	state->btps_blkno = blockno;
+	state->btps_buf = buf;
 
 	state->btps_lowkey = NULL;
 	/* initialize lastoff so first item goes into P_FIRSTKEY */
@@ -825,6 +950,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 {
 	Page		npage;
 	BlockNumber nblkno;
+	Buffer nbuf;
 	OffsetNumber last_off;
 	Size		last_truncextra;
 	Size		pgspc;
@@ -839,6 +965,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 
 	npage = state->btps_page;
 	nblkno = state->btps_blkno;
+	nbuf = state->btps_buf;
 	last_off = state->btps_lastoff;
 	last_truncextra = state->btps_lastextra;
 	state->btps_lastextra = truncextra;
@@ -895,15 +1022,14 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 		 */
 		Page		opage = npage;
 		BlockNumber oblkno = nblkno;
+		Buffer obuf = nbuf;
 		ItemId		ii;
 		ItemId		hii;
 		IndexTuple	oitup;
 
-		/* Create new page of same level */
-		npage = _bt_blnewpage(state->btps_level);
-
-		/* and assign it a page position */
-		nblkno = wstate->btws_pages_alloced++;
+		/* Create and initialize a new page of same level */
+		npage = _bt_blnewpage(state->btps_level,
+				wstate->_bt_bl_alloc_page(wstate, &nblkno, &nbuf));
 
 		/*
 		 * We copy the last item on the page into the new page, and then
@@ -1013,9 +1139,12 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 
 		/*
 		 * Write out the old page.  We never need to touch it again, so we can
-		 * free the opage workspace too.
+		 * free the opage workspace too. obuf has been released and is no longer
+		 * valid.
 		 */
-		_bt_blwritepage(wstate, opage, oblkno);
+		 wstate->_bt_blwritepage(wstate, opage, oblkno, obuf);
+		 obuf = InvalidBuffer;
+		 opage = NULL;
 
 		/*
 		 * Reset last_off to point to new page
@@ -1050,6 +1179,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 
 	state->btps_page = npage;
 	state->btps_blkno = nblkno;
+	state->btps_buf = nbuf;
 	state->btps_lastoff = last_off;
 }
 
@@ -1095,12 +1225,12 @@ _bt_sort_dedup_finish_pending(BTWriteState *wstate, BTPageState *state,
  * Finish writing out the completed btree.
  */
 static void
-_bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
+_bt_uppershutdown(BTWriteState *wstate, BTPageState *state, Buffer metabuf,
+		Page metapage)
 {
 	BTPageState *s;
 	BlockNumber rootblkno = P_NONE;
 	uint32		rootlevel = 0;
-	Page		metapage;
 
 	/*
 	 * Each iteration of this loop completes one more level of the tree.
@@ -1146,20 +1276,24 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 		 * back one slot.  Then we can dump out the page.
 		 */
 		_bt_slideleft(s->btps_page);
-		_bt_blwritepage(wstate, s->btps_page, s->btps_blkno);
+		wstate->_bt_blwritepage(wstate, s->btps_page, s->btps_blkno, s->btps_buf);
+		s->btps_buf = InvalidBuffer;
 		s->btps_page = NULL;	/* writepage freed the workspace */
 	}
 
 	/*
-	 * As the last step in the process, construct the metapage and make it
+	 * As the last step in the process, initialize the metapage and make it
 	 * point to the new root (unless we had no data at all, in which case it's
 	 * set to point to "P_NONE").  This changes the index to the "valid" state
 	 * by filling in a valid magic number in the metapage.
+	 * After this, metapage will have been freed or invalid and metabuf, if ever
+	 * valid, will have been released.
 	 */
-	metapage = (Page) palloc(BLCKSZ);
 	_bt_initmetapage(metapage, rootblkno, rootlevel,
 					 wstate->inskey->allequalimage);
-	_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
+	wstate->_bt_blwritepage(wstate, metapage, BTREE_METAPAGE, metabuf);
+	metabuf = InvalidBuffer;
+	metapage = NULL;
 }
 
 /*
@@ -1169,6 +1303,10 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 static void
 _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 {
+	Page metapage;
+	BlockNumber metablkno;
+	Buffer metabuf;
+
 	BTPageState *state = NULL;
 	bool		merge = (btspool2 != NULL);
 	IndexTuple	itup,
@@ -1181,11 +1319,30 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	int64		tuples_done = 0;
 	bool		deduplicate;
 
-
-	unbuffered_prep(&wstate->ub_wstate);
 	deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
 		BTGetDeduplicateItems(wstate->index);
 
+	/*
+	 * Reserve block 0 for the metapage up front.
+	 *
+	 * When using the shared buffers API it is easier to allocate the buffer
+	 * for block 0 first instead of trying skip block 0 and allocate it at the
+	 * end of index build.
+	 *
+	 * When not using the shared buffers API, there is no harm in allocating
+	 * the metapage first. When block 1 is written, the direct writepage
+	 * function will zero-fill block 0. When writing out the metapage at the
+	 * end of index build, it will overwrite that block 0.
+	 *
+	 * The metapage will be initialized and written out at the end of the index
+	 * build when all of the information needed to do so is available.
+	 *
+	 * The block number will always be BTREE_METAPAGE, so the metablkno
+	 * variable is unused and only created to avoid a special case in the
+	 * direct alloc function.
+	 */
+	metapage = wstate->_bt_bl_alloc_page(wstate, &metablkno, &metabuf);
+
 	if (merge)
 	{
 		/*
@@ -1407,10 +1564,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	}
 
 	/* Close down final pages and write the metapage */
-	_bt_uppershutdown(wstate, state);
-
-	if (wstate->btws_use_wal)
-		unbuffered_finish(&wstate->ub_wstate, MAIN_FORKNUM);
+	_bt_uppershutdown(wstate, state, metabuf, metapage);
 }
 
 /*
-- 
2.30.2

v4-0001-Add-unbuffered-IO-and-avoid-immed-fsync.patchtext/x-patch; charset=US-ASCII; name=v4-0001-Add-unbuffered-IO-and-avoid-immed-fsync.patchDownload

From b016c9ee2e47efe434bf805fb0b310e9a302b0f9 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 28 Sep 2021 14:51:11 -0400
Subject: [PATCH v4 1/2] Add unbuffered IO and avoid immed fsync

Replace unbuffered extends and writes

When writing data outside of shared buffers, the backend must do a
series of steps to ensure the data is both durable and recoverable. When
writing or extending a page of data, the backend must log, checksum, and
write out the page before freeing it.

Additionally, though the data is durable once the WAL entries are on
permanent storage, the XLOG Redo pointer cannot be moved past this WAL
until the page data is safely on permanent storage. If a crash were to
occur before the data is fsync'd, the WAL wouldn't be replayed during
recovery, and the data would be lost. Thus, the backend must ensure that
either the Redo pointer has not moved or that the data is fsync'd before
freeing the page.

This is not a problem with pages written in shared buffers because the
checkpointer will block until all buffers that were dirtied before it
began finish before it moves the Redo pointer past their associated WAL
entries.

This commit makes two main changes:

1) It wraps smgrextend() and smgrwrite() in functions from a new API
   for writing data outside of shared buffers, directmgr.

2) It saves the XLOG Redo pointer location before doing the write or
   extend. It also adds an fsync request for the page to the
   checkpointer's pending-ops table. Then, after doing the write or
   extend, if the Redo pointer has moved (meaning a checkpoint has
   started since it saved it last), then the backend fsync's the page
   itself. Otherwise, it lets the checkpointer take care of fsync'ing
   the page the next time it processes the pending-ops table.
---
 src/backend/access/gist/gistbuild.c       | 17 +++----
 src/backend/access/hash/hashpage.c        |  9 ++--
 src/backend/access/heap/heapam_handler.c  | 16 ++++---
 src/backend/access/heap/rewriteheap.c     | 26 ++++-------
 src/backend/access/heap/visibilitymap.c   |  8 ++--
 src/backend/access/nbtree/nbtree.c        | 17 ++++---
 src/backend/access/nbtree/nbtsort.c       | 39 +++++-----------
 src/backend/access/spgist/spginsert.c     | 25 +++++-----
 src/backend/access/transam/xlog.c         | 13 ++++++
 src/backend/catalog/storage.c             | 25 +++-------
 src/backend/storage/Makefile              |  2 +-
 src/backend/storage/direct/Makefile       | 17 +++++++
 src/backend/storage/direct/directmgr.c    | 57 +++++++++++++++++++++++
 src/backend/storage/freespace/freespace.c | 10 ++--
 src/include/access/xlog.h                 |  1 +
 src/include/storage/directmgr.h           | 44 +++++++++++++++++
 16 files changed, 219 insertions(+), 107 deletions(-)
 create mode 100644 src/backend/storage/direct/Makefile
 create mode 100644 src/backend/storage/direct/directmgr.c
 create mode 100644 src/include/storage/directmgr.h

diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 9854116fca..748ca65492 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -43,6 +43,7 @@
 #include "miscadmin.h"
 #include "optimizer/optimizer.h"
 #include "storage/bufmgr.h"
+#include "storage/directmgr.h"
 #include "storage/smgr.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -91,6 +92,7 @@ typedef struct
 
 	int64		indtuples;		/* number of tuples indexed */
 
+	UnBufferedWriteState ub_wstate;
 	/*
 	 * Extra data structures used during a buffering build. 'gfbb' contains
 	 * information related to managing the build buffers. 'parentMap' is a
@@ -194,6 +196,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.heaprel = heap;
 	buildstate.sortstate = NULL;
 	buildstate.giststate = initGISTstate(index);
+	buildstate.ub_wstate.smgr_rel = RelationGetSmgr(index);
 
 	/*
 	 * Create a temporary memory context that is reset once for each tuple
@@ -403,14 +406,15 @@ gist_indexsortbuild(GISTBuildState *state)
 	state->pages_allocated = 0;
 	state->pages_written = 0;
 	state->ready_num_pages = 0;
+	unbuffered_prep(&state->ub_wstate);
 
 	/*
 	 * Write an empty page as a placeholder for the root page. It will be
 	 * replaced with the real root page at the end.
 	 */
 	page = palloc0(BLCKSZ);
-	smgrextend(RelationGetSmgr(state->indexrel), MAIN_FORKNUM, GIST_ROOT_BLKNO,
-			   page, true);
+	unbuffered_extend(&state->ub_wstate, MAIN_FORKNUM, GIST_ROOT_BLKNO, page,
+			true);
 	state->pages_allocated++;
 	state->pages_written++;
 
@@ -450,13 +454,12 @@ gist_indexsortbuild(GISTBuildState *state)
 
 	/* Write out the root */
 	PageSetLSN(pagestate->page, GistBuildLSN);
-	PageSetChecksumInplace(pagestate->page, GIST_ROOT_BLKNO);
-	smgrwrite(RelationGetSmgr(state->indexrel), MAIN_FORKNUM, GIST_ROOT_BLKNO,
-			  pagestate->page, true);
+	unbuffered_write(&state->ub_wstate, MAIN_FORKNUM, GIST_ROOT_BLKNO, pagestate->page);
 	if (RelationNeedsWAL(state->indexrel))
 		log_newpage(&state->indexrel->rd_node, MAIN_FORKNUM, GIST_ROOT_BLKNO,
 					pagestate->page, true);
 
+	unbuffered_finish(&state->ub_wstate, MAIN_FORKNUM);
 	pfree(pagestate->page);
 	pfree(pagestate);
 }
@@ -570,9 +573,7 @@ gist_indexsortbuild_flush_ready_pages(GISTBuildState *state)
 			elog(ERROR, "unexpected block number to flush GiST sorting build");
 
 		PageSetLSN(page, GistBuildLSN);
-		PageSetChecksumInplace(page, blkno);
-		smgrextend(RelationGetSmgr(state->indexrel), MAIN_FORKNUM, blkno, page,
-				   true);
+		unbuffered_extend(&state->ub_wstate, MAIN_FORKNUM, blkno, page, false);
 
 		state->pages_written++;
 	}
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index ee351aea09..722420adf5 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -32,6 +32,7 @@
 #include "access/hash_xlog.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
+#include "storage/directmgr.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
 #include "storage/smgr.h"
@@ -990,8 +991,10 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
 	PGAlignedBlock zerobuf;
 	Page		page;
 	HashPageOpaque ovflopaque;
+	UnBufferedWriteState ub_wstate;
 
 	lastblock = firstblock + nblocks - 1;
+	ub_wstate.smgr_rel = RelationGetSmgr(rel);
 
 	/*
 	 * Check for overflow in block number calculation; if so, we cannot extend
@@ -1000,6 +1003,8 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
 	if (lastblock < firstblock || lastblock == InvalidBlockNumber)
 		return false;
 
+	unbuffered_prep(&ub_wstate);
+
 	page = (Page) zerobuf.data;
 
 	/*
@@ -1024,9 +1029,7 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
 					zerobuf.data,
 					true);
 
-	PageSetChecksumInplace(page, lastblock);
-	smgrextend(RelationGetSmgr(rel), MAIN_FORKNUM, lastblock, zerobuf.data,
-			   false);
+	unbuffered_extend(&ub_wstate, MAIN_FORKNUM, lastblock, zerobuf.data, false);
 
 	return true;
 }
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 39ef8a0b77..8824e39a91 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -38,6 +38,7 @@
 #include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/bufpage.h"
+#include "storage/directmgr.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
 #include "storage/procarray.h"
@@ -575,6 +576,7 @@ heapam_relation_set_new_filenode(Relation rel,
 								 MultiXactId *minmulti)
 {
 	SMgrRelation srel;
+	UnBufferedWriteState ub_wstate;
 
 	/*
 	 * Initialize to the minimum XID that could put tuples in the table. We
@@ -594,15 +596,15 @@ heapam_relation_set_new_filenode(Relation rel,
 	*minmulti = GetOldestMultiXactId();
 
 	srel = RelationCreateStorage(*newrnode, persistence);
+	ub_wstate.smgr_rel = srel;
+	unbuffered_prep(&ub_wstate);
 
 	/*
 	 * If required, set up an init fork for an unlogged table so that it can
-	 * be correctly reinitialized on restart.  An immediate sync is required
-	 * even if the page has been logged, because the write did not go through
-	 * shared_buffers and therefore a concurrent checkpoint may have moved the
-	 * redo pointer past our xlog record.  Recovery may as well remove it
-	 * while replaying, for example, XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE
-	 * record. Therefore, logging is necessary even if wal_level=minimal.
+	 * be correctly reinitialized on restart.
+	 * Recovery may as well remove our xlog record while replaying, for
+	 * example, XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record. Therefore,
+	 * logging is necessary even if wal_level=minimal.
 	 */
 	if (persistence == RELPERSISTENCE_UNLOGGED)
 	{
@@ -611,7 +613,7 @@ heapam_relation_set_new_filenode(Relation rel,
 			   rel->rd_rel->relkind == RELKIND_TOASTVALUE);
 		smgrcreate(srel, INIT_FORKNUM, false);
 		log_smgrcreate(newrnode, INIT_FORKNUM);
-		smgrimmedsync(srel, INIT_FORKNUM);
+		unbuffered_finish(&ub_wstate, INIT_FORKNUM);
 	}
 
 	smgrclose(srel);
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index aa265edf60..0bb6eec9f9 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -119,6 +119,7 @@
 #include "replication/logical.h"
 #include "replication/slot.h"
 #include "storage/bufmgr.h"
+#include "storage/directmgr.h"
 #include "storage/fd.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
@@ -152,6 +153,7 @@ typedef struct RewriteStateData
 	HTAB	   *rs_old_new_tid_map; /* unmatched B tuples */
 	HTAB	   *rs_logical_mappings;	/* logical remapping files */
 	uint32		rs_num_rewrite_mappings;	/* # in memory mappings */
+	UnBufferedWriteState rs_unbuffered_wstate;
 }			RewriteStateData;
 
 /*
@@ -264,6 +266,9 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
 	state->rs_freeze_xid = freeze_xid;
 	state->rs_cutoff_multi = cutoff_multi;
 	state->rs_cxt = rw_cxt;
+	state->rs_unbuffered_wstate.smgr_rel = RelationGetSmgr(state->rs_new_rel);
+
+	unbuffered_prep(&state->rs_unbuffered_wstate);
 
 	/* Initialize hash tables used to track update chains */
 	hash_ctl.keysize = sizeof(TidHashKey);
@@ -324,21 +329,12 @@ end_heap_rewrite(RewriteState state)
 						state->rs_buffer,
 						true);
 
-		PageSetChecksumInplace(state->rs_buffer, state->rs_blockno);
-
-		smgrextend(RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM,
-				   state->rs_blockno, (char *) state->rs_buffer, true);
+		unbuffered_extend(&state->rs_unbuffered_wstate, MAIN_FORKNUM,
+				state->rs_blockno, state->rs_buffer, false);
 	}
 
-	/*
-	 * When we WAL-logged rel pages, we must nonetheless fsync them.  The
-	 * reason is the same as in storage.c's RelationCopyStorage(): we're
-	 * writing data that's not in shared buffers, and so a CHECKPOINT
-	 * occurring during the rewriteheap operation won't have fsync'd data we
-	 * wrote before the checkpoint.
-	 */
 	if (RelationNeedsWAL(state->rs_new_rel))
-		smgrimmedsync(RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM);
+		unbuffered_finish(&state->rs_unbuffered_wstate, MAIN_FORKNUM);
 
 	logical_end_heap_rewrite(state);
 
@@ -690,10 +686,8 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
 			 * need for smgr to schedule an fsync for this write; we'll do it
 			 * ourselves in end_heap_rewrite.
 			 */
-			PageSetChecksumInplace(page, state->rs_blockno);
-
-			smgrextend(RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM,
-					   state->rs_blockno, (char *) page, true);
+			unbuffered_extend(&state->rs_unbuffered_wstate, MAIN_FORKNUM,
+					state->rs_blockno, page, false);
 
 			state->rs_blockno++;
 			state->rs_buffer_valid = false;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 9032d4758f..1a7482e267 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -92,6 +92,7 @@
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "storage/bufmgr.h"
+#include "storage/directmgr.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
 #include "utils/inval.h"
@@ -616,7 +617,9 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 	BlockNumber vm_nblocks_now;
 	PGAlignedBlock pg;
 	SMgrRelation reln;
+	UnBufferedWriteState ub_wstate;
 
+	ub_wstate.smgr_rel = RelationGetSmgr(rel);
 	PageInit((Page) pg.data, BLCKSZ, 0);
 
 	/*
@@ -654,9 +657,8 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 	/* Now extend the file */
 	while (vm_nblocks_now < vm_nblocks)
 	{
-		PageSetChecksumInplace((Page) pg.data, vm_nblocks_now);
-
-		smgrextend(reln, VISIBILITYMAP_FORKNUM, vm_nblocks_now, pg.data, false);
+		// TODO: aren't these pages empty? why checksum them
+		unbuffered_extend(&ub_wstate, VISIBILITYMAP_FORKNUM, vm_nblocks_now, (Page) pg.data, false);
 		vm_nblocks_now++;
 	}
 
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 13024af2fa..6ab6651420 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -29,6 +29,7 @@
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "storage/condition_variable.h"
+#include "storage/directmgr.h"
 #include "storage/indexfsm.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
@@ -151,6 +152,11 @@ void
 btbuildempty(Relation index)
 {
 	Page		metapage;
+	UnBufferedWriteState wstate;
+
+	wstate.smgr_rel = RelationGetSmgr(index);
+
+	unbuffered_prep(&wstate);
 
 	/* Construct metapage. */
 	metapage = (Page) palloc(BLCKSZ);
@@ -163,18 +169,15 @@ btbuildempty(Relation index)
 	 * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we need
 	 * this even when wal_level=minimal.
 	 */
-	PageSetChecksumInplace(metapage, BTREE_METAPAGE);
-	smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, BTREE_METAPAGE,
-			  (char *) metapage, true);
+	unbuffered_write(&wstate, INIT_FORKNUM, BTREE_METAPAGE, metapage);
 	log_newpage(&RelationGetSmgr(index)->smgr_rnode.node, INIT_FORKNUM,
 				BTREE_METAPAGE, metapage, true);
 
 	/*
-	 * An immediate sync is required even if we xlog'd the page, because the
-	 * write did not go through shared_buffers and therefore a concurrent
-	 * checkpoint may have moved the redo pointer past our xlog record.
+	 * Even though we xlog'd the page, a concurrent checkpoint may have moved
+	 * the redo pointer past our xlog record, so we may still need to fsync.
 	 */
-	smgrimmedsync(RelationGetSmgr(index), INIT_FORKNUM);
+	unbuffered_finish(&wstate, INIT_FORKNUM);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index dc220146fd..5687acd99d 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -57,6 +57,7 @@
 #include "executor/instrument.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "storage/directmgr.h"
 #include "storage/smgr.h"
 #include "tcop/tcopprot.h"		/* pgrminclude ignore */
 #include "utils/rel.h"
@@ -253,6 +254,7 @@ typedef struct BTWriteState
 	BlockNumber btws_pages_alloced; /* # pages allocated */
 	BlockNumber btws_pages_written; /* # pages written out */
 	Page		btws_zeropage;	/* workspace for filling zeroes */
+	UnBufferedWriteState ub_wstate;
 } BTWriteState;
 
 
@@ -560,6 +562,8 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
+	wstate.ub_wstate.smgr_rel = RelationGetSmgr(btspool->index);
+	wstate.ub_wstate.redo = InvalidXLogRecPtr;
 	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
 	/* _bt_mkscankey() won't set allequalimage without metapage */
 	wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
@@ -656,31 +660,19 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
 		if (!wstate->btws_zeropage)
 			wstate->btws_zeropage = (Page) palloc0(BLCKSZ);
 		/* don't set checksum for all-zero page */
-		smgrextend(RelationGetSmgr(wstate->index), MAIN_FORKNUM,
-				   wstate->btws_pages_written++,
-				   (char *) wstate->btws_zeropage,
-				   true);
+		unbuffered_extend(&wstate->ub_wstate, MAIN_FORKNUM, wstate->btws_pages_written++, wstate->btws_zeropage, true);
 	}
 
-	PageSetChecksumInplace(page, blkno);
 
-	/*
-	 * Now write the page.  There's no need for smgr to schedule an fsync for
-	 * this write; we'll do it ourselves before ending the build.
-	 */
+	/* Now write the page. Either we are extending the file... */
 	if (blkno == wstate->btws_pages_written)
 	{
-		/* extending the file... */
-		smgrextend(RelationGetSmgr(wstate->index), MAIN_FORKNUM, blkno,
-				   (char *) page, true);
+		unbuffered_extend(&wstate->ub_wstate, MAIN_FORKNUM, blkno, page, false);
 		wstate->btws_pages_written++;
 	}
+	/* or we are overwriting a block we zero-filled before. */
 	else
-	{
-		/* overwriting a block we zero-filled before */
-		smgrwrite(RelationGetSmgr(wstate->index), MAIN_FORKNUM, blkno,
-				  (char *) page, true);
-	}
+		unbuffered_write(&wstate->ub_wstate, MAIN_FORKNUM, blkno, page);
 
 	pfree(page);
 }
@@ -1189,6 +1181,8 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	int64		tuples_done = 0;
 	bool		deduplicate;
 
+
+	unbuffered_prep(&wstate->ub_wstate);
 	deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
 		BTGetDeduplicateItems(wstate->index);
 
@@ -1415,17 +1409,8 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	/* Close down final pages and write the metapage */
 	_bt_uppershutdown(wstate, state);
 
-	/*
-	 * When we WAL-logged index pages, we must nonetheless fsync index files.
-	 * Since we're building outside shared buffers, a CHECKPOINT occurring
-	 * during the build has no way to flush the previously written data to
-	 * disk (indeed it won't know the index even exists).  A crash later on
-	 * would replay WAL from the checkpoint, therefore it wouldn't replay our
-	 * earlier WAL entries. If we do not fsync those pages here, they might
-	 * still not be on disk when the crash occurs.
-	 */
 	if (wstate->btws_use_wal)
-		smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM);
+		unbuffered_finish(&wstate->ub_wstate, MAIN_FORKNUM);
 }
 
 /*
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index bfb74049d0..9fd5686b8d 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -25,6 +25,7 @@
 #include "catalog/index.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/directmgr.h"
 #include "storage/smgr.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -156,6 +157,10 @@ void
 spgbuildempty(Relation index)
 {
 	Page		page;
+	UnBufferedWriteState wstate;
+
+	wstate.smgr_rel = RelationGetSmgr(index);
+	unbuffered_prep(&wstate);
 
 	/* Construct metapage. */
 	page = (Page) palloc(BLCKSZ);
@@ -168,36 +173,30 @@ spgbuildempty(Relation index)
 	 * of their existing content when the corresponding create records are
 	 * replayed.
 	 */
-	PageSetChecksumInplace(page, SPGIST_METAPAGE_BLKNO);
-	smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, SPGIST_METAPAGE_BLKNO,
-			  (char *) page, true);
+	unbuffered_write(&wstate, INIT_FORKNUM, SPGIST_METAPAGE_BLKNO, page);
 	log_newpage(&(RelationGetSmgr(index))->smgr_rnode.node, INIT_FORKNUM,
 				SPGIST_METAPAGE_BLKNO, page, true);
 
 	/* Likewise for the root page. */
 	SpGistInitPage(page, SPGIST_LEAF);
 
-	PageSetChecksumInplace(page, SPGIST_ROOT_BLKNO);
-	smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, SPGIST_ROOT_BLKNO,
-			  (char *) page, true);
+	unbuffered_write(&wstate, INIT_FORKNUM, SPGIST_ROOT_BLKNO, page);
 	log_newpage(&(RelationGetSmgr(index))->smgr_rnode.node, INIT_FORKNUM,
 				SPGIST_ROOT_BLKNO, page, true);
 
 	/* Likewise for the null-tuples root page. */
 	SpGistInitPage(page, SPGIST_LEAF | SPGIST_NULLS);
 
-	PageSetChecksumInplace(page, SPGIST_NULL_BLKNO);
-	smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, SPGIST_NULL_BLKNO,
-			  (char *) page, true);
+	unbuffered_write(&wstate, INIT_FORKNUM, SPGIST_NULL_BLKNO, page);
 	log_newpage(&(RelationGetSmgr(index))->smgr_rnode.node, INIT_FORKNUM,
 				SPGIST_NULL_BLKNO, page, true);
 
 	/*
-	 * An immediate sync is required even if we xlog'd the pages, because the
-	 * writes did not go through shared buffers and therefore a concurrent
-	 * checkpoint may have moved the redo pointer past our xlog record.
+	 * Because the writes did not go through shared buffers, if a concurrent
+	 * checkpoint moved the redo pointer past our xlog record, an immediate
+	 * sync is required even if we xlog'd the pages.
 	 */
-	smgrimmedsync(RelationGetSmgr(index), INIT_FORKNUM);
+	unbuffered_finish(&wstate, INIT_FORKNUM);
 }
 
 /*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c9d4cbf3ff..a06e8d005b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -13191,6 +13191,19 @@ CheckForStandbyTrigger(void)
 	return false;
 }
 
+bool RedoRecPtrChanged(XLogRecPtr comparator_ptr)
+{
+	XLogRecPtr ptr;
+	SpinLockAcquire(&XLogCtl->info_lck);
+	ptr = XLogCtl->RedoRecPtr;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	if (RedoRecPtr < ptr)
+		RedoRecPtr = ptr;
+
+	return RedoRecPtr != comparator_ptr;
+}
+
 /*
  * Remove the files signaling a standby promotion request.
  */
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 9b8075536a..05e66d0434 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -28,6 +28,7 @@
 #include "catalog/storage.h"
 #include "catalog/storage_xlog.h"
 #include "miscadmin.h"
+#include "storage/directmgr.h"
 #include "storage/freespace.h"
 #include "storage/smgr.h"
 #include "utils/hsearch.h"
@@ -420,6 +421,10 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 	bool		copying_initfork;
 	BlockNumber nblocks;
 	BlockNumber blkno;
+	UnBufferedWriteState wstate;
+
+	wstate.smgr_rel = dst;
+	unbuffered_prep(&wstate);
 
 	page = (Page) buf.data;
 
@@ -477,27 +482,11 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 		if (use_wal)
 			log_newpage(&dst->smgr_rnode.node, forkNum, blkno, page, false);
 
-		PageSetChecksumInplace(page, blkno);
-
-		/*
-		 * Now write the page.  We say skipFsync = true because there's no
-		 * need for smgr to schedule an fsync for this write; we'll do it
-		 * ourselves below.
-		 */
-		smgrextend(dst, forkNum, blkno, buf.data, true);
+		unbuffered_extend(&wstate, forkNum, blkno, buf.data, false);
 	}
 
-	/*
-	 * When we WAL-logged rel pages, we must nonetheless fsync them.  The
-	 * reason is that since we're copying outside shared buffers, a CHECKPOINT
-	 * occurring during the copy has no way to flush the previously written
-	 * data to disk (indeed it won't know the new rel even exists).  A crash
-	 * later on would replay WAL from the checkpoint, therefore it wouldn't
-	 * replay our earlier WAL entries. If we do not fsync those pages here,
-	 * they might still not be on disk when the crash occurs.
-	 */
 	if (use_wal || copying_initfork)
-		smgrimmedsync(dst, forkNum);
+		unbuffered_finish(&wstate, forkNum);
 }
 
 /*
diff --git a/src/backend/storage/Makefile b/src/backend/storage/Makefile
index 8376cdfca2..501fae5f9d 100644
--- a/src/backend/storage/Makefile
+++ b/src/backend/storage/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/storage
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS     = buffer file freespace ipc large_object lmgr page smgr sync
+SUBDIRS     = buffer direct file freespace ipc large_object lmgr page smgr sync
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/direct/Makefile b/src/backend/storage/direct/Makefile
new file mode 100644
index 0000000000..d82bbed48c
--- /dev/null
+++ b/src/backend/storage/direct/Makefile
@@ -0,0 +1,17 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for storage/direct
+#
+# IDENTIFICATION
+#    src/backend/storage/direct/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/storage/direct
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = directmgr.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/direct/directmgr.c b/src/backend/storage/direct/directmgr.c
new file mode 100644
index 0000000000..d33c2a7a5f
--- /dev/null
+++ b/src/backend/storage/direct/directmgr.c
@@ -0,0 +1,57 @@
+/*-------------------------------------------------------------------------
+ *
+ * directmgr.c
+ *	  routines for managing unbuffered IO
+ *
+ * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/direct/directmgr.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+
+#include "access/xlog.h"
+#include "storage/directmgr.h"
+#include "utils/rel.h"
+
+void unbuffered_prep(UnBufferedWriteState *wstate)
+{
+	wstate->redo = GetRedoRecPtr();
+}
+
+/*
+ * When writing data outside shared buffers, a concurrent CHECKPOINT can move
+ * the redo pointer past our WAL entries and won't flush our data to disk. If
+ * the database crashes before the data makes it to disk, our WAL won't be
+ * replayed and the data will be lost.
+ * Thus, if a CHECKPOINT begins between unbuffered_prep() and
+ * unbuffered_finish(), the backend must fsync the data itself.
+ */
+void unbuffered_finish(UnBufferedWriteState *wstate, ForkNumber forknum)
+{
+	if (RedoRecPtrChanged(wstate->redo))
+		smgrimmedsync(wstate->smgr_rel, forknum);
+}
+
+void
+unbuffered_write(UnBufferedWriteState *wstate, ForkNumber forknum, BlockNumber
+		blocknum, Page page)
+{
+	PageSetChecksumInplace(page, blocknum);
+	smgrwrite(wstate->smgr_rel, forknum, blocknum, (char *) page, false);
+}
+
+void
+unbuffered_extend(UnBufferedWriteState *wstate, ForkNumber forknum, BlockNumber
+		blocknum, Page page, bool empty)
+{
+	if (!empty)
+		PageSetChecksumInplace(page, blocknum);
+	smgrextend(wstate->smgr_rel, forknum, blocknum, (char *) page, false);
+}
+
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index c88cb91f06..d31f75bd60 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -26,6 +26,7 @@
 #include "access/htup_details.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
+#include "storage/directmgr.h"
 #include "storage/freespace.h"
 #include "storage/fsm_internals.h"
 #include "storage/lmgr.h"
@@ -608,6 +609,9 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 	BlockNumber fsm_nblocks_now;
 	PGAlignedBlock pg;
 	SMgrRelation reln;
+	UnBufferedWriteState ub_wstate;
+
+	ub_wstate.smgr_rel = RelationGetSmgr(rel);
 
 	PageInit((Page) pg.data, BLCKSZ, 0);
 
@@ -647,10 +651,8 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 	/* Extend as needed. */
 	while (fsm_nblocks_now < fsm_nblocks)
 	{
-		PageSetChecksumInplace((Page) pg.data, fsm_nblocks_now);
-
-		smgrextend(reln, FSM_FORKNUM, fsm_nblocks_now,
-				   pg.data, false);
+		// TODO: why was it checksumming all zero pages?
+		unbuffered_extend(&ub_wstate, FSM_FORKNUM, fsm_nblocks_now, (Page) pg.data, false);
 		fsm_nblocks_now++;
 	}
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index bb0c52686a..4961067f81 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -312,6 +312,7 @@ extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(TimeLineID *insertTLI);
 extern TimeLineID GetWALInsertionTimeLine(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
+extern bool RedoRecPtrChanged(XLogRecPtr comparator_ptr);
 extern void RemovePromoteSignalFiles(void);
 
 extern bool PromoteIsTriggered(void);
diff --git a/src/include/storage/directmgr.h b/src/include/storage/directmgr.h
new file mode 100644
index 0000000000..5435ab1d08
--- /dev/null
+++ b/src/include/storage/directmgr.h
@@ -0,0 +1,44 @@
+/*-------------------------------------------------------------------------
+ *
+ * directmgr.h
+ *	  POSTGRES unbuffered IO manager definitions.
+ *
+ *
+ * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/directmgr.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "access/xlogdefs.h"
+#include "common/relpath.h"
+#include "storage/bufpage.h"
+#include "storage/smgr.h"
+#include "utils/relcache.h"
+
+#ifndef DIRECTMGR_H
+#define DIRECTMGR_H
+
+/*
+ * After committing the pg_buffer_stats patch, this will contain a pointer to a
+ * PgBufferAccess struct to count the writes and extends done in this way.
+ */ 
+typedef struct UnBufferedWriteState
+{
+	SMgrRelation smgr_rel;
+	XLogRecPtr redo;
+} UnBufferedWriteState;
+/*
+ * prototypes for functions in directmgr.c
+ */
+extern void unbuffered_prep(UnBufferedWriteState *wstate);
+extern void unbuffered_finish(UnBufferedWriteState *wstate, ForkNumber forknum);
+extern void
+unbuffered_extend(UnBufferedWriteState *wstate, ForkNumber forknum, BlockNumber
+		blocknum, Page page, bool empty);
+extern void
+unbuffered_write(UnBufferedWriteState *wstate, ForkNumber forknum, BlockNumber
+		blocknum, Page page);
+
+#endif							/* DIRECTMGR_H */
-- 
2.30.2

#11

https://cirrus-ci.com/task/5612464120266752
https://api.cirrus-ci.com/v1/artifact/task/5612464120266752/regress_diffs/src/test/regress/regression.diffs

pryzby@telsasoft.com

almost 4 years ago

In reply to: Melanie Plageman (#10)

1 attachment(s)

Re: Avoiding smgrimmedsync() during nbtree index builds

On Tue, Jan 11, 2022 at 12:10:54PM -0500, Melanie Plageman wrote:

On Mon, Jan 10, 2022 at 5:50 PM Melanie Plageman <melanieplageman@gmail.com> wrote:

I have attached a v3 which includes two commits -- one of which
implements the directmgr API and uses it and the other which adds
functionality to use either directmgr or bufmgr API during index build.

Also registering for march commitfest.

Forgot directmgr.h. Attached v4

Thanks - I had reconstructed it first ;)

I think the ifndef should be outside the includes:

+++ b/src/include/storage/directmgr.h

+#include "access/xlogdefs.h"

+#ifndef DIRECTMGR_H
+#define DIRECTMGR_H

This is failing on windows CI when I use initdb --data-checksums, as attached.

+++ c:/cirrus/src/test/regress/results/bitmapops.out	2022-01-13 00:47:46.704621200 +0000
..
+ERROR:  could not read block 0 in file "base/16384/30310": read only 0 of 8192 bytes

--
Justin

Attachments:

0003-cirrus-run-initdb-with-data-checksums-for-windows.txttext/x-diff; charset=us-asciiDownload

From e1a88c25725bc5b34ca9deb1a0b785048c0c5c28 Mon Sep 17 00:00:00 2001
From: Justin Pryzby <pryzbyj@telsasoft.com>
Date: Wed, 12 Jan 2022 22:31:09 -0600
Subject: [PATCH 3/3] cirrus: run initdb with --data-checksums for windows

---
 .cirrus.yml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/.cirrus.yml b/.cirrus.yml
index 19b3737fa11..efedd1f0873 100644
--- a/.cirrus.yml
+++ b/.cirrus.yml
@@ -390,7 +390,7 @@ task:
     - perl src/tools/msvc/vcregress.pl check parallel
   startcreate_script:
     # paths to binaries need backslashes
-    - tmp_install\bin\pg_ctl.exe initdb -D tmp_check/db -l tmp_check/initdb.log
+    - tmp_install\bin\pg_ctl.exe initdb -D tmp_check/db -l tmp_check/initdb.log --options=--data-checksums
     - echo include '%TEMP_CONFIG%' >> tmp_check/db/postgresql.conf
     - tmp_install\bin\pg_ctl.exe start -D tmp_check/db -l tmp_check/postmaster.log
   test_pl_script:
-- 
2.17.1

#12

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Melanie Plageman (#5)

Re: Avoiding smgrimmedsync() during nbtree index builds

On Wed, Sep 29, 2021 at 2:36 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

unbuffered_write() and unbuffered_extend() might be able to be used even
if unbuffered_prep() and unbuffered_finish() are not used -- for example
hash indexes do something I don't entirely understand in which they call
smgrextend() directly when allocating buckets but then initialize the
new bucket pages using the bufmgr machinery.

My first thought was that someone might do this to make sure that we
don't run out of disk space after initializing some but not all of the
buckets. Someone might have some reason for wanting to avoid that
corner case. However, in _hash_init() that explanation doesn't make
any sense, because an abort would destroy the entire relation. And in
_hash_alloc_buckets() the variable "zerobuf" points to a buffer that
is not, in fact, all zeroes. So my guess is this is just old, crufty
code - either whatever reasons somebody had for doing it that way are
no longer valid, or there wasn't any good reason even at the time.

--
Robert Haas
EDB: http://www.enterprisedb.com

#13

pryzby@telsasoft.com

almost 4 years ago

In reply to: Justin Pryzby (#11)

1 attachment(s)

Re: Avoiding smgrimmedsync() during nbtree index builds

On Thu, Jan 13, 2022 at 09:52:55AM -0600, Justin Pryzby wrote:

This is failing on windows CI when I use initdb --data-checksums, as attached.

https://cirrus-ci.com/task/5612464120266752
https://api.cirrus-ci.com/v1/artifact/task/5612464120266752/regress_diffs/src/test/regress/regression.diffs
+++ c:/cirrus/src/test/regress/results/bitmapops.out	2022-01-13 00:47:46.704621200 +0000
..
+ERROR:  could not read block 0 in file "base/16384/30310": read only 0 of 8192 bytes

The failure isn't consistent, so I double checked my report. I have some more
details:

The problem occurs maybe only ~25% of the time.

The issue is in the 0001 patch.

data-checksums isn't necessary to hit the issue.

errlocation says: LOCATION: mdread, md.c:686 (the only place the error
exists)

With Andres' windows crash patch, I obtained a backtrace - attached.
https://cirrus-ci.com/task/5978171861368832
https://api.cirrus-ci.com/v1/artifact/task/5978171861368832/crashlog/crashlog-postgres.exe_0fa8_2022-01-16_02-54-35-291.txt

Maybe its a race condition or synchronization problem that nowhere else tends
to hit.

Separate from this issue, I wonder if it'd be useful to write a DEBUG log
showing when btree uses shared_buffers vs fsync. And a regression test which
first SETs client_min_messages=debug to capture the debug log to demonstrate
when/that new code path is being hit. I'm not sure if that would be good to
merge, but it may be useful for now.

--
Justin

#14

pryzby@telsasoft.com

almost 4 years ago

In reply to: Justin Pryzby (#13)

Re: Avoiding smgrimmedsync() during nbtree index builds

On Sun, Jan 16, 2022 at 02:25:59PM -0600, Justin Pryzby wrote:

On Thu, Jan 13, 2022 at 09:52:55AM -0600, Justin Pryzby wrote:
This is failing on windows CI when I use initdb --data-checksums, as attached.

https://cirrus-ci.com/task/5612464120266752
https://api.cirrus-ci.com/v1/artifact/task/5612464120266752/regress_diffs/src/test/regress/regression.diffs
+++ c:/cirrus/src/test/regress/results/bitmapops.out	2022-01-13 00:47:46.704621200 +0000
..
+ERROR:  could not read block 0 in file "base/16384/30310": read only 0 of 8192 bytes
The failure isn't consistent, so I double checked my report. I have some more
details:

The problem occurs maybe only ~25% of the time.

The issue is in the 0001 patch.

data-checksums isn't necessary to hit the issue.

errlocation says: LOCATION: mdread, md.c:686 (the only place the error
exists)

With Andres' windows crash patch, I obtained a backtrace - attached.
https://cirrus-ci.com/task/5978171861368832
https://api.cirrus-ci.com/v1/artifact/task/5978171861368832/crashlog/crashlog-postgres.exe_0fa8_2022-01-16_02-54-35-291.txt

Maybe its a race condition or synchronization problem that nowhere else tends
to hit.

I meant to say that I had not seen this issue anywhere but windows.

But now, by chance, I still had the 0001 patch in my tree, and hit the same
issue on linux:

https://cirrus-ci.com/task/4550618281934848
+++ /tmp/cirrus-ci-build/src/bin/pg_upgrade/tmp_check/regress/results/tuplesort.out	2022-01-17 16:06:35.759108172 +0000
 EXPLAIN (COSTS OFF)
 SELECT id, noabort_increasing, noabort_decreasing FROM abbrev_abort_uuids ORDER BY noabort_increasing LIMIT 5;
+ERROR:  could not read block 0 in file "base/16387/t3_36794": read only 0 of 8192 bytes

#15

andres@anarazel.de

almost 4 years ago

In reply to: Melanie Plageman (#10)

Re: Avoiding smgrimmedsync() during nbtree index builds

Hi,

On 2022-01-11 12:10:54 -0500, Melanie Plageman wrote:

On Mon, Jan 10, 2022 at 5:50 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

I have attached a v3 which includes two commits -- one of which
implements the directmgr API and uses it and the other which adds
functionality to use either directmgr or bufmgr API during index build.

Also registering for march commitfest.

Forgot directmgr.h. Attached v4

Are you looking at the failures Justin pointed out? Something isn't quite
right yet. See /messages/by-id/20220116202559.GW14051@telsasoft.com and
the subsequent mail about it also triggering on once on linux.

Thus, the backend must ensure that
either the Redo pointer has not moved or that the data is fsync'd before
freeing the page.

"freeing"?

This is not a problem with pages written in shared buffers because the
checkpointer will block until all buffers that were dirtied before it
began finish before it moves the Redo pointer past their associated WAL
entries.

This commit makes two main changes:

1) It wraps smgrextend() and smgrwrite() in functions from a new API
for writing data outside of shared buffers, directmgr.

2) It saves the XLOG Redo pointer location before doing the write or
extend. It also adds an fsync request for the page to the
checkpointer's pending-ops table. Then, after doing the write or
extend, if the Redo pointer has moved (meaning a checkpoint has
started since it saved it last), then the backend fsync's the page
itself. Otherwise, it lets the checkpointer take care of fsync'ing
the page the next time it processes the pending-ops table.

Why combine those two into one commit?

@@ -654,9 +657,8 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
/* Now extend the file */
while (vm_nblocks_now < vm_nblocks)
{
-		PageSetChecksumInplace((Page) pg.data, vm_nblocks_now);
-
-		smgrextend(reln, VISIBILITYMAP_FORKNUM, vm_nblocks_now, pg.data, false);
+		// TODO: aren't these pages empty? why checksum them
+		unbuffered_extend(&ub_wstate, VISIBILITYMAP_FORKNUM, vm_nblocks_now, (Page) pg.data, false);

Yea, it's a bit odd. PageSetChecksumInplace() will just return immediately:

/* If we don't need a checksum, just return */
if (PageIsNew(page) || !DataChecksumsEnabled())
return;

OTOH, it seems easier to have it there than to later forget it, when
e.g. adding some actual initial content to the pages during the smgrextend().

@@ -560,6 +562,8 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)

wstate.heap = btspool->heap;
wstate.index = btspool->index;
+	wstate.ub_wstate.smgr_rel = RelationGetSmgr(btspool->index);
+	wstate.ub_wstate.redo = InvalidXLogRecPtr;
wstate.inskey = _bt_mkscankey(wstate.index, NULL);
/* _bt_mkscankey() won't set allequalimage without metapage */
wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
@@ -656,31 +660,19 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
if (!wstate->btws_zeropage)
wstate->btws_zeropage = (Page) palloc0(BLCKSZ);
/* don't set checksum for all-zero page */
-		smgrextend(RelationGetSmgr(wstate->index), MAIN_FORKNUM,
-				   wstate->btws_pages_written++,
-				   (char *) wstate->btws_zeropage,
-				   true);
+		unbuffered_extend(&wstate->ub_wstate, MAIN_FORKNUM, wstate->btws_pages_written++, wstate->btws_zeropage, true);
}

There's a bunch of long lines in here...

-	/*
-	 * When we WAL-logged index pages, we must nonetheless fsync index files.
-	 * Since we're building outside shared buffers, a CHECKPOINT occurring
-	 * during the build has no way to flush the previously written data to
-	 * disk (indeed it won't know the index even exists).  A crash later on
-	 * would replay WAL from the checkpoint, therefore it wouldn't replay our
-	 * earlier WAL entries. If we do not fsync those pages here, they might
-	 * still not be on disk when the crash occurs.
-	 */
if (wstate->btws_use_wal)
-		smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM);
+		unbuffered_finish(&wstate->ub_wstate, MAIN_FORKNUM);
}

The API of unbuffered_finish() only sometimes getting called, but
unbuffered_prep() being unconditional, strikes me as prone to bugs. Perhaps
it'd make sense to pass in whether the relation needs to be synced or not instead?

spgbuildempty(Relation index)
{
Page		page;
+	UnBufferedWriteState wstate;
+
+	wstate.smgr_rel = RelationGetSmgr(index);
+	unbuffered_prep(&wstate);

I don't think that's actually safe, and one of the instances could be the
cause cause of the bug CI is seeing:

* Note: since a relcache flush can cause the file handle to be closed again,
* it's unwise to hold onto the pointer returned by this function for any
* long period. Recommended practice is to just re-execute RelationGetSmgr
* each time you need to access the SMgrRelation. It's quite cheap in
* comparison to whatever an smgr function is going to do.
*/
static inline SMgrRelation
RelationGetSmgr(Relation rel)

Greetings,

Andres Freund

#16

melanieplageman@gmail.com

almost 4 years ago

In reply to: Andres Freund (#15)

4 attachment(s)

Re: Avoiding smgrimmedsync() during nbtree index builds

Hi,
v5 attached and all email feedback addressed below

On Thu, Jan 13, 2022 at 12:18 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Sep 29, 2021 at 2:36 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

unbuffered_write() and unbuffered_extend() might be able to be used even
if unbuffered_prep() and unbuffered_finish() are not used -- for example
hash indexes do something I don't entirely understand in which they call
smgrextend() directly when allocating buckets but then initialize the
new bucket pages using the bufmgr machinery.

My first thought was that someone might do this to make sure that we
don't run out of disk space after initializing some but not all of the
buckets. Someone might have some reason for wanting to avoid that
corner case. However, in _hash_init() that explanation doesn't make
any sense, because an abort would destroy the entire relation. And in
_hash_alloc_buckets() the variable "zerobuf" points to a buffer that
is not, in fact, all zeroes. So my guess is this is just old, crufty
code - either whatever reasons somebody had for doing it that way are
no longer valid, or there wasn't any good reason even at the time.

I notice in the comment before _hash_alloc_buckets() is called, it says

/*
* We treat allocation of buckets as a separate WAL-logged action.
* Even if we fail after this operation, won't leak bucket pages;
* rather, the next split will consume this space. In any case, even
* without failure we don't use all the space in one split operation.
*/

Does this mean that it is okay that these pages are written outside of
shared buffers and, though skipFsync is passed as false, a checkpoint
starting and finishing between writing the WAL and
register_dirty_segment() followed by a crash could result in lost data?

On Thu, Jan 13, 2022 at 10:52 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

I think the ifndef should be outside the includes:

Thanks, fixed!

On Sun, Jan 16, 2022 at 3:26 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

Separate from this issue, I wonder if it'd be useful to write a DEBUG log
showing when btree uses shared_buffers vs fsync. And a regression test which
first SETs client_min_messages=debug to capture the debug log to demonstrate
when/that new code path is being hit. I'm not sure if that would be good to
merge, but it may be useful for now.

I will definitely think about doing this.

On Mon, Jan 17, 2022 at 12:22 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

On Sun, Jan 16, 2022 at 02:25:59PM -0600, Justin Pryzby wrote:
On Thu, Jan 13, 2022 at 09:52:55AM -0600, Justin Pryzby wrote:
This is failing on windows CI when I use initdb --data-checksums, as attached.

https://cirrus-ci.com/task/5612464120266752
https://api.cirrus-ci.com/v1/artifact/task/5612464120266752/regress_diffs/src/test/regress/regression.diffs
+++ c:/cirrus/src/test/regress/results/bitmapops.out        2022-01-13 00:47:46.704621200 +0000
..
+ERROR:  could not read block 0 in file "base/16384/30310": read only 0 of 8192 bytes
The failure isn't consistent, so I double checked my report. I have some more
details:

The problem occurs maybe only ~25% of the time.

The issue is in the 0001 patch.

data-checksums isn't necessary to hit the issue.

errlocation says: LOCATION: mdread, md.c:686 (the only place the error
exists)

With Andres' windows crash patch, I obtained a backtrace - attached.
https://cirrus-ci.com/task/5978171861368832
https://api.cirrus-ci.com/v1/artifact/task/5978171861368832/crashlog/crashlog-postgres.exe_0fa8_2022-01-16_02-54-35-291.txt

Maybe its a race condition or synchronization problem that nowhere else tends
to hit.
I meant to say that I had not seen this issue anywhere but windows.

But now, by chance, I still had the 0001 patch in my tree, and hit the same
issue on linux:
https://cirrus-ci.com/task/4550618281934848
+++ /tmp/cirrus-ci-build/src/bin/pg_upgrade/tmp_check/regress/results/tuplesort.out     2022-01-17 16:06:35.759108172 +0000
EXPLAIN (COSTS OFF)
SELECT id, noabort_increasing, noabort_decreasing FROM abbrev_abort_uuids ORDER BY noabort_increasing LIMIT 5;
+ERROR:  could not read block 0 in file "base/16387/t3_36794": read only 0 of 8192 bytes

Yes, I think this is due to the problem Andres mentioned with my saving
the SMgrRelation and then trying to use it after a relcache flush. The
new patch version addresses this by always re-executing
RelationGetSmgr() as recommended in the comments.

On Sun, Jan 23, 2022 at 4:55 PM Andres Freund <andres@anarazel.de> wrote:

On 2022-01-11 12:10:54 -0500, Melanie Plageman wrote:

On Mon, Jan 10, 2022 at 5:50 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
Thus, the backend must ensure that
either the Redo pointer has not moved or that the data is fsync'd before
freeing the page.

"freeing"?

Yes, I agree this wording was confusing/incorrect. I meant before it
moves on (I said freeing because it usually pfrees() the page in memory
that it was writing from). I've changed the commit message.

This is not a problem with pages written in shared buffers because the
checkpointer will block until all buffers that were dirtied before it
began finish before it moves the Redo pointer past their associated WAL
entries.

This commit makes two main changes:

1) It wraps smgrextend() and smgrwrite() in functions from a new API
for writing data outside of shared buffers, directmgr.

2) It saves the XLOG Redo pointer location before doing the write or
extend. It also adds an fsync request for the page to the
checkpointer's pending-ops table. Then, after doing the write or
extend, if the Redo pointer has moved (meaning a checkpoint has
started since it saved it last), then the backend fsync's the page
itself. Otherwise, it lets the checkpointer take care of fsync'ing
the page the next time it processes the pending-ops table.

Why combine those two into one commit?

I've separated it into three commits -- the above two + a separate
commit that actually has the btree index use the self-fsync
optimization.

@@ -654,9 +657,8 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
/* Now extend the file */
while (vm_nblocks_now < vm_nblocks)
{
-             PageSetChecksumInplace((Page) pg.data, vm_nblocks_now);
-
-             smgrextend(reln, VISIBILITYMAP_FORKNUM, vm_nblocks_now, pg.data, false);
+             // TODO: aren't these pages empty? why checksum them
+             unbuffered_extend(&ub_wstate, VISIBILITYMAP_FORKNUM, vm_nblocks_now, (Page) pg.data, false);
Yea, it's a bit odd. PageSetChecksumInplace() will just return immediately:

/* If we don't need a checksum, just return */
if (PageIsNew(page) || !DataChecksumsEnabled())
return;

OTOH, it seems easier to have it there than to later forget it, when
e.g. adding some actual initial content to the pages during the smgrextend().

I've left these as is and removed the comment.

@@ -560,6 +562,8 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)

wstate.heap = btspool->heap;
wstate.index = btspool->index;
+     wstate.ub_wstate.smgr_rel = RelationGetSmgr(btspool->index);
+     wstate.ub_wstate.redo = InvalidXLogRecPtr;
wstate.inskey = _bt_mkscankey(wstate.index, NULL);
/* _bt_mkscankey() won't set allequalimage without metapage */
wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
@@ -656,31 +660,19 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
if (!wstate->btws_zeropage)
wstate->btws_zeropage = (Page) palloc0(BLCKSZ);
/* don't set checksum for all-zero page */
-             smgrextend(RelationGetSmgr(wstate->index), MAIN_FORKNUM,
-                                wstate->btws_pages_written++,
-                                (char *) wstate->btws_zeropage,
-                                true);
+             unbuffered_extend(&wstate->ub_wstate, MAIN_FORKNUM, wstate->btws_pages_written++, wstate->btws_zeropage, true);
}

There's a bunch of long lines in here...

Fixed.

-     /*
-      * When we WAL-logged index pages, we must nonetheless fsync index files.
-      * Since we're building outside shared buffers, a CHECKPOINT occurring
-      * during the build has no way to flush the previously written data to
-      * disk (indeed it won't know the index even exists).  A crash later on
-      * would replay WAL from the checkpoint, therefore it wouldn't replay our
-      * earlier WAL entries. If we do not fsync those pages here, they might
-      * still not be on disk when the crash occurs.
-      */
if (wstate->btws_use_wal)
-             smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM);
+             unbuffered_finish(&wstate->ub_wstate, MAIN_FORKNUM);
}

I've fixed this. Now unbuffered_prep() and unbuffered_finish() will
always be called. I've added a few options to unbuffered_prep() to
indicate whether or not the smgrimmedsync() should be called in the end
as well as whether or not skipFsync should be passed as true or false to
smgrextend() and smgrwrite() and whether or not the avoiding self-fsync
optimization should be used.

I found it best to do it this way because simply passing whether or not
to do the sync to unbuffered_finish() did not allow me to distinguish
between the case in which the sync should not be done ever (because the
caller did not call smgrimmedsync() or because the relation does not
require WAL) and when smgrimmedsync() should only be done if the redo
pointer has changed (in the case of the optimization).

I thought it actually made for a better API to specify up front (in
unbuffered_prep()) whether or not the caller should be prepared to do
the fsync itself or not and whether it not it wanted to do the
optimization. It feels less prone to error and omission.

spgbuildempty(Relation index)
{
Page            page;
+     UnBufferedWriteState wstate;
+
+     wstate.smgr_rel = RelationGetSmgr(index);
+     unbuffered_prep(&wstate);
I don't think that's actually safe, and one of the instances could be the
cause cause of the bug CI is seeing:

* Note: since a relcache flush can cause the file handle to be closed again,
* it's unwise to hold onto the pointer returned by this function for any
* long period. Recommended practice is to just re-execute RelationGetSmgr
* each time you need to access the SMgrRelation. It's quite cheap in
* comparison to whatever an smgr function is going to do.
*/
static inline SMgrRelation
RelationGetSmgr(Relation rel)

Yes, I've changed this in the attached v5.

One question I have is whether or not other callers than btree index
could benefit from the self-fsync avoidance optimization.

Also, after taking another look at gist index build, I notice that
smgrimmedsync() is not done anywhere and skipFsync is always passed as
true, so what happens if a full checkpoint and a crash happens between
WAL-logging and whenever the dirty pages make it to permanent storage?

- Melanie

Attachments:

v5-0002-Avoid-immediate-fsync-for-unbuffered-IO.patchtext/x-patch; charset=US-ASCII; name=v5-0002-Avoid-immediate-fsync-for-unbuffered-IO.patchDownload

From 72c528a913ed4cae1cd11789439bfe1208dd379a Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 8 Feb 2022 19:01:27 -0500
Subject: [PATCH v5 2/4] Avoid immediate fsync for unbuffered IO

Data written to WAL-logged table forks is durable once the WAL entries
are on permanent storage; however, the XLOG Redo pointer cannot be moved
past the associated WAL until the page data is safely on permanent
storage. If a crash were to occur before the data is fsync'd, the WAL
wouldn't be replayed during recovery, and the data would be lost.

This is not a problem with pages written in shared buffers because the
checkpointer will block until FlushBuffer() is complete for all buffers
that were dirtied before it began. Therefore it will not move the Redo
pointer past their associated WAL entries until it has fsync'd the data.

A backend writing data outside of shared buffers must ensure that the
data has reached permanent storage itself or that the Redo pointer has
not moved while it was writing the data.

In the common case, the backend should not have to do this fsync itself
and can instead request the checkpointer do it.

To ensure this is safe, the backend can save the XLOG Redo pointer
location before doing the write or extend. Then it can add an fsync
request for the page to the checkpointer's pending-ops table using the
existing mechanism. After doing the write or extend, if the Redo pointer
has moved (meaning a checkpoint has started since it saved it last),
then the backend can simply fsync the page itself. Otherwise, the
checkpointer takes care of fsync'ing the page the next time it processes
the pending-ops table.

This commit adds the optimization option to the directmgr API but does
not add any users, so there is no behavior change.
---
 src/backend/access/gist/gistbuild.c       |  4 ++--
 src/backend/access/hash/hashpage.c        |  2 +-
 src/backend/access/heap/heapam_handler.c  |  2 +-
 src/backend/access/heap/rewriteheap.c     |  2 +-
 src/backend/access/heap/visibilitymap.c   |  2 +-
 src/backend/access/nbtree/nbtree.c        |  2 +-
 src/backend/access/nbtree/nbtsort.c       |  5 ++++-
 src/backend/access/spgist/spginsert.c     |  2 +-
 src/backend/access/transam/xlog.c         | 13 +++++++++++++
 src/backend/catalog/storage.c             |  2 +-
 src/backend/storage/direct/directmgr.c    | 18 ++++++++++++++++--
 src/backend/storage/freespace/freespace.c |  2 +-
 src/include/access/xlog.h                 |  1 +
 src/include/storage/directmgr.h           | 13 +++++++++++--
 14 files changed, 55 insertions(+), 15 deletions(-)

diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 42a5e61f8e..53226e45bf 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -411,7 +411,7 @@ gist_indexsortbuild(GISTBuildState *state)
 	state->pages_allocated = 0;
 	state->pages_written = 0;
 	state->ready_num_pages = 0;
-	unbuffered_prep(&state->ub_wstate, false, false);
+	unbuffered_prep(&state->ub_wstate, false, false, false);
 
 	/*
 	 * Write an empty page as a placeholder for the root page. It will be
@@ -638,7 +638,7 @@ gist_indexsortbuild_flush_ready_pages(GISTBuildState *state)
 	if (state->ready_num_pages == 0)
 		return;
 
-	unbuffered_prep(&state->ub_wstate, false, false);
+	unbuffered_prep(&state->ub_wstate, false, false, false);
 
 	for (int i = 0; i < state->ready_num_pages; i++)
 	{
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 6096604438..0c5533e632 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -1003,7 +1003,7 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
 	if (lastblock < firstblock || lastblock == InvalidBlockNumber)
 		return false;
 
-	unbuffered_prep(&ub_wstate, false, true);
+	unbuffered_prep(&ub_wstate, false, false, true);
 
 	page = (Page) zerobuf.data;
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 9fd6a6f447..f9f6527507 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -578,7 +578,7 @@ heapam_relation_set_new_filenode(Relation rel,
 	SMgrRelation srel;
 	UnBufferedWriteState ub_wstate;
 
-	unbuffered_prep(&ub_wstate, true, false);
+	unbuffered_prep(&ub_wstate, false, true, false);
 	/*
 	 * Initialize to the minimum XID that could put tuples in the table. We
 	 * know that no xacts older than RecentXmin are still running, so that
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 12bdd6ff60..b103a62135 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -267,7 +267,7 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
 	state->rs_cutoff_multi = cutoff_multi;
 	state->rs_cxt = rw_cxt;
 
-	unbuffered_prep(&state->rs_unbuffered_wstate,
+	unbuffered_prep(&state->rs_unbuffered_wstate, false,
 			RelationNeedsWAL(state->rs_new_rel), false);
 
 	/* Initialize hash tables used to track update chains */
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 897de5ec1f..d844767abc 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -633,7 +633,7 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 	 * by the time we get the lock.
 	 */
 	LockRelationForExtension(rel, ExclusiveLock);
-	unbuffered_prep(&ub_wstate, false, true);
+	unbuffered_prep(&ub_wstate, false, false, true);
 
 	/*
 	 * Caution: re-using this smgr pointer could fail if the relcache entry
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 1ec7493ad3..843c9e2362 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -155,7 +155,7 @@ btbuildempty(Relation index)
 	Page		metapage;
 	UnBufferedWriteState wstate;
 
-	unbuffered_prep(&wstate, true, false);
+	unbuffered_prep(&wstate, false, true, false);
 
 	/* Construct metapage. */
 	metapage = (Page) palloc(BLCKSZ);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index c7a65a9972..a67770f3fd 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1186,7 +1186,10 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	bool		deduplicate;
 
 
-	unbuffered_prep(&wstate->ub_wstate, wstate->btws_use_wal, false);
+	/*
+	 * Only bother fsync'ing the data to permanent storage if WAL logging
+	 */
+	unbuffered_prep(&wstate->ub_wstate, false, wstate->btws_use_wal, false);
 
 	deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
 		BTGetDeduplicateItems(wstate->index);
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index e232ba4b86..e45f1f5db9 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -159,7 +159,7 @@ spgbuildempty(Relation index)
 	Page		page;
 	UnBufferedWriteState wstate;
 
-	unbuffered_prep(&wstate, true, false);
+	unbuffered_prep(&wstate, false, true, false);
 
 	/* Construct metapage. */
 	page = (Page) palloc(BLCKSZ);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 958220c495..c1434d8f85 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8774,6 +8774,19 @@ GetLastImportantRecPtr(void)
 	return res;
 }
 
+bool RedoRecPtrChanged(XLogRecPtr comparator_ptr)
+{
+	XLogRecPtr ptr;
+	SpinLockAcquire(&XLogCtl->info_lck);
+	ptr = XLogCtl->RedoRecPtr;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	if (RedoRecPtr < ptr)
+		RedoRecPtr = ptr;
+
+	return RedoRecPtr != comparator_ptr;
+}
+
 /*
  * Get the time and LSN of the last xlog segment switch
  */
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 1ec90e00ab..307f32ab8c 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -443,7 +443,7 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 	use_wal = XLogIsNeeded() &&
 		(relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork);
 
-	unbuffered_prep(&wstate, (use_wal || copying_initfork), false);
+	unbuffered_prep(&wstate, false, (use_wal || copying_initfork), false);
 
 	nblocks = smgrnblocks(src, forkNum);
 
diff --git a/src/backend/storage/direct/directmgr.c b/src/backend/storage/direct/directmgr.c
index 371ff5602f..120b7c06a7 100644
--- a/src/backend/storage/direct/directmgr.c
+++ b/src/backend/storage/direct/directmgr.c
@@ -15,15 +15,26 @@
 #include "postgres.h"
 
 
+#include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "storage/directmgr.h"
 
+// TODO: do_optimization can be derived from request_fsync and fsync_self, I
+// think. but is that true in all cases and also is it confusing?
 void
-unbuffered_prep(UnBufferedWriteState *wstate, bool fsync_self, bool
-		request_fsync)
+unbuffered_prep(UnBufferedWriteState *wstate, bool do_optimization, bool
+		fsync_self, bool request_fsync)
 {
+	/*
+	 * No reason to do optimization when not required to fsync self
+	 */
+	Assert(!do_optimization || (do_optimization && fsync_self));
+
+	wstate->do_optimization = do_optimization;
 	wstate->fsync_self = fsync_self;
 	wstate->request_fsync = request_fsync;
+
+	wstate->redo = do_optimization ? GetRedoRecPtr() : InvalidXLogRecPtr;
 }
 
 void
@@ -75,5 +86,8 @@ unbuffered_finish(UnBufferedWriteState *wstate, SMgrRelation smgrrel,
 	if (!wstate->fsync_self)
 		return;
 
+	if (wstate->do_optimization && !RedoRecPtrChanged(wstate->redo))
+		return;
+
 	smgrimmedsync(smgrrel, forknum);
 }
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 4326ea8f01..7c79983ff9 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -627,7 +627,7 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 	 */
 	LockRelationForExtension(rel, ExclusiveLock);
 
-	unbuffered_prep(&ub_wstate, false, true);
+	unbuffered_prep(&ub_wstate, false, false, true);
 
 	/*
 	 * Caution: re-using this smgr pointer could fail if the relcache entry
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index a4b1c1286f..c2ae0ce304 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -310,6 +310,7 @@ extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(TimeLineID *insertTLI);
 extern TimeLineID GetWALInsertionTimeLine(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
+extern bool RedoRecPtrChanged(XLogRecPtr comparator_ptr);
 extern void RemovePromoteSignalFiles(void);
 
 extern bool PromoteIsTriggered(void);
diff --git a/src/include/storage/directmgr.h b/src/include/storage/directmgr.h
index 47653d0d1b..1ff0c9b0c8 100644
--- a/src/include/storage/directmgr.h
+++ b/src/include/storage/directmgr.h
@@ -14,6 +14,7 @@
 #ifndef DIRECTMGR_H
 #define DIRECTMGR_H
 
+#include "access/xlogdefs.h"
 #include "common/relpath.h"
 #include "storage/block.h"
 #include "storage/bufpage.h"
@@ -31,19 +32,27 @@ typedef struct UnBufferedWriteState
 	 * associated WAL entries. To avoid this, callers in this situation must
 	 * fsync the pages they have written themselves.
 	 *
+	 * These callers can optionally use the following optimization:
+	 * attempt to use the sync request queue and fall back to fsync'ing the
+	 * pages themselves if the redo pointer moves between the start and finish
+	 * of their write. In order to do this, they must set do_optimization to
+	 * true so that the redo pointer is saved before the write begins.
+	 *
 	 * Callers able to use the checkpointer's sync request queue when writing
 	 * data outside shared buffers (like fsm and vm) can set request_fsync to
 	 * true so that these fsync requests are added to the queue.
 	 */
+	bool do_optimization;
 	bool fsync_self;
 	bool request_fsync;
+	XLogRecPtr redo;
 } UnBufferedWriteState;
 /*
  * prototypes for functions in directmgr.c
  */
 extern void
-unbuffered_prep(UnBufferedWriteState *wstate, bool fsync_self, bool
-		request_fsync);
+unbuffered_prep(UnBufferedWriteState *wstate, bool do_optimization, bool
+		fsync_self, bool request_fsync);
 extern void
 unbuffered_write(UnBufferedWriteState *wstate, bool do_wal, SMgrRelation
 		smgrrel, ForkNumber forknum, BlockNumber blocknum, Page page);
-- 
2.30.2

v5-0004-Use-shared-buffers-when-possible-for-index-build.patchtext/x-patch; charset=US-ASCII; name=v5-0004-Use-shared-buffers-when-possible-for-index-build.patchDownload

From 973469e7a5217630542eb7cfd0a7acfc35c6a9fd Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 8 Feb 2022 19:01:36 -0500
Subject: [PATCH v5 4/4] Use shared buffers when possible for index build

When there are not too many tuples, building the index in shared buffers
makes sense. It allows the buffer manager to handle how best to do the
IO.
---
 src/backend/access/nbtree/nbtree.c  |  29 +--
 src/backend/access/nbtree/nbtsort.c | 267 ++++++++++++++++++++++------
 2 files changed, 223 insertions(+), 73 deletions(-)

diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index a1efbe1e6a..4dbf7af9af 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -152,26 +152,27 @@ bthandler(PG_FUNCTION_ARGS)
 void
 btbuildempty(Relation index)
 {
+	/*
+	 * Since this only writes one page, use shared buffers.
+	 */
 	Page		metapage;
-	UnBufferedWriteState wstate;
-
-	unbuffered_prep(&wstate, true, true, true);
+	Buffer metabuf;
 
-	/* Construct metapage. */
-	metapage = (Page) palloc(BLCKSZ);
+	/*
+	 * Allocate a buffer for metapage and initialize metapage.
+	 */
+	metabuf = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_ZERO_AND_LOCK, NULL);
+	metapage = BufferGetPage(metabuf);
 	_bt_initmetapage(metapage, P_NONE, 0, _bt_allequalimage(index, false));
 
 	/*
-	 * Write the page and log it.  It might seem that an immediate sync would
-	 * be sufficient to guarantee that the file exists on disk, but recovery
-	 * itself might remove it while replaying, for example, an
-	 * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we need
-	 * this even when wal_level=minimal.
+	 * Mark metapage buffer as dirty and XLOG it
 	 */
-	unbuffered_write(&wstate, true, RelationGetSmgr(index), INIT_FORKNUM,
-			BTREE_METAPAGE, metapage);
-
-	unbuffered_finish(&wstate, RelationGetSmgr(index), INIT_FORKNUM);
+	START_CRIT_SECTION();
+	MarkBufferDirty(metabuf);
+	log_newpage_buffer(metabuf, true);
+	END_CRIT_SECTION();
+	_bt_relbuf(index, metabuf);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 079832bb78..7e1dd93df0 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -237,6 +237,7 @@ typedef struct BTPageState
 {
 	Page		btps_page;		/* workspace for page building */
 	BlockNumber btps_blkno;		/* block # to write this page at */
+	Buffer btps_buf; /* buffer to write this page to */
 	IndexTuple	btps_lowkey;	/* page's strict lower bound pivot tuple */
 	OffsetNumber btps_lastoff;	/* last item offset loaded */
 	Size		btps_lastextra; /* last item's extra posting list space */
@@ -254,10 +255,26 @@ typedef struct BTWriteState
 	Relation	index;
 	BTScanInsert inskey;		/* generic insertion scankey */
 	bool		btws_use_wal;	/* dump pages to WAL? */
-	BlockNumber btws_pages_alloced; /* # pages allocated */
-	BlockNumber btws_pages_written; /* # pages written out */
+	BlockNumber btws_pages_alloced; /* # pages allocated for index builds outside SB */
+	BlockNumber btws_pages_written; /* # pages written out for index builds outside SB */
 	Page		btws_zeropage;	/* workspace for filling zeroes */
 	UnBufferedWriteState ub_wstate;
+	/*
+	 * Allocate a new btree page. This does not initialize the page.
+	 */
+	Page (*_bt_bl_alloc_page) (struct BTWriteState *wstate, BlockNumber
+			*blockno, Buffer *buf);
+	/*
+	 * Emit a completed btree page, and release the working storage.
+	 */
+	void (*_bt_blwritepage) (struct BTWriteState *wstate, Page page,
+			BlockNumber blkno, Buffer buf);
+
+	void (*_bt_bl_unbuffered_prep) (UnBufferedWriteState *wstate, bool
+			do_optimization, bool fsync_self, bool request_fsync);
+
+	void (*_bt_bl_unbuffered_finish) (UnBufferedWriteState *wstate,
+			SMgrRelation smgrrel, ForkNumber forknum);
 } BTWriteState;
 
 
@@ -266,10 +283,22 @@ static double _bt_spools_heapscan(Relation heap, Relation index,
 static void _bt_spooldestroy(BTSpool *btspool);
 static void _bt_spool(BTSpool *btspool, ItemPointer self,
 					  Datum *values, bool *isnull);
-static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2);
+static void _bt_leafbuild(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2);
 static void _bt_build_callback(Relation index, ItemPointer tid, Datum *values,
 							   bool *isnull, bool tupleIsAlive, void *state);
-static Page _bt_blnewpage(uint32 level);
+
+static Page
+_bt_bl_alloc_page_direct(BTWriteState *wstate, BlockNumber *blockno, Buffer *buf);
+static Page
+_bt_bl_alloc_page_shared(BTWriteState *wstate, BlockNumber *blockno, Buffer *buf);
+
+static Page _bt_blnewpage(uint32 level, Page page);
+
+static void
+_bt_blwritepage_direct(BTWriteState *wstate, Page page, BlockNumber blkno, Buffer buf);
+static void
+_bt_blwritepage_shared(BTWriteState *wstate, Page page, BlockNumber blkno, Buffer buf);
+
 static BTPageState *_bt_pagestate(BTWriteState *wstate, uint32 level);
 static void _bt_slideleft(Page rightmostpage);
 static void _bt_sortaddtup(Page page, Size itemsize,
@@ -280,9 +309,10 @@ static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
 static void _bt_sort_dedup_finish_pending(BTWriteState *wstate,
 										  BTPageState *state,
 										  BTDedupState dstate);
-static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
 static void _bt_load(BTWriteState *wstate,
 					 BTSpool *btspool, BTSpool *btspool2);
+static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state, Buffer
+		metabuf, Page metapage);
 static void _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent,
 							   int request);
 static void _bt_end_parallel(BTLeader *btleader);
@@ -295,6 +325,21 @@ static void _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 									   Sharedsort *sharedsort2, int sortmem,
 									   bool progress);
 
+#define BT_BUILD_SB_THRESHOLD 1024
+
+static const BTWriteState wstate_shared = {
+	._bt_bl_alloc_page = _bt_bl_alloc_page_shared,
+	._bt_blwritepage = _bt_blwritepage_shared,
+	._bt_bl_unbuffered_prep = NULL,
+	._bt_bl_unbuffered_finish = NULL,
+};
+
+static const BTWriteState wstate_direct = {
+	._bt_bl_alloc_page = _bt_bl_alloc_page_direct,
+	._bt_blwritepage = _bt_blwritepage_direct,
+	._bt_bl_unbuffered_prep = unbuffered_prep,
+	._bt_bl_unbuffered_finish = unbuffered_finish,
+};
 
 /*
  *	btbuild() -- build a new btree index.
@@ -304,6 +349,7 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 {
 	IndexBuildResult *result;
 	BTBuildState buildstate;
+	BTWriteState writestate;
 	double		reltuples;
 
 #ifdef BTREE_BUILD_STATS
@@ -334,8 +380,12 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
+	 *
+	 * Don't use shared buffers if the number of tuples is too large.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2);
+	writestate = reltuples < BT_BUILD_SB_THRESHOLD ? wstate_shared : wstate_direct;
+
+	_bt_leafbuild(&writestate, buildstate.spool, buildstate.spool2);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
@@ -543,10 +593,8 @@ _bt_spool(BTSpool *btspool, ItemPointer self, Datum *values, bool *isnull)
  * create an entire btree.
  */
 static void
-_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
+_bt_leafbuild(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 {
-	BTWriteState wstate;
-
 #ifdef BTREE_BUILD_STATS
 	if (log_btree_build_stats)
 	{
@@ -566,21 +614,45 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 		tuplesort_performsort(btspool2->sortstate);
 	}
 
-	wstate.heap = btspool->heap;
-	wstate.index = btspool->index;
-	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
-	/* _bt_mkscankey() won't set allequalimage without metapage */
-	wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
-	wstate.btws_use_wal = RelationNeedsWAL(wstate.index);
+	wstate->heap = btspool->heap;
+	wstate->index = btspool->index;
+	wstate->inskey = _bt_mkscankey(wstate->index, NULL);
+	/* _bt-mkscankey() won't set allequalimage without metapage */
+	wstate->inskey->allequalimage = _bt_allequalimage(wstate->index, true);
+	wstate->btws_use_wal = RelationNeedsWAL(wstate->index);
 
 	/* reserve the metapage */
-	wstate.btws_pages_alloced = BTREE_METAPAGE + 1;
-	wstate.btws_pages_written = 0;
-	wstate.btws_zeropage = NULL;	/* until needed */
+	wstate->btws_pages_alloced = 0;
+	wstate->btws_pages_written = 0;
+	wstate->btws_zeropage = NULL;
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_BTREE_PHASE_LEAF_LOAD);
-	_bt_load(&wstate, btspool, btspool2);
+
+	/*
+	 * If not using shared buffers, for a WAL-logged relation, save the redo
+	 * pointer location in case a checkpoint begins during the index build.
+	 *
+	 * This optimization requires that the backend both add an fsync request to
+	 * the checkpointer's pending-ops table as well as be prepared to fsync the
+	 * page data itself. Because none of these are required if the relation is
+	 * not WAL-logged, pass btws_use_wal for all parameters of the prep
+	 * function.
+	 */
+	if (wstate->_bt_bl_unbuffered_prep)
+			wstate->_bt_bl_unbuffered_prep(&wstate->ub_wstate,
+					wstate->btws_use_wal, wstate->btws_use_wal,
+					wstate->btws_use_wal);
+
+	_bt_load(wstate, btspool, btspool2);
+
+	/*
+	 * If not using shared buffers, for a WAL-logged relation, check if backend
+	 * must fsync the page itself.
+	 */
+	if (wstate->_bt_bl_unbuffered_finish)
+		wstate->_bt_bl_unbuffered_finish(&wstate->ub_wstate,
+				RelationGetSmgr(wstate->index), MAIN_FORKNUM);
 }
 
 /*
@@ -613,15 +685,15 @@ _bt_build_callback(Relation index,
 }
 
 /*
- * allocate workspace for a new, clean btree page, not linked to any siblings.
+ * Set up workspace for a new, clean btree page, not linked to any siblings.
+ * Caller must allocate the passed in page.
  */
 static Page
-_bt_blnewpage(uint32 level)
+_bt_blnewpage(uint32 level, Page page)
 {
-	Page		page;
 	BTPageOpaque opaque;
 
-	page = (Page) palloc(BLCKSZ);
+	Assert(page);
 
 	/* Zero the page and set up standard page header info */
 	_bt_pageinit(page, BLCKSZ);
@@ -639,11 +711,8 @@ _bt_blnewpage(uint32 level)
 	return page;
 }
 
-/*
- * emit a completed btree page, and release the working storage.
- */
 static void
-_bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
+_bt_blwritepage_direct(BTWriteState *wstate, Page page, BlockNumber blkno, Buffer buf)
 {
 	/*
 	 * If we have to write pages nonsequentially, fill in the space with
@@ -681,6 +750,61 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
 	pfree(page);
 }
 
+static void
+_bt_blwritepage_shared(BTWriteState *wstate, Page page, BlockNumber blkno, Buffer buf)
+{
+	/*
+	 * Indexes built in shared buffers need only to mark the buffer as dirty
+	 * and XLOG it.
+	 */
+	Assert(buf);
+	START_CRIT_SECTION();
+	MarkBufferDirty(buf);
+	if (wstate->btws_use_wal)
+		log_newpage_buffer(buf, true);
+	END_CRIT_SECTION();
+	_bt_relbuf(wstate->index, buf);
+}
+
+static Page
+_bt_bl_alloc_page_direct(BTWriteState *wstate, BlockNumber *blockno, Buffer *buf)
+{
+	 /* buf is only used when using shared buffers, so set it to InvalidBuffer */
+	*buf = InvalidBuffer;
+
+	/*
+	 * Assign block number for the page.
+	 * This will be used to link to sibling page(s) later and, if this is the
+	 * initial page in the level, saved in the BTPageState
+	 */
+	*blockno = wstate->btws_pages_alloced++;
+
+	/* now allocate and set up the new page */
+	return palloc(BLCKSZ);
+}
+
+static Page
+_bt_bl_alloc_page_shared(BTWriteState *wstate, BlockNumber *blockno, Buffer *buf)
+{
+	/*
+	 * Find a shared buffer for the page. Pass mode RBM_ZERO_AND_LOCK to get an
+	 * exclusive lock on the buffer content. No lock on the relation as a whole
+	 * is needed (as in LockRelationForExtension()) because the initial index
+	 * build is not yet complete.
+	 */
+	*buf = ReadBufferExtended(wstate->index, MAIN_FORKNUM, P_NEW,
+			RBM_ZERO_AND_LOCK, NULL);
+
+	/*
+	 * bufmgr will assign a block number for the new page.
+	 * This will be used to link to sibling page(s) later and, if this is the
+	 * initial page in the level, saved in the BTPageState
+	 */
+	*blockno = BufferGetBlockNumber(*buf);
+
+	return BufferGetPage(*buf);
+}
+
 /*
  * allocate and initialize a new BTPageState.  the returned structure
  * is suitable for immediate use by _bt_buildadd.
@@ -688,13 +812,20 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
 static BTPageState *
 _bt_pagestate(BTWriteState *wstate, uint32 level)
 {
+	Buffer buf;
+	BlockNumber blockno;
+
 	BTPageState *state = (BTPageState *) palloc0(sizeof(BTPageState));
 
-	/* create initial page for level */
-	state->btps_page = _bt_blnewpage(level);
+	/*
+	 * Allocate and initialize initial page for the level, and if using shared
+	 * buffers, extend the relation and allocate a shared buffer for the block.
+	 */
+	state->btps_page = _bt_blnewpage(level, wstate->_bt_bl_alloc_page(wstate,
+				&blockno, &buf));
 
-	/* and assign it a page position */
-	state->btps_blkno = wstate->btws_pages_alloced++;
+	state->btps_blkno = blockno;
+	state->btps_buf = buf;
 
 	state->btps_lowkey = NULL;
 	/* initialize lastoff so first item goes into P_FIRSTKEY */
@@ -829,6 +960,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 {
 	Page		npage;
 	BlockNumber nblkno;
+	Buffer nbuf;
 	OffsetNumber last_off;
 	Size		last_truncextra;
 	Size		pgspc;
@@ -843,6 +975,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 
 	npage = state->btps_page;
 	nblkno = state->btps_blkno;
+	nbuf = state->btps_buf;
 	last_off = state->btps_lastoff;
 	last_truncextra = state->btps_lastextra;
 	state->btps_lastextra = truncextra;
@@ -899,15 +1032,14 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 		 */
 		Page		opage = npage;
 		BlockNumber oblkno = nblkno;
+		Buffer obuf = nbuf;
 		ItemId		ii;
 		ItemId		hii;
 		IndexTuple	oitup;
 
-		/* Create new page of same level */
-		npage = _bt_blnewpage(state->btps_level);
-
-		/* and assign it a page position */
-		nblkno = wstate->btws_pages_alloced++;
+		/* Create and initialize a new page of same level */
+		npage = _bt_blnewpage(state->btps_level,
+				wstate->_bt_bl_alloc_page(wstate, &nblkno, &nbuf));
 
 		/*
 		 * We copy the last item on the page into the new page, and then
@@ -1017,9 +1149,12 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 
 		/*
 		 * Write out the old page.  We never need to touch it again, so we can
-		 * free the opage workspace too.
+		 * free the opage workspace too. obuf has been released and is no longer
+		 * valid.
 		 */
-		_bt_blwritepage(wstate, opage, oblkno);
+		 wstate->_bt_blwritepage(wstate, opage, oblkno, obuf);
+		 obuf = InvalidBuffer;
+		 opage = NULL;
 
 		/*
 		 * Reset last_off to point to new page
@@ -1054,6 +1189,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 
 	state->btps_page = npage;
 	state->btps_blkno = nblkno;
+	state->btps_buf = nbuf;
 	state->btps_lastoff = last_off;
 }
 
@@ -1099,12 +1235,12 @@ _bt_sort_dedup_finish_pending(BTWriteState *wstate, BTPageState *state,
  * Finish writing out the completed btree.
  */
 static void
-_bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
+_bt_uppershutdown(BTWriteState *wstate, BTPageState *state, Buffer metabuf,
+		Page metapage)
 {
 	BTPageState *s;
 	BlockNumber rootblkno = P_NONE;
 	uint32		rootlevel = 0;
-	Page		metapage;
 
 	/*
 	 * Each iteration of this loop completes one more level of the tree.
@@ -1150,20 +1286,24 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 		 * back one slot.  Then we can dump out the page.
 		 */
 		_bt_slideleft(s->btps_page);
-		_bt_blwritepage(wstate, s->btps_page, s->btps_blkno);
+		wstate->_bt_blwritepage(wstate, s->btps_page, s->btps_blkno, s->btps_buf);
+		s->btps_buf = InvalidBuffer;
 		s->btps_page = NULL;	/* writepage freed the workspace */
 	}
 
 	/*
-	 * As the last step in the process, construct the metapage and make it
+	 * As the last step in the process, initialize the metapage and make it
 	 * point to the new root (unless we had no data at all, in which case it's
 	 * set to point to "P_NONE").  This changes the index to the "valid" state
 	 * by filling in a valid magic number in the metapage.
+	 * After this, metapage will have been freed or invalid and metabuf, if ever
+	 * valid, will have been released.
 	 */
-	metapage = (Page) palloc(BLCKSZ);
 	_bt_initmetapage(metapage, rootblkno, rootlevel,
 					 wstate->inskey->allequalimage);
-	_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
+	wstate->_bt_blwritepage(wstate, metapage, BTREE_METAPAGE, metabuf);
+	metabuf = InvalidBuffer;
+	metapage = NULL;
 }
 
 /*
@@ -1173,6 +1313,10 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 static void
 _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 {
+	Page metapage;
+	BlockNumber metablkno;
+	Buffer metabuf;
+
 	BTPageState *state = NULL;
 	bool		merge = (btspool2 != NULL);
 	IndexTuple	itup,
@@ -1185,21 +1329,29 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	int64		tuples_done = 0;
 	bool		deduplicate;
 
+	deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
+		BTGetDeduplicateItems(wstate->index);
 
 	/*
-	 * Only bother fsync'ing the data to permanent storage if WAL logging
+	 * Reserve block 0 for the metapage up front.
+	 *
+	 * When using the shared buffers API it is easier to allocate the buffer
+	 * for block 0 first instead of trying skip block 0 and allocate it at the
+	 * end of index build.
+	 *
+	 * When not using the shared buffers API, there is no harm in allocating
+	 * the metapage first. When block 1 is written, the direct writepage
+	 * function will zero-fill block 0. When writing out the metapage at the
+	 * end of index build, it will overwrite that block 0.
+	 *
+	 * The metapage will be initialized and written out at the end of the index
+	 * build when all of the information needed to do so is available.
 	 *
-	 * The self-fsync optimization requires that the backend both add an fsync
-	 * request to the checkpointer's pending-ops table as well as be prepared
-	 * to fsync the page data itself. Because none of these are required if the
-	 * relation is not WAL-logged, pass btws_use_wal for all parameters of the
-	 * prep function.
+	 * The block number will always be BTREE_METAPAGE, so the metablkno
+	 * variable is unused and only created to avoid a special case in the
+	 * direct alloc function.
 	 */
-	unbuffered_prep(&wstate->ub_wstate, wstate->btws_use_wal,
-			wstate->btws_use_wal, wstate->btws_use_wal);
-
-	deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
-		BTGetDeduplicateItems(wstate->index);
+	metapage = wstate->_bt_bl_alloc_page(wstate, &metablkno, &metabuf);
 
 	if (merge)
 	{
@@ -1422,10 +1574,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	}
 
 	/* Close down final pages and write the metapage */
-	_bt_uppershutdown(wstate, state);
-
-	unbuffered_finish(&wstate->ub_wstate, RelationGetSmgr(wstate->index),
-			MAIN_FORKNUM);
+	_bt_uppershutdown(wstate, state, metabuf, metapage);
 }
 
 /*
-- 
2.30.2

v5-0001-Add-unbuffered-IO-API.patchtext/x-patch; charset=US-ASCII; name=v5-0001-Add-unbuffered-IO-API.patchDownload

From e714a6bd00db76f516ce99bac97ed7e8268eb645 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 8 Feb 2022 19:01:18 -0500
Subject: [PATCH v5 1/4] Add unbuffered IO API

Wrap unbuffered extends and writes in a new API, directmgr.

When writing data outside of shared buffers, the backend must do a
series of steps to ensure the data is both durable and recoverable.

When writing or extending a page of data for a WAL-logged table fork,
the backend must log, checksum (if page is not empty), and write out the
page before moving on.

Additionally, the backend must fsync the page data to ensure it reaches
permanent storage since checkpointer is unaware of the buffer and could
move the Redo pointer past the associated WAL for this write/extend
before it fsyncs the data.

This API is also used for non-WAL-logged and non-self-fsync'd table
forks but with the appropriate exceptions to the above steps.

This commit introduces no functional change. It replaces all current
callers of smgrimmedsync(), smgrextend(), and smgrwrite() with the
equivalent directmgr functions. Consolidating these steps makes IO
outside of shared buffers less error-prone.
---
 src/backend/access/gist/gistbuild.c       | 36 +++++++----
 src/backend/access/hash/hashpage.c        | 18 +++---
 src/backend/access/heap/heapam_handler.c  | 15 +++--
 src/backend/access/heap/rewriteheap.c     | 53 +++++----------
 src/backend/access/heap/visibilitymap.c   | 10 ++-
 src/backend/access/nbtree/nbtree.c        | 18 ++----
 src/backend/access/nbtree/nbtsort.c       | 56 ++++++----------
 src/backend/access/spgist/spginsert.c     | 39 ++++-------
 src/backend/catalog/storage.c             | 30 +++------
 src/backend/storage/Makefile              |  2 +-
 src/backend/storage/direct/Makefile       | 17 +++++
 src/backend/storage/direct/directmgr.c    | 79 +++++++++++++++++++++++
 src/backend/storage/freespace/freespace.c | 14 ++--
 src/include/catalog/storage.h             |  1 +
 src/include/storage/directmgr.h           | 57 ++++++++++++++++
 15 files changed, 276 insertions(+), 169 deletions(-)
 create mode 100644 src/backend/storage/direct/Makefile
 create mode 100644 src/backend/storage/direct/directmgr.c
 create mode 100644 src/include/storage/directmgr.h

diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 4db896a533..42a5e61f8e 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -43,6 +43,7 @@
 #include "miscadmin.h"
 #include "optimizer/optimizer.h"
 #include "storage/bufmgr.h"
+#include "storage/directmgr.h"
 #include "storage/smgr.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -91,6 +92,7 @@ typedef struct
 
 	int64		indtuples;		/* number of tuples indexed */
 
+	UnBufferedWriteState ub_wstate;
 	/*
 	 * Extra data structures used during a buffering build. 'gfbb' contains
 	 * information related to managing the build buffers. 'parentMap' is a
@@ -409,14 +411,16 @@ gist_indexsortbuild(GISTBuildState *state)
 	state->pages_allocated = 0;
 	state->pages_written = 0;
 	state->ready_num_pages = 0;
+	unbuffered_prep(&state->ub_wstate, false, false);
 
 	/*
 	 * Write an empty page as a placeholder for the root page. It will be
 	 * replaced with the real root page at the end.
 	 */
 	page = palloc0(BLCKSZ);
-	smgrextend(RelationGetSmgr(state->indexrel), MAIN_FORKNUM, GIST_ROOT_BLKNO,
-			   page, true);
+	unbuffered_extend(&state->ub_wstate, false,
+			RelationGetSmgr(state->indexrel), MAIN_FORKNUM, GIST_ROOT_BLKNO,
+			page, true);
 	state->pages_allocated++;
 	state->pages_written++;
 
@@ -458,12 +462,13 @@ gist_indexsortbuild(GISTBuildState *state)
 
 	/* Write out the root */
 	PageSetLSN(levelstate->pages[0], GistBuildLSN);
-	PageSetChecksumInplace(levelstate->pages[0], GIST_ROOT_BLKNO);
-	smgrwrite(RelationGetSmgr(state->indexrel), MAIN_FORKNUM, GIST_ROOT_BLKNO,
-			  levelstate->pages[0], true);
-	if (RelationNeedsWAL(state->indexrel))
-		log_newpage(&state->indexrel->rd_node, MAIN_FORKNUM, GIST_ROOT_BLKNO,
-					levelstate->pages[0], true);
+
+	unbuffered_write(&state->ub_wstate, RelationNeedsWAL(state->indexrel),
+			RelationGetSmgr(state->indexrel), MAIN_FORKNUM, GIST_ROOT_BLKNO,
+			  levelstate->pages[0]);
+
+	unbuffered_finish(&state->ub_wstate, RelationGetSmgr(state->indexrel),
+			MAIN_FORKNUM);
 
 	pfree(levelstate->pages[0]);
 	pfree(levelstate);
@@ -633,6 +638,8 @@ gist_indexsortbuild_flush_ready_pages(GISTBuildState *state)
 	if (state->ready_num_pages == 0)
 		return;
 
+	unbuffered_prep(&state->ub_wstate, false, false);
+
 	for (int i = 0; i < state->ready_num_pages; i++)
 	{
 		Page		page = state->ready_pages[i];
@@ -643,9 +650,13 @@ gist_indexsortbuild_flush_ready_pages(GISTBuildState *state)
 			elog(ERROR, "unexpected block number to flush GiST sorting build");
 
 		PageSetLSN(page, GistBuildLSN);
-		PageSetChecksumInplace(page, blkno);
-		smgrextend(RelationGetSmgr(state->indexrel), MAIN_FORKNUM, blkno, page,
-				   true);
+
+		/*
+		 * These will be WAL logged below
+		 */
+		unbuffered_extend(&state->ub_wstate, false,
+				RelationGetSmgr(state->indexrel), MAIN_FORKNUM, blkno, page,
+				false);
 
 		state->pages_written++;
 	}
@@ -654,6 +665,9 @@ gist_indexsortbuild_flush_ready_pages(GISTBuildState *state)
 		log_newpages(&state->indexrel->rd_node, MAIN_FORKNUM, state->ready_num_pages,
 					 state->ready_blknos, state->ready_pages, true);
 
+	unbuffered_finish(&state->ub_wstate, RelationGetSmgr(state->indexrel),
+			MAIN_FORKNUM);
+
 	for (int i = 0; i < state->ready_num_pages; i++)
 		pfree(state->ready_pages[i]);
 
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 28c5297a1d..6096604438 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -33,6 +33,7 @@
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
+#include "storage/directmgr.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
 #include "storage/smgr.h"
@@ -991,6 +992,7 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
 	PGAlignedBlock zerobuf;
 	Page		page;
 	HashPageOpaque ovflopaque;
+	UnBufferedWriteState ub_wstate;
 
 	lastblock = firstblock + nblocks - 1;
 
@@ -1001,6 +1003,8 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
 	if (lastblock < firstblock || lastblock == InvalidBlockNumber)
 		return false;
 
+	unbuffered_prep(&ub_wstate, false, true);
+
 	page = (Page) zerobuf.data;
 
 	/*
@@ -1018,16 +1022,10 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
 	ovflopaque->hasho_flag = LH_UNUSED_PAGE;
 	ovflopaque->hasho_page_id = HASHO_PAGE_ID;
 
-	if (RelationNeedsWAL(rel))
-		log_newpage(&rel->rd_node,
-					MAIN_FORKNUM,
-					lastblock,
-					zerobuf.data,
-					true);
-
-	PageSetChecksumInplace(page, lastblock);
-	smgrextend(RelationGetSmgr(rel), MAIN_FORKNUM, lastblock, zerobuf.data,
-			   false);
+	unbuffered_extend(&ub_wstate, RelationNeedsWAL(rel), RelationGetSmgr(rel),
+			MAIN_FORKNUM, lastblock, zerobuf.data, false);
+
+	unbuffered_finish(&ub_wstate, RelationGetSmgr(rel), MAIN_FORKNUM);
 
 	return true;
 }
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 39ef8a0b77..9fd6a6f447 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -38,6 +38,7 @@
 #include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/bufpage.h"
+#include "storage/directmgr.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
 #include "storage/procarray.h"
@@ -575,7 +576,9 @@ heapam_relation_set_new_filenode(Relation rel,
 								 MultiXactId *minmulti)
 {
 	SMgrRelation srel;
+	UnBufferedWriteState ub_wstate;
 
+	unbuffered_prep(&ub_wstate, true, false);
 	/*
 	 * Initialize to the minimum XID that could put tuples in the table. We
 	 * know that no xacts older than RecentXmin are still running, so that
@@ -597,12 +600,10 @@ heapam_relation_set_new_filenode(Relation rel,
 
 	/*
 	 * If required, set up an init fork for an unlogged table so that it can
-	 * be correctly reinitialized on restart.  An immediate sync is required
-	 * even if the page has been logged, because the write did not go through
-	 * shared_buffers and therefore a concurrent checkpoint may have moved the
-	 * redo pointer past our xlog record.  Recovery may as well remove it
-	 * while replaying, for example, XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE
-	 * record. Therefore, logging is necessary even if wal_level=minimal.
+	 * be correctly reinitialized on restart.
+	 * Recovery may as well remove our xlog record while replaying, for
+	 * example, XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record. Therefore,
+	 * logging is necessary even if wal_level=minimal.
 	 */
 	if (persistence == RELPERSISTENCE_UNLOGGED)
 	{
@@ -611,7 +612,7 @@ heapam_relation_set_new_filenode(Relation rel,
 			   rel->rd_rel->relkind == RELKIND_TOASTVALUE);
 		smgrcreate(srel, INIT_FORKNUM, false);
 		log_smgrcreate(newrnode, INIT_FORKNUM);
-		smgrimmedsync(srel, INIT_FORKNUM);
+		unbuffered_finish(&ub_wstate, srel, INIT_FORKNUM);
 	}
 
 	smgrclose(srel);
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 2a53826736..12bdd6ff60 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -119,6 +119,7 @@
 #include "replication/logical.h"
 #include "replication/slot.h"
 #include "storage/bufmgr.h"
+#include "storage/directmgr.h"
 #include "storage/fd.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
@@ -152,6 +153,7 @@ typedef struct RewriteStateData
 	HTAB	   *rs_old_new_tid_map; /* unmatched B tuples */
 	HTAB	   *rs_logical_mappings;	/* logical remapping files */
 	uint32		rs_num_rewrite_mappings;	/* # in memory mappings */
+	UnBufferedWriteState rs_unbuffered_wstate;
 }			RewriteStateData;
 
 /*
@@ -265,6 +267,9 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
 	state->rs_cutoff_multi = cutoff_multi;
 	state->rs_cxt = rw_cxt;
 
+	unbuffered_prep(&state->rs_unbuffered_wstate,
+			RelationNeedsWAL(state->rs_new_rel), false);
+
 	/* Initialize hash tables used to track update chains */
 	hash_ctl.keysize = sizeof(TidHashKey);
 	hash_ctl.entrysize = sizeof(UnresolvedTupData);
@@ -317,28 +322,14 @@ end_heap_rewrite(RewriteState state)
 	/* Write the last page, if any */
 	if (state->rs_buffer_valid)
 	{
-		if (RelationNeedsWAL(state->rs_new_rel))
-			log_newpage(&state->rs_new_rel->rd_node,
-						MAIN_FORKNUM,
-						state->rs_blockno,
-						state->rs_buffer,
-						true);
-
-		PageSetChecksumInplace(state->rs_buffer, state->rs_blockno);
-
-		smgrextend(RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM,
-				   state->rs_blockno, (char *) state->rs_buffer, true);
+		unbuffered_extend(&state->rs_unbuffered_wstate,
+				RelationNeedsWAL(state->rs_new_rel),
+				RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM,
+				state->rs_blockno, state->rs_buffer, false);
 	}
 
-	/*
-	 * When we WAL-logged rel pages, we must nonetheless fsync them.  The
-	 * reason is the same as in storage.c's RelationCopyStorage(): we're
-	 * writing data that's not in shared buffers, and so a CHECKPOINT
-	 * occurring during the rewriteheap operation won't have fsync'd data we
-	 * wrote before the checkpoint.
-	 */
-	if (RelationNeedsWAL(state->rs_new_rel))
-		smgrimmedsync(RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM);
+	unbuffered_finish(&state->rs_unbuffered_wstate,
+			RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM);
 
 	logical_end_heap_rewrite(state);
 
@@ -676,24 +667,10 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
 			 * contains a tuple.  Hence, unlike RelationGetBufferForTuple(),
 			 * enforce saveFreeSpace unconditionally.
 			 */
-
-			/* XLOG stuff */
-			if (RelationNeedsWAL(state->rs_new_rel))
-				log_newpage(&state->rs_new_rel->rd_node,
-							MAIN_FORKNUM,
-							state->rs_blockno,
-							page,
-							true);
-
-			/*
-			 * Now write the page. We say skipFsync = true because there's no
-			 * need for smgr to schedule an fsync for this write; we'll do it
-			 * ourselves in end_heap_rewrite.
-			 */
-			PageSetChecksumInplace(page, state->rs_blockno);
-
-			smgrextend(RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM,
-					   state->rs_blockno, (char *) page, true);
+			unbuffered_extend(&state->rs_unbuffered_wstate,
+					RelationNeedsWAL(state->rs_new_rel),
+					RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM,
+					state->rs_blockno, page, false);
 
 			state->rs_blockno++;
 			state->rs_buffer_valid = false;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index e09f25a684..897de5ec1f 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -93,6 +93,7 @@
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "storage/bufmgr.h"
+#include "storage/directmgr.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
 #include "utils/inval.h"
@@ -617,6 +618,7 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 	BlockNumber vm_nblocks_now;
 	PGAlignedBlock pg;
 	SMgrRelation reln;
+	UnBufferedWriteState ub_wstate;
 
 	PageInit((Page) pg.data, BLCKSZ, 0);
 
@@ -631,6 +633,7 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 	 * by the time we get the lock.
 	 */
 	LockRelationForExtension(rel, ExclusiveLock);
+	unbuffered_prep(&ub_wstate, false, true);
 
 	/*
 	 * Caution: re-using this smgr pointer could fail if the relcache entry
@@ -655,12 +658,13 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 	/* Now extend the file */
 	while (vm_nblocks_now < vm_nblocks)
 	{
-		PageSetChecksumInplace((Page) pg.data, vm_nblocks_now);
-
-		smgrextend(reln, VISIBILITYMAP_FORKNUM, vm_nblocks_now, pg.data, false);
+		unbuffered_extend(&ub_wstate, false, reln, VISIBILITYMAP_FORKNUM,
+				vm_nblocks_now, (Page) pg.data, false);
 		vm_nblocks_now++;
 	}
 
+	unbuffered_finish(&ub_wstate, reln, VISIBILITYMAP_FORKNUM);
+
 	/*
 	 * Send a shared-inval message to force other backends to close any smgr
 	 * references they may have for this rel, which we are about to change.
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index c9b4964c1e..1ec7493ad3 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -30,6 +30,7 @@
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "storage/condition_variable.h"
+#include "storage/directmgr.h"
 #include "storage/indexfsm.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
@@ -152,6 +153,9 @@ void
 btbuildempty(Relation index)
 {
 	Page		metapage;
+	UnBufferedWriteState wstate;
+
+	unbuffered_prep(&wstate, true, false);
 
 	/* Construct metapage. */
 	metapage = (Page) palloc(BLCKSZ);
@@ -164,18 +168,10 @@ btbuildempty(Relation index)
 	 * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we need
 	 * this even when wal_level=minimal.
 	 */
-	PageSetChecksumInplace(metapage, BTREE_METAPAGE);
-	smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, BTREE_METAPAGE,
-			  (char *) metapage, true);
-	log_newpage(&RelationGetSmgr(index)->smgr_rnode.node, INIT_FORKNUM,
-				BTREE_METAPAGE, metapage, true);
+	unbuffered_write(&wstate, true, RelationGetSmgr(index), INIT_FORKNUM,
+			BTREE_METAPAGE, metapage);
 
-	/*
-	 * An immediate sync is required even if we xlog'd the page, because the
-	 * write did not go through shared_buffers and therefore a concurrent
-	 * checkpoint may have moved the redo pointer past our xlog record.
-	 */
-	smgrimmedsync(RelationGetSmgr(index), INIT_FORKNUM);
+	unbuffered_finish(&wstate, RelationGetSmgr(index), INIT_FORKNUM);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 8a19de2f66..c7a65a9972 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -57,6 +57,7 @@
 #include "executor/instrument.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "storage/directmgr.h"
 #include "storage/smgr.h"
 #include "tcop/tcopprot.h"		/* pgrminclude ignore */
 #include "utils/rel.h"
@@ -256,6 +257,7 @@ typedef struct BTWriteState
 	BlockNumber btws_pages_alloced; /* # pages allocated */
 	BlockNumber btws_pages_written; /* # pages written out */
 	Page		btws_zeropage;	/* workspace for filling zeroes */
+	UnBufferedWriteState ub_wstate;
 } BTWriteState;
 
 
@@ -643,13 +645,6 @@ _bt_blnewpage(uint32 level)
 static void
 _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
 {
-	/* XLOG stuff */
-	if (wstate->btws_use_wal)
-	{
-		/* We use the XLOG_FPI record type for this */
-		log_newpage(&wstate->index->rd_node, MAIN_FORKNUM, blkno, page, true);
-	}
-
 	/*
 	 * If we have to write pages nonsequentially, fill in the space with
 	 * zeroes until we come back and overwrite.  This is not logically
@@ -661,32 +656,27 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
 	{
 		if (!wstate->btws_zeropage)
 			wstate->btws_zeropage = (Page) palloc0(BLCKSZ);
-		/* don't set checksum for all-zero page */
-		smgrextend(RelationGetSmgr(wstate->index), MAIN_FORKNUM,
-				   wstate->btws_pages_written++,
-				   (char *) wstate->btws_zeropage,
-				   true);
+
+		unbuffered_extend(&wstate->ub_wstate, false,
+				RelationGetSmgr(wstate->index), MAIN_FORKNUM,
+				wstate->btws_pages_written++, wstate->btws_zeropage, true);
 	}
 
-	PageSetChecksumInplace(page, blkno);
 
-	/*
-	 * Now write the page.  There's no need for smgr to schedule an fsync for
-	 * this write; we'll do it ourselves before ending the build.
-	 */
+	/* Now write the page. Either we are extending the file... */
 	if (blkno == wstate->btws_pages_written)
 	{
-		/* extending the file... */
-		smgrextend(RelationGetSmgr(wstate->index), MAIN_FORKNUM, blkno,
-				   (char *) page, true);
+		unbuffered_extend(&wstate->ub_wstate, wstate->btws_use_wal,
+				RelationGetSmgr(wstate->index), MAIN_FORKNUM, blkno, page,
+				false);
+
 		wstate->btws_pages_written++;
 	}
+
+	/* or we are overwriting a block we zero-filled before. */
 	else
-	{
-		/* overwriting a block we zero-filled before */
-		smgrwrite(RelationGetSmgr(wstate->index), MAIN_FORKNUM, blkno,
-				  (char *) page, true);
-	}
+		unbuffered_write(&wstate->ub_wstate, wstate->btws_use_wal,
+				RelationGetSmgr(wstate->index), MAIN_FORKNUM, blkno, page);
 
 	pfree(page);
 }
@@ -1195,6 +1185,9 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	int64		tuples_done = 0;
 	bool		deduplicate;
 
+
+	unbuffered_prep(&wstate->ub_wstate, wstate->btws_use_wal, false);
+
 	deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
 		BTGetDeduplicateItems(wstate->index);
 
@@ -1421,17 +1414,8 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	/* Close down final pages and write the metapage */
 	_bt_uppershutdown(wstate, state);
 
-	/*
-	 * When we WAL-logged index pages, we must nonetheless fsync index files.
-	 * Since we're building outside shared buffers, a CHECKPOINT occurring
-	 * during the build has no way to flush the previously written data to
-	 * disk (indeed it won't know the index even exists).  A crash later on
-	 * would replay WAL from the checkpoint, therefore it wouldn't replay our
-	 * earlier WAL entries. If we do not fsync those pages here, they might
-	 * still not be on disk when the crash occurs.
-	 */
-	if (wstate->btws_use_wal)
-		smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM);
+	unbuffered_finish(&wstate->ub_wstate, RelationGetSmgr(wstate->index),
+			MAIN_FORKNUM);
 }
 
 /*
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index bfb74049d0..e232ba4b86 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -25,6 +25,7 @@
 #include "catalog/index.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/directmgr.h"
 #include "storage/smgr.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -156,48 +157,30 @@ void
 spgbuildempty(Relation index)
 {
 	Page		page;
+	UnBufferedWriteState wstate;
+
+	unbuffered_prep(&wstate, true, false);
 
 	/* Construct metapage. */
 	page = (Page) palloc(BLCKSZ);
 	SpGistInitMetapage(page);
 
-	/*
-	 * Write the page and log it unconditionally.  This is important
-	 * particularly for indexes created on tablespaces and databases whose
-	 * creation happened after the last redo pointer as recovery removes any
-	 * of their existing content when the corresponding create records are
-	 * replayed.
-	 */
-	PageSetChecksumInplace(page, SPGIST_METAPAGE_BLKNO);
-	smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, SPGIST_METAPAGE_BLKNO,
-			  (char *) page, true);
-	log_newpage(&(RelationGetSmgr(index))->smgr_rnode.node, INIT_FORKNUM,
-				SPGIST_METAPAGE_BLKNO, page, true);
+	unbuffered_write(&wstate, true, RelationGetSmgr(index), INIT_FORKNUM,
+			SPGIST_METAPAGE_BLKNO, page);
 
 	/* Likewise for the root page. */
 	SpGistInitPage(page, SPGIST_LEAF);
 
-	PageSetChecksumInplace(page, SPGIST_ROOT_BLKNO);
-	smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, SPGIST_ROOT_BLKNO,
-			  (char *) page, true);
-	log_newpage(&(RelationGetSmgr(index))->smgr_rnode.node, INIT_FORKNUM,
-				SPGIST_ROOT_BLKNO, page, true);
+	unbuffered_write(&wstate, true, RelationGetSmgr(index), INIT_FORKNUM,
+			SPGIST_ROOT_BLKNO, page);
 
 	/* Likewise for the null-tuples root page. */
 	SpGistInitPage(page, SPGIST_LEAF | SPGIST_NULLS);
 
-	PageSetChecksumInplace(page, SPGIST_NULL_BLKNO);
-	smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, SPGIST_NULL_BLKNO,
-			  (char *) page, true);
-	log_newpage(&(RelationGetSmgr(index))->smgr_rnode.node, INIT_FORKNUM,
-				SPGIST_NULL_BLKNO, page, true);
+	unbuffered_write(&wstate, true, RelationGetSmgr(index), INIT_FORKNUM,
+			SPGIST_NULL_BLKNO, page);
 
-	/*
-	 * An immediate sync is required even if we xlog'd the pages, because the
-	 * writes did not go through shared buffers and therefore a concurrent
-	 * checkpoint may have moved the redo pointer past our xlog record.
-	 */
-	smgrimmedsync(RelationGetSmgr(index), INIT_FORKNUM);
+	unbuffered_finish(&wstate, RelationGetSmgr(index), INIT_FORKNUM);
 }
 
 /*
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 9b8075536a..1ec90e00ab 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -28,6 +28,7 @@
 #include "catalog/storage.h"
 #include "catalog/storage_xlog.h"
 #include "miscadmin.h"
+#include "storage/directmgr.h"
 #include "storage/freespace.h"
 #include "storage/smgr.h"
 #include "utils/hsearch.h"
@@ -420,6 +421,8 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 	bool		copying_initfork;
 	BlockNumber nblocks;
 	BlockNumber blkno;
+	UnBufferedWriteState wstate;
+
 
 	page = (Page) buf.data;
 
@@ -440,6 +443,8 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 	use_wal = XLogIsNeeded() &&
 		(relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork);
 
+	unbuffered_prep(&wstate, (use_wal || copying_initfork), false);
+
 	nblocks = smgrnblocks(src, forkNum);
 
 	for (blkno = 0; blkno < nblocks; blkno++)
@@ -474,30 +479,15 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 		 * page this is, so we have to log the full page including any unused
 		 * space.
 		 */
-		if (use_wal)
-			log_newpage(&dst->smgr_rnode.node, forkNum, blkno, page, false);
 
-		PageSetChecksumInplace(page, blkno);
+		// TODO: is it okay to pass the page here to unbuffered_extend so that
+		// it can be WAL-logged as a full page even though smgrextend used to
+		// take just buf.data?
+		unbuffered_extend(&wstate, use_wal, dst, forkNum, blkno, page, false);
 
-		/*
-		 * Now write the page.  We say skipFsync = true because there's no
-		 * need for smgr to schedule an fsync for this write; we'll do it
-		 * ourselves below.
-		 */
-		smgrextend(dst, forkNum, blkno, buf.data, true);
 	}
 
-	/*
-	 * When we WAL-logged rel pages, we must nonetheless fsync them.  The
-	 * reason is that since we're copying outside shared buffers, a CHECKPOINT
-	 * occurring during the copy has no way to flush the previously written
-	 * data to disk (indeed it won't know the new rel even exists).  A crash
-	 * later on would replay WAL from the checkpoint, therefore it wouldn't
-	 * replay our earlier WAL entries. If we do not fsync those pages here,
-	 * they might still not be on disk when the crash occurs.
-	 */
-	if (use_wal || copying_initfork)
-		smgrimmedsync(dst, forkNum);
+	unbuffered_finish(&wstate, dst, forkNum);
 }
 
 /*
diff --git a/src/backend/storage/Makefile b/src/backend/storage/Makefile
index 8376cdfca2..501fae5f9d 100644
--- a/src/backend/storage/Makefile
+++ b/src/backend/storage/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/storage
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS     = buffer file freespace ipc large_object lmgr page smgr sync
+SUBDIRS     = buffer direct file freespace ipc large_object lmgr page smgr sync
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/direct/Makefile b/src/backend/storage/direct/Makefile
new file mode 100644
index 0000000000..d82bbed48c
--- /dev/null
+++ b/src/backend/storage/direct/Makefile
@@ -0,0 +1,17 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for storage/direct
+#
+# IDENTIFICATION
+#    src/backend/storage/direct/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/storage/direct
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = directmgr.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/direct/directmgr.c b/src/backend/storage/direct/directmgr.c
new file mode 100644
index 0000000000..371ff5602f
--- /dev/null
+++ b/src/backend/storage/direct/directmgr.c
@@ -0,0 +1,79 @@
+/*-------------------------------------------------------------------------
+ *
+ * directmgr.c
+ *	  routines for managing unbuffered IO
+ *
+ * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/direct/directmgr.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+
+#include "access/xloginsert.h"
+#include "storage/directmgr.h"
+
+void
+unbuffered_prep(UnBufferedWriteState *wstate, bool fsync_self, bool
+		request_fsync)
+{
+	wstate->fsync_self = fsync_self;
+	wstate->request_fsync = request_fsync;
+}
+
+void
+unbuffered_write(UnBufferedWriteState *wstate, bool do_wal, SMgrRelation
+		smgrrel, ForkNumber forknum, BlockNumber blocknum, Page page)
+{
+	PageSetChecksumInplace(page, blocknum);
+
+	smgrwrite(smgrrel, forknum, blocknum, (char *) page,
+			!wstate->request_fsync);
+
+	if (do_wal)
+		log_newpage(&(smgrrel)->smgr_rnode.node, forknum,
+					blocknum, page, true);
+}
+
+void
+unbuffered_extend(UnBufferedWriteState *wstate, bool do_wal, SMgrRelation
+		smgrrel, ForkNumber forknum, BlockNumber blocknum, Page page, bool
+		empty)
+{
+	/*
+	 * Don't checksum empty pages
+	 */
+	if (!empty)
+		PageSetChecksumInplace(page, blocknum);
+
+	smgrextend(smgrrel, forknum, blocknum, (char *) page,
+			!wstate->request_fsync);
+
+	if (do_wal)
+		log_newpage(&(smgrrel)->smgr_rnode.node, forknum,
+					blocknum, page, true);
+
+}
+
+/*
+ * When writing data outside shared buffers, a concurrent CHECKPOINT can move
+ * the redo pointer past our WAL entries and won't flush our data to disk. If
+ * the database crashes before the data makes it to disk, our WAL won't be
+ * replayed and the data will be lost.
+ * Thus, if a CHECKPOINT begins between unbuffered_prep() and
+ * unbuffered_finish(), the backend must fsync the data itself.
+ */
+void
+unbuffered_finish(UnBufferedWriteState *wstate, SMgrRelation smgrrel,
+		ForkNumber forknum)
+{
+	if (!wstate->fsync_self)
+		return;
+
+	smgrimmedsync(smgrrel, forknum);
+}
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 78c073b7c9..4326ea8f01 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -27,6 +27,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
+#include "storage/directmgr.h"
 #include "storage/freespace.h"
 #include "storage/fsm_internals.h"
 #include "storage/lmgr.h"
@@ -609,9 +610,11 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 	BlockNumber fsm_nblocks_now;
 	PGAlignedBlock pg;
 	SMgrRelation reln;
+	UnBufferedWriteState ub_wstate;
 
 	PageInit((Page) pg.data, BLCKSZ, 0);
 
+
 	/*
 	 * We use the relation extension lock to lock out other backends trying to
 	 * extend the FSM at the same time. It also locks out extension of the
@@ -624,6 +627,8 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 	 */
 	LockRelationForExtension(rel, ExclusiveLock);
 
+	unbuffered_prep(&ub_wstate, false, true);
+
 	/*
 	 * Caution: re-using this smgr pointer could fail if the relcache entry
 	 * gets closed.  It's safe as long as we only do smgr-level operations
@@ -648,14 +653,15 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 	/* Extend as needed. */
 	while (fsm_nblocks_now < fsm_nblocks)
 	{
-		PageSetChecksumInplace((Page) pg.data, fsm_nblocks_now);
-
-		smgrextend(reln, FSM_FORKNUM, fsm_nblocks_now,
-				   pg.data, false);
+		unbuffered_extend(&ub_wstate, false, reln, FSM_FORKNUM,
+				fsm_nblocks_now, (Page) pg.data, false);
 		fsm_nblocks_now++;
 	}
 
+	unbuffered_finish(&ub_wstate, reln, FSM_FORKNUM);
+
 	UnlockRelationForExtension(rel, ExclusiveLock);
+
 }
 
 /*
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 9ffc741913..b097d09c9f 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -15,6 +15,7 @@
 #define STORAGE_H
 
 #include "storage/block.h"
+#include "storage/directmgr.h"
 #include "storage/relfilenode.h"
 #include "storage/smgr.h"
 #include "utils/relcache.h"
diff --git a/src/include/storage/directmgr.h b/src/include/storage/directmgr.h
new file mode 100644
index 0000000000..47653d0d1b
--- /dev/null
+++ b/src/include/storage/directmgr.h
@@ -0,0 +1,57 @@
+/*-------------------------------------------------------------------------
+ *
+ * directmgr.h
+ *	  POSTGRES unbuffered IO manager definitions.
+ *
+ *
+ * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/directmgr.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef DIRECTMGR_H
+#define DIRECTMGR_H
+
+#include "common/relpath.h"
+#include "storage/block.h"
+#include "storage/bufpage.h"
+#include "storage/smgr.h"
+
+/*
+ * After committing the pg_buffer_stats patch, this will contain a pointer to a
+ * PgBufferAccess struct to count the writes and extends done in this way.
+ */
+typedef struct UnBufferedWriteState
+{
+	/*
+	 * When writing logged table data outside of shared buffers, there is a
+	 * risk of a concurrent CHECKPOINT moving the redo pointer past the data's
+	 * associated WAL entries. To avoid this, callers in this situation must
+	 * fsync the pages they have written themselves.
+	 *
+	 * Callers able to use the checkpointer's sync request queue when writing
+	 * data outside shared buffers (like fsm and vm) can set request_fsync to
+	 * true so that these fsync requests are added to the queue.
+	 */
+	bool fsync_self;
+	bool request_fsync;
+} UnBufferedWriteState;
+/*
+ * prototypes for functions in directmgr.c
+ */
+extern void
+unbuffered_prep(UnBufferedWriteState *wstate, bool fsync_self, bool
+		request_fsync);
+extern void
+unbuffered_write(UnBufferedWriteState *wstate, bool do_wal, SMgrRelation
+		smgrrel, ForkNumber forknum, BlockNumber blocknum, Page page);
+extern void
+unbuffered_extend(UnBufferedWriteState *wstate, bool do_wal, SMgrRelation smgrrel,
+		ForkNumber forknum, BlockNumber blocknum, Page page, bool empty);
+extern void
+unbuffered_finish(UnBufferedWriteState *wstate, SMgrRelation smgrrel,
+		ForkNumber forknum);
+
+#endif							/* DIRECTMGR_H */
-- 
2.30.2

v5-0003-BTree-index-use-unbuffered-IO-optimization.patchtext/x-patch; charset=US-ASCII; name=v5-0003-BTree-index-use-unbuffered-IO-optimization.patchDownload

From ed9200e6262e451936cfceb1b41ff079531a83f8 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 8 Feb 2022 19:01:32 -0500
Subject: [PATCH v5 3/4] BTree index use unbuffered IO optimization

While building a btree index, the backend can avoid fsync'ing all of the
pages if it uses the optimization introduced in a prior commit.

This can substantially improve performance when many indexes are being
built during DDL operations.
---
 src/backend/access/nbtree/nbtree.c  | 2 +-
 src/backend/access/nbtree/nbtsort.c | 9 ++++++++-
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 843c9e2362..a1efbe1e6a 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -155,7 +155,7 @@ btbuildempty(Relation index)
 	Page		metapage;
 	UnBufferedWriteState wstate;
 
-	unbuffered_prep(&wstate, false, true, false);
+	unbuffered_prep(&wstate, true, true, true);
 
 	/* Construct metapage. */
 	metapage = (Page) palloc(BLCKSZ);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index a67770f3fd..079832bb78 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1188,8 +1188,15 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 
 	/*
 	 * Only bother fsync'ing the data to permanent storage if WAL logging
+	 *
+	 * The self-fsync optimization requires that the backend both add an fsync
+	 * request to the checkpointer's pending-ops table as well as be prepared
+	 * to fsync the page data itself. Because none of these are required if the
+	 * relation is not WAL-logged, pass btws_use_wal for all parameters of the
+	 * prep function.
 	 */
-	unbuffered_prep(&wstate->ub_wstate, false, wstate->btws_use_wal, false);
+	unbuffered_prep(&wstate->ub_wstate, wstate->btws_use_wal,
+			wstate->btws_use_wal, wstate->btws_use_wal);
 
 	deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
 		BTGetDeduplicateItems(wstate->index);
-- 
2.30.2

#17

Dmitry Dolgov

9erthalion6@gmail.com

almost 4 years ago

In reply to: Melanie Plageman (#16)

Re: Avoiding smgrimmedsync() during nbtree index builds

On Wed, Feb 09, 2022 at 01:49:30PM -0500, Melanie Plageman wrote:
Hi,
v5 attached and all email feedback addressed below

Thanks for the patch, it looks quite good.

I don't see it in the discussion, so naturally curious -- why directmgr
is not used for bloom index, e.g. in blbuildempty?

On Sun, Jan 16, 2022 at 3:26 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

Separate from this issue, I wonder if it'd be useful to write a DEBUG log
showing when btree uses shared_buffers vs fsync. And a regression test which
first SETs client_min_messages=debug to capture the debug log to demonstrate
when/that new code path is being hit. I'm not sure if that would be good to
merge, but it may be useful for now.

I can't find the thread right away, but I vaguely remember a similar
situation where such approach, as a main way to test the patch, had
caused some disagreement. Of course for the development phase it would
be indeed convenient.

#18

pryzby@telsasoft.com

almost 4 years ago

In reply to: Melanie Plageman (#16)

4 attachment(s)

Re: Avoiding smgrimmedsync() during nbtree index builds

Rebased to appease cfbot.

I ran these paches under a branch which shows code coverage in cirrus. It
looks good to my eyes.
https://api.cirrus-ci.com/v1/artifact/task/5212346552418304/coverage/coverage/00-index.html

Are these patches being considered for v15 ?

Attachments:

0001-Add-unbuffered-IO-API.patchtext/x-diff; charset=us-asciiDownload

From 30707da3e5eb68d1efbc5594696da47ad7f72bab Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 8 Feb 2022 19:01:18 -0500
Subject: [PATCH 1/4] Add unbuffered IO API

Wrap unbuffered extends and writes in a new API, directmgr.

When writing data outside of shared buffers, the backend must do a
series of steps to ensure the data is both durable and recoverable.

When writing or extending a page of data for a WAL-logged table fork,
the backend must log, checksum (if page is not empty), and write out the
page before moving on.

Additionally, the backend must fsync the page data to ensure it reaches
permanent storage since checkpointer is unaware of the buffer and could
move the Redo pointer past the associated WAL for this write/extend
before it fsyncs the data.

This API is also used for non-WAL-logged and non-self-fsync'd table
forks but with the appropriate exceptions to the above steps.

This commit introduces no functional change. It replaces all current
callers of smgrimmedsync(), smgrextend(), and smgrwrite() with the
equivalent directmgr functions. Consolidating these steps makes IO
outside of shared buffers less error-prone.
---
 src/backend/access/gist/gistbuild.c       | 36 +++++++----
 src/backend/access/hash/hashpage.c        | 18 +++---
 src/backend/access/heap/heapam_handler.c  | 15 +++--
 src/backend/access/heap/rewriteheap.c     | 53 +++++----------
 src/backend/access/heap/visibilitymap.c   | 10 ++-
 src/backend/access/nbtree/nbtree.c        | 18 ++----
 src/backend/access/nbtree/nbtsort.c       | 56 ++++++----------
 src/backend/access/spgist/spginsert.c     | 39 ++++-------
 src/backend/catalog/storage.c             | 30 +++------
 src/backend/storage/Makefile              |  2 +-
 src/backend/storage/direct/Makefile       | 17 +++++
 src/backend/storage/direct/directmgr.c    | 79 +++++++++++++++++++++++
 src/backend/storage/freespace/freespace.c | 14 ++--
 src/include/catalog/storage.h             |  1 +
 src/include/storage/directmgr.h           | 57 ++++++++++++++++
 15 files changed, 276 insertions(+), 169 deletions(-)
 create mode 100644 src/backend/storage/direct/Makefile
 create mode 100644 src/backend/storage/direct/directmgr.c
 create mode 100644 src/include/storage/directmgr.h

diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index e081e6571a4..8fabc2a42d7 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -43,6 +43,7 @@
 #include "miscadmin.h"
 #include "optimizer/optimizer.h"
 #include "storage/bufmgr.h"
+#include "storage/directmgr.h"
 #include "storage/smgr.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -91,6 +92,7 @@ typedef struct
 
 	int64		indtuples;		/* number of tuples indexed */
 
+	UnBufferedWriteState ub_wstate;
 	/*
 	 * Extra data structures used during a buffering build. 'gfbb' contains
 	 * information related to managing the build buffers. 'parentMap' is a
@@ -409,14 +411,16 @@ gist_indexsortbuild(GISTBuildState *state)
 	state->pages_allocated = 0;
 	state->pages_written = 0;
 	state->ready_num_pages = 0;
+	unbuffered_prep(&state->ub_wstate, false, false);
 
 	/*
 	 * Write an empty page as a placeholder for the root page. It will be
 	 * replaced with the real root page at the end.
 	 */
 	page = palloc0(BLCKSZ);
-	smgrextend(RelationGetSmgr(state->indexrel), MAIN_FORKNUM, GIST_ROOT_BLKNO,
-			   page, true);
+	unbuffered_extend(&state->ub_wstate, false,
+			RelationGetSmgr(state->indexrel), MAIN_FORKNUM, GIST_ROOT_BLKNO,
+			page, true);
 	state->pages_allocated++;
 	state->pages_written++;
 
@@ -458,12 +462,13 @@ gist_indexsortbuild(GISTBuildState *state)
 
 	/* Write out the root */
 	PageSetLSN(levelstate->pages[0], GistBuildLSN);
-	PageSetChecksumInplace(levelstate->pages[0], GIST_ROOT_BLKNO);
-	smgrwrite(RelationGetSmgr(state->indexrel), MAIN_FORKNUM, GIST_ROOT_BLKNO,
-			  levelstate->pages[0], true);
-	if (RelationNeedsWAL(state->indexrel))
-		log_newpage(&state->indexrel->rd_node, MAIN_FORKNUM, GIST_ROOT_BLKNO,
-					levelstate->pages[0], true);
+
+	unbuffered_write(&state->ub_wstate, RelationNeedsWAL(state->indexrel),
+			RelationGetSmgr(state->indexrel), MAIN_FORKNUM, GIST_ROOT_BLKNO,
+			  levelstate->pages[0]);
+
+	unbuffered_finish(&state->ub_wstate, RelationGetSmgr(state->indexrel),
+			MAIN_FORKNUM);
 
 	pfree(levelstate->pages[0]);
 	pfree(levelstate);
@@ -645,6 +650,8 @@ gist_indexsortbuild_flush_ready_pages(GISTBuildState *state)
 	if (state->ready_num_pages == 0)
 		return;
 
+	unbuffered_prep(&state->ub_wstate, false, false);
+
 	for (int i = 0; i < state->ready_num_pages; i++)
 	{
 		Page		page = state->ready_pages[i];
@@ -655,9 +662,13 @@ gist_indexsortbuild_flush_ready_pages(GISTBuildState *state)
 			elog(ERROR, "unexpected block number to flush GiST sorting build");
 
 		PageSetLSN(page, GistBuildLSN);
-		PageSetChecksumInplace(page, blkno);
-		smgrextend(RelationGetSmgr(state->indexrel), MAIN_FORKNUM, blkno, page,
-				   true);
+
+		/*
+		 * These will be WAL logged below
+		 */
+		unbuffered_extend(&state->ub_wstate, false,
+				RelationGetSmgr(state->indexrel), MAIN_FORKNUM, blkno, page,
+				false);
 
 		state->pages_written++;
 	}
@@ -666,6 +677,9 @@ gist_indexsortbuild_flush_ready_pages(GISTBuildState *state)
 		log_newpages(&state->indexrel->rd_node, MAIN_FORKNUM, state->ready_num_pages,
 					 state->ready_blknos, state->ready_pages, true);
 
+	unbuffered_finish(&state->ub_wstate, RelationGetSmgr(state->indexrel),
+			MAIN_FORKNUM);
+
 	for (int i = 0; i < state->ready_num_pages; i++)
 		pfree(state->ready_pages[i]);
 
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 28c5297a1dc..6096604438e 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -33,6 +33,7 @@
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
+#include "storage/directmgr.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
 #include "storage/smgr.h"
@@ -991,6 +992,7 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
 	PGAlignedBlock zerobuf;
 	Page		page;
 	HashPageOpaque ovflopaque;
+	UnBufferedWriteState ub_wstate;
 
 	lastblock = firstblock + nblocks - 1;
 
@@ -1001,6 +1003,8 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
 	if (lastblock < firstblock || lastblock == InvalidBlockNumber)
 		return false;
 
+	unbuffered_prep(&ub_wstate, false, true);
+
 	page = (Page) zerobuf.data;
 
 	/*
@@ -1018,16 +1022,10 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
 	ovflopaque->hasho_flag = LH_UNUSED_PAGE;
 	ovflopaque->hasho_page_id = HASHO_PAGE_ID;
 
-	if (RelationNeedsWAL(rel))
-		log_newpage(&rel->rd_node,
-					MAIN_FORKNUM,
-					lastblock,
-					zerobuf.data,
-					true);
-
-	PageSetChecksumInplace(page, lastblock);
-	smgrextend(RelationGetSmgr(rel), MAIN_FORKNUM, lastblock, zerobuf.data,
-			   false);
+	unbuffered_extend(&ub_wstate, RelationNeedsWAL(rel), RelationGetSmgr(rel),
+			MAIN_FORKNUM, lastblock, zerobuf.data, false);
+
+	unbuffered_finish(&ub_wstate, RelationGetSmgr(rel), MAIN_FORKNUM);
 
 	return true;
 }
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 39ef8a0b77d..9fd6a6f4474 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -38,6 +38,7 @@
 #include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/bufpage.h"
+#include "storage/directmgr.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
 #include "storage/procarray.h"
@@ -575,7 +576,9 @@ heapam_relation_set_new_filenode(Relation rel,
 								 MultiXactId *minmulti)
 {
 	SMgrRelation srel;
+	UnBufferedWriteState ub_wstate;
 
+	unbuffered_prep(&ub_wstate, true, false);
 	/*
 	 * Initialize to the minimum XID that could put tuples in the table. We
 	 * know that no xacts older than RecentXmin are still running, so that
@@ -597,12 +600,10 @@ heapam_relation_set_new_filenode(Relation rel,
 
 	/*
 	 * If required, set up an init fork for an unlogged table so that it can
-	 * be correctly reinitialized on restart.  An immediate sync is required
-	 * even if the page has been logged, because the write did not go through
-	 * shared_buffers and therefore a concurrent checkpoint may have moved the
-	 * redo pointer past our xlog record.  Recovery may as well remove it
-	 * while replaying, for example, XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE
-	 * record. Therefore, logging is necessary even if wal_level=minimal.
+	 * be correctly reinitialized on restart.
+	 * Recovery may as well remove our xlog record while replaying, for
+	 * example, XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record. Therefore,
+	 * logging is necessary even if wal_level=minimal.
 	 */
 	if (persistence == RELPERSISTENCE_UNLOGGED)
 	{
@@ -611,7 +612,7 @@ heapam_relation_set_new_filenode(Relation rel,
 			   rel->rd_rel->relkind == RELKIND_TOASTVALUE);
 		smgrcreate(srel, INIT_FORKNUM, false);
 		log_smgrcreate(newrnode, INIT_FORKNUM);
-		smgrimmedsync(srel, INIT_FORKNUM);
+		unbuffered_finish(&ub_wstate, srel, INIT_FORKNUM);
 	}
 
 	smgrclose(srel);
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 2a53826736e..12bdd6ff601 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -119,6 +119,7 @@
 #include "replication/logical.h"
 #include "replication/slot.h"
 #include "storage/bufmgr.h"
+#include "storage/directmgr.h"
 #include "storage/fd.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
@@ -152,6 +153,7 @@ typedef struct RewriteStateData
 	HTAB	   *rs_old_new_tid_map; /* unmatched B tuples */
 	HTAB	   *rs_logical_mappings;	/* logical remapping files */
 	uint32		rs_num_rewrite_mappings;	/* # in memory mappings */
+	UnBufferedWriteState rs_unbuffered_wstate;
 }			RewriteStateData;
 
 /*
@@ -265,6 +267,9 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
 	state->rs_cutoff_multi = cutoff_multi;
 	state->rs_cxt = rw_cxt;
 
+	unbuffered_prep(&state->rs_unbuffered_wstate,
+			RelationNeedsWAL(state->rs_new_rel), false);
+
 	/* Initialize hash tables used to track update chains */
 	hash_ctl.keysize = sizeof(TidHashKey);
 	hash_ctl.entrysize = sizeof(UnresolvedTupData);
@@ -317,28 +322,14 @@ end_heap_rewrite(RewriteState state)
 	/* Write the last page, if any */
 	if (state->rs_buffer_valid)
 	{
-		if (RelationNeedsWAL(state->rs_new_rel))
-			log_newpage(&state->rs_new_rel->rd_node,
-						MAIN_FORKNUM,
-						state->rs_blockno,
-						state->rs_buffer,
-						true);
-
-		PageSetChecksumInplace(state->rs_buffer, state->rs_blockno);
-
-		smgrextend(RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM,
-				   state->rs_blockno, (char *) state->rs_buffer, true);
+		unbuffered_extend(&state->rs_unbuffered_wstate,
+				RelationNeedsWAL(state->rs_new_rel),
+				RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM,
+				state->rs_blockno, state->rs_buffer, false);
 	}
 
-	/*
-	 * When we WAL-logged rel pages, we must nonetheless fsync them.  The
-	 * reason is the same as in storage.c's RelationCopyStorage(): we're
-	 * writing data that's not in shared buffers, and so a CHECKPOINT
-	 * occurring during the rewriteheap operation won't have fsync'd data we
-	 * wrote before the checkpoint.
-	 */
-	if (RelationNeedsWAL(state->rs_new_rel))
-		smgrimmedsync(RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM);
+	unbuffered_finish(&state->rs_unbuffered_wstate,
+			RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM);
 
 	logical_end_heap_rewrite(state);
 
@@ -676,24 +667,10 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
 			 * contains a tuple.  Hence, unlike RelationGetBufferForTuple(),
 			 * enforce saveFreeSpace unconditionally.
 			 */
-
-			/* XLOG stuff */
-			if (RelationNeedsWAL(state->rs_new_rel))
-				log_newpage(&state->rs_new_rel->rd_node,
-							MAIN_FORKNUM,
-							state->rs_blockno,
-							page,
-							true);
-
-			/*
-			 * Now write the page. We say skipFsync = true because there's no
-			 * need for smgr to schedule an fsync for this write; we'll do it
-			 * ourselves in end_heap_rewrite.
-			 */
-			PageSetChecksumInplace(page, state->rs_blockno);
-
-			smgrextend(RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM,
-					   state->rs_blockno, (char *) page, true);
+			unbuffered_extend(&state->rs_unbuffered_wstate,
+					RelationNeedsWAL(state->rs_new_rel),
+					RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM,
+					state->rs_blockno, page, false);
 
 			state->rs_blockno++;
 			state->rs_buffer_valid = false;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index e09f25a684c..897de5ec1fc 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -93,6 +93,7 @@
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "storage/bufmgr.h"
+#include "storage/directmgr.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
 #include "utils/inval.h"
@@ -617,6 +618,7 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 	BlockNumber vm_nblocks_now;
 	PGAlignedBlock pg;
 	SMgrRelation reln;
+	UnBufferedWriteState ub_wstate;
 
 	PageInit((Page) pg.data, BLCKSZ, 0);
 
@@ -631,6 +633,7 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 	 * by the time we get the lock.
 	 */
 	LockRelationForExtension(rel, ExclusiveLock);
+	unbuffered_prep(&ub_wstate, false, true);
 
 	/*
 	 * Caution: re-using this smgr pointer could fail if the relcache entry
@@ -655,12 +658,13 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 	/* Now extend the file */
 	while (vm_nblocks_now < vm_nblocks)
 	{
-		PageSetChecksumInplace((Page) pg.data, vm_nblocks_now);
-
-		smgrextend(reln, VISIBILITYMAP_FORKNUM, vm_nblocks_now, pg.data, false);
+		unbuffered_extend(&ub_wstate, false, reln, VISIBILITYMAP_FORKNUM,
+				vm_nblocks_now, (Page) pg.data, false);
 		vm_nblocks_now++;
 	}
 
+	unbuffered_finish(&ub_wstate, reln, VISIBILITYMAP_FORKNUM);
+
 	/*
 	 * Send a shared-inval message to force other backends to close any smgr
 	 * references they may have for this rel, which we are about to change.
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index c9b4964c1e8..1ec7493ad39 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -30,6 +30,7 @@
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "storage/condition_variable.h"
+#include "storage/directmgr.h"
 #include "storage/indexfsm.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
@@ -152,6 +153,9 @@ void
 btbuildempty(Relation index)
 {
 	Page		metapage;
+	UnBufferedWriteState wstate;
+
+	unbuffered_prep(&wstate, true, false);
 
 	/* Construct metapage. */
 	metapage = (Page) palloc(BLCKSZ);
@@ -164,18 +168,10 @@ btbuildempty(Relation index)
 	 * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we need
 	 * this even when wal_level=minimal.
 	 */
-	PageSetChecksumInplace(metapage, BTREE_METAPAGE);
-	smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, BTREE_METAPAGE,
-			  (char *) metapage, true);
-	log_newpage(&RelationGetSmgr(index)->smgr_rnode.node, INIT_FORKNUM,
-				BTREE_METAPAGE, metapage, true);
+	unbuffered_write(&wstate, true, RelationGetSmgr(index), INIT_FORKNUM,
+			BTREE_METAPAGE, metapage);
 
-	/*
-	 * An immediate sync is required even if we xlog'd the page, because the
-	 * write did not go through shared_buffers and therefore a concurrent
-	 * checkpoint may have moved the redo pointer past our xlog record.
-	 */
-	smgrimmedsync(RelationGetSmgr(index), INIT_FORKNUM);
+	unbuffered_finish(&wstate, RelationGetSmgr(index), INIT_FORKNUM);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 8a19de2f66c..c7a65a99727 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -57,6 +57,7 @@
 #include "executor/instrument.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "storage/directmgr.h"
 #include "storage/smgr.h"
 #include "tcop/tcopprot.h"		/* pgrminclude ignore */
 #include "utils/rel.h"
@@ -256,6 +257,7 @@ typedef struct BTWriteState
 	BlockNumber btws_pages_alloced; /* # pages allocated */
 	BlockNumber btws_pages_written; /* # pages written out */
 	Page		btws_zeropage;	/* workspace for filling zeroes */
+	UnBufferedWriteState ub_wstate;
 } BTWriteState;
 
 
@@ -643,13 +645,6 @@ _bt_blnewpage(uint32 level)
 static void
 _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
 {
-	/* XLOG stuff */
-	if (wstate->btws_use_wal)
-	{
-		/* We use the XLOG_FPI record type for this */
-		log_newpage(&wstate->index->rd_node, MAIN_FORKNUM, blkno, page, true);
-	}
-
 	/*
 	 * If we have to write pages nonsequentially, fill in the space with
 	 * zeroes until we come back and overwrite.  This is not logically
@@ -661,32 +656,27 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
 	{
 		if (!wstate->btws_zeropage)
 			wstate->btws_zeropage = (Page) palloc0(BLCKSZ);
-		/* don't set checksum for all-zero page */
-		smgrextend(RelationGetSmgr(wstate->index), MAIN_FORKNUM,
-				   wstate->btws_pages_written++,
-				   (char *) wstate->btws_zeropage,
-				   true);
+
+		unbuffered_extend(&wstate->ub_wstate, false,
+				RelationGetSmgr(wstate->index), MAIN_FORKNUM,
+				wstate->btws_pages_written++, wstate->btws_zeropage, true);
 	}
 
-	PageSetChecksumInplace(page, blkno);
 
-	/*
-	 * Now write the page.  There's no need for smgr to schedule an fsync for
-	 * this write; we'll do it ourselves before ending the build.
-	 */
+	/* Now write the page. Either we are extending the file... */
 	if (blkno == wstate->btws_pages_written)
 	{
-		/* extending the file... */
-		smgrextend(RelationGetSmgr(wstate->index), MAIN_FORKNUM, blkno,
-				   (char *) page, true);
+		unbuffered_extend(&wstate->ub_wstate, wstate->btws_use_wal,
+				RelationGetSmgr(wstate->index), MAIN_FORKNUM, blkno, page,
+				false);
+
 		wstate->btws_pages_written++;
 	}
+
+	/* or we are overwriting a block we zero-filled before. */
 	else
-	{
-		/* overwriting a block we zero-filled before */
-		smgrwrite(RelationGetSmgr(wstate->index), MAIN_FORKNUM, blkno,
-				  (char *) page, true);
-	}
+		unbuffered_write(&wstate->ub_wstate, wstate->btws_use_wal,
+				RelationGetSmgr(wstate->index), MAIN_FORKNUM, blkno, page);
 
 	pfree(page);
 }
@@ -1195,6 +1185,9 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	int64		tuples_done = 0;
 	bool		deduplicate;
 
+
+	unbuffered_prep(&wstate->ub_wstate, wstate->btws_use_wal, false);
+
 	deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
 		BTGetDeduplicateItems(wstate->index);
 
@@ -1421,17 +1414,8 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	/* Close down final pages and write the metapage */
 	_bt_uppershutdown(wstate, state);
 
-	/*
-	 * When we WAL-logged index pages, we must nonetheless fsync index files.
-	 * Since we're building outside shared buffers, a CHECKPOINT occurring
-	 * during the build has no way to flush the previously written data to
-	 * disk (indeed it won't know the index even exists).  A crash later on
-	 * would replay WAL from the checkpoint, therefore it wouldn't replay our
-	 * earlier WAL entries. If we do not fsync those pages here, they might
-	 * still not be on disk when the crash occurs.
-	 */
-	if (wstate->btws_use_wal)
-		smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM);
+	unbuffered_finish(&wstate->ub_wstate, RelationGetSmgr(wstate->index),
+			MAIN_FORKNUM);
 }
 
 /*
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index bfb74049d0c..e232ba4b866 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -25,6 +25,7 @@
 #include "catalog/index.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/directmgr.h"
 #include "storage/smgr.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -156,48 +157,30 @@ void
 spgbuildempty(Relation index)
 {
 	Page		page;
+	UnBufferedWriteState wstate;
+
+	unbuffered_prep(&wstate, true, false);
 
 	/* Construct metapage. */
 	page = (Page) palloc(BLCKSZ);
 	SpGistInitMetapage(page);
 
-	/*
-	 * Write the page and log it unconditionally.  This is important
-	 * particularly for indexes created on tablespaces and databases whose
-	 * creation happened after the last redo pointer as recovery removes any
-	 * of their existing content when the corresponding create records are
-	 * replayed.
-	 */
-	PageSetChecksumInplace(page, SPGIST_METAPAGE_BLKNO);
-	smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, SPGIST_METAPAGE_BLKNO,
-			  (char *) page, true);
-	log_newpage(&(RelationGetSmgr(index))->smgr_rnode.node, INIT_FORKNUM,
-				SPGIST_METAPAGE_BLKNO, page, true);
+	unbuffered_write(&wstate, true, RelationGetSmgr(index), INIT_FORKNUM,
+			SPGIST_METAPAGE_BLKNO, page);
 
 	/* Likewise for the root page. */
 	SpGistInitPage(page, SPGIST_LEAF);
 
-	PageSetChecksumInplace(page, SPGIST_ROOT_BLKNO);
-	smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, SPGIST_ROOT_BLKNO,
-			  (char *) page, true);
-	log_newpage(&(RelationGetSmgr(index))->smgr_rnode.node, INIT_FORKNUM,
-				SPGIST_ROOT_BLKNO, page, true);
+	unbuffered_write(&wstate, true, RelationGetSmgr(index), INIT_FORKNUM,
+			SPGIST_ROOT_BLKNO, page);
 
 	/* Likewise for the null-tuples root page. */
 	SpGistInitPage(page, SPGIST_LEAF | SPGIST_NULLS);
 
-	PageSetChecksumInplace(page, SPGIST_NULL_BLKNO);
-	smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, SPGIST_NULL_BLKNO,
-			  (char *) page, true);
-	log_newpage(&(RelationGetSmgr(index))->smgr_rnode.node, INIT_FORKNUM,
-				SPGIST_NULL_BLKNO, page, true);
+	unbuffered_write(&wstate, true, RelationGetSmgr(index), INIT_FORKNUM,
+			SPGIST_NULL_BLKNO, page);
 
-	/*
-	 * An immediate sync is required even if we xlog'd the pages, because the
-	 * writes did not go through shared buffers and therefore a concurrent
-	 * checkpoint may have moved the redo pointer past our xlog record.
-	 */
-	smgrimmedsync(RelationGetSmgr(index), INIT_FORKNUM);
+	unbuffered_finish(&wstate, RelationGetSmgr(index), INIT_FORKNUM);
 }
 
 /*
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 9b8075536a7..1ec90e00abf 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -28,6 +28,7 @@
 #include "catalog/storage.h"
 #include "catalog/storage_xlog.h"
 #include "miscadmin.h"
+#include "storage/directmgr.h"
 #include "storage/freespace.h"
 #include "storage/smgr.h"
 #include "utils/hsearch.h"
@@ -420,6 +421,8 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 	bool		copying_initfork;
 	BlockNumber nblocks;
 	BlockNumber blkno;
+	UnBufferedWriteState wstate;
+
 
 	page = (Page) buf.data;
 
@@ -440,6 +443,8 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 	use_wal = XLogIsNeeded() &&
 		(relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork);
 
+	unbuffered_prep(&wstate, (use_wal || copying_initfork), false);
+
 	nblocks = smgrnblocks(src, forkNum);
 
 	for (blkno = 0; blkno < nblocks; blkno++)
@@ -474,30 +479,15 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 		 * page this is, so we have to log the full page including any unused
 		 * space.
 		 */
-		if (use_wal)
-			log_newpage(&dst->smgr_rnode.node, forkNum, blkno, page, false);
 
-		PageSetChecksumInplace(page, blkno);
+		// TODO: is it okay to pass the page here to unbuffered_extend so that
+		// it can be WAL-logged as a full page even though smgrextend used to
+		// take just buf.data?
+		unbuffered_extend(&wstate, use_wal, dst, forkNum, blkno, page, false);
 
-		/*
-		 * Now write the page.  We say skipFsync = true because there's no
-		 * need for smgr to schedule an fsync for this write; we'll do it
-		 * ourselves below.
-		 */
-		smgrextend(dst, forkNum, blkno, buf.data, true);
 	}
 
-	/*
-	 * When we WAL-logged rel pages, we must nonetheless fsync them.  The
-	 * reason is that since we're copying outside shared buffers, a CHECKPOINT
-	 * occurring during the copy has no way to flush the previously written
-	 * data to disk (indeed it won't know the new rel even exists).  A crash
-	 * later on would replay WAL from the checkpoint, therefore it wouldn't
-	 * replay our earlier WAL entries. If we do not fsync those pages here,
-	 * they might still not be on disk when the crash occurs.
-	 */
-	if (use_wal || copying_initfork)
-		smgrimmedsync(dst, forkNum);
+	unbuffered_finish(&wstate, dst, forkNum);
 }
 
 /*
diff --git a/src/backend/storage/Makefile b/src/backend/storage/Makefile
index 8376cdfca20..501fae5f9d0 100644
--- a/src/backend/storage/Makefile
+++ b/src/backend/storage/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/storage
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS     = buffer file freespace ipc large_object lmgr page smgr sync
+SUBDIRS     = buffer direct file freespace ipc large_object lmgr page smgr sync
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/direct/Makefile b/src/backend/storage/direct/Makefile
new file mode 100644
index 00000000000..d82bbed48c2
--- /dev/null
+++ b/src/backend/storage/direct/Makefile
@@ -0,0 +1,17 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for storage/direct
+#
+# IDENTIFICATION
+#    src/backend/storage/direct/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/storage/direct
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = directmgr.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/direct/directmgr.c b/src/backend/storage/direct/directmgr.c
new file mode 100644
index 00000000000..371ff5602fe
--- /dev/null
+++ b/src/backend/storage/direct/directmgr.c
@@ -0,0 +1,79 @@
+/*-------------------------------------------------------------------------
+ *
+ * directmgr.c
+ *	  routines for managing unbuffered IO
+ *
+ * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/direct/directmgr.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+
+#include "access/xloginsert.h"
+#include "storage/directmgr.h"
+
+void
+unbuffered_prep(UnBufferedWriteState *wstate, bool fsync_self, bool
+		request_fsync)
+{
+	wstate->fsync_self = fsync_self;
+	wstate->request_fsync = request_fsync;
+}
+
+void
+unbuffered_write(UnBufferedWriteState *wstate, bool do_wal, SMgrRelation
+		smgrrel, ForkNumber forknum, BlockNumber blocknum, Page page)
+{
+	PageSetChecksumInplace(page, blocknum);
+
+	smgrwrite(smgrrel, forknum, blocknum, (char *) page,
+			!wstate->request_fsync);
+
+	if (do_wal)
+		log_newpage(&(smgrrel)->smgr_rnode.node, forknum,
+					blocknum, page, true);
+}
+
+void
+unbuffered_extend(UnBufferedWriteState *wstate, bool do_wal, SMgrRelation
+		smgrrel, ForkNumber forknum, BlockNumber blocknum, Page page, bool
+		empty)
+{
+	/*
+	 * Don't checksum empty pages
+	 */
+	if (!empty)
+		PageSetChecksumInplace(page, blocknum);
+
+	smgrextend(smgrrel, forknum, blocknum, (char *) page,
+			!wstate->request_fsync);
+
+	if (do_wal)
+		log_newpage(&(smgrrel)->smgr_rnode.node, forknum,
+					blocknum, page, true);
+
+}
+
+/*
+ * When writing data outside shared buffers, a concurrent CHECKPOINT can move
+ * the redo pointer past our WAL entries and won't flush our data to disk. If
+ * the database crashes before the data makes it to disk, our WAL won't be
+ * replayed and the data will be lost.
+ * Thus, if a CHECKPOINT begins between unbuffered_prep() and
+ * unbuffered_finish(), the backend must fsync the data itself.
+ */
+void
+unbuffered_finish(UnBufferedWriteState *wstate, SMgrRelation smgrrel,
+		ForkNumber forknum)
+{
+	if (!wstate->fsync_self)
+		return;
+
+	smgrimmedsync(smgrrel, forknum);
+}
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 78c073b7c98..4326ea8f015 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -27,6 +27,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
+#include "storage/directmgr.h"
 #include "storage/freespace.h"
 #include "storage/fsm_internals.h"
 #include "storage/lmgr.h"
@@ -609,9 +610,11 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 	BlockNumber fsm_nblocks_now;
 	PGAlignedBlock pg;
 	SMgrRelation reln;
+	UnBufferedWriteState ub_wstate;
 
 	PageInit((Page) pg.data, BLCKSZ, 0);
 
+
 	/*
 	 * We use the relation extension lock to lock out other backends trying to
 	 * extend the FSM at the same time. It also locks out extension of the
@@ -624,6 +627,8 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 	 */
 	LockRelationForExtension(rel, ExclusiveLock);
 
+	unbuffered_prep(&ub_wstate, false, true);
+
 	/*
 	 * Caution: re-using this smgr pointer could fail if the relcache entry
 	 * gets closed.  It's safe as long as we only do smgr-level operations
@@ -648,14 +653,15 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 	/* Extend as needed. */
 	while (fsm_nblocks_now < fsm_nblocks)
 	{
-		PageSetChecksumInplace((Page) pg.data, fsm_nblocks_now);
-
-		smgrextend(reln, FSM_FORKNUM, fsm_nblocks_now,
-				   pg.data, false);
+		unbuffered_extend(&ub_wstate, false, reln, FSM_FORKNUM,
+				fsm_nblocks_now, (Page) pg.data, false);
 		fsm_nblocks_now++;
 	}
 
+	unbuffered_finish(&ub_wstate, reln, FSM_FORKNUM);
+
 	UnlockRelationForExtension(rel, ExclusiveLock);
+
 }
 
 /*
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 9ffc7419131..b097d09c9fc 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -15,6 +15,7 @@
 #define STORAGE_H
 
 #include "storage/block.h"
+#include "storage/directmgr.h"
 #include "storage/relfilenode.h"
 #include "storage/smgr.h"
 #include "utils/relcache.h"
diff --git a/src/include/storage/directmgr.h b/src/include/storage/directmgr.h
new file mode 100644
index 00000000000..47653d0d1bb
--- /dev/null
+++ b/src/include/storage/directmgr.h
@@ -0,0 +1,57 @@
+/*-------------------------------------------------------------------------
+ *
+ * directmgr.h
+ *	  POSTGRES unbuffered IO manager definitions.
+ *
+ *
+ * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/directmgr.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef DIRECTMGR_H
+#define DIRECTMGR_H
+
+#include "common/relpath.h"
+#include "storage/block.h"
+#include "storage/bufpage.h"
+#include "storage/smgr.h"
+
+/*
+ * After committing the pg_buffer_stats patch, this will contain a pointer to a
+ * PgBufferAccess struct to count the writes and extends done in this way.
+ */
+typedef struct UnBufferedWriteState
+{
+	/*
+	 * When writing logged table data outside of shared buffers, there is a
+	 * risk of a concurrent CHECKPOINT moving the redo pointer past the data's
+	 * associated WAL entries. To avoid this, callers in this situation must
+	 * fsync the pages they have written themselves.
+	 *
+	 * Callers able to use the checkpointer's sync request queue when writing
+	 * data outside shared buffers (like fsm and vm) can set request_fsync to
+	 * true so that these fsync requests are added to the queue.
+	 */
+	bool fsync_self;
+	bool request_fsync;
+} UnBufferedWriteState;
+/*
+ * prototypes for functions in directmgr.c
+ */
+extern void
+unbuffered_prep(UnBufferedWriteState *wstate, bool fsync_self, bool
+		request_fsync);
+extern void
+unbuffered_write(UnBufferedWriteState *wstate, bool do_wal, SMgrRelation
+		smgrrel, ForkNumber forknum, BlockNumber blocknum, Page page);
+extern void
+unbuffered_extend(UnBufferedWriteState *wstate, bool do_wal, SMgrRelation smgrrel,
+		ForkNumber forknum, BlockNumber blocknum, Page page, bool empty);
+extern void
+unbuffered_finish(UnBufferedWriteState *wstate, SMgrRelation smgrrel,
+		ForkNumber forknum);
+
+#endif							/* DIRECTMGR_H */
-- 
2.17.1

0002-Avoid-immediate-fsync-for-unbuffered-IO.patchtext/x-diff; charset=us-asciiDownload

From 8985cd63425e79bd3be641770b61d17e9b9caa00 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 8 Feb 2022 19:01:27 -0500
Subject: [PATCH 2/4] Avoid immediate fsync for unbuffered IO

Data written to WAL-logged table forks is durable once the WAL entries
are on permanent storage; however, the XLOG Redo pointer cannot be moved
past the associated WAL until the page data is safely on permanent
storage. If a crash were to occur before the data is fsync'd, the WAL
wouldn't be replayed during recovery, and the data would be lost.

This is not a problem with pages written in shared buffers because the
checkpointer will block until FlushBuffer() is complete for all buffers
that were dirtied before it began. Therefore it will not move the Redo
pointer past their associated WAL entries until it has fsync'd the data.

A backend writing data outside of shared buffers must ensure that the
data has reached permanent storage itself or that the Redo pointer has
not moved while it was writing the data.

In the common case, the backend should not have to do this fsync itself
and can instead request the checkpointer do it.

To ensure this is safe, the backend can save the XLOG Redo pointer
location before doing the write or extend. Then it can add an fsync
request for the page to the checkpointer's pending-ops table using the
existing mechanism. After doing the write or extend, if the Redo pointer
has moved (meaning a checkpoint has started since it saved it last),
then the backend can simply fsync the page itself. Otherwise, the
checkpointer takes care of fsync'ing the page the next time it processes
the pending-ops table.

This commit adds the optimization option to the directmgr API but does
not add any users, so there is no behavior change.
---
 src/backend/access/gist/gistbuild.c       |  4 ++--
 src/backend/access/hash/hashpage.c        |  2 +-
 src/backend/access/heap/heapam_handler.c  |  2 +-
 src/backend/access/heap/rewriteheap.c     |  2 +-
 src/backend/access/heap/visibilitymap.c   |  2 +-
 src/backend/access/nbtree/nbtree.c        |  2 +-
 src/backend/access/nbtree/nbtsort.c       |  5 ++++-
 src/backend/access/spgist/spginsert.c     |  2 +-
 src/backend/access/transam/xlog.c         | 13 +++++++++++++
 src/backend/catalog/storage.c             |  2 +-
 src/backend/storage/direct/directmgr.c    | 18 ++++++++++++++++--
 src/backend/storage/freespace/freespace.c |  2 +-
 src/include/access/xlog.h                 |  1 +
 src/include/storage/directmgr.h           | 13 +++++++++++--
 14 files changed, 55 insertions(+), 15 deletions(-)

diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 8fabc2a42d7..3d23d5b74d6 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -411,7 +411,7 @@ gist_indexsortbuild(GISTBuildState *state)
 	state->pages_allocated = 0;
 	state->pages_written = 0;
 	state->ready_num_pages = 0;
-	unbuffered_prep(&state->ub_wstate, false, false);
+	unbuffered_prep(&state->ub_wstate, false, false, false);
 
 	/*
 	 * Write an empty page as a placeholder for the root page. It will be
@@ -650,7 +650,7 @@ gist_indexsortbuild_flush_ready_pages(GISTBuildState *state)
 	if (state->ready_num_pages == 0)
 		return;
 
-	unbuffered_prep(&state->ub_wstate, false, false);
+	unbuffered_prep(&state->ub_wstate, false, false, false);
 
 	for (int i = 0; i < state->ready_num_pages; i++)
 	{
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 6096604438e..0c5533e632e 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -1003,7 +1003,7 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
 	if (lastblock < firstblock || lastblock == InvalidBlockNumber)
 		return false;
 
-	unbuffered_prep(&ub_wstate, false, true);
+	unbuffered_prep(&ub_wstate, false, false, true);
 
 	page = (Page) zerobuf.data;
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 9fd6a6f4474..f9f6527507c 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -578,7 +578,7 @@ heapam_relation_set_new_filenode(Relation rel,
 	SMgrRelation srel;
 	UnBufferedWriteState ub_wstate;
 
-	unbuffered_prep(&ub_wstate, true, false);
+	unbuffered_prep(&ub_wstate, false, true, false);
 	/*
 	 * Initialize to the minimum XID that could put tuples in the table. We
 	 * know that no xacts older than RecentXmin are still running, so that
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 12bdd6ff601..b103a62135e 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -267,7 +267,7 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
 	state->rs_cutoff_multi = cutoff_multi;
 	state->rs_cxt = rw_cxt;
 
-	unbuffered_prep(&state->rs_unbuffered_wstate,
+	unbuffered_prep(&state->rs_unbuffered_wstate, false,
 			RelationNeedsWAL(state->rs_new_rel), false);
 
 	/* Initialize hash tables used to track update chains */
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 897de5ec1fc..d844767abc9 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -633,7 +633,7 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 	 * by the time we get the lock.
 	 */
 	LockRelationForExtension(rel, ExclusiveLock);
-	unbuffered_prep(&ub_wstate, false, true);
+	unbuffered_prep(&ub_wstate, false, false, true);
 
 	/*
 	 * Caution: re-using this smgr pointer could fail if the relcache entry
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 1ec7493ad39..843c9e23625 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -155,7 +155,7 @@ btbuildempty(Relation index)
 	Page		metapage;
 	UnBufferedWriteState wstate;
 
-	unbuffered_prep(&wstate, true, false);
+	unbuffered_prep(&wstate, false, true, false);
 
 	/* Construct metapage. */
 	metapage = (Page) palloc(BLCKSZ);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index c7a65a99727..a67770f3fd3 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1186,7 +1186,10 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	bool		deduplicate;
 
 
-	unbuffered_prep(&wstate->ub_wstate, wstate->btws_use_wal, false);
+	/*
+	 * Only bother fsync'ing the data to permanent storage if WAL logging
+	 */
+	unbuffered_prep(&wstate->ub_wstate, false, wstate->btws_use_wal, false);
 
 	deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
 		BTGetDeduplicateItems(wstate->index);
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index e232ba4b866..e45f1f5db9f 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -159,7 +159,7 @@ spgbuildempty(Relation index)
 	Page		page;
 	UnBufferedWriteState wstate;
 
-	unbuffered_prep(&wstate, true, false);
+	unbuffered_prep(&wstate, false, true, false);
 
 	/* Construct metapage. */
 	page = (Page) palloc(BLCKSZ);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 0d2bd7a3576..db7b33ec673 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5971,6 +5971,19 @@ GetLastImportantRecPtr(void)
 	return res;
 }
 
+bool RedoRecPtrChanged(XLogRecPtr comparator_ptr)
+{
+	XLogRecPtr ptr;
+	SpinLockAcquire(&XLogCtl->info_lck);
+	ptr = XLogCtl->RedoRecPtr;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	if (RedoRecPtr < ptr)
+		RedoRecPtr = ptr;
+
+	return RedoRecPtr != comparator_ptr;
+}
+
 /*
  * Get the time and LSN of the last xlog segment switch
  */
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 1ec90e00abf..307f32ab8cc 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -443,7 +443,7 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 	use_wal = XLogIsNeeded() &&
 		(relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork);
 
-	unbuffered_prep(&wstate, (use_wal || copying_initfork), false);
+	unbuffered_prep(&wstate, false, (use_wal || copying_initfork), false);
 
 	nblocks = smgrnblocks(src, forkNum);
 
diff --git a/src/backend/storage/direct/directmgr.c b/src/backend/storage/direct/directmgr.c
index 371ff5602fe..120b7c06a74 100644
--- a/src/backend/storage/direct/directmgr.c
+++ b/src/backend/storage/direct/directmgr.c
@@ -15,15 +15,26 @@
 #include "postgres.h"
 
 
+#include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "storage/directmgr.h"
 
+// TODO: do_optimization can be derived from request_fsync and fsync_self, I
+// think. but is that true in all cases and also is it confusing?
 void
-unbuffered_prep(UnBufferedWriteState *wstate, bool fsync_self, bool
-		request_fsync)
+unbuffered_prep(UnBufferedWriteState *wstate, bool do_optimization, bool
+		fsync_self, bool request_fsync)
 {
+	/*
+	 * No reason to do optimization when not required to fsync self
+	 */
+	Assert(!do_optimization || (do_optimization && fsync_self));
+
+	wstate->do_optimization = do_optimization;
 	wstate->fsync_self = fsync_self;
 	wstate->request_fsync = request_fsync;
+
+	wstate->redo = do_optimization ? GetRedoRecPtr() : InvalidXLogRecPtr;
 }
 
 void
@@ -75,5 +86,8 @@ unbuffered_finish(UnBufferedWriteState *wstate, SMgrRelation smgrrel,
 	if (!wstate->fsync_self)
 		return;
 
+	if (wstate->do_optimization && !RedoRecPtrChanged(wstate->redo))
+		return;
+
 	smgrimmedsync(smgrrel, forknum);
 }
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 4326ea8f015..7c79983ff97 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -627,7 +627,7 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 	 */
 	LockRelationForExtension(rel, ExclusiveLock);
 
-	unbuffered_prep(&ub_wstate, false, true);
+	unbuffered_prep(&ub_wstate, false, false, true);
 
 	/*
 	 * Caution: re-using this smgr pointer could fail if the relcache entry
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 4b45ac64db8..71fe99a28d9 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -241,6 +241,7 @@ extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(TimeLineID *insertTLI);
 extern TimeLineID GetWALInsertionTimeLine(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
+extern bool RedoRecPtrChanged(XLogRecPtr comparator_ptr);
 
 extern void SetWalWriterSleeping(bool sleeping);
 
diff --git a/src/include/storage/directmgr.h b/src/include/storage/directmgr.h
index 47653d0d1bb..1ff0c9b0c83 100644
--- a/src/include/storage/directmgr.h
+++ b/src/include/storage/directmgr.h
@@ -14,6 +14,7 @@
 #ifndef DIRECTMGR_H
 #define DIRECTMGR_H
 
+#include "access/xlogdefs.h"
 #include "common/relpath.h"
 #include "storage/block.h"
 #include "storage/bufpage.h"
@@ -31,19 +32,27 @@ typedef struct UnBufferedWriteState
 	 * associated WAL entries. To avoid this, callers in this situation must
 	 * fsync the pages they have written themselves.
 	 *
+	 * These callers can optionally use the following optimization:
+	 * attempt to use the sync request queue and fall back to fsync'ing the
+	 * pages themselves if the redo pointer moves between the start and finish
+	 * of their write. In order to do this, they must set do_optimization to
+	 * true so that the redo pointer is saved before the write begins.
+	 *
 	 * Callers able to use the checkpointer's sync request queue when writing
 	 * data outside shared buffers (like fsm and vm) can set request_fsync to
 	 * true so that these fsync requests are added to the queue.
 	 */
+	bool do_optimization;
 	bool fsync_self;
 	bool request_fsync;
+	XLogRecPtr redo;
 } UnBufferedWriteState;
 /*
  * prototypes for functions in directmgr.c
  */
 extern void
-unbuffered_prep(UnBufferedWriteState *wstate, bool fsync_self, bool
-		request_fsync);
+unbuffered_prep(UnBufferedWriteState *wstate, bool do_optimization, bool
+		fsync_self, bool request_fsync);
 extern void
 unbuffered_write(UnBufferedWriteState *wstate, bool do_wal, SMgrRelation
 		smgrrel, ForkNumber forknum, BlockNumber blocknum, Page page);
-- 
2.17.1

0003-BTree-index-use-unbuffered-IO-optimization.patchtext/x-diff; charset=us-asciiDownload

From a67011c2704bbf6927d6010aa1c2b19f9dcb8266 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 8 Feb 2022 19:01:32 -0500
Subject: [PATCH 3/4] BTree index use unbuffered IO optimization

While building a btree index, the backend can avoid fsync'ing all of the
pages if it uses the optimization introduced in a prior commit.

This can substantially improve performance when many indexes are being
built during DDL operations.
---
 src/backend/access/nbtree/nbtree.c  | 2 +-
 src/backend/access/nbtree/nbtsort.c | 9 ++++++++-
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 843c9e23625..a1efbe1e6ae 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -155,7 +155,7 @@ btbuildempty(Relation index)
 	Page		metapage;
 	UnBufferedWriteState wstate;
 
-	unbuffered_prep(&wstate, false, true, false);
+	unbuffered_prep(&wstate, true, true, true);
 
 	/* Construct metapage. */
 	metapage = (Page) palloc(BLCKSZ);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index a67770f3fd3..079832bb783 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1188,8 +1188,15 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 
 	/*
 	 * Only bother fsync'ing the data to permanent storage if WAL logging
+	 *
+	 * The self-fsync optimization requires that the backend both add an fsync
+	 * request to the checkpointer's pending-ops table as well as be prepared
+	 * to fsync the page data itself. Because none of these are required if the
+	 * relation is not WAL-logged, pass btws_use_wal for all parameters of the
+	 * prep function.
 	 */
-	unbuffered_prep(&wstate->ub_wstate, false, wstate->btws_use_wal, false);
+	unbuffered_prep(&wstate->ub_wstate, wstate->btws_use_wal,
+			wstate->btws_use_wal, wstate->btws_use_wal);
 
 	deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
 		BTGetDeduplicateItems(wstate->index);
-- 
2.17.1

0004-Use-shared-buffers-when-possible-for-index-build.patchtext/x-diff; charset=us-asciiDownload

From a1f5fef8d329ac6580623f38016e85314440da3f Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 8 Feb 2022 19:01:36 -0500
Subject: [PATCH 4/4] Use shared buffers when possible for index build

When there are not too many tuples, building the index in shared buffers
makes sense. It allows the buffer manager to handle how best to do the
IO.
---
 src/backend/access/nbtree/nbtree.c  |  29 +--
 src/backend/access/nbtree/nbtsort.c | 267 ++++++++++++++++++++++------
 2 files changed, 223 insertions(+), 73 deletions(-)

diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index a1efbe1e6ae..4dbf7af9afe 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -152,26 +152,27 @@ bthandler(PG_FUNCTION_ARGS)
 void
 btbuildempty(Relation index)
 {
+	/*
+	 * Since this only writes one page, use shared buffers.
+	 */
 	Page		metapage;
-	UnBufferedWriteState wstate;
-
-	unbuffered_prep(&wstate, true, true, true);
+	Buffer metabuf;
 
-	/* Construct metapage. */
-	metapage = (Page) palloc(BLCKSZ);
+	/*
+	 * Allocate a buffer for metapage and initialize metapage.
+	 */
+	metabuf = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_ZERO_AND_LOCK, NULL);
+	metapage = BufferGetPage(metabuf);
 	_bt_initmetapage(metapage, P_NONE, 0, _bt_allequalimage(index, false));
 
 	/*
-	 * Write the page and log it.  It might seem that an immediate sync would
-	 * be sufficient to guarantee that the file exists on disk, but recovery
-	 * itself might remove it while replaying, for example, an
-	 * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we need
-	 * this even when wal_level=minimal.
+	 * Mark metapage buffer as dirty and XLOG it
 	 */
-	unbuffered_write(&wstate, true, RelationGetSmgr(index), INIT_FORKNUM,
-			BTREE_METAPAGE, metapage);
-
-	unbuffered_finish(&wstate, RelationGetSmgr(index), INIT_FORKNUM);
+	START_CRIT_SECTION();
+	MarkBufferDirty(metabuf);
+	log_newpage_buffer(metabuf, true);
+	END_CRIT_SECTION();
+	_bt_relbuf(index, metabuf);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 079832bb783..7e1dd93df0e 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -237,6 +237,7 @@ typedef struct BTPageState
 {
 	Page		btps_page;		/* workspace for page building */
 	BlockNumber btps_blkno;		/* block # to write this page at */
+	Buffer btps_buf; /* buffer to write this page to */
 	IndexTuple	btps_lowkey;	/* page's strict lower bound pivot tuple */
 	OffsetNumber btps_lastoff;	/* last item offset loaded */
 	Size		btps_lastextra; /* last item's extra posting list space */
@@ -254,10 +255,26 @@ typedef struct BTWriteState
 	Relation	index;
 	BTScanInsert inskey;		/* generic insertion scankey */
 	bool		btws_use_wal;	/* dump pages to WAL? */
-	BlockNumber btws_pages_alloced; /* # pages allocated */
-	BlockNumber btws_pages_written; /* # pages written out */
+	BlockNumber btws_pages_alloced; /* # pages allocated for index builds outside SB */
+	BlockNumber btws_pages_written; /* # pages written out for index builds outside SB */
 	Page		btws_zeropage;	/* workspace for filling zeroes */
 	UnBufferedWriteState ub_wstate;
+	/*
+	 * Allocate a new btree page. This does not initialize the page.
+	 */
+	Page (*_bt_bl_alloc_page) (struct BTWriteState *wstate, BlockNumber
+			*blockno, Buffer *buf);
+	/*
+	 * Emit a completed btree page, and release the working storage.
+	 */
+	void (*_bt_blwritepage) (struct BTWriteState *wstate, Page page,
+			BlockNumber blkno, Buffer buf);
+
+	void (*_bt_bl_unbuffered_prep) (UnBufferedWriteState *wstate, bool
+			do_optimization, bool fsync_self, bool request_fsync);
+
+	void (*_bt_bl_unbuffered_finish) (UnBufferedWriteState *wstate,
+			SMgrRelation smgrrel, ForkNumber forknum);
 } BTWriteState;
 
 
@@ -266,10 +283,22 @@ static double _bt_spools_heapscan(Relation heap, Relation index,
 static void _bt_spooldestroy(BTSpool *btspool);
 static void _bt_spool(BTSpool *btspool, ItemPointer self,
 					  Datum *values, bool *isnull);
-static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2);
+static void _bt_leafbuild(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2);
 static void _bt_build_callback(Relation index, ItemPointer tid, Datum *values,
 							   bool *isnull, bool tupleIsAlive, void *state);
-static Page _bt_blnewpage(uint32 level);
+
+static Page
+_bt_bl_alloc_page_direct(BTWriteState *wstate, BlockNumber *blockno, Buffer *buf);
+static Page
+_bt_bl_alloc_page_shared(BTWriteState *wstate, BlockNumber *blockno, Buffer *buf);
+
+static Page _bt_blnewpage(uint32 level, Page page);
+
+static void
+_bt_blwritepage_direct(BTWriteState *wstate, Page page, BlockNumber blkno, Buffer buf);
+static void
+_bt_blwritepage_shared(BTWriteState *wstate, Page page, BlockNumber blkno, Buffer buf);
+
 static BTPageState *_bt_pagestate(BTWriteState *wstate, uint32 level);
 static void _bt_slideleft(Page rightmostpage);
 static void _bt_sortaddtup(Page page, Size itemsize,
@@ -280,9 +309,10 @@ static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
 static void _bt_sort_dedup_finish_pending(BTWriteState *wstate,
 										  BTPageState *state,
 										  BTDedupState dstate);
-static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
 static void _bt_load(BTWriteState *wstate,
 					 BTSpool *btspool, BTSpool *btspool2);
+static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state, Buffer
+		metabuf, Page metapage);
 static void _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent,
 							   int request);
 static void _bt_end_parallel(BTLeader *btleader);
@@ -295,6 +325,21 @@ static void _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 									   Sharedsort *sharedsort2, int sortmem,
 									   bool progress);
 
+#define BT_BUILD_SB_THRESHOLD 1024
+
+static const BTWriteState wstate_shared = {
+	._bt_bl_alloc_page = _bt_bl_alloc_page_shared,
+	._bt_blwritepage = _bt_blwritepage_shared,
+	._bt_bl_unbuffered_prep = NULL,
+	._bt_bl_unbuffered_finish = NULL,
+};
+
+static const BTWriteState wstate_direct = {
+	._bt_bl_alloc_page = _bt_bl_alloc_page_direct,
+	._bt_blwritepage = _bt_blwritepage_direct,
+	._bt_bl_unbuffered_prep = unbuffered_prep,
+	._bt_bl_unbuffered_finish = unbuffered_finish,
+};
 
 /*
  *	btbuild() -- build a new btree index.
@@ -304,6 +349,7 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 {
 	IndexBuildResult *result;
 	BTBuildState buildstate;
+	BTWriteState writestate;
 	double		reltuples;
 
 #ifdef BTREE_BUILD_STATS
@@ -334,8 +380,12 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
+	 *
+	 * Don't use shared buffers if the number of tuples is too large.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2);
+	writestate = reltuples < BT_BUILD_SB_THRESHOLD ? wstate_shared : wstate_direct;
+
+	_bt_leafbuild(&writestate, buildstate.spool, buildstate.spool2);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
@@ -543,10 +593,8 @@ _bt_spool(BTSpool *btspool, ItemPointer self, Datum *values, bool *isnull)
  * create an entire btree.
  */
 static void
-_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
+_bt_leafbuild(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 {
-	BTWriteState wstate;
-
 #ifdef BTREE_BUILD_STATS
 	if (log_btree_build_stats)
 	{
@@ -566,21 +614,45 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 		tuplesort_performsort(btspool2->sortstate);
 	}
 
-	wstate.heap = btspool->heap;
-	wstate.index = btspool->index;
-	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
-	/* _bt_mkscankey() won't set allequalimage without metapage */
-	wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
-	wstate.btws_use_wal = RelationNeedsWAL(wstate.index);
+	wstate->heap = btspool->heap;
+	wstate->index = btspool->index;
+	wstate->inskey = _bt_mkscankey(wstate->index, NULL);
+	/* _bt-mkscankey() won't set allequalimage without metapage */
+	wstate->inskey->allequalimage = _bt_allequalimage(wstate->index, true);
+	wstate->btws_use_wal = RelationNeedsWAL(wstate->index);
 
 	/* reserve the metapage */
-	wstate.btws_pages_alloced = BTREE_METAPAGE + 1;
-	wstate.btws_pages_written = 0;
-	wstate.btws_zeropage = NULL;	/* until needed */
+	wstate->btws_pages_alloced = 0;
+	wstate->btws_pages_written = 0;
+	wstate->btws_zeropage = NULL;
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_BTREE_PHASE_LEAF_LOAD);
-	_bt_load(&wstate, btspool, btspool2);
+
+	/*
+	 * If not using shared buffers, for a WAL-logged relation, save the redo
+	 * pointer location in case a checkpoint begins during the index build.
+	 *
+	 * This optimization requires that the backend both add an fsync request to
+	 * the checkpointer's pending-ops table as well as be prepared to fsync the
+	 * page data itself. Because none of these are required if the relation is
+	 * not WAL-logged, pass btws_use_wal for all parameters of the prep
+	 * function.
+	 */
+	if (wstate->_bt_bl_unbuffered_prep)
+			wstate->_bt_bl_unbuffered_prep(&wstate->ub_wstate,
+					wstate->btws_use_wal, wstate->btws_use_wal,
+					wstate->btws_use_wal);
+
+	_bt_load(wstate, btspool, btspool2);
+
+	/*
+	 * If not using shared buffers, for a WAL-logged relation, check if backend
+	 * must fsync the page itself.
+	 */
+	if (wstate->_bt_bl_unbuffered_finish)
+		wstate->_bt_bl_unbuffered_finish(&wstate->ub_wstate,
+				RelationGetSmgr(wstate->index), MAIN_FORKNUM);
 }
 
 /*
@@ -613,15 +685,15 @@ _bt_build_callback(Relation index,
 }
 
 /*
- * allocate workspace for a new, clean btree page, not linked to any siblings.
+ * Set up workspace for a new, clean btree page, not linked to any siblings.
+ * Caller must allocate the passed in page.
  */
 static Page
-_bt_blnewpage(uint32 level)
+_bt_blnewpage(uint32 level, Page page)
 {
-	Page		page;
 	BTPageOpaque opaque;
 
-	page = (Page) palloc(BLCKSZ);
+	Assert(page);
 
 	/* Zero the page and set up standard page header info */
 	_bt_pageinit(page, BLCKSZ);
@@ -639,11 +711,8 @@ _bt_blnewpage(uint32 level)
 	return page;
 }
 
-/*
- * emit a completed btree page, and release the working storage.
- */
 static void
-_bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
+_bt_blwritepage_direct(BTWriteState *wstate, Page page, BlockNumber blkno, Buffer buf)
 {
 	/*
 	 * If we have to write pages nonsequentially, fill in the space with
@@ -681,6 +750,61 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
 	pfree(page);
 }
 
+static void
+_bt_blwritepage_shared(BTWriteState *wstate, Page page, BlockNumber blkno, Buffer buf)
+{
+	/*
+	 * Indexes built in shared buffers need only to mark the buffer as dirty
+	 * and XLOG it.
+	 */
+	Assert(buf);
+	START_CRIT_SECTION();
+	MarkBufferDirty(buf);
+	if (wstate->btws_use_wal)
+		log_newpage_buffer(buf, true);
+	END_CRIT_SECTION();
+	_bt_relbuf(wstate->index, buf);
+}
+
+static Page
+_bt_bl_alloc_page_direct(BTWriteState *wstate, BlockNumber *blockno, Buffer *buf)
+{
+	 /* buf is only used when using shared buffers, so set it to InvalidBuffer */
+	*buf = InvalidBuffer;
+
+	/*
+	 * Assign block number for the page.
+	 * This will be used to link to sibling page(s) later and, if this is the
+	 * initial page in the level, saved in the BTPageState
+	 */
+	*blockno = wstate->btws_pages_alloced++;
+
+	/* now allocate and set up the new page */
+	return palloc(BLCKSZ);
+}
+
+static Page
+_bt_bl_alloc_page_shared(BTWriteState *wstate, BlockNumber *blockno, Buffer *buf)
+{
+	/*
+	 * Find a shared buffer for the page. Pass mode RBM_ZERO_AND_LOCK to get an
+	 * exclusive lock on the buffer content. No lock on the relation as a whole
+	 * is needed (as in LockRelationForExtension()) because the initial index
+	 * build is not yet complete.
+	 */
+	*buf = ReadBufferExtended(wstate->index, MAIN_FORKNUM, P_NEW,
+			RBM_ZERO_AND_LOCK, NULL);
+
+	/*
+	 * bufmgr will assign a block number for the new page.
+	 * This will be used to link to sibling page(s) later and, if this is the
+	 * initial page in the level, saved in the BTPageState
+	 */
+	*blockno = BufferGetBlockNumber(*buf);
+
+	return BufferGetPage(*buf);
+}
+
 /*
  * allocate and initialize a new BTPageState.  the returned structure
  * is suitable for immediate use by _bt_buildadd.
@@ -688,13 +812,20 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
 static BTPageState *
 _bt_pagestate(BTWriteState *wstate, uint32 level)
 {
+	Buffer buf;
+	BlockNumber blockno;
+
 	BTPageState *state = (BTPageState *) palloc0(sizeof(BTPageState));
 
-	/* create initial page for level */
-	state->btps_page = _bt_blnewpage(level);
+	/*
+	 * Allocate and initialize initial page for the level, and if using shared
+	 * buffers, extend the relation and allocate a shared buffer for the block.
+	 */
+	state->btps_page = _bt_blnewpage(level, wstate->_bt_bl_alloc_page(wstate,
+				&blockno, &buf));
 
-	/* and assign it a page position */
-	state->btps_blkno = wstate->btws_pages_alloced++;
+	state->btps_blkno = blockno;
+	state->btps_buf = buf;
 
 	state->btps_lowkey = NULL;
 	/* initialize lastoff so first item goes into P_FIRSTKEY */
@@ -829,6 +960,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 {
 	Page		npage;
 	BlockNumber nblkno;
+	Buffer nbuf;
 	OffsetNumber last_off;
 	Size		last_truncextra;
 	Size		pgspc;
@@ -843,6 +975,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 
 	npage = state->btps_page;
 	nblkno = state->btps_blkno;
+	nbuf = state->btps_buf;
 	last_off = state->btps_lastoff;
 	last_truncextra = state->btps_lastextra;
 	state->btps_lastextra = truncextra;
@@ -899,15 +1032,14 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 		 */
 		Page		opage = npage;
 		BlockNumber oblkno = nblkno;
+		Buffer obuf = nbuf;
 		ItemId		ii;
 		ItemId		hii;
 		IndexTuple	oitup;
 
-		/* Create new page of same level */
-		npage = _bt_blnewpage(state->btps_level);
-
-		/* and assign it a page position */
-		nblkno = wstate->btws_pages_alloced++;
+		/* Create and initialize a new page of same level */
+		npage = _bt_blnewpage(state->btps_level,
+				wstate->_bt_bl_alloc_page(wstate, &nblkno, &nbuf));
 
 		/*
 		 * We copy the last item on the page into the new page, and then
@@ -1017,9 +1149,12 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 
 		/*
 		 * Write out the old page.  We never need to touch it again, so we can
-		 * free the opage workspace too.
+		 * free the opage workspace too. obuf has been released and is no longer
+		 * valid.
 		 */
-		_bt_blwritepage(wstate, opage, oblkno);
+		 wstate->_bt_blwritepage(wstate, opage, oblkno, obuf);
+		 obuf = InvalidBuffer;
+		 opage = NULL;
 
 		/*
 		 * Reset last_off to point to new page
@@ -1054,6 +1189,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 
 	state->btps_page = npage;
 	state->btps_blkno = nblkno;
+	state->btps_buf = nbuf;
 	state->btps_lastoff = last_off;
 }
 
@@ -1099,12 +1235,12 @@ _bt_sort_dedup_finish_pending(BTWriteState *wstate, BTPageState *state,
  * Finish writing out the completed btree.
  */
 static void
-_bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
+_bt_uppershutdown(BTWriteState *wstate, BTPageState *state, Buffer metabuf,
+		Page metapage)
 {
 	BTPageState *s;
 	BlockNumber rootblkno = P_NONE;
 	uint32		rootlevel = 0;
-	Page		metapage;
 
 	/*
 	 * Each iteration of this loop completes one more level of the tree.
@@ -1150,20 +1286,24 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 		 * back one slot.  Then we can dump out the page.
 		 */
 		_bt_slideleft(s->btps_page);
-		_bt_blwritepage(wstate, s->btps_page, s->btps_blkno);
+		wstate->_bt_blwritepage(wstate, s->btps_page, s->btps_blkno, s->btps_buf);
+		s->btps_buf = InvalidBuffer;
 		s->btps_page = NULL;	/* writepage freed the workspace */
 	}
 
 	/*
-	 * As the last step in the process, construct the metapage and make it
+	 * As the last step in the process, initialize the metapage and make it
 	 * point to the new root (unless we had no data at all, in which case it's
 	 * set to point to "P_NONE").  This changes the index to the "valid" state
 	 * by filling in a valid magic number in the metapage.
+	 * After this, metapage will have been freed or invalid and metabuf, if ever
+	 * valid, will have been released.
 	 */
-	metapage = (Page) palloc(BLCKSZ);
 	_bt_initmetapage(metapage, rootblkno, rootlevel,
 					 wstate->inskey->allequalimage);
-	_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
+	wstate->_bt_blwritepage(wstate, metapage, BTREE_METAPAGE, metabuf);
+	metabuf = InvalidBuffer;
+	metapage = NULL;
 }
 
 /*
@@ -1173,6 +1313,10 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 static void
 _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 {
+	Page metapage;
+	BlockNumber metablkno;
+	Buffer metabuf;
+
 	BTPageState *state = NULL;
 	bool		merge = (btspool2 != NULL);
 	IndexTuple	itup,
@@ -1185,21 +1329,29 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	int64		tuples_done = 0;
 	bool		deduplicate;
 
+	deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
+		BTGetDeduplicateItems(wstate->index);
 
 	/*
-	 * Only bother fsync'ing the data to permanent storage if WAL logging
+	 * Reserve block 0 for the metapage up front.
+	 *
+	 * When using the shared buffers API it is easier to allocate the buffer
+	 * for block 0 first instead of trying skip block 0 and allocate it at the
+	 * end of index build.
+	 *
+	 * When not using the shared buffers API, there is no harm in allocating
+	 * the metapage first. When block 1 is written, the direct writepage
+	 * function will zero-fill block 0. When writing out the metapage at the
+	 * end of index build, it will overwrite that block 0.
+	 *
+	 * The metapage will be initialized and written out at the end of the index
+	 * build when all of the information needed to do so is available.
 	 *
-	 * The self-fsync optimization requires that the backend both add an fsync
-	 * request to the checkpointer's pending-ops table as well as be prepared
-	 * to fsync the page data itself. Because none of these are required if the
-	 * relation is not WAL-logged, pass btws_use_wal for all parameters of the
-	 * prep function.
+	 * The block number will always be BTREE_METAPAGE, so the metablkno
+	 * variable is unused and only created to avoid a special case in the
+	 * direct alloc function.
 	 */
-	unbuffered_prep(&wstate->ub_wstate, wstate->btws_use_wal,
-			wstate->btws_use_wal, wstate->btws_use_wal);
-
-	deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
-		BTGetDeduplicateItems(wstate->index);
+	metapage = wstate->_bt_bl_alloc_page(wstate, &metablkno, &metabuf);
 
 	if (merge)
 	{
@@ -1422,10 +1574,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	}
 
 	/* Close down final pages and write the metapage */
-	_bt_uppershutdown(wstate, state);
-
-	unbuffered_finish(&wstate->ub_wstate, RelationGetSmgr(wstate->index),
-			MAIN_FORKNUM);
+	_bt_uppershutdown(wstate, state, metabuf, metapage);
 }
 
 /*
-- 
2.17.1

#19

melanieplageman@gmail.com

almost 4 years ago

In reply to: Justin Pryzby (#18)

4 attachment(s)

Re: Avoiding smgrimmedsync() during nbtree index builds

On Wed, Mar 2, 2022 at 8:09 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

Rebased to appease cfbot.

I ran these paches under a branch which shows code coverage in cirrus. It
looks good to my eyes.
https://api.cirrus-ci.com/v1/artifact/task/5212346552418304/coverage/coverage/00-index.html

thanks for doing that and for the rebase! since I made updates, the
attached version 6 is also rebased.

To Dmitry's question:

On Sun, Feb 13, 2022 at 9:33 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

On Wed, Feb 09, 2022 at 01:49:30PM -0500, Melanie Plageman wrote:

I don't see it in the discussion, so naturally curious -- why directmgr
is not used for bloom index, e.g. in blbuildempty?

thanks for pointing this out. blbuildempty() is also included now. bloom
doesn't seem to use smgr* anywhere except blbuildempty(), so that is the
only place I made changes in bloom index build.

v6 has the following updates/changes:

- removed erroneous extra calls to unbuffered_prep() and
unbuffered_finish() in GiST and btree index builds

- removed unnecessary includes

- some comments were updated to accurately reflect use of directmgr

- smgrwrite doesn't WAL-log anymore. This one I'm not sure about. I
think it makes sense that unbuffered_extend() of non-empty pages of
WAL-logged relations (or the init fork of unlogged relations) do
log_newpage(), but I wasn't so sure about smgrwrite().

Currently all callers of smgrwrite() do log_newpage() and anyone using
mdwrite() will end up writing the whole page. However, it seems
possible that a caller of the directmgr API might wish to do a write
to a particular offset and, either because of that, or, for some other
reason, require different logging than that done in log_newpage().

I didn't want to make a separate wrapper function for WAL-logging in
directmgr because it felt like one more step to forget.

- heapam_relation_set_new_filenode doesn't use directmgr API anymore for
unlogged relations. It doesn't extend or write, so it seemed like a
special case better left alone.

Note that the ambuildempty() functions which also write to the init
fork of an unlogged relation still use the directmgr API. It is a bit
confusing because they pass do_wal=true to unbuffered_prep() even
though they are unlogged relations because they must log and fsync.

- interface changes to unbuffered_prep()
I removed the parameters to unbuffered_prep() which required the user
to specify if fsync should be added to pendingOps or done with
smgrimmedsync(). Understanding all of the combinations of these
parameters and when they were needed was confusing and the interface
felt like a foot gun. Special cases shouldn't use this interface.

I prefer the idea that users of this API expect that
1) empty pages won't be checksummed or WAL logged
2) for relations that are WAL-logged, when the build is done, the
relation will be fsync'd by the backend (unless the fsync optimization
is being used)
3) the only case in which fsync requests are added to the pendingOps
queue is when the fsync optimization is being used (which saves the
redo pointer and check it later to determine if it needs to fsync
itself)

I also added the parameter "do_wal" to unbuffered_prep() and the
UnBufferedWriteState struct. This is used when extending the file to
determine whether or not to log_newpage(). unbuffered_extend() and
unbuffered_write() no longer take do_wal as a parameter.

Note that callers need to pass do_wal=true/false to unbuffered_prep()
based on whether or not they want log_newpage() called during
unbuffered_extend()--not simply based on whether or not the relation
in question is WAL-logged.

do_wal is the only member of the UnBufferedWriteState struct in the
first patch in the set, but I think it makes sense to keep the struct
around since I anticipate that the patch containing the other members
needed for the fsync optimization will be committed at the same time.

One final note on unbuffered_prep() -- I am thinking of renaming
"do_optimization" to "try_optimization" or maybe
"request_fsync_optimization". The interface (of unbuffered_prep())
would be better if we always tried to do the optimization when
relevant (when the relation is WAL-logged).

- freespace map, visimap, and hash index don't use directmgr API anymore
For most use cases writing/extending outside shared buffers, it isn't
safe to rely on requesting fsync from checkpointer.

visimap, fsm, and hash index all request fsync from checkpointer for
the pages they write with smgrextend() and don't smgrimmedsync() when
finished adding pages (even when the hash index is WAL-logged).

Supporting these exceptions made the interface incoherent, so I cut
them.

- added unbuffered_extend_range()
This one is a bit unfortunate. Because GiST index build uses
log_newpages() to log a whole page range but calls smgrextend() for
each of those pages individually, I couldn't use the
unbuffered_extend() function easily.

So, I thought it might be useful in other contexts as well to have a
function which calls smgrextend() for a range of pages and then calls
log_newpages(). I've added this.

However, there are two parts of GiST index build flush ready pages
that didn't work with this either.

The first is that it does an error check on the block numbers one at a
time while looping through them to write the pages. To retain this
check, I loop through the ready pages an extra time before calling
unbuffered_extend(), which is probably not acceptable.

Also, GiST needs to use a custom LSN for the pages. To achieve this, I
added a "custom_lsn" parameter to unbuffered_extend_range(). This
isn't great either. If this was a more common case, I could pass the
custom LSN to unbuffered_prep().

- Melanie

Attachments:

v6-0003-BTree-index-use-unbuffered-IO-optimization.patchtext/x-patch; charset=US-ASCII; name=v6-0003-BTree-index-use-unbuffered-IO-optimization.patchDownload

From 17fb22142ade65fdbe8c90889e49d0be60ba45e4 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Fri, 4 Mar 2022 15:53:05 -0500
Subject: [PATCH v6 3/4] BTree index use unbuffered IO optimization

While building a btree index, the backend can avoid fsync'ing all of the
pages if it uses the optimization introduced in a prior commit.

This can substantially improve performance when many indexes are being
built during DDL operations.
---
 src/backend/access/nbtree/nbtree.c  | 2 +-
 src/backend/access/nbtree/nbtsort.c | 6 +++++-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 6b78acefbe..fc5cce4603 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -161,7 +161,7 @@ btbuildempty(Relation index)
 	 * internally. However, were this to be replaced with unbuffered_extend(),
 	 * do_wal must be true to ensure the data is logged and fsync'd.
 	 */
-	unbuffered_prep(&wstate, true, false);
+	unbuffered_prep(&wstate, true, true);
 
 	/* Construct metapage. */
 	metapage = (Page) palloc(BLCKSZ);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index d6d0d4b361..f1b9e2e24e 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1189,7 +1189,11 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	int64		tuples_done = 0;
 	bool		deduplicate;
 
-	unbuffered_prep(&wstate->ub_wstate, wstate->btws_use_wal, false);
+	/*
+	 * The fsync optimization done by directmgr is only relevant if
+	 * WAL-logging, so pass btws_use_wal for this parameter.
+	 */
+	unbuffered_prep(&wstate->ub_wstate, wstate->btws_use_wal, wstate->btws_use_wal);
 
 	deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
 		BTGetDeduplicateItems(wstate->index);
-- 
2.30.2

v6-0001-Add-unbuffered-IO-API.patchtext/x-patch; charset=US-ASCII; name=v6-0001-Add-unbuffered-IO-API.patchDownload

From a06407b19c8d168ea966e45c0e483b38d49ddc12 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Fri, 4 Mar 2022 14:48:39 -0500
Subject: [PATCH v6 1/4] Add unbuffered IO API

Wrap unbuffered extends and writes in a new API, directmgr.

When writing data outside of shared buffers, the backend must do a
series of steps to ensure the data is both durable and recoverable.

When writing or extending a page of data outside of shared buffers the
backend must log the write or extend (if table is WAL-logged), checksum
the page (if it is not empty), and write it out before moving on.

Additionally, the backend must fsync the page data to ensure it reaches
permanent storage since checkpointer is unaware of the buffer and could
move the Redo pointer past the associated WAL for this write/extend
before the data is safely on permanent storage.

This commit introduces no functional change. It replaces many current
callers of smgrimmedsync(), smgrextend(), and smgrwrite() with the
equivalent directmgr functions. Consolidating these steps makes IO
outside of shared buffers less error-prone.
---
 contrib/bloom/blinsert.c               | 25 ++++---
 src/backend/access/gist/gistbuild.c    | 50 +++++--------
 src/backend/access/heap/rewriteheap.c  | 71 ++++++-------------
 src/backend/access/nbtree/nbtree.c     | 26 ++++---
 src/backend/access/nbtree/nbtsort.c    | 73 ++++++++-----------
 src/backend/access/spgist/spginsert.c  | 41 +++++------
 src/backend/catalog/storage.c          | 29 ++------
 src/backend/storage/Makefile           |  2 +-
 src/backend/storage/direct/Makefile    | 17 +++++
 src/backend/storage/direct/directmgr.c | 98 ++++++++++++++++++++++++++
 src/include/storage/directmgr.h        | 53 ++++++++++++++
 11 files changed, 296 insertions(+), 189 deletions(-)
 create mode 100644 src/backend/storage/direct/Makefile
 create mode 100644 src/backend/storage/direct/directmgr.c
 create mode 100644 src/include/storage/directmgr.h

diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c
index c94cf34e69..7954a17e2d 100644
--- a/contrib/bloom/blinsert.c
+++ b/contrib/bloom/blinsert.c
@@ -19,8 +19,8 @@
 #include "catalog/index.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/directmgr.h"
 #include "storage/indexfsm.h"
-#include "storage/smgr.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -164,6 +164,16 @@ void
 blbuildempty(Relation index)
 {
 	Page		metapage;
+	UnBufferedWriteState wstate;
+
+	/*
+	 * Though this is an unlogged relation, pass do_wal=true since the init
+	 * fork of an unlogged index must be wal-logged and fsync'd. This currently
+	 * has no effect, as unbuffered_write() does not do log_newpage()
+	 * internally. However, were this to be replaced with unbuffered_extend(),
+	 * do_wal must be true to ensure the data is logged and fsync'd.
+	 */
+	unbuffered_prep(&wstate, true);
 
 	/* Construct metapage. */
 	metapage = (Page) palloc(BLCKSZ);
@@ -176,18 +186,13 @@ blbuildempty(Relation index)
 	 * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we need
 	 * this even when wal_level=minimal.
 	 */
-	PageSetChecksumInplace(metapage, BLOOM_METAPAGE_BLKNO);
-	smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, BLOOM_METAPAGE_BLKNO,
-			  (char *) metapage, true);
+	unbuffered_write(&wstate, RelationGetSmgr(index), INIT_FORKNUM,
+			BLOOM_METAPAGE_BLKNO, metapage);
+
 	log_newpage(&(RelationGetSmgr(index))->smgr_rnode.node, INIT_FORKNUM,
 				BLOOM_METAPAGE_BLKNO, metapage, true);
 
-	/*
-	 * An immediate sync is required even if we xlog'd the page, because the
-	 * write did not go through shared_buffers and therefore a concurrent
-	 * checkpoint may have moved the redo pointer past our xlog record.
-	 */
-	smgrimmedsync(RelationGetSmgr(index), INIT_FORKNUM);
+	unbuffered_finish(&wstate, RelationGetSmgr(index), INIT_FORKNUM);
 }
 
 /*
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index e081e6571a..fc09938f80 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -43,6 +43,7 @@
 #include "miscadmin.h"
 #include "optimizer/optimizer.h"
 #include "storage/bufmgr.h"
+#include "storage/directmgr.h"
 #include "storage/smgr.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -91,6 +92,7 @@ typedef struct
 
 	int64		indtuples;		/* number of tuples indexed */
 
+	UnBufferedWriteState ub_wstate;
 	/*
 	 * Extra data structures used during a buffering build. 'gfbb' contains
 	 * information related to managing the build buffers. 'parentMap' is a
@@ -410,13 +412,15 @@ gist_indexsortbuild(GISTBuildState *state)
 	state->pages_written = 0;
 	state->ready_num_pages = 0;
 
+	unbuffered_prep(&state->ub_wstate, RelationNeedsWAL(state->indexrel));
+
 	/*
 	 * Write an empty page as a placeholder for the root page. It will be
 	 * replaced with the real root page at the end.
 	 */
 	page = palloc0(BLCKSZ);
-	smgrextend(RelationGetSmgr(state->indexrel), MAIN_FORKNUM, GIST_ROOT_BLKNO,
-			   page, true);
+	unbuffered_extend(&state->ub_wstate, RelationGetSmgr(state->indexrel),
+			MAIN_FORKNUM, GIST_ROOT_BLKNO, page, true);
 	state->pages_allocated++;
 	state->pages_written++;
 
@@ -458,27 +462,19 @@ gist_indexsortbuild(GISTBuildState *state)
 
 	/* Write out the root */
 	PageSetLSN(levelstate->pages[0], GistBuildLSN);
-	PageSetChecksumInplace(levelstate->pages[0], GIST_ROOT_BLKNO);
-	smgrwrite(RelationGetSmgr(state->indexrel), MAIN_FORKNUM, GIST_ROOT_BLKNO,
-			  levelstate->pages[0], true);
+
+	unbuffered_write(&state->ub_wstate, RelationGetSmgr(state->indexrel),
+			MAIN_FORKNUM, GIST_ROOT_BLKNO, levelstate->pages[0]);
+
 	if (RelationNeedsWAL(state->indexrel))
 		log_newpage(&state->indexrel->rd_node, MAIN_FORKNUM, GIST_ROOT_BLKNO,
 					levelstate->pages[0], true);
 
+	unbuffered_finish(&state->ub_wstate, RelationGetSmgr(state->indexrel),
+			MAIN_FORKNUM);
+
 	pfree(levelstate->pages[0]);
 	pfree(levelstate);
-
-	/*
-	 * When we WAL-logged index pages, we must nonetheless fsync index files.
-	 * Since we're building outside shared buffers, a CHECKPOINT occurring
-	 * during the build has no way to flush the previously written data to
-	 * disk (indeed it won't know the index even exists).  A crash later on
-	 * would replay WAL from the checkpoint, therefore it wouldn't replay our
-	 * earlier WAL entries. If we do not fsync those pages here, they might
-	 * still not be on disk when the crash occurs.
-	 */
-	if (RelationNeedsWAL(state->indexrel))
-		smgrimmedsync(RelationGetSmgr(state->indexrel), MAIN_FORKNUM);
 }
 
 /*
@@ -645,26 +641,18 @@ gist_indexsortbuild_flush_ready_pages(GISTBuildState *state)
 	if (state->ready_num_pages == 0)
 		return;
 
+	/* Currently, the blocks must be buffered in order. */
 	for (int i = 0; i < state->ready_num_pages; i++)
 	{
-		Page		page = state->ready_pages[i];
-		BlockNumber blkno = state->ready_blknos[i];
-
-		/* Currently, the blocks must be buffered in order. */
-		if (blkno != state->pages_written)
+		if (state->ready_blknos[i] != state->pages_written)
 			elog(ERROR, "unexpected block number to flush GiST sorting build");
-
-		PageSetLSN(page, GistBuildLSN);
-		PageSetChecksumInplace(page, blkno);
-		smgrextend(RelationGetSmgr(state->indexrel), MAIN_FORKNUM, blkno, page,
-				   true);
-
 		state->pages_written++;
 	}
 
-	if (RelationNeedsWAL(state->indexrel))
-		log_newpages(&state->indexrel->rd_node, MAIN_FORKNUM, state->ready_num_pages,
-					 state->ready_blknos, state->ready_pages, true);
+	unbuffered_extend_range(&state->ub_wstate,
+			RelationGetSmgr(state->indexrel), MAIN_FORKNUM,
+			state->ready_num_pages, state->ready_blknos, state->ready_pages,
+			false, GistBuildLSN);
 
 	for (int i = 0; i < state->ready_num_pages; i++)
 		pfree(state->ready_pages[i]);
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 2a53826736..c8fa8bb27c 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -81,15 +81,15 @@
  * in the whole table.  Note that if we do fail halfway through a CLUSTER,
  * the old table is still valid, so failure is not catastrophic.
  *
- * We can't use the normal heap_insert function to insert into the new
- * heap, because heap_insert overwrites the visibility information.
- * We use a special-purpose raw_heap_insert function instead, which
- * is optimized for bulk inserting a lot of tuples, knowing that we have
- * exclusive access to the heap.  raw_heap_insert builds new pages in
- * local storage.  When a page is full, or at the end of the process,
- * we insert it to WAL as a single record and then write it to disk
- * directly through smgr.  Note, however, that any data sent to the new
- * heap's TOAST table will go through the normal bufmgr.
+ * We can't use the normal heap_insert function to insert into the new heap,
+ * because heap_insert overwrites the visibility information. We use a
+ * special-purpose raw_heap_insert function instead, which is optimized for
+ * bulk inserting a lot of tuples, knowing that we have exclusive access to the
+ * heap.  raw_heap_insert builds new pages in local storage.  When a page is
+ * full, or at the end of the process, we insert it to WAL as a single record
+ * and then write it to disk directly through directmgr.  Note, however, that
+ * any data sent to the new heap's TOAST table will go through the normal
+ * bufmgr.
  *
  *
  * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
@@ -119,9 +119,9 @@
 #include "replication/logical.h"
 #include "replication/slot.h"
 #include "storage/bufmgr.h"
+#include "storage/directmgr.h"
 #include "storage/fd.h"
 #include "storage/procarray.h"
-#include "storage/smgr.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -152,6 +152,7 @@ typedef struct RewriteStateData
 	HTAB	   *rs_old_new_tid_map; /* unmatched B tuples */
 	HTAB	   *rs_logical_mappings;	/* logical remapping files */
 	uint32		rs_num_rewrite_mappings;	/* # in memory mappings */
+	UnBufferedWriteState rs_unbuffered_wstate;
 }			RewriteStateData;
 
 /*
@@ -265,6 +266,10 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
 	state->rs_cutoff_multi = cutoff_multi;
 	state->rs_cxt = rw_cxt;
 
+	unbuffered_prep(&state->rs_unbuffered_wstate,
+			RelationNeedsWAL(state->rs_new_rel));
+
+
 	/* Initialize hash tables used to track update chains */
 	hash_ctl.keysize = sizeof(TidHashKey);
 	hash_ctl.entrysize = sizeof(UnresolvedTupData);
@@ -317,28 +322,13 @@ end_heap_rewrite(RewriteState state)
 	/* Write the last page, if any */
 	if (state->rs_buffer_valid)
 	{
-		if (RelationNeedsWAL(state->rs_new_rel))
-			log_newpage(&state->rs_new_rel->rd_node,
-						MAIN_FORKNUM,
-						state->rs_blockno,
-						state->rs_buffer,
-						true);
-
-		PageSetChecksumInplace(state->rs_buffer, state->rs_blockno);
-
-		smgrextend(RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM,
-				   state->rs_blockno, (char *) state->rs_buffer, true);
+		unbuffered_extend(&state->rs_unbuffered_wstate,
+				RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM,
+				state->rs_blockno, state->rs_buffer, false);
 	}
 
-	/*
-	 * When we WAL-logged rel pages, we must nonetheless fsync them.  The
-	 * reason is the same as in storage.c's RelationCopyStorage(): we're
-	 * writing data that's not in shared buffers, and so a CHECKPOINT
-	 * occurring during the rewriteheap operation won't have fsync'd data we
-	 * wrote before the checkpoint.
-	 */
-	if (RelationNeedsWAL(state->rs_new_rel))
-		smgrimmedsync(RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM);
+	unbuffered_finish(&state->rs_unbuffered_wstate,
+			RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM);
 
 	logical_end_heap_rewrite(state);
 
@@ -676,24 +666,9 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
 			 * contains a tuple.  Hence, unlike RelationGetBufferForTuple(),
 			 * enforce saveFreeSpace unconditionally.
 			 */
-
-			/* XLOG stuff */
-			if (RelationNeedsWAL(state->rs_new_rel))
-				log_newpage(&state->rs_new_rel->rd_node,
-							MAIN_FORKNUM,
-							state->rs_blockno,
-							page,
-							true);
-
-			/*
-			 * Now write the page. We say skipFsync = true because there's no
-			 * need for smgr to schedule an fsync for this write; we'll do it
-			 * ourselves in end_heap_rewrite.
-			 */
-			PageSetChecksumInplace(page, state->rs_blockno);
-
-			smgrextend(RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM,
-					   state->rs_blockno, (char *) page, true);
+			unbuffered_extend(&state->rs_unbuffered_wstate,
+					RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM,
+					state->rs_blockno, page, false);
 
 			state->rs_blockno++;
 			state->rs_buffer_valid = false;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index c9b4964c1e..c7bf971917 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -30,10 +30,10 @@
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "storage/condition_variable.h"
+#include "storage/directmgr.h"
 #include "storage/indexfsm.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
-#include "storage/smgr.h"
 #include "utils/builtins.h"
 #include "utils/index_selfuncs.h"
 #include "utils/memutils.h"
@@ -152,6 +152,16 @@ void
 btbuildempty(Relation index)
 {
 	Page		metapage;
+	UnBufferedWriteState wstate;
+
+	/*
+	 * Though this is an unlogged relation, pass do_wal=true since the init
+	 * fork of an unlogged index must be wal-logged and fsync'd. This currently
+	 * has no effect, as unbuffered_write() does not do log_newpage()
+	 * internally. However, were this to be replaced with unbuffered_extend(),
+	 * do_wal must be true to ensure the data is logged and fsync'd.
+	 */
+	unbuffered_prep(&wstate, true);
 
 	/* Construct metapage. */
 	metapage = (Page) palloc(BLCKSZ);
@@ -164,18 +174,12 @@ btbuildempty(Relation index)
 	 * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we need
 	 * this even when wal_level=minimal.
 	 */
-	PageSetChecksumInplace(metapage, BTREE_METAPAGE);
-	smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, BTREE_METAPAGE,
-			  (char *) metapage, true);
+	unbuffered_write(&wstate, RelationGetSmgr(index), INIT_FORKNUM,
+			BTREE_METAPAGE, metapage);
 	log_newpage(&RelationGetSmgr(index)->smgr_rnode.node, INIT_FORKNUM,
 				BTREE_METAPAGE, metapage, true);
 
-	/*
-	 * An immediate sync is required even if we xlog'd the page, because the
-	 * write did not go through shared_buffers and therefore a concurrent
-	 * checkpoint may have moved the redo pointer past our xlog record.
-	 */
-	smgrimmedsync(RelationGetSmgr(index), INIT_FORKNUM);
+	unbuffered_finish(&wstate, RelationGetSmgr(index), INIT_FORKNUM);
 }
 
 /*
@@ -959,7 +963,7 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 * delete some deletable tuples.  Hence, we must repeatedly check the
 	 * relation length.  We must acquire the relation-extension lock while
 	 * doing so to avoid a race condition: if someone else is extending the
-	 * relation, there is a window where bufmgr/smgr have created a new
+	 * relation, there is a window where bufmgr/directmgr have created a new
 	 * all-zero page but it hasn't yet been write-locked by _bt_getbuf(). If
 	 * we manage to scan such a page here, we'll improperly assume it can be
 	 * recycled.  Taking the lock synchronizes things enough to prevent a
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 8a19de2f66..e280253127 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -23,13 +23,13 @@
  * many upper pages if the keys are reasonable-size) without risking a lot of
  * cascading splits during early insertions.
  *
- * Formerly the index pages being built were kept in shared buffers, but
- * that is of no value (since other backends have no interest in them yet)
- * and it created locking problems for CHECKPOINT, because the upper-level
- * pages were held exclusive-locked for long periods.  Now we just build
- * the pages in local memory and smgrwrite or smgrextend them as we finish
- * them.  They will need to be re-read into shared buffers on first use after
- * the build finishes.
+ * Formerly the index pages being built were kept in shared buffers, but that
+ * is of no value (since other backends have no interest in them yet) and it
+ * created locking problems for CHECKPOINT, because the upper-level pages were
+ * held exclusive-locked for long periods.  Now we just build the pages in
+ * local memory and write or extend them with directmgr as we finish them.
+ * They will need to be re-read into shared buffers on first use after the
+ * build finishes.
  *
  * This code isn't concerned about the FSM at all. The caller is responsible
  * for initializing that.
@@ -57,7 +57,7 @@
 #include "executor/instrument.h"
 #include "miscadmin.h"
 #include "pgstat.h"
-#include "storage/smgr.h"
+#include "storage/directmgr.h"
 #include "tcop/tcopprot.h"		/* pgrminclude ignore */
 #include "utils/rel.h"
 #include "utils/sortsupport.h"
@@ -256,6 +256,7 @@ typedef struct BTWriteState
 	BlockNumber btws_pages_alloced; /* # pages allocated */
 	BlockNumber btws_pages_written; /* # pages written out */
 	Page		btws_zeropage;	/* workspace for filling zeroes */
+	UnBufferedWriteState ub_wstate;
 } BTWriteState;
 
 
@@ -643,13 +644,6 @@ _bt_blnewpage(uint32 level)
 static void
 _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
 {
-	/* XLOG stuff */
-	if (wstate->btws_use_wal)
-	{
-		/* We use the XLOG_FPI record type for this */
-		log_newpage(&wstate->index->rd_node, MAIN_FORKNUM, blkno, page, true);
-	}
-
 	/*
 	 * If we have to write pages nonsequentially, fill in the space with
 	 * zeroes until we come back and overwrite.  This is not logically
@@ -661,33 +655,33 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
 	{
 		if (!wstate->btws_zeropage)
 			wstate->btws_zeropage = (Page) palloc0(BLCKSZ);
-		/* don't set checksum for all-zero page */
-		smgrextend(RelationGetSmgr(wstate->index), MAIN_FORKNUM,
-				   wstate->btws_pages_written++,
-				   (char *) wstate->btws_zeropage,
-				   true);
+
+		unbuffered_extend(&wstate->ub_wstate, RelationGetSmgr(wstate->index),
+				MAIN_FORKNUM, wstate->btws_pages_written++,
+				wstate->btws_zeropage, true);
 	}
 
-	PageSetChecksumInplace(page, blkno);
 
-	/*
-	 * Now write the page.  There's no need for smgr to schedule an fsync for
-	 * this write; we'll do it ourselves before ending the build.
-	 */
+	/* Now write the page. Either we are extending the file... */
 	if (blkno == wstate->btws_pages_written)
 	{
-		/* extending the file... */
-		smgrextend(RelationGetSmgr(wstate->index), MAIN_FORKNUM, blkno,
-				   (char *) page, true);
+		unbuffered_extend(&wstate->ub_wstate, RelationGetSmgr(wstate->index),
+				MAIN_FORKNUM, blkno, page, false);
+
 		wstate->btws_pages_written++;
 	}
+
+	/* or we are overwriting a block we zero-filled before. */
 	else
 	{
-		/* overwriting a block we zero-filled before */
-		smgrwrite(RelationGetSmgr(wstate->index), MAIN_FORKNUM, blkno,
-				  (char *) page, true);
-	}
+		unbuffered_write(&wstate->ub_wstate, RelationGetSmgr(wstate->index),
+				MAIN_FORKNUM, blkno, page);
+
+		/* We use the XLOG_FPI record type for this */
+		if (wstate->btws_use_wal)
+			log_newpage(&wstate->index->rd_node, MAIN_FORKNUM, blkno, page, true);
 
+	}
 	pfree(page);
 }
 
@@ -1195,6 +1189,8 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	int64		tuples_done = 0;
 	bool		deduplicate;
 
+	unbuffered_prep(&wstate->ub_wstate, wstate->btws_use_wal);
+
 	deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
 		BTGetDeduplicateItems(wstate->index);
 
@@ -1421,17 +1417,8 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	/* Close down final pages and write the metapage */
 	_bt_uppershutdown(wstate, state);
 
-	/*
-	 * When we WAL-logged index pages, we must nonetheless fsync index files.
-	 * Since we're building outside shared buffers, a CHECKPOINT occurring
-	 * during the build has no way to flush the previously written data to
-	 * disk (indeed it won't know the index even exists).  A crash later on
-	 * would replay WAL from the checkpoint, therefore it wouldn't replay our
-	 * earlier WAL entries. If we do not fsync those pages here, they might
-	 * still not be on disk when the crash occurs.
-	 */
-	if (wstate->btws_use_wal)
-		smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM);
+	unbuffered_finish(&wstate->ub_wstate, RelationGetSmgr(wstate->index),
+			MAIN_FORKNUM);
 }
 
 /*
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index bfb74049d0..318dbee823 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -25,7 +25,7 @@
 #include "catalog/index.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
-#include "storage/smgr.h"
+#include "storage/directmgr.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -156,48 +156,43 @@ void
 spgbuildempty(Relation index)
 {
 	Page		page;
+	UnBufferedWriteState wstate;
+
+	/*
+	 * Though this is an unlogged relation, pass do_wal=true since the init
+	 * fork of an unlogged index must be wal-logged and fsync'd. This currently
+	 * has no effect, as unbuffered_write() does not do log_newpage()
+	 * internally. However, were this to be replaced with unbuffered_extend(),
+	 * do_wal must be true to ensure the data is logged and fsync'd.
+	 */
+	unbuffered_prep(&wstate, true);
 
 	/* Construct metapage. */
 	page = (Page) palloc(BLCKSZ);
 	SpGistInitMetapage(page);
 
-	/*
-	 * Write the page and log it unconditionally.  This is important
-	 * particularly for indexes created on tablespaces and databases whose
-	 * creation happened after the last redo pointer as recovery removes any
-	 * of their existing content when the corresponding create records are
-	 * replayed.
-	 */
-	PageSetChecksumInplace(page, SPGIST_METAPAGE_BLKNO);
-	smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, SPGIST_METAPAGE_BLKNO,
-			  (char *) page, true);
+	unbuffered_write(&wstate, RelationGetSmgr(index), INIT_FORKNUM,
+			SPGIST_METAPAGE_BLKNO, page);
 	log_newpage(&(RelationGetSmgr(index))->smgr_rnode.node, INIT_FORKNUM,
 				SPGIST_METAPAGE_BLKNO, page, true);
 
 	/* Likewise for the root page. */
 	SpGistInitPage(page, SPGIST_LEAF);
 
-	PageSetChecksumInplace(page, SPGIST_ROOT_BLKNO);
-	smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, SPGIST_ROOT_BLKNO,
-			  (char *) page, true);
+	unbuffered_write(&wstate, RelationGetSmgr(index), INIT_FORKNUM,
+			SPGIST_ROOT_BLKNO, page);
 	log_newpage(&(RelationGetSmgr(index))->smgr_rnode.node, INIT_FORKNUM,
 				SPGIST_ROOT_BLKNO, page, true);
 
 	/* Likewise for the null-tuples root page. */
 	SpGistInitPage(page, SPGIST_LEAF | SPGIST_NULLS);
 
-	PageSetChecksumInplace(page, SPGIST_NULL_BLKNO);
-	smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, SPGIST_NULL_BLKNO,
-			  (char *) page, true);
+	unbuffered_write(&wstate, RelationGetSmgr(index), INIT_FORKNUM,
+			SPGIST_NULL_BLKNO, page);
 	log_newpage(&(RelationGetSmgr(index))->smgr_rnode.node, INIT_FORKNUM,
 				SPGIST_NULL_BLKNO, page, true);
 
-	/*
-	 * An immediate sync is required even if we xlog'd the pages, because the
-	 * writes did not go through shared buffers and therefore a concurrent
-	 * checkpoint may have moved the redo pointer past our xlog record.
-	 */
-	smgrimmedsync(RelationGetSmgr(index), INIT_FORKNUM);
+	unbuffered_finish(&wstate, RelationGetSmgr(index), INIT_FORKNUM);
 }
 
 /*
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 9b8075536a..0b211895c1 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -28,6 +28,7 @@
 #include "catalog/storage.h"
 #include "catalog/storage_xlog.h"
 #include "miscadmin.h"
+#include "storage/directmgr.h"
 #include "storage/freespace.h"
 #include "storage/smgr.h"
 #include "utils/hsearch.h"
@@ -420,6 +421,8 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 	bool		copying_initfork;
 	BlockNumber nblocks;
 	BlockNumber blkno;
+	UnBufferedWriteState wstate;
+
 
 	page = (Page) buf.data;
 
@@ -440,6 +443,8 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 	use_wal = XLogIsNeeded() &&
 		(relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork);
 
+	unbuffered_prep(&wstate, use_wal);
+
 	nblocks = smgrnblocks(src, forkNum);
 
 	for (blkno = 0; blkno < nblocks; blkno++)
@@ -474,30 +479,10 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 		 * page this is, so we have to log the full page including any unused
 		 * space.
 		 */
-		if (use_wal)
-			log_newpage(&dst->smgr_rnode.node, forkNum, blkno, page, false);
-
-		PageSetChecksumInplace(page, blkno);
-
-		/*
-		 * Now write the page.  We say skipFsync = true because there's no
-		 * need for smgr to schedule an fsync for this write; we'll do it
-		 * ourselves below.
-		 */
-		smgrextend(dst, forkNum, blkno, buf.data, true);
+		unbuffered_extend(&wstate, dst, forkNum, blkno, page, false);
 	}
 
-	/*
-	 * When we WAL-logged rel pages, we must nonetheless fsync them.  The
-	 * reason is that since we're copying outside shared buffers, a CHECKPOINT
-	 * occurring during the copy has no way to flush the previously written
-	 * data to disk (indeed it won't know the new rel even exists).  A crash
-	 * later on would replay WAL from the checkpoint, therefore it wouldn't
-	 * replay our earlier WAL entries. If we do not fsync those pages here,
-	 * they might still not be on disk when the crash occurs.
-	 */
-	if (use_wal || copying_initfork)
-		smgrimmedsync(dst, forkNum);
+	unbuffered_finish(&wstate, dst, forkNum);
 }
 
 /*
diff --git a/src/backend/storage/Makefile b/src/backend/storage/Makefile
index 8376cdfca2..501fae5f9d 100644
--- a/src/backend/storage/Makefile
+++ b/src/backend/storage/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/storage
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS     = buffer file freespace ipc large_object lmgr page smgr sync
+SUBDIRS     = buffer direct file freespace ipc large_object lmgr page smgr sync
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/direct/Makefile b/src/backend/storage/direct/Makefile
new file mode 100644
index 0000000000..d82bbed48c
--- /dev/null
+++ b/src/backend/storage/direct/Makefile
@@ -0,0 +1,17 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for storage/direct
+#
+# IDENTIFICATION
+#    src/backend/storage/direct/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/storage/direct
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = directmgr.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/direct/directmgr.c b/src/backend/storage/direct/directmgr.c
new file mode 100644
index 0000000000..42c37daa7a
--- /dev/null
+++ b/src/backend/storage/direct/directmgr.c
@@ -0,0 +1,98 @@
+/*-------------------------------------------------------------------------
+ *
+ * directmgr.c
+ *	  routines for managing unbuffered IO
+ *
+ * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/direct/directmgr.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+
+#include "access/xlogdefs.h"
+#include "access/xloginsert.h"
+#include "storage/directmgr.h"
+
+void
+unbuffered_prep(UnBufferedWriteState *wstate, bool do_wal)
+{
+	wstate->do_wal = do_wal;
+}
+
+void
+unbuffered_extend(UnBufferedWriteState *wstate, SMgrRelation
+		smgrrel, ForkNumber forknum, BlockNumber blocknum, Page page, bool
+		empty)
+{
+	/*
+	 * Don't checksum empty pages
+	 */
+	if (!empty)
+		PageSetChecksumInplace(page, blocknum);
+
+	smgrextend(smgrrel, forknum, blocknum, (char *) page, true);
+
+	/*
+	 * Don't WAL-log empty pages
+	 */
+	if (!empty && wstate->do_wal)
+		log_newpage(&(smgrrel)->smgr_rnode.node, forknum,
+					blocknum, page, true);
+}
+
+void
+unbuffered_extend_range(UnBufferedWriteState *wstate, SMgrRelation smgrrel,
+		ForkNumber forknum, int num_pages, BlockNumber *blocknums, Page *pages,
+		bool empty, XLogRecPtr custom_lsn)
+{
+	for (int i = 0; i < num_pages; i++)
+	{
+		Page		page = pages[i];
+		BlockNumber blkno = blocknums[i];
+
+		if (!XLogRecPtrIsInvalid(custom_lsn))
+			PageSetLSN(page, custom_lsn);
+
+		if (!empty)
+			PageSetChecksumInplace(page, blkno);
+
+		smgrextend(smgrrel, forknum, blkno, (char *) page, true);
+	}
+
+	if (!empty && wstate->do_wal)
+		log_newpages(&(smgrrel)->smgr_rnode.node, forknum, num_pages,
+				blocknums, pages, true);
+}
+
+void
+unbuffered_write(UnBufferedWriteState *wstate, SMgrRelation smgrrel, ForkNumber
+		forknum, BlockNumber blocknum, Page page)
+{
+	PageSetChecksumInplace(page, blocknum);
+
+	smgrwrite(smgrrel, forknum, blocknum, (char *) page, true);
+}
+
+/*
+ * When writing data outside shared buffers, a concurrent CHECKPOINT can move
+ * the redo pointer past our WAL entries and won't flush our data to disk. If
+ * the database crashes before the data makes it to disk, our WAL won't be
+ * replayed and the data will be lost.
+ * Thus, if a CHECKPOINT begins between unbuffered_prep() and
+ * unbuffered_finish(), the backend must fsync the data itself.
+ */
+void
+unbuffered_finish(UnBufferedWriteState *wstate, SMgrRelation smgrrel,
+		ForkNumber forknum)
+{
+	if (!wstate->do_wal)
+		return;
+
+	smgrimmedsync(smgrrel, forknum);
+}
diff --git a/src/include/storage/directmgr.h b/src/include/storage/directmgr.h
new file mode 100644
index 0000000000..db5e3b1cac
--- /dev/null
+++ b/src/include/storage/directmgr.h
@@ -0,0 +1,53 @@
+/*-------------------------------------------------------------------------
+ *
+ * directmgr.h
+ *	  POSTGRES unbuffered IO manager definitions.
+ *
+ *
+ * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/directmgr.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef DIRECTMGR_H
+#define DIRECTMGR_H
+
+#include "common/relpath.h"
+#include "storage/block.h"
+#include "storage/bufpage.h"
+#include "storage/smgr.h"
+
+typedef struct UnBufferedWriteState
+{
+	/*
+	 * When writing WAL-logged relation data outside of shared buffers, there
+	 * is a risk of a concurrent CHECKPOINT moving the redo pointer past the
+	 * data's associated WAL entries. To avoid this, callers in this situation
+	 * must fsync the pages they have written themselves. This is necessary
+	 * only if the relation is WAL-logged or in special cases such as the init
+	 * fork of an unlogged index.
+	 */
+	bool do_wal;
+} UnBufferedWriteState;
+/*
+ * prototypes for functions in directmgr.c
+ */
+extern void
+unbuffered_prep(UnBufferedWriteState *wstate, bool do_wal);
+extern void
+unbuffered_extend(UnBufferedWriteState *wstate, SMgrRelation smgrrel,
+		ForkNumber forknum, BlockNumber blocknum, Page page, bool empty);
+extern void
+unbuffered_extend_range(UnBufferedWriteState *wstate, SMgrRelation smgrrel,
+		ForkNumber forknum, int num_pages, BlockNumber *blocknums, Page *pages,
+		bool empty, XLogRecPtr custom_lsn);
+extern void
+unbuffered_write(UnBufferedWriteState *wstate, SMgrRelation smgrrel, ForkNumber
+		forknum, BlockNumber blocknum, Page page);
+extern void
+unbuffered_finish(UnBufferedWriteState *wstate, SMgrRelation smgrrel,
+		ForkNumber forknum);
+
+#endif							/* DIRECTMGR_H */
-- 
2.30.2

v6-0004-Use-shared-buffers-when-possible-for-index-build.patchtext/x-patch; charset=US-ASCII; name=v6-0004-Use-shared-buffers-when-possible-for-index-build.patchDownload

From 377c195bccf2dd2529e64d0d453104485f7662b7 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Fri, 4 Mar 2022 15:52:45 -0500
Subject: [PATCH v6 4/4] Use shared buffers when possible for index build

When there are not too many tuples, building the index in shared buffers
makes sense. It allows the buffer manager to handle how best to do the
IO.
---
 src/backend/access/nbtree/nbtree.c  |  32 ++--
 src/backend/access/nbtree/nbtsort.c | 273 +++++++++++++++++++++-------
 2 files changed, 223 insertions(+), 82 deletions(-)

diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index fc5cce4603..d3982b9388 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -152,34 +152,24 @@ void
 btbuildempty(Relation index)
 {
 	Page		metapage;
-	UnBufferedWriteState wstate;
+	Buffer metabuf;
 
 	/*
-	 * Though this is an unlogged relation, pass do_wal=true since the init
-	 * fork of an unlogged index must be wal-logged and fsync'd. This currently
-	 * has no effect, as unbuffered_write() does not do log_newpage()
-	 * internally. However, were this to be replaced with unbuffered_extend(),
-	 * do_wal must be true to ensure the data is logged and fsync'd.
+	 * Allocate a buffer for metapage and initialize metapage.
 	 */
-	unbuffered_prep(&wstate, true, true);
-
-	/* Construct metapage. */
-	metapage = (Page) palloc(BLCKSZ);
+	metabuf = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_ZERO_AND_LOCK,
+			NULL);
+	metapage = BufferGetPage(metabuf);
 	_bt_initmetapage(metapage, P_NONE, 0, _bt_allequalimage(index, false));
 
 	/*
-	 * Write the page and log it.  It might seem that an immediate sync would
-	 * be sufficient to guarantee that the file exists on disk, but recovery
-	 * itself might remove it while replaying, for example, an
-	 * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we need
-	 * this even when wal_level=minimal.
+	 * Mark metapage buffer as dirty and XLOG it
 	 */
-	unbuffered_write(&wstate, RelationGetSmgr(index), INIT_FORKNUM,
-			BTREE_METAPAGE, metapage);
-	log_newpage(&RelationGetSmgr(index)->smgr_rnode.node, INIT_FORKNUM,
-				BTREE_METAPAGE, metapage, true);
-
-	unbuffered_finish(&wstate, RelationGetSmgr(index), INIT_FORKNUM);
+	START_CRIT_SECTION();
+	MarkBufferDirty(metabuf);
+	log_newpage_buffer(metabuf, true);
+	END_CRIT_SECTION();
+	_bt_relbuf(index, metabuf);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index f1b9e2e24e..35ded6e4a1 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -23,13 +23,15 @@
  * many upper pages if the keys are reasonable-size) without risking a lot of
  * cascading splits during early insertions.
  *
- * Formerly the index pages being built were kept in shared buffers, but that
- * is of no value (since other backends have no interest in them yet) and it
- * created locking problems for CHECKPOINT, because the upper-level pages were
- * held exclusive-locked for long periods.  Now we just build the pages in
- * local memory and write or extend them with directmgr as we finish them.
- * They will need to be re-read into shared buffers on first use after the
- * build finishes.
+ * If indexing little enough data, it can be built efficiently in shared
+ * buffers, leaving checkpointer to deal with fsync'ing the data at its
+ * convenience. However, if indexing many pages, building the index in shared
+ * buffers could delay CHECKPOINTs--since the upper-level pages are
+ * exclusive-locked for long periods. In that case, build the pages in local
+ * memory and write or extend them with directmgr as they are finished. If a
+ * CHECKPOINT has begun since the build started, directmgr will fsync the
+ * relation file itself after finishing the build. The index will then need to
+ * be re-read into shared buffers on first use after the build finishes.
  *
  * This code isn't concerned about the FSM at all. The caller is responsible
  * for initializing that.
@@ -236,6 +238,7 @@ typedef struct BTPageState
 {
 	Page		btps_page;		/* workspace for page building */
 	BlockNumber btps_blkno;		/* block # to write this page at */
+	Buffer btps_buf; /* buffer to write this page to */
 	IndexTuple	btps_lowkey;	/* page's strict lower bound pivot tuple */
 	OffsetNumber btps_lastoff;	/* last item offset loaded */
 	Size		btps_lastextra; /* last item's extra posting list space */
@@ -253,10 +256,26 @@ typedef struct BTWriteState
 	Relation	index;
 	BTScanInsert inskey;		/* generic insertion scankey */
 	bool		btws_use_wal;	/* dump pages to WAL? */
-	BlockNumber btws_pages_alloced; /* # pages allocated */
-	BlockNumber btws_pages_written; /* # pages written out */
+	BlockNumber btws_pages_alloced; /* # pages allocated for index builds outside SB */
+	BlockNumber btws_pages_written; /* # pages written out for index builds outside SB */
 	Page		btws_zeropage;	/* workspace for filling zeroes */
 	UnBufferedWriteState ub_wstate;
+	/*
+	 * Allocate a new btree page. This does not initialize the page.
+	 */
+	Page (*_bt_bl_alloc_page) (struct BTWriteState *wstate, BlockNumber
+			*blockno, Buffer *buf);
+	/*
+	 * Emit a completed btree page, and release the working storage.
+	 */
+	void (*_bt_blwritepage) (struct BTWriteState *wstate, Page page,
+			BlockNumber blkno, Buffer buf);
+
+	void (*_bt_bl_unbuffered_prep) (UnBufferedWriteState *wstate, bool do_wal,
+			bool do_optimization);
+
+	void (*_bt_bl_unbuffered_finish) (UnBufferedWriteState *wstate,
+			SMgrRelation smgrrel, ForkNumber forknum);
 } BTWriteState;
 
 
@@ -265,10 +284,22 @@ static double _bt_spools_heapscan(Relation heap, Relation index,
 static void _bt_spooldestroy(BTSpool *btspool);
 static void _bt_spool(BTSpool *btspool, ItemPointer self,
 					  Datum *values, bool *isnull);
-static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2);
+static void _bt_leafbuild(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2);
 static void _bt_build_callback(Relation index, ItemPointer tid, Datum *values,
 							   bool *isnull, bool tupleIsAlive, void *state);
-static Page _bt_blnewpage(uint32 level);
+
+static Page
+_bt_bl_alloc_page_direct(BTWriteState *wstate, BlockNumber *blockno, Buffer *buf);
+static Page
+_bt_bl_alloc_page_shared(BTWriteState *wstate, BlockNumber *blockno, Buffer *buf);
+
+static Page _bt_blnewpage(uint32 level, Page page);
+
+static void
+_bt_blwritepage_direct(BTWriteState *wstate, Page page, BlockNumber blkno, Buffer buf);
+static void
+_bt_blwritepage_shared(BTWriteState *wstate, Page page, BlockNumber blkno, Buffer buf);
+
 static BTPageState *_bt_pagestate(BTWriteState *wstate, uint32 level);
 static void _bt_slideleft(Page rightmostpage);
 static void _bt_sortaddtup(Page page, Size itemsize,
@@ -279,9 +310,10 @@ static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
 static void _bt_sort_dedup_finish_pending(BTWriteState *wstate,
 										  BTPageState *state,
 										  BTDedupState dstate);
-static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
 static void _bt_load(BTWriteState *wstate,
 					 BTSpool *btspool, BTSpool *btspool2);
+static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state, Buffer
+		metabuf, Page metapage);
 static void _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent,
 							   int request);
 static void _bt_end_parallel(BTLeader *btleader);
@@ -294,6 +326,21 @@ static void _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 									   Sharedsort *sharedsort2, int sortmem,
 									   bool progress);
 
+#define BT_BUILD_SB_THRESHOLD 1024
+
+static const BTWriteState wstate_shared = {
+	._bt_bl_alloc_page = _bt_bl_alloc_page_shared,
+	._bt_blwritepage = _bt_blwritepage_shared,
+	._bt_bl_unbuffered_prep = NULL,
+	._bt_bl_unbuffered_finish = NULL,
+};
+
+static const BTWriteState wstate_direct = {
+	._bt_bl_alloc_page = _bt_bl_alloc_page_direct,
+	._bt_blwritepage = _bt_blwritepage_direct,
+	._bt_bl_unbuffered_prep = unbuffered_prep,
+	._bt_bl_unbuffered_finish = unbuffered_finish,
+};
 
 /*
  *	btbuild() -- build a new btree index.
@@ -303,6 +350,7 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 {
 	IndexBuildResult *result;
 	BTBuildState buildstate;
+	BTWriteState writestate;
 	double		reltuples;
 
 #ifdef BTREE_BUILD_STATS
@@ -333,8 +381,12 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
+	 *
+	 * Don't use shared buffers if the number of tuples is too large.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2);
+	writestate = reltuples < BT_BUILD_SB_THRESHOLD ? wstate_shared : wstate_direct;
+
+	_bt_leafbuild(&writestate, buildstate.spool, buildstate.spool2);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
@@ -542,10 +594,8 @@ _bt_spool(BTSpool *btspool, ItemPointer self, Datum *values, bool *isnull)
  * create an entire btree.
  */
 static void
-_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
+_bt_leafbuild(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 {
-	BTWriteState wstate;
-
 #ifdef BTREE_BUILD_STATS
 	if (log_btree_build_stats)
 	{
@@ -565,21 +615,38 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 		tuplesort_performsort(btspool2->sortstate);
 	}
 
-	wstate.heap = btspool->heap;
-	wstate.index = btspool->index;
-	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
-	/* _bt_mkscankey() won't set allequalimage without metapage */
-	wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
-	wstate.btws_use_wal = RelationNeedsWAL(wstate.index);
+	wstate->heap = btspool->heap;
+	wstate->index = btspool->index;
+	wstate->inskey = _bt_mkscankey(wstate->index, NULL);
+	/* _bt-mkscankey() won't set allequalimage without metapage */
+	wstate->inskey->allequalimage = _bt_allequalimage(wstate->index, true);
+	wstate->btws_use_wal = RelationNeedsWAL(wstate->index);
 
 	/* reserve the metapage */
-	wstate.btws_pages_alloced = BTREE_METAPAGE + 1;
-	wstate.btws_pages_written = 0;
-	wstate.btws_zeropage = NULL;	/* until needed */
+	wstate->btws_pages_alloced = 0;
+	wstate->btws_pages_written = 0;
+	wstate->btws_zeropage = NULL;
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_BTREE_PHASE_LEAF_LOAD);
-	_bt_load(&wstate, btspool, btspool2);
+
+	/*
+	 * If not using shared buffers, for a WAL-logged relation, save the redo
+	 * pointer location in case a checkpoint begins during the index build.
+	 */
+	if (wstate->_bt_bl_unbuffered_prep)
+		wstate->_bt_bl_unbuffered_prep(&wstate->ub_wstate,
+				wstate->btws_use_wal, wstate->btws_use_wal);
+
+	_bt_load(wstate, btspool, btspool2);
+
+	/*
+	 * If not using shared buffers, for a WAL-logged relation, check if backend
+	 * must fsync the page itself.
+	 */
+	if (wstate->_bt_bl_unbuffered_finish)
+		wstate->_bt_bl_unbuffered_finish(&wstate->ub_wstate,
+				RelationGetSmgr(wstate->index), MAIN_FORKNUM);
 }
 
 /*
@@ -612,15 +679,15 @@ _bt_build_callback(Relation index,
 }
 
 /*
- * allocate workspace for a new, clean btree page, not linked to any siblings.
+ * Set up workspace for a new, clean btree page, not linked to any siblings.
+ * Caller must allocate the passed in page.
  */
 static Page
-_bt_blnewpage(uint32 level)
+_bt_blnewpage(uint32 level, Page page)
 {
-	Page		page;
 	BTPageOpaque opaque;
 
-	page = (Page) palloc(BLCKSZ);
+	Assert(page);
 
 	/* Zero the page and set up standard page header info */
 	_bt_pageinit(page, BLCKSZ);
@@ -638,11 +705,8 @@ _bt_blnewpage(uint32 level)
 	return page;
 }
 
-/*
- * emit a completed btree page, and release the working storage.
- */
 static void
-_bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
+_bt_blwritepage_direct(BTWriteState *wstate, Page page, BlockNumber blkno, Buffer buf)
 {
 	/*
 	 * If we have to write pages nonsequentially, fill in the space with
@@ -685,6 +749,61 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
 	pfree(page);
 }
 
+static void
+_bt_blwritepage_shared(BTWriteState *wstate, Page page, BlockNumber blkno, Buffer buf)
+{
+	/*
+	 * Indexes built in shared buffers need only to mark the buffer as dirty
+	 * and XLOG it.
+	 */
+	Assert(buf);
+	START_CRIT_SECTION();
+	MarkBufferDirty(buf);
+	if (wstate->btws_use_wal)
+		log_newpage_buffer(buf, true);
+	END_CRIT_SECTION();
+	_bt_relbuf(wstate->index, buf);
+}
+
+static Page
+_bt_bl_alloc_page_direct(BTWriteState *wstate, BlockNumber *blockno, Buffer *buf)
+{
+	 /* buf is only used when using shared buffers, so set it to InvalidBuffer */
+	*buf = InvalidBuffer;
+
+	/*
+	 * Assign block number for the page.
+	 * This will be used to link to sibling page(s) later and, if this is the
+	 * initial page in the level, saved in the BTPageState
+	 */
+	*blockno = wstate->btws_pages_alloced++;
+
+	/* now allocate and set up the new page */
+	return palloc(BLCKSZ);
+}
+
+static Page
+_bt_bl_alloc_page_shared(BTWriteState *wstate, BlockNumber *blockno, Buffer *buf)
+{
+	/*
+	 * Find a shared buffer for the page. Pass mode RBM_ZERO_AND_LOCK to get an
+	 * exclusive lock on the buffer content. No lock on the relation as a whole
+	 * is needed (as in LockRelationForExtension()) because the initial index
+	 * build is not yet complete.
+	 */
+	*buf = ReadBufferExtended(wstate->index, MAIN_FORKNUM, P_NEW,
+			RBM_ZERO_AND_LOCK, NULL);
+
+	/*
+	 * bufmgr will assign a block number for the new page.
+	 * This will be used to link to sibling page(s) later and, if this is the
+	 * initial page in the level, saved in the BTPageState
+	 */
+	*blockno = BufferGetBlockNumber(*buf);
+
+	return BufferGetPage(*buf);
+}
+
 /*
  * allocate and initialize a new BTPageState.  the returned structure
  * is suitable for immediate use by _bt_buildadd.
@@ -692,13 +811,20 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
 static BTPageState *
 _bt_pagestate(BTWriteState *wstate, uint32 level)
 {
+	Buffer buf;
+	BlockNumber blockno;
+
 	BTPageState *state = (BTPageState *) palloc0(sizeof(BTPageState));
 
-	/* create initial page for level */
-	state->btps_page = _bt_blnewpage(level);
+	/*
+	 * Allocate and initialize initial page for the level, and if using shared
+	 * buffers, extend the relation and allocate a shared buffer for the block.
+	 */
+	state->btps_page = _bt_blnewpage(level, wstate->_bt_bl_alloc_page(wstate,
+				&blockno, &buf));
 
-	/* and assign it a page position */
-	state->btps_blkno = wstate->btws_pages_alloced++;
+	state->btps_blkno = blockno;
+	state->btps_buf = buf;
 
 	state->btps_lowkey = NULL;
 	/* initialize lastoff so first item goes into P_FIRSTKEY */
@@ -833,6 +959,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 {
 	Page		npage;
 	BlockNumber nblkno;
+	Buffer nbuf;
 	OffsetNumber last_off;
 	Size		last_truncextra;
 	Size		pgspc;
@@ -847,6 +974,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 
 	npage = state->btps_page;
 	nblkno = state->btps_blkno;
+	nbuf = state->btps_buf;
 	last_off = state->btps_lastoff;
 	last_truncextra = state->btps_lastextra;
 	state->btps_lastextra = truncextra;
@@ -903,15 +1031,14 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 		 */
 		Page		opage = npage;
 		BlockNumber oblkno = nblkno;
+		Buffer obuf = nbuf;
 		ItemId		ii;
 		ItemId		hii;
 		IndexTuple	oitup;
 
-		/* Create new page of same level */
-		npage = _bt_blnewpage(state->btps_level);
-
-		/* and assign it a page position */
-		nblkno = wstate->btws_pages_alloced++;
+		/* Create and initialize a new page of same level */
+		npage = _bt_blnewpage(state->btps_level,
+				wstate->_bt_bl_alloc_page(wstate, &nblkno, &nbuf));
 
 		/*
 		 * We copy the last item on the page into the new page, and then
@@ -1021,9 +1148,12 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 
 		/*
 		 * Write out the old page.  We never need to touch it again, so we can
-		 * free the opage workspace too.
+		 * free the opage workspace too. obuf has been released and is no longer
+		 * valid.
 		 */
-		_bt_blwritepage(wstate, opage, oblkno);
+		 wstate->_bt_blwritepage(wstate, opage, oblkno, obuf);
+		 obuf = InvalidBuffer;
+		 opage = NULL;
 
 		/*
 		 * Reset last_off to point to new page
@@ -1058,6 +1188,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 
 	state->btps_page = npage;
 	state->btps_blkno = nblkno;
+	state->btps_buf = nbuf;
 	state->btps_lastoff = last_off;
 }
 
@@ -1103,12 +1234,12 @@ _bt_sort_dedup_finish_pending(BTWriteState *wstate, BTPageState *state,
  * Finish writing out the completed btree.
  */
 static void
-_bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
+_bt_uppershutdown(BTWriteState *wstate, BTPageState *state, Buffer metabuf,
+		Page metapage)
 {
 	BTPageState *s;
 	BlockNumber rootblkno = P_NONE;
 	uint32		rootlevel = 0;
-	Page		metapage;
 
 	/*
 	 * Each iteration of this loop completes one more level of the tree.
@@ -1154,20 +1285,24 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 		 * back one slot.  Then we can dump out the page.
 		 */
 		_bt_slideleft(s->btps_page);
-		_bt_blwritepage(wstate, s->btps_page, s->btps_blkno);
+		wstate->_bt_blwritepage(wstate, s->btps_page, s->btps_blkno, s->btps_buf);
+		s->btps_buf = InvalidBuffer;
 		s->btps_page = NULL;	/* writepage freed the workspace */
 	}
 
 	/*
-	 * As the last step in the process, construct the metapage and make it
+	 * As the last step in the process, initialize the metapage and make it
 	 * point to the new root (unless we had no data at all, in which case it's
 	 * set to point to "P_NONE").  This changes the index to the "valid" state
 	 * by filling in a valid magic number in the metapage.
+	 * After this, metapage will have been freed or invalid and metabuf, if ever
+	 * valid, will have been released.
 	 */
-	metapage = (Page) palloc(BLCKSZ);
 	_bt_initmetapage(metapage, rootblkno, rootlevel,
 					 wstate->inskey->allequalimage);
-	_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
+	wstate->_bt_blwritepage(wstate, metapage, BTREE_METAPAGE, metabuf);
+	metabuf = InvalidBuffer;
+	metapage = NULL;
 }
 
 /*
@@ -1177,6 +1312,10 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 static void
 _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 {
+	Page metapage;
+	BlockNumber metablkno;
+	Buffer metabuf;
+
 	BTPageState *state = NULL;
 	bool		merge = (btspool2 != NULL);
 	IndexTuple	itup,
@@ -1189,15 +1328,30 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	int64		tuples_done = 0;
 	bool		deduplicate;
 
-	/*
-	 * The fsync optimization done by directmgr is only relevant if
-	 * WAL-logging, so pass btws_use_wal for this parameter.
-	 */
-	unbuffered_prep(&wstate->ub_wstate, wstate->btws_use_wal, wstate->btws_use_wal);
-
 	deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
 		BTGetDeduplicateItems(wstate->index);
 
+	/*
+	 * Reserve block 0 for the metapage up front.
+	 *
+	 * When using the shared buffers API it is easier to allocate the buffer
+	 * for block 0 first instead of trying skip block 0 and allocate it at the
+	 * end of index build.
+	 *
+	 * When not using the shared buffers API, there is no harm in allocating
+	 * the metapage first. When block 1 is written, the direct writepage
+	 * function will zero-fill block 0. When writing out the metapage at the
+	 * end of index build, it will overwrite that block 0.
+	 *
+	 * The metapage will be initialized and written out at the end of the index
+	 * build when all of the information needed to do so is available.
+	 *
+	 * The block number will always be BTREE_METAPAGE, so the metablkno
+	 * variable is unused and only created to avoid a special case in the
+	 * direct alloc function.
+	 */
+	metapage = wstate->_bt_bl_alloc_page(wstate, &metablkno, &metabuf);
+
 	if (merge)
 	{
 		/*
@@ -1419,10 +1573,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	}
 
 	/* Close down final pages and write the metapage */
-	_bt_uppershutdown(wstate, state);
-
-	unbuffered_finish(&wstate->ub_wstate, RelationGetSmgr(wstate->index),
-			MAIN_FORKNUM);
+	_bt_uppershutdown(wstate, state, metabuf, metapage);
 }
 
 /*
-- 
2.30.2

v6-0002-Avoid-immediate-fsync-for-unbuffered-IO.patchtext/x-patch; charset=US-ASCII; name=v6-0002-Avoid-immediate-fsync-for-unbuffered-IO.patchDownload

From ac914bf0e227b3cb62918954a7d7567a3f09f3b7 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Fri, 4 Mar 2022 14:14:21 -0500
Subject: [PATCH v6 2/4] Avoid immediate fsync for unbuffered IO

Data written to WAL-logged relations is durable once the WAL entries are
on permanent storage; however, the XLOG Redo pointer cannot be moved
past the associated WAL until the page data is safely on permanent
storage. If a crash were to occur before the data is fsync'd, the WAL
wouldn't be replayed during recovery, and the data would be lost.

This is not a problem with pages written in shared buffers because the
checkpointer will block until FlushBuffer() is complete for all buffers
that were dirtied before it began. Therefore it will not move the Redo
pointer past their associated WAL entries until it has fsync'd the data.

A backend writing data outside of shared buffers must ensure that the
data has reached permanent storage itself or that the Redo pointer has
not moved while it was writing the data.

In the common case, the backend should not have to do this fsync itself
and can instead request the checkpointer do it.

To ensure this is safe, the backend can save the XLOG Redo pointer
location before doing the write or extend. Then it can add an fsync
request for the page to the checkpointer's pending-ops table using the
existing mechanism. After doing the write or extend, if the Redo pointer
has moved (meaning a checkpoint has started since it saved it last),
then the backend can simply fsync the page itself. Otherwise, the
checkpointer takes care of fsync'ing the page the next time it processes
the pending-ops table.

This commit adds the optimization option to the directmgr API but does
not add any users, so there is no behavior change.
---
 contrib/bloom/blinsert.c               |  2 +-
 src/backend/access/gist/gistbuild.c    |  2 +-
 src/backend/access/heap/rewriteheap.c  |  3 +--
 src/backend/access/nbtree/nbtree.c     |  2 +-
 src/backend/access/nbtree/nbtsort.c    |  2 +-
 src/backend/access/spgist/spginsert.c  |  2 +-
 src/backend/access/transam/xlog.c      | 13 +++++++++++++
 src/backend/catalog/storage.c          |  2 +-
 src/backend/storage/direct/directmgr.c | 14 +++++++++++++-
 src/include/access/xlog.h              |  1 +
 src/include/storage/directmgr.h        | 11 ++++++++++-
 11 files changed, 44 insertions(+), 10 deletions(-)

diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c
index 7954a17e2d..fdde2bd88a 100644
--- a/contrib/bloom/blinsert.c
+++ b/contrib/bloom/blinsert.c
@@ -173,7 +173,7 @@ blbuildempty(Relation index)
 	 * internally. However, were this to be replaced with unbuffered_extend(),
 	 * do_wal must be true to ensure the data is logged and fsync'd.
 	 */
-	unbuffered_prep(&wstate, true);
+	unbuffered_prep(&wstate, true, false);
 
 	/* Construct metapage. */
 	metapage = (Page) palloc(BLCKSZ);
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index fc09938f80..8de19199a6 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -412,7 +412,7 @@ gist_indexsortbuild(GISTBuildState *state)
 	state->pages_written = 0;
 	state->ready_num_pages = 0;
 
-	unbuffered_prep(&state->ub_wstate, RelationNeedsWAL(state->indexrel));
+	unbuffered_prep(&state->ub_wstate, RelationNeedsWAL(state->indexrel), false);
 
 	/*
 	 * Write an empty page as a placeholder for the root page. It will be
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index c8fa8bb27c..70b7a0f269 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -267,8 +267,7 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
 	state->rs_cxt = rw_cxt;
 
 	unbuffered_prep(&state->rs_unbuffered_wstate,
-			RelationNeedsWAL(state->rs_new_rel));
-
+			RelationNeedsWAL(state->rs_new_rel), false);
 
 	/* Initialize hash tables used to track update chains */
 	hash_ctl.keysize = sizeof(TidHashKey);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index c7bf971917..6b78acefbe 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -161,7 +161,7 @@ btbuildempty(Relation index)
 	 * internally. However, were this to be replaced with unbuffered_extend(),
 	 * do_wal must be true to ensure the data is logged and fsync'd.
 	 */
-	unbuffered_prep(&wstate, true);
+	unbuffered_prep(&wstate, true, false);
 
 	/* Construct metapage. */
 	metapage = (Page) palloc(BLCKSZ);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index e280253127..d6d0d4b361 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1189,7 +1189,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	int64		tuples_done = 0;
 	bool		deduplicate;
 
-	unbuffered_prep(&wstate->ub_wstate, wstate->btws_use_wal);
+	unbuffered_prep(&wstate->ub_wstate, wstate->btws_use_wal, false);
 
 	deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
 		BTGetDeduplicateItems(wstate->index);
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index 318dbee823..e30f3623f5 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -165,7 +165,7 @@ spgbuildempty(Relation index)
 	 * internally. However, were this to be replaced with unbuffered_extend(),
 	 * do_wal must be true to ensure the data is logged and fsync'd.
 	 */
-	unbuffered_prep(&wstate, true);
+	unbuffered_prep(&wstate, true, false);
 
 	/* Construct metapage. */
 	page = (Page) palloc(BLCKSZ);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 0d2bd7a357..db7b33ec67 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5971,6 +5971,19 @@ GetLastImportantRecPtr(void)
 	return res;
 }
 
+bool RedoRecPtrChanged(XLogRecPtr comparator_ptr)
+{
+	XLogRecPtr ptr;
+	SpinLockAcquire(&XLogCtl->info_lck);
+	ptr = XLogCtl->RedoRecPtr;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	if (RedoRecPtr < ptr)
+		RedoRecPtr = ptr;
+
+	return RedoRecPtr != comparator_ptr;
+}
+
 /*
  * Get the time and LSN of the last xlog segment switch
  */
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 0b211895c1..2fd5a31ffd 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -443,7 +443,7 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 	use_wal = XLogIsNeeded() &&
 		(relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork);
 
-	unbuffered_prep(&wstate, use_wal);
+	unbuffered_prep(&wstate, use_wal, false);
 
 	nblocks = smgrnblocks(src, forkNum);
 
diff --git a/src/backend/storage/direct/directmgr.c b/src/backend/storage/direct/directmgr.c
index 42c37daa7a..4c3a5a2f74 100644
--- a/src/backend/storage/direct/directmgr.c
+++ b/src/backend/storage/direct/directmgr.c
@@ -15,14 +15,23 @@
 #include "postgres.h"
 
 
+#include "access/xlog.h"
 #include "access/xlogdefs.h"
 #include "access/xloginsert.h"
 #include "storage/directmgr.h"
 
 void
-unbuffered_prep(UnBufferedWriteState *wstate, bool do_wal)
+unbuffered_prep(UnBufferedWriteState *wstate, bool do_wal, bool
+		do_optimization)
 {
+	/*
+	 * There is no valid fsync optimization if no WAL is being written anyway
+	 */
+	Assert(!do_optimization || (do_optimization && do_wal));
+
 	wstate->do_wal = do_wal;
+	wstate->do_optimization = do_optimization;
+	wstate->redo = do_optimization ? GetRedoRecPtr() : InvalidXLogRecPtr;
 }
 
 void
@@ -94,5 +103,8 @@ unbuffered_finish(UnBufferedWriteState *wstate, SMgrRelation smgrrel,
 	if (!wstate->do_wal)
 		return;
 
+	if (wstate->do_optimization && !RedoRecPtrChanged(wstate->redo))
+		return;
+
 	smgrimmedsync(smgrrel, forknum);
 }
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 4b45ac64db..71fe99a28d 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -241,6 +241,7 @@ extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(TimeLineID *insertTLI);
 extern TimeLineID GetWALInsertionTimeLine(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
+extern bool RedoRecPtrChanged(XLogRecPtr comparator_ptr);
 
 extern void SetWalWriterSleeping(bool sleeping);
 
diff --git a/src/include/storage/directmgr.h b/src/include/storage/directmgr.h
index db5e3b1cac..e5454a3296 100644
--- a/src/include/storage/directmgr.h
+++ b/src/include/storage/directmgr.h
@@ -14,6 +14,7 @@
 #ifndef DIRECTMGR_H
 #define DIRECTMGR_H
 
+#include "access/xlogdefs.h"
 #include "common/relpath.h"
 #include "storage/block.h"
 #include "storage/bufpage.h"
@@ -28,14 +29,22 @@ typedef struct UnBufferedWriteState
 	 * must fsync the pages they have written themselves. This is necessary
 	 * only if the relation is WAL-logged or in special cases such as the init
 	 * fork of an unlogged index.
+	 *
+	 * These callers can optionally use the following optimization: attempt to
+	 * use the sync request queue and fall back to fsync'ing the pages
+	 * themselves if the Redo pointer moves between the start and finish of
+	 * their write. In order to do this, they must set do_optimization to true
+	 * so that the redo pointer is saved before the write begins.
 	 */
 	bool do_wal;
+	bool do_optimization;
+	XLogRecPtr redo;
 } UnBufferedWriteState;
 /*
  * prototypes for functions in directmgr.c
  */
 extern void
-unbuffered_prep(UnBufferedWriteState *wstate, bool do_wal);
+unbuffered_prep(UnBufferedWriteState *wstate, bool do_wal, bool do_optimization);
 extern void
 unbuffered_extend(UnBufferedWriteState *wstate, SMgrRelation smgrrel,
 		ForkNumber forknum, BlockNumber blocknum, Page page, bool empty);
-- 
2.30.2

#20