Relation bulk write facility

Started by Heikki Linnakangasover 2 years ago54 messages
#1Heikki Linnakangas
hlinnaka@iki.fi
1 attachment(s)

Several places bypass the buffer manager and use direct smgrextend()
calls to populate a new relation: Index AM build methods, rewriteheap.c
and RelationCopyStorage(). There's fair amount of duplicated code to
WAL-log the pages, calculate checksums, call smgrextend(), and finally
call smgrimmedsync() if needed. The duplication is tedious and
error-prone. For example, if we want to optimize by WAL-logging multiple
pages in one record, that needs to be implemented in each AM separately.
Currently only sorted GiST index build does that but it would be equally
beneficial in all of those places.

And I believe we got the smgrimmedsync() logic slightly wrong in a
number of places [1]/messages/by-id/58effc10-c160-b4a6-4eb7-384e95e6f9e3@iki.fi. And it's not great for latency, we could let the
checkpointer do the fsyncing lazily, like Robert mentioned in the same
thread.

The attached patch centralizes that pattern to a new bulk writing
facility, and changes all those AMs to use it. The facility buffers 32
pages and WAL-logs them in record, calculates checksums. You could
imagine a lot of further optimizations, like writing those 32 pages in
one vectored pvwrite() call [2]/messages/by-id/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com, and not skipping the buffer manager
when the relation is small. But the scope of this initial version is
mostly to refactor the existing code.

One new optimization included here is to let the checkpointer do the
fsyncing if possible. That gives a big speedup when e.g. restoring a
schema-only dump with lots of relations.

[1]: /messages/by-id/58effc10-c160-b4a6-4eb7-384e95e6f9e3@iki.fi
/messages/by-id/58effc10-c160-b4a6-4eb7-384e95e6f9e3@iki.fi

[2]: /messages/by-id/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
/messages/by-id/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com

--
Heikki Linnakangas
Neon (https://neon.tech)

Attachments:

v1-0001-Introduce-a-new-bulk-loading-facility.patchtext/x-patch; charset=UTF-8; name=v1-0001-Introduce-a-new-bulk-loading-facility.patchDownload
From 33f5cafc1512e8a004df2d506da71dfbbef5c60d Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Tue, 19 Sep 2023 18:09:34 +0300
Subject: [PATCH v1 1/1] Introduce a new bulk loading facility.

The new facility makes it easier to optimize bulk loading, as the
logic for buffering, WAL-logging, and syncing the relation only needs
to be implemented once. It's also less error-prone: We have had a
number of bugs in how a relation is fsync'd - or not - at the end of a
bulk loading operation. By centralizing that logic to one place, we
only need to write it correctly once.

The new facility is faster for small relations: Instead of of calling
smgrimmedsync(), we register the fsync to happen at next checkpoint,
which avoids the fsync latency. That can make a big difference if you
are e.g. restoring a schema-only dump with lots of relations.

It is also slightly more efficient with large relations, as the WAL
logging is performed multiple pages at a time. That avoids some WAL
header overhead. The sorted GiST index build did that already, this
moves the buffering to the new facility.

The changes to pageinspect GiST test needs an explanation: Before this
patch, the sorted GiST index build set the LSN on every page to the
special GistBuildLSN value, not the LSN of the WAL record, even though
they were WAL-logged. There was no particular need for it, it just
happened naturally when we wrote out the pages before WAL-logging
them. Now we WAL-log the pages first, like in B-tree build, so the
pages are stamped with the record's real LSN. When the build is not
WAL-logged, we still use GistBuildLSN. To make the test output
predictable, use an unlogged index.
---
 src/backend/access/gist/gistbuild.c   | 111 ++-------
 src/backend/access/heap/rewriteheap.c |  71 ++----
 src/backend/access/nbtree/nbtree.c    |  29 +--
 src/backend/access/nbtree/nbtsort.c   | 102 ++------
 src/backend/access/spgist/spginsert.c |  49 ++--
 src/backend/catalog/storage.c         |  38 +--
 src/backend/storage/smgr/Makefile     |   1 +
 src/backend/storage/smgr/bulk_write.c | 334 ++++++++++++++++++++++++++
 src/backend/storage/smgr/md.c         |  43 ++++
 src/backend/storage/smgr/meson.build  |   1 +
 src/backend/storage/smgr/smgr.c       |  31 +++
 src/include/storage/bulk_write.h      |  28 +++
 src/include/storage/md.h              |   1 +
 src/include/storage/smgr.h            |   1 +
 14 files changed, 535 insertions(+), 305 deletions(-)
 create mode 100644 src/backend/storage/smgr/bulk_write.c
 create mode 100644 src/include/storage/bulk_write.h

diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 5e0c1447f92..950b6046ac5 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -43,7 +43,8 @@
 #include "miscadmin.h"
 #include "optimizer/optimizer.h"
 #include "storage/bufmgr.h"
-#include "storage/smgr.h"
+#include "storage/bulk_write.h"
+
 #include "utils/memutils.h"
 #include "utils/rel.h"
 #include "utils/tuplesort.h"
@@ -106,11 +107,8 @@ typedef struct
 	Tuplesortstate *sortstate;	/* state data for tuplesort.c */
 
 	BlockNumber pages_allocated;
-	BlockNumber pages_written;
 
-	int			ready_num_pages;
-	BlockNumber ready_blknos[XLR_MAX_BLOCK_ID];
-	Page		ready_pages[XLR_MAX_BLOCK_ID];
+	BulkWriteState *bulkw;
 } GISTBuildState;
 
 #define GIST_SORTED_BUILD_PAGE_NUM 4
@@ -142,7 +140,6 @@ static void gist_indexsortbuild_levelstate_add(GISTBuildState *state,
 											   IndexTuple itup);
 static void gist_indexsortbuild_levelstate_flush(GISTBuildState *state,
 												 GistSortedBuildLevelState *levelstate);
-static void gist_indexsortbuild_flush_ready_pages(GISTBuildState *state);
 
 static void gistInitBuffering(GISTBuildState *buildstate);
 static int	calculatePagesPerBuffer(GISTBuildState *buildstate, int levelStep);
@@ -407,21 +404,13 @@ gist_indexsortbuild(GISTBuildState *state)
 	GistSortedBuildLevelState *levelstate;
 	Page		page;
 
-	state->pages_allocated = 0;
-	state->pages_written = 0;
-	state->ready_num_pages = 0;
+	/* Reserve block 0 for the root page */
+	state->pages_allocated = 1;
 
-	/*
-	 * Write an empty page as a placeholder for the root page. It will be
-	 * replaced with the real root page at the end.
-	 */
-	page = palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, MCXT_ALLOC_ZERO);
-	smgrextend(RelationGetSmgr(state->indexrel), MAIN_FORKNUM, GIST_ROOT_BLKNO,
-			   page, true);
-	state->pages_allocated++;
-	state->pages_written++;
+	state->bulkw = bulkw_start_rel(state->indexrel, MAIN_FORKNUM);
 
 	/* Allocate a temporary buffer for the first leaf page batch. */
+	page = bulkw_alloc_buf(state->bulkw);
 	levelstate = palloc0(sizeof(GistSortedBuildLevelState));
 	levelstate->pages[0] = page;
 	levelstate->parent = NULL;
@@ -455,31 +444,13 @@ gist_indexsortbuild(GISTBuildState *state)
 		levelstate = parent;
 	}
 
-	gist_indexsortbuild_flush_ready_pages(state);
-
 	/* Write out the root */
 	PageSetLSN(levelstate->pages[0], GistBuildLSN);
-	PageSetChecksumInplace(levelstate->pages[0], GIST_ROOT_BLKNO);
-	smgrwrite(RelationGetSmgr(state->indexrel), MAIN_FORKNUM, GIST_ROOT_BLKNO,
-			  levelstate->pages[0], true);
-	if (RelationNeedsWAL(state->indexrel))
-		log_newpage(&state->indexrel->rd_locator, MAIN_FORKNUM, GIST_ROOT_BLKNO,
-					levelstate->pages[0], true);
-
-	pfree(levelstate->pages[0]);
+	bulkw_write(state->bulkw, GIST_ROOT_BLKNO, levelstate->pages[0], true);
+
 	pfree(levelstate);
 
-	/*
-	 * When we WAL-logged index pages, we must nonetheless fsync index files.
-	 * Since we're building outside shared buffers, a CHECKPOINT occurring
-	 * during the build has no way to flush the previously written data to
-	 * disk (indeed it won't know the index even exists).  A crash later on
-	 * would replay WAL from the checkpoint, therefore it wouldn't replay our
-	 * earlier WAL entries. If we do not fsync those pages here, they might
-	 * still not be on disk when the crash occurs.
-	 */
-	if (RelationNeedsWAL(state->indexrel))
-		smgrimmedsync(RelationGetSmgr(state->indexrel), MAIN_FORKNUM);
+	bulkw_finish(state->bulkw);
 }
 
 /*
@@ -510,7 +481,7 @@ gist_indexsortbuild_levelstate_add(GISTBuildState *state,
 
 		if (levelstate->pages[levelstate->current_page] == NULL)
 			levelstate->pages[levelstate->current_page] =
-				palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
+				bulkw_alloc_buf(state->bulkw);
 
 		newPage = levelstate->pages[levelstate->current_page];
 		gistinitpage(newPage, old_page_flags);
@@ -580,7 +551,7 @@ gist_indexsortbuild_levelstate_flush(GISTBuildState *state,
 
 		/* Create page and copy data */
 		data = (char *) (dist->list);
-		target = palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, MCXT_ALLOC_ZERO);
+		target = bulkw_alloc_buf(state->bulkw);
 		gistinitpage(target, isleaf ? F_LEAF : 0);
 		for (int i = 0; i < dist->block.num; i++)
 		{
@@ -593,20 +564,6 @@ gist_indexsortbuild_levelstate_flush(GISTBuildState *state,
 		}
 		union_tuple = dist->itup;
 
-		if (state->ready_num_pages == XLR_MAX_BLOCK_ID)
-			gist_indexsortbuild_flush_ready_pages(state);
-
-		/*
-		 * The page is now complete. Assign a block number to it, and add it
-		 * to the list of finished pages. (We don't write it out immediately,
-		 * because we want to WAL-log the pages in batches.)
-		 */
-		blkno = state->pages_allocated++;
-		state->ready_blknos[state->ready_num_pages] = blkno;
-		state->ready_pages[state->ready_num_pages] = target;
-		state->ready_num_pages++;
-		ItemPointerSetBlockNumber(&(union_tuple->t_tid), blkno);
-
 		/*
 		 * Set the right link to point to the previous page. This is just for
 		 * debugging purposes: GiST only follows the right link if a page is
@@ -621,6 +578,15 @@ gist_indexsortbuild_levelstate_flush(GISTBuildState *state,
 		 */
 		if (levelstate->last_blkno)
 			GistPageGetOpaque(target)->rightlink = levelstate->last_blkno;
+
+		/*
+		 * The page is now complete. Assign a block number to it, and pass it
+		 * to the bulk writer.
+		 */
+		blkno = state->pages_allocated++;
+		PageSetLSN(target, GistBuildLSN);
+		bulkw_write(state->bulkw, blkno, target, true);
+		ItemPointerSetBlockNumber(&(union_tuple->t_tid), blkno);
 		levelstate->last_blkno = blkno;
 
 		/*
@@ -631,7 +597,7 @@ gist_indexsortbuild_levelstate_flush(GISTBuildState *state,
 		if (parent == NULL)
 		{
 			parent = palloc0(sizeof(GistSortedBuildLevelState));
-			parent->pages[0] = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
+			parent->pages[0] = bulkw_alloc_buf(state->bulkw);
 			parent->parent = NULL;
 			gistinitpage(parent->pages[0], 0);
 
@@ -641,39 +607,6 @@ gist_indexsortbuild_levelstate_flush(GISTBuildState *state,
 	}
 }
 
-static void
-gist_indexsortbuild_flush_ready_pages(GISTBuildState *state)
-{
-	if (state->ready_num_pages == 0)
-		return;
-
-	for (int i = 0; i < state->ready_num_pages; i++)
-	{
-		Page		page = state->ready_pages[i];
-		BlockNumber blkno = state->ready_blknos[i];
-
-		/* Currently, the blocks must be buffered in order. */
-		if (blkno != state->pages_written)
-			elog(ERROR, "unexpected block number to flush GiST sorting build");
-
-		PageSetLSN(page, GistBuildLSN);
-		PageSetChecksumInplace(page, blkno);
-		smgrextend(RelationGetSmgr(state->indexrel), MAIN_FORKNUM, blkno, page,
-				   true);
-
-		state->pages_written++;
-	}
-
-	if (RelationNeedsWAL(state->indexrel))
-		log_newpages(&state->indexrel->rd_locator, MAIN_FORKNUM, state->ready_num_pages,
-					 state->ready_blknos, state->ready_pages, true);
-
-	for (int i = 0; i < state->ready_num_pages; i++)
-		pfree(state->ready_pages[i]);
-
-	state->ready_num_pages = 0;
-}
-
 
 /*-------------------------------------------------------------------------
  * Routines for non-sorted build
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 424958912c7..9d7c7042e68 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -87,8 +87,8 @@
  * is optimized for bulk inserting a lot of tuples, knowing that we have
  * exclusive access to the heap.  raw_heap_insert builds new pages in
  * local storage.  When a page is full, or at the end of the process,
- * we insert it to WAL as a single record and then write it to disk
- * directly through smgr.  Note, however, that any data sent to the new
+ * we insert it to WAL as a single record and then write it to disk with
+ * the bulk smgr writer.  Note, however, that any data sent to the new
  * heap's TOAST table will go through the normal bufmgr.
  *
  *
@@ -119,9 +119,9 @@
 #include "replication/logical.h"
 #include "replication/slot.h"
 #include "storage/bufmgr.h"
+#include "storage/bulk_write.h"
 #include "storage/fd.h"
 #include "storage/procarray.h"
-#include "storage/smgr.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -133,9 +133,11 @@ typedef struct RewriteStateData
 {
 	Relation	rs_old_rel;		/* source heap */
 	Relation	rs_new_rel;		/* destination heap */
+
+	BulkWriteState *rs_bulkw;
+
 	Page		rs_buffer;		/* page currently being built */
 	BlockNumber rs_blockno;		/* block where page will go */
-	bool		rs_buffer_valid;	/* T if any tuples in buffer */
 	bool		rs_logical_rewrite; /* do we need to do logical rewriting */
 	TransactionId rs_oldest_xmin;	/* oldest xmin used by caller to determine
 									 * tuple visibility */
@@ -255,15 +257,16 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
 
 	state->rs_old_rel = old_heap;
 	state->rs_new_rel = new_heap;
-	state->rs_buffer = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
+	state->rs_buffer = NULL;
 	/* new_heap needn't be empty, just locked */
 	state->rs_blockno = RelationGetNumberOfBlocks(new_heap);
-	state->rs_buffer_valid = false;
 	state->rs_oldest_xmin = oldest_xmin;
 	state->rs_freeze_xid = freeze_xid;
 	state->rs_cutoff_multi = cutoff_multi;
 	state->rs_cxt = rw_cxt;
 
+	state->rs_bulkw = bulkw_start_rel(new_heap, MAIN_FORKNUM);
+
 	/* Initialize hash tables used to track update chains */
 	hash_ctl.keysize = sizeof(TidHashKey);
 	hash_ctl.entrysize = sizeof(UnresolvedTupData);
@@ -314,30 +317,13 @@ end_heap_rewrite(RewriteState state)
 	}
 
 	/* Write the last page, if any */
-	if (state->rs_buffer_valid)
+	if (state->rs_buffer)
 	{
-		if (RelationNeedsWAL(state->rs_new_rel))
-			log_newpage(&state->rs_new_rel->rd_locator,
-						MAIN_FORKNUM,
-						state->rs_blockno,
-						state->rs_buffer,
-						true);
-
-		PageSetChecksumInplace(state->rs_buffer, state->rs_blockno);
-
-		smgrextend(RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM,
-				   state->rs_blockno, state->rs_buffer, true);
+		bulkw_write(state->rs_bulkw, state->rs_blockno, state->rs_buffer, true);
+		state->rs_buffer = NULL;
 	}
 
-	/*
-	 * When we WAL-logged rel pages, we must nonetheless fsync them.  The
-	 * reason is the same as in storage.c's RelationCopyStorage(): we're
-	 * writing data that's not in shared buffers, and so a CHECKPOINT
-	 * occurring during the rewriteheap operation won't have fsync'd data we
-	 * wrote before the checkpoint.
-	 */
-	if (RelationNeedsWAL(state->rs_new_rel))
-		smgrimmedsync(RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM);
+	bulkw_finish(state->rs_bulkw);
 
 	logical_end_heap_rewrite(state);
 
@@ -611,7 +597,7 @@ rewrite_heap_dead_tuple(RewriteState state, HeapTuple old_tuple)
 static void
 raw_heap_insert(RewriteState state, HeapTuple tup)
 {
-	Page		page = state->rs_buffer;
+	Page		page;
 	Size		pageFreeSpace,
 				saveFreeSpace;
 	Size		len;
@@ -664,7 +650,8 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
 												   HEAP_DEFAULT_FILLFACTOR);
 
 	/* Now we can check to see if there's enough free space already. */
-	if (state->rs_buffer_valid)
+	page = state->rs_buffer;
+	if (page)
 	{
 		pageFreeSpace = PageGetHeapFreeSpace(page);
 
@@ -675,35 +662,17 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
 			 * contains a tuple.  Hence, unlike RelationGetBufferForTuple(),
 			 * enforce saveFreeSpace unconditionally.
 			 */
-
-			/* XLOG stuff */
-			if (RelationNeedsWAL(state->rs_new_rel))
-				log_newpage(&state->rs_new_rel->rd_locator,
-							MAIN_FORKNUM,
-							state->rs_blockno,
-							page,
-							true);
-
-			/*
-			 * Now write the page. We say skipFsync = true because there's no
-			 * need for smgr to schedule an fsync for this write; we'll do it
-			 * ourselves in end_heap_rewrite.
-			 */
-			PageSetChecksumInplace(page, state->rs_blockno);
-
-			smgrextend(RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM,
-					   state->rs_blockno, page, true);
-
+			bulkw_write(state->rs_bulkw, state->rs_blockno, state->rs_buffer, true);
+			page = state->rs_buffer = NULL;
 			state->rs_blockno++;
-			state->rs_buffer_valid = false;
 		}
 	}
 
-	if (!state->rs_buffer_valid)
+	if (!page)
 	{
 		/* Initialize a new empty page */
+		page = state->rs_buffer = bulkw_alloc_buf(state->rs_bulkw);
 		PageInit(page, BLCKSZ, 0);
-		state->rs_buffer_valid = true;
 	}
 
 	/* And now we can insert the tuple into the page */
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 62bc9917f13..0d738df6f0e 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -29,11 +29,11 @@
 #include "nodes/execnodes.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
+#include "storage/bulk_write.h"
 #include "storage/condition_variable.h"
 #include "storage/indexfsm.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
-#include "storage/smgr.h"
 #include "utils/builtins.h"
 #include "utils/index_selfuncs.h"
 #include "utils/memutils.h"
@@ -152,32 +152,17 @@ void
 btbuildempty(Relation index)
 {
 	bool		allequalimage = _bt_allequalimage(index, false);
-	Buffer		metabuf;
 	Page		metapage;
+	BulkWriteState *bulkw;
 
-	/*
-	 * Initalize the metapage.
-	 *
-	 * Regular index build bypasses the buffer manager and uses smgr functions
-	 * directly, with an smgrimmedsync() call at the end.  That makes sense
-	 * when the index is large, but for an empty index, it's better to use the
-	 * buffer cache to avoid the smgrimmedsync().
-	 */
-	metabuf = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
-	Assert(BufferGetBlockNumber(metabuf) == BTREE_METAPAGE);
-	_bt_lockbuf(index, metabuf, BT_WRITE);
+	bulkw = bulkw_start_rel(index, INIT_FORKNUM);
 
-	START_CRIT_SECTION();
-
-	metapage = BufferGetPage(metabuf);
+	/* Construct metapage. */
+	metapage = bulkw_alloc_buf(bulkw);
 	_bt_initmetapage(metapage, P_NONE, 0, allequalimage);
-	MarkBufferDirty(metabuf);
-	log_newpage_buffer(metabuf, true);
-
-	END_CRIT_SECTION();
+	bulkw_write(bulkw, BTREE_METAPAGE, metapage, true);
 
-	_bt_unlockbuf(index, metabuf);
-	ReleaseBuffer(metabuf);
+	bulkw_finish(bulkw);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index c2665fce411..1b0591f551c 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -23,13 +23,8 @@
  * many upper pages if the keys are reasonable-size) without risking a lot of
  * cascading splits during early insertions.
  *
- * Formerly the index pages being built were kept in shared buffers, but
- * that is of no value (since other backends have no interest in them yet)
- * and it created locking problems for CHECKPOINT, because the upper-level
- * pages were held exclusive-locked for long periods.  Now we just build
- * the pages in local memory and smgrwrite or smgrextend them as we finish
- * them.  They will need to be re-read into shared buffers on first use after
- * the build finishes.
+ * We use the bulk smgr loading facility to bypass the buffer cache and
+ * WAL log the pages efficiently.
  *
  * This code isn't concerned about the FSM at all. The caller is responsible
  * for initializing that.
@@ -57,7 +52,7 @@
 #include "executor/instrument.h"
 #include "miscadmin.h"
 #include "pgstat.h"
-#include "storage/smgr.h"
+#include "storage/bulk_write.h"
 #include "tcop/tcopprot.h"		/* pgrminclude ignore */
 #include "utils/rel.h"
 #include "utils/sortsupport.h"
@@ -251,11 +246,9 @@ typedef struct BTWriteState
 {
 	Relation	heap;
 	Relation	index;
+	BulkWriteState *bulkw;
 	BTScanInsert inskey;		/* generic insertion scankey */
-	bool		btws_use_wal;	/* dump pages to WAL? */
 	BlockNumber btws_pages_alloced; /* # pages allocated */
-	BlockNumber btws_pages_written; /* # pages written out */
-	Page		btws_zeropage;	/* workspace for filling zeroes */
 } BTWriteState;
 
 
@@ -267,7 +260,7 @@ static void _bt_spool(BTSpool *btspool, ItemPointer self,
 static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2);
 static void _bt_build_callback(Relation index, ItemPointer tid, Datum *values,
 							   bool *isnull, bool tupleIsAlive, void *state);
-static Page _bt_blnewpage(uint32 level);
+static Page _bt_blnewpage(BTWriteState *wstate, uint32 level);
 static BTPageState *_bt_pagestate(BTWriteState *wstate, uint32 level);
 static void _bt_slideleft(Page rightmostpage);
 static void _bt_sortaddtup(Page page, Size itemsize,
@@ -569,16 +562,17 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
 	/* _bt_mkscankey() won't set allequalimage without metapage */
 	wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
-	wstate.btws_use_wal = RelationNeedsWAL(wstate.index);
 
 	/* reserve the metapage */
 	wstate.btws_pages_alloced = BTREE_METAPAGE + 1;
-	wstate.btws_pages_written = 0;
-	wstate.btws_zeropage = NULL;	/* until needed */
+
+	wstate.bulkw = bulkw_start_rel(wstate.index, MAIN_FORKNUM);
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_BTREE_PHASE_LEAF_LOAD);
 	_bt_load(&wstate, btspool, btspool2);
+
+	bulkw_finish(wstate.bulkw);
 }
 
 /*
@@ -614,12 +608,12 @@ _bt_build_callback(Relation index,
  * allocate workspace for a new, clean btree page, not linked to any siblings.
  */
 static Page
-_bt_blnewpage(uint32 level)
+_bt_blnewpage(BTWriteState *wstate, uint32 level)
 {
 	Page		page;
 	BTPageOpaque opaque;
 
-	page = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
+	page = bulkw_alloc_buf(wstate->bulkw);
 
 	/* Zero the page and set up standard page header info */
 	_bt_pageinit(page, BLCKSZ);
@@ -643,54 +637,8 @@ _bt_blnewpage(uint32 level)
 static void
 _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
 {
-	/* XLOG stuff */
-	if (wstate->btws_use_wal)
-	{
-		/* We use the XLOG_FPI record type for this */
-		log_newpage(&wstate->index->rd_locator, MAIN_FORKNUM, blkno, page, true);
-	}
-
-	/*
-	 * If we have to write pages nonsequentially, fill in the space with
-	 * zeroes until we come back and overwrite.  This is not logically
-	 * necessary on standard Unix filesystems (unwritten space will read as
-	 * zeroes anyway), but it should help to avoid fragmentation. The dummy
-	 * pages aren't WAL-logged though.
-	 */
-	while (blkno > wstate->btws_pages_written)
-	{
-		if (!wstate->btws_zeropage)
-			wstate->btws_zeropage = (Page) palloc_aligned(BLCKSZ,
-														  PG_IO_ALIGN_SIZE,
-														  MCXT_ALLOC_ZERO);
-		/* don't set checksum for all-zero page */
-		smgrextend(RelationGetSmgr(wstate->index), MAIN_FORKNUM,
-				   wstate->btws_pages_written++,
-				   wstate->btws_zeropage,
-				   true);
-	}
-
-	PageSetChecksumInplace(page, blkno);
-
-	/*
-	 * Now write the page.  There's no need for smgr to schedule an fsync for
-	 * this write; we'll do it ourselves before ending the build.
-	 */
-	if (blkno == wstate->btws_pages_written)
-	{
-		/* extending the file... */
-		smgrextend(RelationGetSmgr(wstate->index), MAIN_FORKNUM, blkno,
-				   page, true);
-		wstate->btws_pages_written++;
-	}
-	else
-	{
-		/* overwriting a block we zero-filled before */
-		smgrwrite(RelationGetSmgr(wstate->index), MAIN_FORKNUM, blkno,
-				  page, true);
-	}
-
-	pfree(page);
+	bulkw_write(wstate->bulkw, blkno, page, true);
+	/* bulkw_write took ownership of 'page' */
 }
 
 /*
@@ -703,7 +651,7 @@ _bt_pagestate(BTWriteState *wstate, uint32 level)
 	BTPageState *state = (BTPageState *) palloc0(sizeof(BTPageState));
 
 	/* create initial page for level */
-	state->btps_page = _bt_blnewpage(level);
+	state->btps_page = _bt_blnewpage(wstate, level);
 
 	/* and assign it a page position */
 	state->btps_blkno = wstate->btws_pages_alloced++;
@@ -916,7 +864,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 		IndexTuple	oitup;
 
 		/* Create new page of same level */
-		npage = _bt_blnewpage(state->btps_level);
+		npage = _bt_blnewpage(wstate, state->btps_level);
 
 		/* and assign it a page position */
 		nblkno = wstate->btws_pages_alloced++;
@@ -1028,8 +976,8 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 		}
 
 		/*
-		 * Write out the old page.  We never need to touch it again, so we can
-		 * free the opage workspace too.
+		 * Write out the old page. _bt_blwritepage takes ownership of the
+		 * 'opage' buffer.
 		 */
 		_bt_blwritepage(wstate, opage, oblkno);
 
@@ -1163,7 +1111,7 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 		 */
 		_bt_slideleft(s->btps_page);
 		_bt_blwritepage(wstate, s->btps_page, s->btps_blkno);
-		s->btps_page = NULL;	/* writepage freed the workspace */
+		s->btps_page = NULL;	/* writepage took ownership of the buffer */
 	}
 
 	/*
@@ -1172,7 +1120,7 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 	 * set to point to "P_NONE").  This changes the index to the "valid" state
 	 * by filling in a valid magic number in the metapage.
 	 */
-	metapage = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
+	metapage = bulkw_alloc_buf(wstate->bulkw);
 	_bt_initmetapage(metapage, rootblkno, rootlevel,
 					 wstate->inskey->allequalimage);
 	_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
@@ -1422,18 +1370,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 
 	/* Close down final pages and write the metapage */
 	_bt_uppershutdown(wstate, state);
-
-	/*
-	 * When we WAL-logged index pages, we must nonetheless fsync index files.
-	 * Since we're building outside shared buffers, a CHECKPOINT occurring
-	 * during the build has no way to flush the previously written data to
-	 * disk (indeed it won't know the index even exists).  A crash later on
-	 * would replay WAL from the checkpoint, therefore it wouldn't replay our
-	 * earlier WAL entries. If we do not fsync those pages here, they might
-	 * still not be on disk when the crash occurs.
-	 */
-	if (wstate->btws_use_wal)
-		smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM);
 }
 
 /*
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index 4443f1918df..11fe932c26d 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -25,7 +25,7 @@
 #include "catalog/index.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
-#include "storage/smgr.h"
+#include "storage/bulk_write.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -155,42 +155,27 @@ spgbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 void
 spgbuildempty(Relation index)
 {
-	Buffer		metabuffer,
-				rootbuffer,
-				nullbuffer;
-
-	/*
-	 * Initialize the meta page and root pages
-	 */
-	metabuffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
-	LockBuffer(metabuffer, BUFFER_LOCK_EXCLUSIVE);
-	rootbuffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
-	LockBuffer(rootbuffer, BUFFER_LOCK_EXCLUSIVE);
-	nullbuffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
-	LockBuffer(nullbuffer, BUFFER_LOCK_EXCLUSIVE);
-
-	Assert(BufferGetBlockNumber(metabuffer) == SPGIST_METAPAGE_BLKNO);
-	Assert(BufferGetBlockNumber(rootbuffer) == SPGIST_ROOT_BLKNO);
-	Assert(BufferGetBlockNumber(nullbuffer) == SPGIST_NULL_BLKNO);
+	Page		page;
+	BulkWriteState *bulkw;
 
-	START_CRIT_SECTION();
+	bulkw = bulkw_start_rel(index, INIT_FORKNUM);
 
-	SpGistInitMetapage(BufferGetPage(metabuffer));
-	MarkBufferDirty(metabuffer);
-	SpGistInitBuffer(rootbuffer, SPGIST_LEAF);
-	MarkBufferDirty(rootbuffer);
-	SpGistInitBuffer(nullbuffer, SPGIST_LEAF | SPGIST_NULLS);
-	MarkBufferDirty(nullbuffer);
+	/* Construct metapage. */
+	page = (Page) bulkw_alloc_buf(bulkw);
+	SpGistInitMetapage(page);
+	bulkw_write(bulkw, SPGIST_METAPAGE_BLKNO, page, true);
 
-	log_newpage_buffer(metabuffer, true);
-	log_newpage_buffer(rootbuffer, true);
-	log_newpage_buffer(nullbuffer, true);
+	/* Likewise for the root page. */
+	page = (Page) bulkw_alloc_buf(bulkw);
+	SpGistInitPage(page, SPGIST_LEAF);
+	bulkw_write(bulkw, SPGIST_ROOT_BLKNO, page, true);
 
-	END_CRIT_SECTION();
+	/* Likewise for the null-tuples root page. */
+	page = (Page) bulkw_alloc_buf(bulkw);
+	SpGistInitPage(page, SPGIST_LEAF | SPGIST_NULLS);
+	bulkw_write(bulkw, SPGIST_NULL_BLKNO, page, true);
 
-	UnlockReleaseBuffer(metabuffer);
-	UnlockReleaseBuffer(rootbuffer);
-	UnlockReleaseBuffer(nullbuffer);
+	bulkw_finish(bulkw);
 }
 
 /*
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 93f07e49b72..eb982b1a88d 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -28,6 +28,7 @@
 #include "catalog/storage.h"
 #include "catalog/storage_xlog.h"
 #include "miscadmin.h"
+#include "storage/bulk_write.h"
 #include "storage/freespace.h"
 #include "storage/smgr.h"
 #include "utils/hsearch.h"
@@ -451,14 +452,11 @@ void
 RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 					ForkNumber forkNum, char relpersistence)
 {
-	PGIOAlignedBlock buf;
-	Page		page;
 	bool		use_wal;
 	bool		copying_initfork;
 	BlockNumber nblocks;
 	BlockNumber blkno;
-
-	page = (Page) buf.data;
+	BulkWriteState *bulkw;
 
 	/*
 	 * The init fork for an unlogged relation in many respects has to be
@@ -477,14 +475,19 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 	use_wal = XLogIsNeeded() &&
 		(relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork);
 
+	bulkw = bulkw_start_smgr(dst, forkNum, use_wal);
+
 	nblocks = smgrnblocks(src, forkNum);
 
 	for (blkno = 0; blkno < nblocks; blkno++)
 	{
+		Page		page;
+
 		/* If we got a cancel signal during the copy of the data, quit */
 		CHECK_FOR_INTERRUPTS();
 
-		smgrread(src, forkNum, blkno, buf.data);
+		page = bulkw_alloc_buf(bulkw);
+		smgrread(src, forkNum, blkno, page);
 
 		if (!PageIsVerifiedExtended(page, blkno,
 									PIV_LOG_WARNING | PIV_REPORT_STAT))
@@ -511,30 +514,9 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 		 * page this is, so we have to log the full page including any unused
 		 * space.
 		 */
-		if (use_wal)
-			log_newpage(&dst->smgr_rlocator.locator, forkNum, blkno, page, false);
-
-		PageSetChecksumInplace(page, blkno);
-
-		/*
-		 * Now write the page.  We say skipFsync = true because there's no
-		 * need for smgr to schedule an fsync for this write; we'll do it
-		 * ourselves below.
-		 */
-		smgrextend(dst, forkNum, blkno, buf.data, true);
+		bulkw_write(bulkw, blkno, page, false);
 	}
-
-	/*
-	 * When we WAL-logged rel pages, we must nonetheless fsync them.  The
-	 * reason is that since we're copying outside shared buffers, a CHECKPOINT
-	 * occurring during the copy has no way to flush the previously written
-	 * data to disk (indeed it won't know the new rel even exists).  A crash
-	 * later on would replay WAL from the checkpoint, therefore it wouldn't
-	 * replay our earlier WAL entries. If we do not fsync those pages here,
-	 * they might still not be on disk when the crash occurs.
-	 */
-	if (use_wal || copying_initfork)
-		smgrimmedsync(dst, forkNum);
+	bulkw_finish(bulkw);
 }
 
 /*
diff --git a/src/backend/storage/smgr/Makefile b/src/backend/storage/smgr/Makefile
index 596b564656f..1d0b98764f9 100644
--- a/src/backend/storage/smgr/Makefile
+++ b/src/backend/storage/smgr/Makefile
@@ -13,6 +13,7 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = \
+	bulk_write.o \
 	md.o \
 	smgr.o
 
diff --git a/src/backend/storage/smgr/bulk_write.c b/src/backend/storage/smgr/bulk_write.c
new file mode 100644
index 00000000000..d9090979d65
--- /dev/null
+++ b/src/backend/storage/smgr/bulk_write.c
@@ -0,0 +1,334 @@
+/*-------------------------------------------------------------------------
+ *
+ * bulk_write.c
+ *	  Efficiently and reliably populate a new relation
+ *
+ * The assumption is that no other backends access the relation while we are
+ * loading it, so we can take some shortcuts.  Alternatively, you can use the
+ * buffer manager as usual, if performance is not critical, but you must not
+ * mix operations through the buffer manager and the bulk loading interface at
+ * the same time.
+ *
+ * We bypass the buffer manager to avoid the locking overhead, and call
+ * smgrextend() directly.  A downside is that the pages will need to be
+ * re-read into shared buffers on first use after the build finishes.  That's
+ * usually a good tradeoff for large relations, and for small relations, the
+ * overhead isn't very significant compared to creating the relation in the
+ * first place.
+ *
+ * The pages are WAL-logged if needed.  To save on WAL header overhead, we
+ * WAL-log several pages in one record.
+ *
+ * One tricky point is that because we bypass the buffer manager, we need to
+ * register the relation for fsyncing at the next checkpoint ourselves, and
+ * make sure that the relation is correctly fsync by us or the checkpointer
+ * even if a checkpoint happens concurrently.
+ *
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/smgr/bulk_write.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/xloginsert.h"
+#include "access/xlogrecord.h"
+#include "storage/bufmgr.h"
+#include "storage/bufpage.h"
+#include "storage/bulk_write.h"
+#include "storage/proc.h"
+#include "storage/smgr.h"
+#include "utils/rel.h"
+
+#define MAX_BUFFERED_PAGES XLR_MAX_BLOCK_ID
+
+typedef struct BulkWriteBuffer
+{
+	Page		page;
+	BlockNumber blkno;
+	bool		page_std;
+	int16		order;
+} BulkWriteBuffer;
+
+/*
+ * Bulk writer state for one relation fork.
+ */
+typedef struct BulkWriteState
+{
+	/* Information about the target relation we're writing */
+	SMgrRelation smgr;
+	ForkNumber	forknum;
+	bool		use_wal;
+
+	/* We keep several pages buffered, and WAL-log them in batches */
+	int			nbuffered;
+	BulkWriteBuffer buffers[MAX_BUFFERED_PAGES];
+
+	/* Current size of the relation */
+	BlockNumber pages_written;
+
+	/* The RedoRecPtr at the time that the bulk operation started */
+	XLogRecPtr	start_RedoRecPtr;
+
+	Page		zeropage;		/* workspace for filling zeroes */
+
+	MemoryContext memcxt;
+} BulkWriteState;
+
+static void bulkw_flush(BulkWriteState *bulkw);
+
+/*
+ * Start a bulk write operation on a relation fork.
+ */
+BulkWriteState *
+bulkw_start_rel(Relation rel, ForkNumber forknum)
+{
+	return bulkw_start_smgr(RelationGetSmgr(rel),
+							forknum,
+							RelationNeedsWAL(rel) || forknum == INIT_FORKNUM);
+}
+
+/*
+ * Start a bulk write operation on a relation fork.
+ *
+ * This is like bulkw_start_rel, but can be used without a relcache entry.
+ */
+BulkWriteState *
+bulkw_start_smgr(SMgrRelation smgr, ForkNumber forknum, bool use_wal)
+{
+	BulkWriteState *bulkw;
+
+	bulkw = palloc(sizeof(BulkWriteState));
+	bulkw->smgr = smgr;
+	bulkw->forknum = forknum;
+	bulkw->use_wal = use_wal;
+
+	bulkw->nbuffered = 0;
+	bulkw->pages_written = 0;
+
+	bulkw->start_RedoRecPtr = GetRedoRecPtr();
+
+	bulkw->zeropage = NULL;		/* until needed */
+
+	/*
+	 * Remember the memory context.  We will use it to allocate all the
+	 * buffers later.
+	 */
+	bulkw->memcxt = CurrentMemoryContext;
+
+	return bulkw;
+}
+
+/*
+ * Finish bulk write operation.
+ *
+ * This WAL-logs and flushes any remaining buffers to disk, and
+ * fsyncs the relation if needed.
+ */
+void
+bulkw_finish(BulkWriteState *bulkw)
+{
+	/* WAL-log and flush any remaining pages */
+	bulkw_flush(bulkw);
+
+	/*
+	 * When we wrote out the pages, we passed skipFsync=true to avoid the
+	 * overhead of registering all the writes with the checkpointer.  Register
+	 * the whole relation now.
+	 *
+	 * There is one hole in that idea: If a checkpoint occurred while we were
+	 * writing the pages, it already missed fsyncing the pages we had written
+	 * before the checkpoint started.  A crash later on would replay the WAL
+	 * starting from the checkpoint, therefore it wouldn't replay our earlier
+	 * WAL records.  So if a checkpoint started after the bulk write, fsync
+	 * the files now.
+	 */
+	if (!SmgrIsTemp(bulkw->smgr))
+	{
+		/*
+		 * Prevent a checkpoint from starting between the GetRedoRecPtr() and
+		 * smgrregistersync() calls.
+		 */
+		Assert((MyProc->delayChkptFlags & DELAY_CHKPT_START) == 0);
+		MyProc->delayChkptFlags |= DELAY_CHKPT_START;
+
+		if (bulkw->start_RedoRecPtr != GetRedoRecPtr())
+		{
+			/*
+			 * A checkpoint occurred and it didn't know about our writes, so
+			 * fsync() the relation ourselves.
+			 */
+			MyProc->delayChkptFlags &= ~DELAY_CHKPT_START;
+			smgrimmedsync(bulkw->smgr, bulkw->forknum);
+			elog(DEBUG1, "flushed relation because a checkpoint occurred concurrently");
+		}
+		else
+		{
+			smgrregistersync(bulkw->smgr, bulkw->forknum);
+			MyProc->delayChkptFlags &= ~DELAY_CHKPT_START;
+		}
+	}
+}
+
+static int
+buffer_cmp(const void *a, const void *b)
+{
+	const BulkWriteBuffer *bufa = (const BulkWriteBuffer *) a;
+	const BulkWriteBuffer *bufb = (const BulkWriteBuffer *) b;
+
+	if (bufa->blkno == bufb->blkno)
+	{
+		if (bufa->order > bufb->order)
+			return 1;
+		else
+			return -1;
+	}
+	else if (bufa->blkno > bufb->blkno)
+	{
+		return 1;
+	}
+	else
+		return -1;
+}
+
+/*
+ * Write all buffered pages to disk.
+ */
+static void
+bulkw_flush(BulkWriteState *bulkw)
+{
+	int			nbuffered = bulkw->nbuffered;
+	BulkWriteBuffer *buffers = bulkw->buffers;
+
+	if (nbuffered == 0)
+		return;
+
+	if (nbuffered > 1)
+	{
+		int			o;
+
+		qsort(buffers, nbuffered, sizeof(BulkWriteBuffer), buffer_cmp);
+
+		/*
+		 * Eliminate duplicates, keeping the last write of each block.
+		 * (buffer_cmp uses 'order' as the last sort key)
+		 */
+		o = 0;
+		for (int i = 0; i < nbuffered; i++)
+		{
+			if (i < nbuffered - 1 && buffers[i + 1].blkno == buffers[i].blkno)
+			{
+				/* there is a later write of the same page, skip this one */
+				pfree(buffers[i].page);
+				continue;
+			}
+
+			if (i != o)
+				buffers[o] = buffers[i];
+			o++;
+		}
+		nbuffered = o;
+	}
+
+	if (bulkw->use_wal)
+	{
+		BlockNumber blknos[MAX_BUFFERED_PAGES];
+		Page		pages[MAX_BUFFERED_PAGES];
+		bool		page_std = true;
+
+		for (int i = 0; i < nbuffered; i++)
+		{
+			blknos[i] = buffers[i].blkno;
+			pages[i] = buffers[i].page;
+
+			/*
+			 * If any of the pages use !page_std, we log them all as such.
+			 * That's a bit wasteful, but in practice, a mix of standard and
+			 * non-standard page layout is rare.  None of the built-in AMs do
+			 * that.
+			 */
+			if (!buffers[i].page_std)
+				page_std = false;
+		}
+		log_newpages(&bulkw->smgr->smgr_rlocator.locator, bulkw->forknum, nbuffered, blknos, pages,
+					 page_std);
+	}
+
+	for (int i = 0; i < nbuffered; i++)
+	{
+		BlockNumber blkno = buffers[i].blkno;
+		Page		page = buffers[i].page;
+
+		PageSetChecksumInplace(page, blkno);
+
+		if (blkno >= bulkw->pages_written)
+		{
+			/*
+			 * If we have to write pages nonsequentially, fill in the space
+			 * with zeroes until we come back and overwrite.  This is not
+			 * logically necessary on standard Unix filesystems (unwritten
+			 * space will read as zeroes anyway), but it should help to avoid
+			 * fragmentation.  The dummy pages aren't WAL-logged though.
+			 */
+			while (blkno > bulkw->pages_written)
+			{
+				if (!bulkw->zeropage)
+					bulkw->zeropage = (Page) palloc_aligned(BLCKSZ,
+															PG_IO_ALIGN_SIZE,
+															MCXT_ALLOC_ZERO);
+
+				/* don't set checksum for all-zero page */
+				smgrextend(bulkw->smgr, bulkw->forknum,
+						   bulkw->pages_written++,
+						   bulkw->zeropage,
+						   true);
+			}
+
+			smgrextend(bulkw->smgr, bulkw->forknum, blkno, page, true);
+			bulkw->pages_written = buffers[i].blkno + 1;
+		}
+		else
+			smgrwrite(bulkw->smgr, bulkw->forknum, blkno, page, true);
+		pfree(page);
+	}
+
+	bulkw->nbuffered = 0;
+}
+
+/*
+ * Write out or buffer write of 'page'.
+ *
+ * NB: this takes ownership of 'page'
+ */
+void
+bulkw_write(BulkWriteState *bulkw, BlockNumber blocknum, Page page, bool page_std)
+{
+	bulkw->buffers[bulkw->nbuffered].page = page;
+	bulkw->buffers[bulkw->nbuffered].blkno = blocknum;
+	bulkw->buffers[bulkw->nbuffered].page_std = page_std;
+
+	/*
+	 * If the same page is written multiple times, 'order' is used to remember
+	 * the order of the writes, so that the last write wins.
+	 */
+	bulkw->buffers[bulkw->nbuffered].order = (int16) bulkw->nbuffered;
+
+	bulkw->nbuffered++;
+
+	if (bulkw->nbuffered == MAX_BUFFERED_PAGES)
+		bulkw_flush(bulkw);
+}
+
+/*
+ * Allocate a new buffer which can later be written with bulkw_write()
+ */
+Page
+bulkw_alloc_buf(BulkWriteState *bulkw)
+{
+	return MemoryContextAllocAligned(bulkw->memcxt, BLCKSZ, PG_IO_ALIGN_SIZE, 0);
+}
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index fdecbad1709..343ee51048e 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -1082,6 +1082,49 @@ mdtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
 	}
 }
 
+/*
+ * mdregistersync() -- Mark whole relation as needing fsync
+ */
+void
+mdregistersync(SMgrRelation reln, ForkNumber forknum)
+{
+	int			segno;
+	int			min_inactive_seg;
+
+	/*
+	 * NOTE: mdnblocks makes sure we have opened all active segments, so that
+	 * the loop below will get them all!
+	 */
+	mdnblocks(reln, forknum);
+
+	min_inactive_seg = segno = reln->md_num_open_segs[forknum];
+
+	/*
+	 * Temporarily open inactive segments, then close them after sync.  There
+	 * may be some inactive segments left opened after error, but that is
+	 * harmless.  We don't bother to clean them up and take a risk of further
+	 * trouble.  The next mdclose() will soon close them.
+	 */
+	while (_mdfd_openseg(reln, forknum, segno, 0) != NULL)
+		segno++;
+
+	while (segno > 0)
+	{
+		MdfdVec    *v = &reln->md_seg_fds[forknum][segno - 1];
+
+		register_dirty_segment(reln, forknum, v);
+
+		/* Close inactive segments immediately */
+		if (segno > min_inactive_seg)
+		{
+			FileClose(v->mdfd_vfd);
+			_fdvec_resize(reln, forknum, segno - 1);
+		}
+
+		segno--;
+	}
+}
+
 /*
  * mdimmedsync() -- Immediately sync a relation to stable storage.
  *
diff --git a/src/backend/storage/smgr/meson.build b/src/backend/storage/smgr/meson.build
index e1ba6ed74b8..133622a6528 100644
--- a/src/backend/storage/smgr/meson.build
+++ b/src/backend/storage/smgr/meson.build
@@ -1,6 +1,7 @@
 # Copyright (c) 2022-2023, PostgreSQL Global Development Group
 
 backend_sources += files(
+  'bulk_write.c',
   'md.c',
   'smgr.c',
 )
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 5d0f3d515c3..9f7405e3c88 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -65,6 +65,7 @@ typedef struct f_smgr
 	void		(*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
 								  BlockNumber nblocks);
 	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
+	void		(*smgr_registersync) (SMgrRelation reln, ForkNumber forknum);
 } f_smgr;
 
 static const f_smgr smgrsw[] = {
@@ -86,6 +87,7 @@ static const f_smgr smgrsw[] = {
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
 		.smgr_immedsync = mdimmedsync,
+		.smgr_registersync = mdregistersync,
 	}
 };
 
@@ -576,6 +578,14 @@ smgrread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
  * on disk at return, only dumped out to the kernel.  However,
  * provisions will be made to fsync the write before the next checkpoint.
  *
+ * NB: The mechanism to ensure fsync at next checkpoint assumes that there is
+ * something that prevents a concurrent checkpoint from "racing ahead" of the
+ * write.  One way to prevent that is by holding a lock on the buffer; the
+ * buffer manager's writes are protected by that.  The bulk writer facility in
+ * bulk_write.c checks the redo pointer and and calls smgrimmedsync() if a
+ * checkpoint happened; that relies on the fact that no other backend can be
+ * concurrently modify the page.
+ *
  * skipFsync indicates that the caller will make other provisions to
  * fsync the relation, so we needn't bother.  Temporary relations also
  * do not require fsync.
@@ -694,6 +704,24 @@ smgrtruncate(SMgrRelation reln, ForkNumber *forknum, int nforks, BlockNumber *nb
 	}
 }
 
+/*
+ * smgrregistersync() -- Request a relation to be sync'd at next checkpoint
+ *
+ * This can be used after calling smgrwrite() or smgrextend() with skipFsync =
+ * true, to register the fsyncs that were skipped earlier.
+ *
+ * Note: be mindful that a checkpoint could already have happened between the
+ * smgrwrite or smgrextend calls and this!  In that case, the checkpoint
+ * already missed fsyncing this relation, and you should use smgrimmedsync
+ * instead.  Most callers should use the bulk loading facility in bulk_write.c
+ * instead, which handles all that.
+ */
+void
+smgrregistersync(SMgrRelation reln, ForkNumber forknum)
+{
+	smgrsw[reln->smgr_which].smgr_registersync(reln, forknum);
+}
+
 /*
  * smgrimmedsync() -- Force the specified relation to stable storage.
  *
@@ -716,6 +744,9 @@ smgrtruncate(SMgrRelation reln, ForkNumber *forknum, int nforks, BlockNumber *nb
  * Note that you need to do FlushRelationBuffers() first if there is
  * any possibility that there are dirty buffers for the relation;
  * otherwise the sync is not very meaningful.
+ *
+ * Most callers should use the bulk loading facility in bulk_write.c
+ * instead of calling this directly.
  */
 void
 smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
diff --git a/src/include/storage/bulk_write.h b/src/include/storage/bulk_write.h
new file mode 100644
index 00000000000..c8eb7307178
--- /dev/null
+++ b/src/include/storage/bulk_write.h
@@ -0,0 +1,28 @@
+/*-------------------------------------------------------------------------
+ *
+ * bulk_write.h
+ *	  Efficiently and reliably populate a new relation
+ *
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/bulk_write.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BULK_WRITE_H
+#define BULK_WRITE_H
+
+typedef struct BulkWriteState BulkWriteState;
+
+/* forward declared from smgr.h */
+struct SMgrRelationData;
+
+extern BulkWriteState *bulkw_start_rel(Relation rel, ForkNumber forknum);
+extern BulkWriteState *bulkw_start_smgr(struct SMgrRelationData *smgr, ForkNumber forknum, bool use_wal);
+extern Page bulkw_alloc_buf(BulkWriteState *bulkw);
+extern void bulkw_write(BulkWriteState *bulkw, BlockNumber blocknum, Page page, bool page_std);
+extern void bulkw_finish(BulkWriteState *bulkw);
+
+#endif							/* BULK_WRITE_H */
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 941879ee6a8..225701271d2 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -42,6 +42,7 @@ extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber nblocks);
 extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
+extern void mdregistersync(SMgrRelation reln, ForkNumber forknum);
 
 extern void ForgetDatabaseSyncRequests(Oid dbid);
 extern void DropRelationFiles(RelFileLocator *delrels, int ndelrels, bool isRedo);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index a9a179aabac..cc5a91dc624 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -107,6 +107,7 @@ extern BlockNumber smgrnblocks_cached(SMgrRelation reln, ForkNumber forknum);
 extern void smgrtruncate(SMgrRelation reln, ForkNumber *forknum,
 						 int nforks, BlockNumber *nblocks);
 extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
+extern void smgrregistersync(SMgrRelation reln, ForkNumber forknum);
 extern void AtEOXact_SMgr(void);
 extern bool ProcessBarrierSmgrRelease(void);
 
-- 
2.39.2

#2Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Heikki Linnakangas (#1)
1 attachment(s)
Re: Relation bulk write facility

On 19/09/2023 17:13, Heikki Linnakangas wrote:

The attached patch centralizes that pattern to a new bulk writing
facility, and changes all those AMs to use it.

Here's a new rebased version of the patch.

This includes fixes to the pageinspect regression test. They were
explained in the commit message, but I forgot to include the actual test
changes. That should fix the cfbot failures.

--
Heikki Linnakangas
Neon (https://neon.tech)

Attachments:

v2-0001-Introduce-a-new-bulk-loading-facility.patchtext/x-patch; charset=UTF-8; name=v2-0001-Introduce-a-new-bulk-loading-facility.patchDownload
From 1ee5e2b5d4f87c1c70630814b81bafe2ac17a60d Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Tue, 19 Sep 2023 18:09:34 +0300
Subject: [PATCH v2 1/1] Introduce a new bulk loading facility.

The new facility makes it easier to optimize bulk loading, as the
logic for buffering, WAL-logging, and syncing the relation only needs
to be implemented once. It's also less error-prone: We have had a
number of bugs in how a relation is fsync'd - or not - at the end of a
bulk loading operation. By centralizing that logic to one place, we
only need to write it correctly once.

The new facility is faster for small relations: Instead of of calling
smgrimmedsync(), we register the fsync to happen at next checkpoint,
which avoids the fsync latency. That can make a big difference if you
are e.g. restoring a schema-only dump with lots of relations.

It is also slightly more efficient with large relations, as the WAL
logging is performed multiple pages at a time. That avoids some WAL
header overhead. The sorted GiST index build did that already, this
moves the buffering to the new facility.

The changes to pageinspect GiST test needs an explanation: Before this
patch, the sorted GiST index build set the LSN on every page to the
special GistBuildLSN value, not the LSN of the WAL record, even though
they were WAL-logged. There was no particular need for it, it just
happened naturally when we wrote out the pages before WAL-logging
them. Now we WAL-log the pages first, like in B-tree build, so the
pages are stamped with the record's real LSN. When the build is not
WAL-logged, we still use GistBuildLSN. To make the test output
predictable, use an unlogged index.
---
 contrib/pageinspect/expected/gist.out |  14 +-
 contrib/pageinspect/sql/gist.sql      |  16 +-
 src/backend/access/gist/gistbuild.c   | 111 ++-------
 src/backend/access/heap/rewriteheap.c |  71 ++----
 src/backend/access/nbtree/nbtree.c    |  29 +--
 src/backend/access/nbtree/nbtsort.c   | 102 ++------
 src/backend/access/spgist/spginsert.c |  49 ++--
 src/backend/catalog/storage.c         |  38 +--
 src/backend/storage/smgr/Makefile     |   1 +
 src/backend/storage/smgr/bulk_write.c | 334 ++++++++++++++++++++++++++
 src/backend/storage/smgr/md.c         |  43 ++++
 src/backend/storage/smgr/meson.build  |   1 +
 src/backend/storage/smgr/smgr.c       |  31 +++
 src/include/storage/bulk_write.h      |  28 +++
 src/include/storage/md.h              |   1 +
 src/include/storage/smgr.h            |   1 +
 src/tools/pgindent/typedefs.list      |   2 +
 17 files changed, 543 insertions(+), 329 deletions(-)
 create mode 100644 src/backend/storage/smgr/bulk_write.c
 create mode 100644 src/include/storage/bulk_write.h

diff --git a/contrib/pageinspect/expected/gist.out b/contrib/pageinspect/expected/gist.out
index d1adbab8ae2..2b1d54a6279 100644
--- a/contrib/pageinspect/expected/gist.out
+++ b/contrib/pageinspect/expected/gist.out
@@ -1,13 +1,6 @@
--- The gist_page_opaque_info() function prints the page's LSN. Normally,
--- that's constant 1 (GistBuildLSN) on every page of a freshly built GiST
--- index. But with wal_level=minimal, the whole relation is dumped to WAL at
--- the end of the transaction if it's smaller than wal_skip_threshold, which
--- updates the LSNs. Wrap the tests on gist_page_opaque_info() in the
--- same transaction with the CREATE INDEX so that we see the LSNs before
--- they are possibly overwritten at end of transaction.
-BEGIN;
--- Create a test table and GiST index.
-CREATE TABLE test_gist AS SELECT point(i,i) p, i::text t FROM
+-- The gist_page_opaque_info() function prints the page's LSN.
+-- Use an unlogged index, so that the LSN is predictable.
+CREATE UNLOGGED TABLE test_gist AS SELECT point(i,i) p, i::text t FROM
     generate_series(1,1000) i;
 CREATE INDEX test_gist_idx ON test_gist USING gist (p);
 -- Page 0 is the root, the rest are leaf pages
@@ -29,7 +22,6 @@ SELECT * FROM gist_page_opaque_info(get_raw_page('test_gist_idx', 2));
  0/1 | 0/0 |         1 | {leaf}
 (1 row)
 
-COMMIT;
 SELECT * FROM gist_page_items(get_raw_page('test_gist_idx', 0), 'test_gist_idx');
  itemoffset |   ctid    | itemlen | dead |             keys              
 ------------+-----------+---------+------+-------------------------------
diff --git a/contrib/pageinspect/sql/gist.sql b/contrib/pageinspect/sql/gist.sql
index d263542ba15..85bc44b8000 100644
--- a/contrib/pageinspect/sql/gist.sql
+++ b/contrib/pageinspect/sql/gist.sql
@@ -1,14 +1,6 @@
--- The gist_page_opaque_info() function prints the page's LSN. Normally,
--- that's constant 1 (GistBuildLSN) on every page of a freshly built GiST
--- index. But with wal_level=minimal, the whole relation is dumped to WAL at
--- the end of the transaction if it's smaller than wal_skip_threshold, which
--- updates the LSNs. Wrap the tests on gist_page_opaque_info() in the
--- same transaction with the CREATE INDEX so that we see the LSNs before
--- they are possibly overwritten at end of transaction.
-BEGIN;
-
--- Create a test table and GiST index.
-CREATE TABLE test_gist AS SELECT point(i,i) p, i::text t FROM
+-- The gist_page_opaque_info() function prints the page's LSN.
+-- Use an unlogged index, so that the LSN is predictable.
+CREATE UNLOGGED TABLE test_gist AS SELECT point(i,i) p, i::text t FROM
     generate_series(1,1000) i;
 CREATE INDEX test_gist_idx ON test_gist USING gist (p);
 
@@ -17,8 +9,6 @@ SELECT * FROM gist_page_opaque_info(get_raw_page('test_gist_idx', 0));
 SELECT * FROM gist_page_opaque_info(get_raw_page('test_gist_idx', 1));
 SELECT * FROM gist_page_opaque_info(get_raw_page('test_gist_idx', 2));
 
-COMMIT;
-
 SELECT * FROM gist_page_items(get_raw_page('test_gist_idx', 0), 'test_gist_idx');
 SELECT * FROM gist_page_items(get_raw_page('test_gist_idx', 1), 'test_gist_idx') LIMIT 5;
 
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index a45e2fe3755..278c9e48676 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -43,7 +43,8 @@
 #include "miscadmin.h"
 #include "optimizer/optimizer.h"
 #include "storage/bufmgr.h"
-#include "storage/smgr.h"
+#include "storage/bulk_write.h"
+
 #include "utils/memutils.h"
 #include "utils/rel.h"
 #include "utils/tuplesort.h"
@@ -106,11 +107,8 @@ typedef struct
 	Tuplesortstate *sortstate;	/* state data for tuplesort.c */
 
 	BlockNumber pages_allocated;
-	BlockNumber pages_written;
 
-	int			ready_num_pages;
-	BlockNumber ready_blknos[XLR_MAX_BLOCK_ID];
-	Page		ready_pages[XLR_MAX_BLOCK_ID];
+	BulkWriteState *bulkw;
 } GISTBuildState;
 
 #define GIST_SORTED_BUILD_PAGE_NUM 4
@@ -142,7 +140,6 @@ static void gist_indexsortbuild_levelstate_add(GISTBuildState *state,
 											   IndexTuple itup);
 static void gist_indexsortbuild_levelstate_flush(GISTBuildState *state,
 												 GistSortedBuildLevelState *levelstate);
-static void gist_indexsortbuild_flush_ready_pages(GISTBuildState *state);
 
 static void gistInitBuffering(GISTBuildState *buildstate);
 static int	calculatePagesPerBuffer(GISTBuildState *buildstate, int levelStep);
@@ -407,21 +404,13 @@ gist_indexsortbuild(GISTBuildState *state)
 	GistSortedBuildLevelState *levelstate;
 	Page		page;
 
-	state->pages_allocated = 0;
-	state->pages_written = 0;
-	state->ready_num_pages = 0;
+	/* Reserve block 0 for the root page */
+	state->pages_allocated = 1;
 
-	/*
-	 * Write an empty page as a placeholder for the root page. It will be
-	 * replaced with the real root page at the end.
-	 */
-	page = palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, MCXT_ALLOC_ZERO);
-	smgrextend(RelationGetSmgr(state->indexrel), MAIN_FORKNUM, GIST_ROOT_BLKNO,
-			   page, true);
-	state->pages_allocated++;
-	state->pages_written++;
+	state->bulkw = bulkw_start_rel(state->indexrel, MAIN_FORKNUM);
 
 	/* Allocate a temporary buffer for the first leaf page batch. */
+	page = bulkw_alloc_buf(state->bulkw);
 	levelstate = palloc0(sizeof(GistSortedBuildLevelState));
 	levelstate->pages[0] = page;
 	levelstate->parent = NULL;
@@ -455,31 +444,13 @@ gist_indexsortbuild(GISTBuildState *state)
 		levelstate = parent;
 	}
 
-	gist_indexsortbuild_flush_ready_pages(state);
-
 	/* Write out the root */
 	PageSetLSN(levelstate->pages[0], GistBuildLSN);
-	PageSetChecksumInplace(levelstate->pages[0], GIST_ROOT_BLKNO);
-	smgrwrite(RelationGetSmgr(state->indexrel), MAIN_FORKNUM, GIST_ROOT_BLKNO,
-			  levelstate->pages[0], true);
-	if (RelationNeedsWAL(state->indexrel))
-		log_newpage(&state->indexrel->rd_locator, MAIN_FORKNUM, GIST_ROOT_BLKNO,
-					levelstate->pages[0], true);
-
-	pfree(levelstate->pages[0]);
+	bulkw_write(state->bulkw, GIST_ROOT_BLKNO, levelstate->pages[0], true);
+
 	pfree(levelstate);
 
-	/*
-	 * When we WAL-logged index pages, we must nonetheless fsync index files.
-	 * Since we're building outside shared buffers, a CHECKPOINT occurring
-	 * during the build has no way to flush the previously written data to
-	 * disk (indeed it won't know the index even exists).  A crash later on
-	 * would replay WAL from the checkpoint, therefore it wouldn't replay our
-	 * earlier WAL entries. If we do not fsync those pages here, they might
-	 * still not be on disk when the crash occurs.
-	 */
-	if (RelationNeedsWAL(state->indexrel))
-		smgrimmedsync(RelationGetSmgr(state->indexrel), MAIN_FORKNUM);
+	bulkw_finish(state->bulkw);
 }
 
 /*
@@ -510,7 +481,7 @@ gist_indexsortbuild_levelstate_add(GISTBuildState *state,
 
 		if (levelstate->pages[levelstate->current_page] == NULL)
 			levelstate->pages[levelstate->current_page] =
-				palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
+				bulkw_alloc_buf(state->bulkw);
 
 		newPage = levelstate->pages[levelstate->current_page];
 		gistinitpage(newPage, old_page_flags);
@@ -580,7 +551,7 @@ gist_indexsortbuild_levelstate_flush(GISTBuildState *state,
 
 		/* Create page and copy data */
 		data = (char *) (dist->list);
-		target = palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, MCXT_ALLOC_ZERO);
+		target = bulkw_alloc_buf(state->bulkw);
 		gistinitpage(target, isleaf ? F_LEAF : 0);
 		for (int i = 0; i < dist->block.num; i++)
 		{
@@ -593,20 +564,6 @@ gist_indexsortbuild_levelstate_flush(GISTBuildState *state,
 		}
 		union_tuple = dist->itup;
 
-		if (state->ready_num_pages == XLR_MAX_BLOCK_ID)
-			gist_indexsortbuild_flush_ready_pages(state);
-
-		/*
-		 * The page is now complete. Assign a block number to it, and add it
-		 * to the list of finished pages. (We don't write it out immediately,
-		 * because we want to WAL-log the pages in batches.)
-		 */
-		blkno = state->pages_allocated++;
-		state->ready_blknos[state->ready_num_pages] = blkno;
-		state->ready_pages[state->ready_num_pages] = target;
-		state->ready_num_pages++;
-		ItemPointerSetBlockNumber(&(union_tuple->t_tid), blkno);
-
 		/*
 		 * Set the right link to point to the previous page. This is just for
 		 * debugging purposes: GiST only follows the right link if a page is
@@ -621,6 +578,15 @@ gist_indexsortbuild_levelstate_flush(GISTBuildState *state,
 		 */
 		if (levelstate->last_blkno)
 			GistPageGetOpaque(target)->rightlink = levelstate->last_blkno;
+
+		/*
+		 * The page is now complete. Assign a block number to it, and pass it
+		 * to the bulk writer.
+		 */
+		blkno = state->pages_allocated++;
+		PageSetLSN(target, GistBuildLSN);
+		bulkw_write(state->bulkw, blkno, target, true);
+		ItemPointerSetBlockNumber(&(union_tuple->t_tid), blkno);
 		levelstate->last_blkno = blkno;
 
 		/*
@@ -631,7 +597,7 @@ gist_indexsortbuild_levelstate_flush(GISTBuildState *state,
 		if (parent == NULL)
 		{
 			parent = palloc0(sizeof(GistSortedBuildLevelState));
-			parent->pages[0] = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
+			parent->pages[0] = bulkw_alloc_buf(state->bulkw);
 			parent->parent = NULL;
 			gistinitpage(parent->pages[0], 0);
 
@@ -641,39 +607,6 @@ gist_indexsortbuild_levelstate_flush(GISTBuildState *state,
 	}
 }
 
-static void
-gist_indexsortbuild_flush_ready_pages(GISTBuildState *state)
-{
-	if (state->ready_num_pages == 0)
-		return;
-
-	for (int i = 0; i < state->ready_num_pages; i++)
-	{
-		Page		page = state->ready_pages[i];
-		BlockNumber blkno = state->ready_blknos[i];
-
-		/* Currently, the blocks must be buffered in order. */
-		if (blkno != state->pages_written)
-			elog(ERROR, "unexpected block number to flush GiST sorting build");
-
-		PageSetLSN(page, GistBuildLSN);
-		PageSetChecksumInplace(page, blkno);
-		smgrextend(RelationGetSmgr(state->indexrel), MAIN_FORKNUM, blkno, page,
-				   true);
-
-		state->pages_written++;
-	}
-
-	if (RelationNeedsWAL(state->indexrel))
-		log_newpages(&state->indexrel->rd_locator, MAIN_FORKNUM, state->ready_num_pages,
-					 state->ready_blknos, state->ready_pages, true);
-
-	for (int i = 0; i < state->ready_num_pages; i++)
-		pfree(state->ready_pages[i]);
-
-	state->ready_num_pages = 0;
-}
-
 
 /*-------------------------------------------------------------------------
  * Routines for non-sorted build
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 424958912c7..9d7c7042e68 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -87,8 +87,8 @@
  * is optimized for bulk inserting a lot of tuples, knowing that we have
  * exclusive access to the heap.  raw_heap_insert builds new pages in
  * local storage.  When a page is full, or at the end of the process,
- * we insert it to WAL as a single record and then write it to disk
- * directly through smgr.  Note, however, that any data sent to the new
+ * we insert it to WAL as a single record and then write it to disk with
+ * the bulk smgr writer.  Note, however, that any data sent to the new
  * heap's TOAST table will go through the normal bufmgr.
  *
  *
@@ -119,9 +119,9 @@
 #include "replication/logical.h"
 #include "replication/slot.h"
 #include "storage/bufmgr.h"
+#include "storage/bulk_write.h"
 #include "storage/fd.h"
 #include "storage/procarray.h"
-#include "storage/smgr.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -133,9 +133,11 @@ typedef struct RewriteStateData
 {
 	Relation	rs_old_rel;		/* source heap */
 	Relation	rs_new_rel;		/* destination heap */
+
+	BulkWriteState *rs_bulkw;
+
 	Page		rs_buffer;		/* page currently being built */
 	BlockNumber rs_blockno;		/* block where page will go */
-	bool		rs_buffer_valid;	/* T if any tuples in buffer */
 	bool		rs_logical_rewrite; /* do we need to do logical rewriting */
 	TransactionId rs_oldest_xmin;	/* oldest xmin used by caller to determine
 									 * tuple visibility */
@@ -255,15 +257,16 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
 
 	state->rs_old_rel = old_heap;
 	state->rs_new_rel = new_heap;
-	state->rs_buffer = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
+	state->rs_buffer = NULL;
 	/* new_heap needn't be empty, just locked */
 	state->rs_blockno = RelationGetNumberOfBlocks(new_heap);
-	state->rs_buffer_valid = false;
 	state->rs_oldest_xmin = oldest_xmin;
 	state->rs_freeze_xid = freeze_xid;
 	state->rs_cutoff_multi = cutoff_multi;
 	state->rs_cxt = rw_cxt;
 
+	state->rs_bulkw = bulkw_start_rel(new_heap, MAIN_FORKNUM);
+
 	/* Initialize hash tables used to track update chains */
 	hash_ctl.keysize = sizeof(TidHashKey);
 	hash_ctl.entrysize = sizeof(UnresolvedTupData);
@@ -314,30 +317,13 @@ end_heap_rewrite(RewriteState state)
 	}
 
 	/* Write the last page, if any */
-	if (state->rs_buffer_valid)
+	if (state->rs_buffer)
 	{
-		if (RelationNeedsWAL(state->rs_new_rel))
-			log_newpage(&state->rs_new_rel->rd_locator,
-						MAIN_FORKNUM,
-						state->rs_blockno,
-						state->rs_buffer,
-						true);
-
-		PageSetChecksumInplace(state->rs_buffer, state->rs_blockno);
-
-		smgrextend(RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM,
-				   state->rs_blockno, state->rs_buffer, true);
+		bulkw_write(state->rs_bulkw, state->rs_blockno, state->rs_buffer, true);
+		state->rs_buffer = NULL;
 	}
 
-	/*
-	 * When we WAL-logged rel pages, we must nonetheless fsync them.  The
-	 * reason is the same as in storage.c's RelationCopyStorage(): we're
-	 * writing data that's not in shared buffers, and so a CHECKPOINT
-	 * occurring during the rewriteheap operation won't have fsync'd data we
-	 * wrote before the checkpoint.
-	 */
-	if (RelationNeedsWAL(state->rs_new_rel))
-		smgrimmedsync(RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM);
+	bulkw_finish(state->rs_bulkw);
 
 	logical_end_heap_rewrite(state);
 
@@ -611,7 +597,7 @@ rewrite_heap_dead_tuple(RewriteState state, HeapTuple old_tuple)
 static void
 raw_heap_insert(RewriteState state, HeapTuple tup)
 {
-	Page		page = state->rs_buffer;
+	Page		page;
 	Size		pageFreeSpace,
 				saveFreeSpace;
 	Size		len;
@@ -664,7 +650,8 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
 												   HEAP_DEFAULT_FILLFACTOR);
 
 	/* Now we can check to see if there's enough free space already. */
-	if (state->rs_buffer_valid)
+	page = state->rs_buffer;
+	if (page)
 	{
 		pageFreeSpace = PageGetHeapFreeSpace(page);
 
@@ -675,35 +662,17 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
 			 * contains a tuple.  Hence, unlike RelationGetBufferForTuple(),
 			 * enforce saveFreeSpace unconditionally.
 			 */
-
-			/* XLOG stuff */
-			if (RelationNeedsWAL(state->rs_new_rel))
-				log_newpage(&state->rs_new_rel->rd_locator,
-							MAIN_FORKNUM,
-							state->rs_blockno,
-							page,
-							true);
-
-			/*
-			 * Now write the page. We say skipFsync = true because there's no
-			 * need for smgr to schedule an fsync for this write; we'll do it
-			 * ourselves in end_heap_rewrite.
-			 */
-			PageSetChecksumInplace(page, state->rs_blockno);
-
-			smgrextend(RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM,
-					   state->rs_blockno, page, true);
-
+			bulkw_write(state->rs_bulkw, state->rs_blockno, state->rs_buffer, true);
+			page = state->rs_buffer = NULL;
 			state->rs_blockno++;
-			state->rs_buffer_valid = false;
 		}
 	}
 
-	if (!state->rs_buffer_valid)
+	if (!page)
 	{
 		/* Initialize a new empty page */
+		page = state->rs_buffer = bulkw_alloc_buf(state->rs_bulkw);
 		PageInit(page, BLCKSZ, 0);
-		state->rs_buffer_valid = true;
 	}
 
 	/* And now we can insert the tuple into the page */
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index a88b36a589a..1eab85f61f6 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -29,11 +29,11 @@
 #include "nodes/execnodes.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
+#include "storage/bulk_write.h"
 #include "storage/condition_variable.h"
 #include "storage/indexfsm.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
-#include "storage/smgr.h"
 #include "utils/builtins.h"
 #include "utils/index_selfuncs.h"
 #include "utils/memutils.h"
@@ -152,32 +152,17 @@ void
 btbuildempty(Relation index)
 {
 	bool		allequalimage = _bt_allequalimage(index, false);
-	Buffer		metabuf;
 	Page		metapage;
+	BulkWriteState *bulkw;
 
-	/*
-	 * Initalize the metapage.
-	 *
-	 * Regular index build bypasses the buffer manager and uses smgr functions
-	 * directly, with an smgrimmedsync() call at the end.  That makes sense
-	 * when the index is large, but for an empty index, it's better to use the
-	 * buffer cache to avoid the smgrimmedsync().
-	 */
-	metabuf = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
-	Assert(BufferGetBlockNumber(metabuf) == BTREE_METAPAGE);
-	_bt_lockbuf(index, metabuf, BT_WRITE);
+	bulkw = bulkw_start_rel(index, INIT_FORKNUM);
 
-	START_CRIT_SECTION();
-
-	metapage = BufferGetPage(metabuf);
+	/* Construct metapage. */
+	metapage = bulkw_alloc_buf(bulkw);
 	_bt_initmetapage(metapage, P_NONE, 0, allequalimage);
-	MarkBufferDirty(metabuf);
-	log_newpage_buffer(metabuf, true);
-
-	END_CRIT_SECTION();
+	bulkw_write(bulkw, BTREE_METAPAGE, metapage, true);
 
-	_bt_unlockbuf(index, metabuf);
-	ReleaseBuffer(metabuf);
+	bulkw_finish(bulkw);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index c2665fce411..1b0591f551c 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -23,13 +23,8 @@
  * many upper pages if the keys are reasonable-size) without risking a lot of
  * cascading splits during early insertions.
  *
- * Formerly the index pages being built were kept in shared buffers, but
- * that is of no value (since other backends have no interest in them yet)
- * and it created locking problems for CHECKPOINT, because the upper-level
- * pages were held exclusive-locked for long periods.  Now we just build
- * the pages in local memory and smgrwrite or smgrextend them as we finish
- * them.  They will need to be re-read into shared buffers on first use after
- * the build finishes.
+ * We use the bulk smgr loading facility to bypass the buffer cache and
+ * WAL log the pages efficiently.
  *
  * This code isn't concerned about the FSM at all. The caller is responsible
  * for initializing that.
@@ -57,7 +52,7 @@
 #include "executor/instrument.h"
 #include "miscadmin.h"
 #include "pgstat.h"
-#include "storage/smgr.h"
+#include "storage/bulk_write.h"
 #include "tcop/tcopprot.h"		/* pgrminclude ignore */
 #include "utils/rel.h"
 #include "utils/sortsupport.h"
@@ -251,11 +246,9 @@ typedef struct BTWriteState
 {
 	Relation	heap;
 	Relation	index;
+	BulkWriteState *bulkw;
 	BTScanInsert inskey;		/* generic insertion scankey */
-	bool		btws_use_wal;	/* dump pages to WAL? */
 	BlockNumber btws_pages_alloced; /* # pages allocated */
-	BlockNumber btws_pages_written; /* # pages written out */
-	Page		btws_zeropage;	/* workspace for filling zeroes */
 } BTWriteState;
 
 
@@ -267,7 +260,7 @@ static void _bt_spool(BTSpool *btspool, ItemPointer self,
 static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2);
 static void _bt_build_callback(Relation index, ItemPointer tid, Datum *values,
 							   bool *isnull, bool tupleIsAlive, void *state);
-static Page _bt_blnewpage(uint32 level);
+static Page _bt_blnewpage(BTWriteState *wstate, uint32 level);
 static BTPageState *_bt_pagestate(BTWriteState *wstate, uint32 level);
 static void _bt_slideleft(Page rightmostpage);
 static void _bt_sortaddtup(Page page, Size itemsize,
@@ -569,16 +562,17 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
 	/* _bt_mkscankey() won't set allequalimage without metapage */
 	wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
-	wstate.btws_use_wal = RelationNeedsWAL(wstate.index);
 
 	/* reserve the metapage */
 	wstate.btws_pages_alloced = BTREE_METAPAGE + 1;
-	wstate.btws_pages_written = 0;
-	wstate.btws_zeropage = NULL;	/* until needed */
+
+	wstate.bulkw = bulkw_start_rel(wstate.index, MAIN_FORKNUM);
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_BTREE_PHASE_LEAF_LOAD);
 	_bt_load(&wstate, btspool, btspool2);
+
+	bulkw_finish(wstate.bulkw);
 }
 
 /*
@@ -614,12 +608,12 @@ _bt_build_callback(Relation index,
  * allocate workspace for a new, clean btree page, not linked to any siblings.
  */
 static Page
-_bt_blnewpage(uint32 level)
+_bt_blnewpage(BTWriteState *wstate, uint32 level)
 {
 	Page		page;
 	BTPageOpaque opaque;
 
-	page = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
+	page = bulkw_alloc_buf(wstate->bulkw);
 
 	/* Zero the page and set up standard page header info */
 	_bt_pageinit(page, BLCKSZ);
@@ -643,54 +637,8 @@ _bt_blnewpage(uint32 level)
 static void
 _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
 {
-	/* XLOG stuff */
-	if (wstate->btws_use_wal)
-	{
-		/* We use the XLOG_FPI record type for this */
-		log_newpage(&wstate->index->rd_locator, MAIN_FORKNUM, blkno, page, true);
-	}
-
-	/*
-	 * If we have to write pages nonsequentially, fill in the space with
-	 * zeroes until we come back and overwrite.  This is not logically
-	 * necessary on standard Unix filesystems (unwritten space will read as
-	 * zeroes anyway), but it should help to avoid fragmentation. The dummy
-	 * pages aren't WAL-logged though.
-	 */
-	while (blkno > wstate->btws_pages_written)
-	{
-		if (!wstate->btws_zeropage)
-			wstate->btws_zeropage = (Page) palloc_aligned(BLCKSZ,
-														  PG_IO_ALIGN_SIZE,
-														  MCXT_ALLOC_ZERO);
-		/* don't set checksum for all-zero page */
-		smgrextend(RelationGetSmgr(wstate->index), MAIN_FORKNUM,
-				   wstate->btws_pages_written++,
-				   wstate->btws_zeropage,
-				   true);
-	}
-
-	PageSetChecksumInplace(page, blkno);
-
-	/*
-	 * Now write the page.  There's no need for smgr to schedule an fsync for
-	 * this write; we'll do it ourselves before ending the build.
-	 */
-	if (blkno == wstate->btws_pages_written)
-	{
-		/* extending the file... */
-		smgrextend(RelationGetSmgr(wstate->index), MAIN_FORKNUM, blkno,
-				   page, true);
-		wstate->btws_pages_written++;
-	}
-	else
-	{
-		/* overwriting a block we zero-filled before */
-		smgrwrite(RelationGetSmgr(wstate->index), MAIN_FORKNUM, blkno,
-				  page, true);
-	}
-
-	pfree(page);
+	bulkw_write(wstate->bulkw, blkno, page, true);
+	/* bulkw_write took ownership of 'page' */
 }
 
 /*
@@ -703,7 +651,7 @@ _bt_pagestate(BTWriteState *wstate, uint32 level)
 	BTPageState *state = (BTPageState *) palloc0(sizeof(BTPageState));
 
 	/* create initial page for level */
-	state->btps_page = _bt_blnewpage(level);
+	state->btps_page = _bt_blnewpage(wstate, level);
 
 	/* and assign it a page position */
 	state->btps_blkno = wstate->btws_pages_alloced++;
@@ -916,7 +864,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 		IndexTuple	oitup;
 
 		/* Create new page of same level */
-		npage = _bt_blnewpage(state->btps_level);
+		npage = _bt_blnewpage(wstate, state->btps_level);
 
 		/* and assign it a page position */
 		nblkno = wstate->btws_pages_alloced++;
@@ -1028,8 +976,8 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 		}
 
 		/*
-		 * Write out the old page.  We never need to touch it again, so we can
-		 * free the opage workspace too.
+		 * Write out the old page. _bt_blwritepage takes ownership of the
+		 * 'opage' buffer.
 		 */
 		_bt_blwritepage(wstate, opage, oblkno);
 
@@ -1163,7 +1111,7 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 		 */
 		_bt_slideleft(s->btps_page);
 		_bt_blwritepage(wstate, s->btps_page, s->btps_blkno);
-		s->btps_page = NULL;	/* writepage freed the workspace */
+		s->btps_page = NULL;	/* writepage took ownership of the buffer */
 	}
 
 	/*
@@ -1172,7 +1120,7 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 	 * set to point to "P_NONE").  This changes the index to the "valid" state
 	 * by filling in a valid magic number in the metapage.
 	 */
-	metapage = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
+	metapage = bulkw_alloc_buf(wstate->bulkw);
 	_bt_initmetapage(metapage, rootblkno, rootlevel,
 					 wstate->inskey->allequalimage);
 	_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
@@ -1422,18 +1370,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 
 	/* Close down final pages and write the metapage */
 	_bt_uppershutdown(wstate, state);
-
-	/*
-	 * When we WAL-logged index pages, we must nonetheless fsync index files.
-	 * Since we're building outside shared buffers, a CHECKPOINT occurring
-	 * during the build has no way to flush the previously written data to
-	 * disk (indeed it won't know the index even exists).  A crash later on
-	 * would replay WAL from the checkpoint, therefore it wouldn't replay our
-	 * earlier WAL entries. If we do not fsync those pages here, they might
-	 * still not be on disk when the crash occurs.
-	 */
-	if (wstate->btws_use_wal)
-		smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM);
 }
 
 /*
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index 4443f1918df..11fe932c26d 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -25,7 +25,7 @@
 #include "catalog/index.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
-#include "storage/smgr.h"
+#include "storage/bulk_write.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -155,42 +155,27 @@ spgbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 void
 spgbuildempty(Relation index)
 {
-	Buffer		metabuffer,
-				rootbuffer,
-				nullbuffer;
-
-	/*
-	 * Initialize the meta page and root pages
-	 */
-	metabuffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
-	LockBuffer(metabuffer, BUFFER_LOCK_EXCLUSIVE);
-	rootbuffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
-	LockBuffer(rootbuffer, BUFFER_LOCK_EXCLUSIVE);
-	nullbuffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
-	LockBuffer(nullbuffer, BUFFER_LOCK_EXCLUSIVE);
-
-	Assert(BufferGetBlockNumber(metabuffer) == SPGIST_METAPAGE_BLKNO);
-	Assert(BufferGetBlockNumber(rootbuffer) == SPGIST_ROOT_BLKNO);
-	Assert(BufferGetBlockNumber(nullbuffer) == SPGIST_NULL_BLKNO);
+	Page		page;
+	BulkWriteState *bulkw;
 
-	START_CRIT_SECTION();
+	bulkw = bulkw_start_rel(index, INIT_FORKNUM);
 
-	SpGistInitMetapage(BufferGetPage(metabuffer));
-	MarkBufferDirty(metabuffer);
-	SpGistInitBuffer(rootbuffer, SPGIST_LEAF);
-	MarkBufferDirty(rootbuffer);
-	SpGistInitBuffer(nullbuffer, SPGIST_LEAF | SPGIST_NULLS);
-	MarkBufferDirty(nullbuffer);
+	/* Construct metapage. */
+	page = (Page) bulkw_alloc_buf(bulkw);
+	SpGistInitMetapage(page);
+	bulkw_write(bulkw, SPGIST_METAPAGE_BLKNO, page, true);
 
-	log_newpage_buffer(metabuffer, true);
-	log_newpage_buffer(rootbuffer, true);
-	log_newpage_buffer(nullbuffer, true);
+	/* Likewise for the root page. */
+	page = (Page) bulkw_alloc_buf(bulkw);
+	SpGistInitPage(page, SPGIST_LEAF);
+	bulkw_write(bulkw, SPGIST_ROOT_BLKNO, page, true);
 
-	END_CRIT_SECTION();
+	/* Likewise for the null-tuples root page. */
+	page = (Page) bulkw_alloc_buf(bulkw);
+	SpGistInitPage(page, SPGIST_LEAF | SPGIST_NULLS);
+	bulkw_write(bulkw, SPGIST_NULL_BLKNO, page, true);
 
-	UnlockReleaseBuffer(metabuffer);
-	UnlockReleaseBuffer(rootbuffer);
-	UnlockReleaseBuffer(nullbuffer);
+	bulkw_finish(bulkw);
 }
 
 /*
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 93f07e49b72..eb982b1a88d 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -28,6 +28,7 @@
 #include "catalog/storage.h"
 #include "catalog/storage_xlog.h"
 #include "miscadmin.h"
+#include "storage/bulk_write.h"
 #include "storage/freespace.h"
 #include "storage/smgr.h"
 #include "utils/hsearch.h"
@@ -451,14 +452,11 @@ void
 RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 					ForkNumber forkNum, char relpersistence)
 {
-	PGIOAlignedBlock buf;
-	Page		page;
 	bool		use_wal;
 	bool		copying_initfork;
 	BlockNumber nblocks;
 	BlockNumber blkno;
-
-	page = (Page) buf.data;
+	BulkWriteState *bulkw;
 
 	/*
 	 * The init fork for an unlogged relation in many respects has to be
@@ -477,14 +475,19 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 	use_wal = XLogIsNeeded() &&
 		(relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork);
 
+	bulkw = bulkw_start_smgr(dst, forkNum, use_wal);
+
 	nblocks = smgrnblocks(src, forkNum);
 
 	for (blkno = 0; blkno < nblocks; blkno++)
 	{
+		Page		page;
+
 		/* If we got a cancel signal during the copy of the data, quit */
 		CHECK_FOR_INTERRUPTS();
 
-		smgrread(src, forkNum, blkno, buf.data);
+		page = bulkw_alloc_buf(bulkw);
+		smgrread(src, forkNum, blkno, page);
 
 		if (!PageIsVerifiedExtended(page, blkno,
 									PIV_LOG_WARNING | PIV_REPORT_STAT))
@@ -511,30 +514,9 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 		 * page this is, so we have to log the full page including any unused
 		 * space.
 		 */
-		if (use_wal)
-			log_newpage(&dst->smgr_rlocator.locator, forkNum, blkno, page, false);
-
-		PageSetChecksumInplace(page, blkno);
-
-		/*
-		 * Now write the page.  We say skipFsync = true because there's no
-		 * need for smgr to schedule an fsync for this write; we'll do it
-		 * ourselves below.
-		 */
-		smgrextend(dst, forkNum, blkno, buf.data, true);
+		bulkw_write(bulkw, blkno, page, false);
 	}
-
-	/*
-	 * When we WAL-logged rel pages, we must nonetheless fsync them.  The
-	 * reason is that since we're copying outside shared buffers, a CHECKPOINT
-	 * occurring during the copy has no way to flush the previously written
-	 * data to disk (indeed it won't know the new rel even exists).  A crash
-	 * later on would replay WAL from the checkpoint, therefore it wouldn't
-	 * replay our earlier WAL entries. If we do not fsync those pages here,
-	 * they might still not be on disk when the crash occurs.
-	 */
-	if (use_wal || copying_initfork)
-		smgrimmedsync(dst, forkNum);
+	bulkw_finish(bulkw);
 }
 
 /*
diff --git a/src/backend/storage/smgr/Makefile b/src/backend/storage/smgr/Makefile
index 596b564656f..1d0b98764f9 100644
--- a/src/backend/storage/smgr/Makefile
+++ b/src/backend/storage/smgr/Makefile
@@ -13,6 +13,7 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = \
+	bulk_write.o \
 	md.o \
 	smgr.o
 
diff --git a/src/backend/storage/smgr/bulk_write.c b/src/backend/storage/smgr/bulk_write.c
new file mode 100644
index 00000000000..d9090979d65
--- /dev/null
+++ b/src/backend/storage/smgr/bulk_write.c
@@ -0,0 +1,334 @@
+/*-------------------------------------------------------------------------
+ *
+ * bulk_write.c
+ *	  Efficiently and reliably populate a new relation
+ *
+ * The assumption is that no other backends access the relation while we are
+ * loading it, so we can take some shortcuts.  Alternatively, you can use the
+ * buffer manager as usual, if performance is not critical, but you must not
+ * mix operations through the buffer manager and the bulk loading interface at
+ * the same time.
+ *
+ * We bypass the buffer manager to avoid the locking overhead, and call
+ * smgrextend() directly.  A downside is that the pages will need to be
+ * re-read into shared buffers on first use after the build finishes.  That's
+ * usually a good tradeoff for large relations, and for small relations, the
+ * overhead isn't very significant compared to creating the relation in the
+ * first place.
+ *
+ * The pages are WAL-logged if needed.  To save on WAL header overhead, we
+ * WAL-log several pages in one record.
+ *
+ * One tricky point is that because we bypass the buffer manager, we need to
+ * register the relation for fsyncing at the next checkpoint ourselves, and
+ * make sure that the relation is correctly fsync by us or the checkpointer
+ * even if a checkpoint happens concurrently.
+ *
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/smgr/bulk_write.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/xloginsert.h"
+#include "access/xlogrecord.h"
+#include "storage/bufmgr.h"
+#include "storage/bufpage.h"
+#include "storage/bulk_write.h"
+#include "storage/proc.h"
+#include "storage/smgr.h"
+#include "utils/rel.h"
+
+#define MAX_BUFFERED_PAGES XLR_MAX_BLOCK_ID
+
+typedef struct BulkWriteBuffer
+{
+	Page		page;
+	BlockNumber blkno;
+	bool		page_std;
+	int16		order;
+} BulkWriteBuffer;
+
+/*
+ * Bulk writer state for one relation fork.
+ */
+typedef struct BulkWriteState
+{
+	/* Information about the target relation we're writing */
+	SMgrRelation smgr;
+	ForkNumber	forknum;
+	bool		use_wal;
+
+	/* We keep several pages buffered, and WAL-log them in batches */
+	int			nbuffered;
+	BulkWriteBuffer buffers[MAX_BUFFERED_PAGES];
+
+	/* Current size of the relation */
+	BlockNumber pages_written;
+
+	/* The RedoRecPtr at the time that the bulk operation started */
+	XLogRecPtr	start_RedoRecPtr;
+
+	Page		zeropage;		/* workspace for filling zeroes */
+
+	MemoryContext memcxt;
+} BulkWriteState;
+
+static void bulkw_flush(BulkWriteState *bulkw);
+
+/*
+ * Start a bulk write operation on a relation fork.
+ */
+BulkWriteState *
+bulkw_start_rel(Relation rel, ForkNumber forknum)
+{
+	return bulkw_start_smgr(RelationGetSmgr(rel),
+							forknum,
+							RelationNeedsWAL(rel) || forknum == INIT_FORKNUM);
+}
+
+/*
+ * Start a bulk write operation on a relation fork.
+ *
+ * This is like bulkw_start_rel, but can be used without a relcache entry.
+ */
+BulkWriteState *
+bulkw_start_smgr(SMgrRelation smgr, ForkNumber forknum, bool use_wal)
+{
+	BulkWriteState *bulkw;
+
+	bulkw = palloc(sizeof(BulkWriteState));
+	bulkw->smgr = smgr;
+	bulkw->forknum = forknum;
+	bulkw->use_wal = use_wal;
+
+	bulkw->nbuffered = 0;
+	bulkw->pages_written = 0;
+
+	bulkw->start_RedoRecPtr = GetRedoRecPtr();
+
+	bulkw->zeropage = NULL;		/* until needed */
+
+	/*
+	 * Remember the memory context.  We will use it to allocate all the
+	 * buffers later.
+	 */
+	bulkw->memcxt = CurrentMemoryContext;
+
+	return bulkw;
+}
+
+/*
+ * Finish bulk write operation.
+ *
+ * This WAL-logs and flushes any remaining buffers to disk, and
+ * fsyncs the relation if needed.
+ */
+void
+bulkw_finish(BulkWriteState *bulkw)
+{
+	/* WAL-log and flush any remaining pages */
+	bulkw_flush(bulkw);
+
+	/*
+	 * When we wrote out the pages, we passed skipFsync=true to avoid the
+	 * overhead of registering all the writes with the checkpointer.  Register
+	 * the whole relation now.
+	 *
+	 * There is one hole in that idea: If a checkpoint occurred while we were
+	 * writing the pages, it already missed fsyncing the pages we had written
+	 * before the checkpoint started.  A crash later on would replay the WAL
+	 * starting from the checkpoint, therefore it wouldn't replay our earlier
+	 * WAL records.  So if a checkpoint started after the bulk write, fsync
+	 * the files now.
+	 */
+	if (!SmgrIsTemp(bulkw->smgr))
+	{
+		/*
+		 * Prevent a checkpoint from starting between the GetRedoRecPtr() and
+		 * smgrregistersync() calls.
+		 */
+		Assert((MyProc->delayChkptFlags & DELAY_CHKPT_START) == 0);
+		MyProc->delayChkptFlags |= DELAY_CHKPT_START;
+
+		if (bulkw->start_RedoRecPtr != GetRedoRecPtr())
+		{
+			/*
+			 * A checkpoint occurred and it didn't know about our writes, so
+			 * fsync() the relation ourselves.
+			 */
+			MyProc->delayChkptFlags &= ~DELAY_CHKPT_START;
+			smgrimmedsync(bulkw->smgr, bulkw->forknum);
+			elog(DEBUG1, "flushed relation because a checkpoint occurred concurrently");
+		}
+		else
+		{
+			smgrregistersync(bulkw->smgr, bulkw->forknum);
+			MyProc->delayChkptFlags &= ~DELAY_CHKPT_START;
+		}
+	}
+}
+
+static int
+buffer_cmp(const void *a, const void *b)
+{
+	const BulkWriteBuffer *bufa = (const BulkWriteBuffer *) a;
+	const BulkWriteBuffer *bufb = (const BulkWriteBuffer *) b;
+
+	if (bufa->blkno == bufb->blkno)
+	{
+		if (bufa->order > bufb->order)
+			return 1;
+		else
+			return -1;
+	}
+	else if (bufa->blkno > bufb->blkno)
+	{
+		return 1;
+	}
+	else
+		return -1;
+}
+
+/*
+ * Write all buffered pages to disk.
+ */
+static void
+bulkw_flush(BulkWriteState *bulkw)
+{
+	int			nbuffered = bulkw->nbuffered;
+	BulkWriteBuffer *buffers = bulkw->buffers;
+
+	if (nbuffered == 0)
+		return;
+
+	if (nbuffered > 1)
+	{
+		int			o;
+
+		qsort(buffers, nbuffered, sizeof(BulkWriteBuffer), buffer_cmp);
+
+		/*
+		 * Eliminate duplicates, keeping the last write of each block.
+		 * (buffer_cmp uses 'order' as the last sort key)
+		 */
+		o = 0;
+		for (int i = 0; i < nbuffered; i++)
+		{
+			if (i < nbuffered - 1 && buffers[i + 1].blkno == buffers[i].blkno)
+			{
+				/* there is a later write of the same page, skip this one */
+				pfree(buffers[i].page);
+				continue;
+			}
+
+			if (i != o)
+				buffers[o] = buffers[i];
+			o++;
+		}
+		nbuffered = o;
+	}
+
+	if (bulkw->use_wal)
+	{
+		BlockNumber blknos[MAX_BUFFERED_PAGES];
+		Page		pages[MAX_BUFFERED_PAGES];
+		bool		page_std = true;
+
+		for (int i = 0; i < nbuffered; i++)
+		{
+			blknos[i] = buffers[i].blkno;
+			pages[i] = buffers[i].page;
+
+			/*
+			 * If any of the pages use !page_std, we log them all as such.
+			 * That's a bit wasteful, but in practice, a mix of standard and
+			 * non-standard page layout is rare.  None of the built-in AMs do
+			 * that.
+			 */
+			if (!buffers[i].page_std)
+				page_std = false;
+		}
+		log_newpages(&bulkw->smgr->smgr_rlocator.locator, bulkw->forknum, nbuffered, blknos, pages,
+					 page_std);
+	}
+
+	for (int i = 0; i < nbuffered; i++)
+	{
+		BlockNumber blkno = buffers[i].blkno;
+		Page		page = buffers[i].page;
+
+		PageSetChecksumInplace(page, blkno);
+
+		if (blkno >= bulkw->pages_written)
+		{
+			/*
+			 * If we have to write pages nonsequentially, fill in the space
+			 * with zeroes until we come back and overwrite.  This is not
+			 * logically necessary on standard Unix filesystems (unwritten
+			 * space will read as zeroes anyway), but it should help to avoid
+			 * fragmentation.  The dummy pages aren't WAL-logged though.
+			 */
+			while (blkno > bulkw->pages_written)
+			{
+				if (!bulkw->zeropage)
+					bulkw->zeropage = (Page) palloc_aligned(BLCKSZ,
+															PG_IO_ALIGN_SIZE,
+															MCXT_ALLOC_ZERO);
+
+				/* don't set checksum for all-zero page */
+				smgrextend(bulkw->smgr, bulkw->forknum,
+						   bulkw->pages_written++,
+						   bulkw->zeropage,
+						   true);
+			}
+
+			smgrextend(bulkw->smgr, bulkw->forknum, blkno, page, true);
+			bulkw->pages_written = buffers[i].blkno + 1;
+		}
+		else
+			smgrwrite(bulkw->smgr, bulkw->forknum, blkno, page, true);
+		pfree(page);
+	}
+
+	bulkw->nbuffered = 0;
+}
+
+/*
+ * Write out or buffer write of 'page'.
+ *
+ * NB: this takes ownership of 'page'
+ */
+void
+bulkw_write(BulkWriteState *bulkw, BlockNumber blocknum, Page page, bool page_std)
+{
+	bulkw->buffers[bulkw->nbuffered].page = page;
+	bulkw->buffers[bulkw->nbuffered].blkno = blocknum;
+	bulkw->buffers[bulkw->nbuffered].page_std = page_std;
+
+	/*
+	 * If the same page is written multiple times, 'order' is used to remember
+	 * the order of the writes, so that the last write wins.
+	 */
+	bulkw->buffers[bulkw->nbuffered].order = (int16) bulkw->nbuffered;
+
+	bulkw->nbuffered++;
+
+	if (bulkw->nbuffered == MAX_BUFFERED_PAGES)
+		bulkw_flush(bulkw);
+}
+
+/*
+ * Allocate a new buffer which can later be written with bulkw_write()
+ */
+Page
+bulkw_alloc_buf(BulkWriteState *bulkw)
+{
+	return MemoryContextAllocAligned(bulkw->memcxt, BLCKSZ, PG_IO_ALIGN_SIZE, 0);
+}
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index fdecbad1709..343ee51048e 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -1082,6 +1082,49 @@ mdtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
 	}
 }
 
+/*
+ * mdregistersync() -- Mark whole relation as needing fsync
+ */
+void
+mdregistersync(SMgrRelation reln, ForkNumber forknum)
+{
+	int			segno;
+	int			min_inactive_seg;
+
+	/*
+	 * NOTE: mdnblocks makes sure we have opened all active segments, so that
+	 * the loop below will get them all!
+	 */
+	mdnblocks(reln, forknum);
+
+	min_inactive_seg = segno = reln->md_num_open_segs[forknum];
+
+	/*
+	 * Temporarily open inactive segments, then close them after sync.  There
+	 * may be some inactive segments left opened after error, but that is
+	 * harmless.  We don't bother to clean them up and take a risk of further
+	 * trouble.  The next mdclose() will soon close them.
+	 */
+	while (_mdfd_openseg(reln, forknum, segno, 0) != NULL)
+		segno++;
+
+	while (segno > 0)
+	{
+		MdfdVec    *v = &reln->md_seg_fds[forknum][segno - 1];
+
+		register_dirty_segment(reln, forknum, v);
+
+		/* Close inactive segments immediately */
+		if (segno > min_inactive_seg)
+		{
+			FileClose(v->mdfd_vfd);
+			_fdvec_resize(reln, forknum, segno - 1);
+		}
+
+		segno--;
+	}
+}
+
 /*
  * mdimmedsync() -- Immediately sync a relation to stable storage.
  *
diff --git a/src/backend/storage/smgr/meson.build b/src/backend/storage/smgr/meson.build
index e1ba6ed74b8..133622a6528 100644
--- a/src/backend/storage/smgr/meson.build
+++ b/src/backend/storage/smgr/meson.build
@@ -1,6 +1,7 @@
 # Copyright (c) 2022-2023, PostgreSQL Global Development Group
 
 backend_sources += files(
+  'bulk_write.c',
   'md.c',
   'smgr.c',
 )
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 5d0f3d515c3..9f7405e3c88 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -65,6 +65,7 @@ typedef struct f_smgr
 	void		(*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
 								  BlockNumber nblocks);
 	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
+	void		(*smgr_registersync) (SMgrRelation reln, ForkNumber forknum);
 } f_smgr;
 
 static const f_smgr smgrsw[] = {
@@ -86,6 +87,7 @@ static const f_smgr smgrsw[] = {
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
 		.smgr_immedsync = mdimmedsync,
+		.smgr_registersync = mdregistersync,
 	}
 };
 
@@ -576,6 +578,14 @@ smgrread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
  * on disk at return, only dumped out to the kernel.  However,
  * provisions will be made to fsync the write before the next checkpoint.
  *
+ * NB: The mechanism to ensure fsync at next checkpoint assumes that there is
+ * something that prevents a concurrent checkpoint from "racing ahead" of the
+ * write.  One way to prevent that is by holding a lock on the buffer; the
+ * buffer manager's writes are protected by that.  The bulk writer facility in
+ * bulk_write.c checks the redo pointer and and calls smgrimmedsync() if a
+ * checkpoint happened; that relies on the fact that no other backend can be
+ * concurrently modify the page.
+ *
  * skipFsync indicates that the caller will make other provisions to
  * fsync the relation, so we needn't bother.  Temporary relations also
  * do not require fsync.
@@ -694,6 +704,24 @@ smgrtruncate(SMgrRelation reln, ForkNumber *forknum, int nforks, BlockNumber *nb
 	}
 }
 
+/*
+ * smgrregistersync() -- Request a relation to be sync'd at next checkpoint
+ *
+ * This can be used after calling smgrwrite() or smgrextend() with skipFsync =
+ * true, to register the fsyncs that were skipped earlier.
+ *
+ * Note: be mindful that a checkpoint could already have happened between the
+ * smgrwrite or smgrextend calls and this!  In that case, the checkpoint
+ * already missed fsyncing this relation, and you should use smgrimmedsync
+ * instead.  Most callers should use the bulk loading facility in bulk_write.c
+ * instead, which handles all that.
+ */
+void
+smgrregistersync(SMgrRelation reln, ForkNumber forknum)
+{
+	smgrsw[reln->smgr_which].smgr_registersync(reln, forknum);
+}
+
 /*
  * smgrimmedsync() -- Force the specified relation to stable storage.
  *
@@ -716,6 +744,9 @@ smgrtruncate(SMgrRelation reln, ForkNumber *forknum, int nforks, BlockNumber *nb
  * Note that you need to do FlushRelationBuffers() first if there is
  * any possibility that there are dirty buffers for the relation;
  * otherwise the sync is not very meaningful.
+ *
+ * Most callers should use the bulk loading facility in bulk_write.c
+ * instead of calling this directly.
  */
 void
 smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
diff --git a/src/include/storage/bulk_write.h b/src/include/storage/bulk_write.h
new file mode 100644
index 00000000000..c8eb7307178
--- /dev/null
+++ b/src/include/storage/bulk_write.h
@@ -0,0 +1,28 @@
+/*-------------------------------------------------------------------------
+ *
+ * bulk_write.h
+ *	  Efficiently and reliably populate a new relation
+ *
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/bulk_write.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BULK_WRITE_H
+#define BULK_WRITE_H
+
+typedef struct BulkWriteState BulkWriteState;
+
+/* forward declared from smgr.h */
+struct SMgrRelationData;
+
+extern BulkWriteState *bulkw_start_rel(Relation rel, ForkNumber forknum);
+extern BulkWriteState *bulkw_start_smgr(struct SMgrRelationData *smgr, ForkNumber forknum, bool use_wal);
+extern Page bulkw_alloc_buf(BulkWriteState *bulkw);
+extern void bulkw_write(BulkWriteState *bulkw, BlockNumber blocknum, Page page, bool page_std);
+extern void bulkw_finish(BulkWriteState *bulkw);
+
+#endif							/* BULK_WRITE_H */
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 941879ee6a8..225701271d2 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -42,6 +42,7 @@ extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber nblocks);
 extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
+extern void mdregistersync(SMgrRelation reln, ForkNumber forknum);
 
 extern void ForgetDatabaseSyncRequests(Oid dbid);
 extern void DropRelationFiles(RelFileLocator *delrels, int ndelrels, bool isRedo);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index a9a179aabac..cc5a91dc624 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -107,6 +107,7 @@ extern BlockNumber smgrnblocks_cached(SMgrRelation reln, ForkNumber forknum);
 extern void smgrtruncate(SMgrRelation reln, ForkNumber *forknum,
 						 int nforks, BlockNumber *nblocks);
 extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
+extern void smgrregistersync(SMgrRelation reln, ForkNumber forknum);
 extern void AtEOXact_SMgr(void);
 extern bool ProcessBarrierSmgrRelease(void);
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index dba3498a13e..23508059d1b 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -327,6 +327,8 @@ BuildAccumulator
 BuiltinScript
 BulkInsertState
 BulkInsertStateData
+BulkWriteBuffer
+BulkWriteState
 CACHESIGN
 CAC_state
 CCFastEqualFN
-- 
2.39.2

#3Andres Freund
andres@anarazel.de
In reply to: Heikki Linnakangas (#2)
Re: Relation bulk write facility

Hi,

On 2023-11-17 11:37:21 +0100, Heikki Linnakangas wrote:

The new facility makes it easier to optimize bulk loading, as the
logic for buffering, WAL-logging, and syncing the relation only needs
to be implemented once. It's also less error-prone: We have had a
number of bugs in how a relation is fsync'd - or not - at the end of a
bulk loading operation. By centralizing that logic to one place, we
only need to write it correctly once.

One thing I'd like to use the centralized handling for is to track such
writes in pg_stat_io. I don't mean as part of the initial patch, just that
that's another reason I like the facility.

The new facility is faster for small relations: Instead of of calling
smgrimmedsync(), we register the fsync to happen at next checkpoint,
which avoids the fsync latency. That can make a big difference if you
are e.g. restoring a schema-only dump with lots of relations.

I think this would be a huge win for people running their application tests
against postgres.

+	bulkw = bulkw_start_smgr(dst, forkNum, use_wal);
+
nblocks = smgrnblocks(src, forkNum);

for (blkno = 0; blkno < nblocks; blkno++)
{
+ Page page;
+
/* If we got a cancel signal during the copy of the data, quit */
CHECK_FOR_INTERRUPTS();

-		smgrread(src, forkNum, blkno, buf.data);
+		page = bulkw_alloc_buf(bulkw);
+		smgrread(src, forkNum, blkno, page);
if (!PageIsVerifiedExtended(page, blkno,
PIV_LOG_WARNING | PIV_REPORT_STAT))
@@ -511,30 +514,9 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
* page this is, so we have to log the full page including any unused
* space.
*/
-		if (use_wal)
-			log_newpage(&dst->smgr_rlocator.locator, forkNum, blkno, page, false);
-
-		PageSetChecksumInplace(page, blkno);
-
-		/*
-		 * Now write the page.  We say skipFsync = true because there's no
-		 * need for smgr to schedule an fsync for this write; we'll do it
-		 * ourselves below.
-		 */
-		smgrextend(dst, forkNum, blkno, buf.data, true);
+		bulkw_write(bulkw, blkno, page, false);

I wonder if bulkw_alloc_buf() is a good name - if you naively read this
change, it looks like it'll just leak memory. It also might be taken to be
valid until freed, which I don't think is the case?

+/*-------------------------------------------------------------------------
+ *
+ * bulk_write.c
+ *	  Efficiently and reliably populate a new relation
+ *
+ * The assumption is that no other backends access the relation while we are
+ * loading it, so we can take some shortcuts.  Alternatively, you can use the
+ * buffer manager as usual, if performance is not critical, but you must not
+ * mix operations through the buffer manager and the bulk loading interface at
+ * the same time.

From "Alternatively" onward this is is somewhat confusing.

+ * We bypass the buffer manager to avoid the locking overhead, and call
+ * smgrextend() directly.  A downside is that the pages will need to be
+ * re-read into shared buffers on first use after the build finishes.  That's
+ * usually a good tradeoff for large relations, and for small relations, the
+ * overhead isn't very significant compared to creating the relation in the
+ * first place.

FWIW, I doubt the "isn't very significant" bit is really true.

+ * One tricky point is that because we bypass the buffer manager, we need to
+ * register the relation for fsyncing at the next checkpoint ourselves, and
+ * make sure that the relation is correctly fsync by us or the checkpointer
+ * even if a checkpoint happens concurrently.

"fsync'ed" or such? Otherwise this reads awkwardly for me.

+ *
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/smgr/bulk_write.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/xloginsert.h"
+#include "access/xlogrecord.h"
+#include "storage/bufmgr.h"
+#include "storage/bufpage.h"
+#include "storage/bulk_write.h"
+#include "storage/proc.h"
+#include "storage/smgr.h"
+#include "utils/rel.h"
+
+#define MAX_BUFFERED_PAGES XLR_MAX_BLOCK_ID
+
+typedef struct BulkWriteBuffer
+{
+	Page		page;
+	BlockNumber blkno;
+	bool		page_std;
+	int16		order;
+} BulkWriteBuffer;
+

The name makes it sound like this struct itself contains a buffer - but it's
just pointing to one. *BufferRef or such maybe?

I was wondering how you dealt with the alignment of buffers given the struct
definition, which is what lead me to look at this...

+/*
+ * Bulk writer state for one relation fork.
+ */
+typedef struct BulkWriteState
+{
+	/* Information about the target relation we're writing */
+	SMgrRelation smgr;

Isn't there a danger of this becoming a dangling pointer? At least until
/messages/by-id/CA+hUKGJ8NTvqLHz6dqbQnt2c8XCki4r2QvXjBQcXpVwxTY_pvA@mail.gmail.com
is merged?

+	ForkNumber	forknum;
+	bool		use_wal;
+
+	/* We keep several pages buffered, and WAL-log them in batches */
+	int			nbuffered;
+	BulkWriteBuffer buffers[MAX_BUFFERED_PAGES];
+
+	/* Current size of the relation */
+	BlockNumber pages_written;
+
+	/* The RedoRecPtr at the time that the bulk operation started */
+	XLogRecPtr	start_RedoRecPtr;
+
+	Page		zeropage;		/* workspace for filling zeroes */

We really should just have one such page in shared memory somewhere... For WAL
writes as well.

But until then - why do you allocate the page? Seems like we could just use a
static global variable?

+/*
+ * Write all buffered pages to disk.
+ */
+static void
+bulkw_flush(BulkWriteState *bulkw)
+{
+	int			nbuffered = bulkw->nbuffered;
+	BulkWriteBuffer *buffers = bulkw->buffers;
+
+	if (nbuffered == 0)
+		return;
+
+	if (nbuffered > 1)
+	{
+		int			o;
+
+		qsort(buffers, nbuffered, sizeof(BulkWriteBuffer), buffer_cmp);
+
+		/*
+		 * Eliminate duplicates, keeping the last write of each block.
+		 * (buffer_cmp uses 'order' as the last sort key)
+		 */

Huh - which use cases would actually cause duplicate writes?

Greetings,

Andres Freund

#4Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Andres Freund (#3)
1 attachment(s)
Re: Relation bulk write facility

On 19/11/2023 02:04, Andres Freund wrote:

On 2023-11-17 11:37:21 +0100, Heikki Linnakangas wrote:

The new facility makes it easier to optimize bulk loading, as the
logic for buffering, WAL-logging, and syncing the relation only needs
to be implemented once. It's also less error-prone: We have had a
number of bugs in how a relation is fsync'd - or not - at the end of a
bulk loading operation. By centralizing that logic to one place, we
only need to write it correctly once.

One thing I'd like to use the centralized handling for is to track such
writes in pg_stat_io. I don't mean as part of the initial patch, just that
that's another reason I like the facility.

Oh I didn't realize they're not counted at the moment.

+	bulkw = bulkw_start_smgr(dst, forkNum, use_wal);
+
nblocks = smgrnblocks(src, forkNum);

for (blkno = 0; blkno < nblocks; blkno++)
{
+ Page page;
+
/* If we got a cancel signal during the copy of the data, quit */
CHECK_FOR_INTERRUPTS();

-		smgrread(src, forkNum, blkno, buf.data);
+		page = bulkw_alloc_buf(bulkw);
+		smgrread(src, forkNum, blkno, page);
if (!PageIsVerifiedExtended(page, blkno,
PIV_LOG_WARNING | PIV_REPORT_STAT))
@@ -511,30 +514,9 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
* page this is, so we have to log the full page including any unused
* space.
*/
-		if (use_wal)
-			log_newpage(&dst->smgr_rlocator.locator, forkNum, blkno, page, false);
-
-		PageSetChecksumInplace(page, blkno);
-
-		/*
-		 * Now write the page.  We say skipFsync = true because there's no
-		 * need for smgr to schedule an fsync for this write; we'll do it
-		 * ourselves below.
-		 */
-		smgrextend(dst, forkNum, blkno, buf.data, true);
+		bulkw_write(bulkw, blkno, page, false);

I wonder if bulkw_alloc_buf() is a good name - if you naively read this
change, it looks like it'll just leak memory. It also might be taken to be
valid until freed, which I don't think is the case?

Yeah, I'm not very happy with this interface. The model is that you get
a buffer to write to by calling bulkw_alloc_buf(). Later, you hand it
over to bulkw_write(), which takes ownership of it and frees it later.
There is no other function to free it, although currently the buffer is
just palloc'd so you could call pfree on it.

However, I'd like to not expose that detail to the callers. I'm
imagining that in the future we might optimize further, by having a
larger e.g. 1 MB buffer, and carve the 8kB blocks from that. Then
opportunistically, if you fill the buffers sequentially, bulk_write.c
could do one smgrextend() to write the whole 1 MB chunk.

I renamed it to bulkw_get_buf() now, and made it return a new
BulkWriteBuffer typedef instead of a plain Page. The point of the new
typedef is to distinguish a buffer returned by bulkw_get_buf() from a
Page or char[BLCKSZ] that you might palloc on your own. That indeed
revealed some latent bugs in gistbuild.c where I had mixed up buffers
returned by bulkw_alloc_buf() and palloc'd buffers.

(The previous version of this patch called a different struct
BulkWriteBuffer, but I renamed that to PendingWrite; see below. Don't be
confused!)

I think this helps a little, but I'm still not very happy with it. I'll
give it some more thought after sleeping over it, but in the meanwhile,
I'm all ears for suggestions.

+/*-------------------------------------------------------------------------
+ *
+ * bulk_write.c
+ *	  Efficiently and reliably populate a new relation
+ *
+ * The assumption is that no other backends access the relation while we are
+ * loading it, so we can take some shortcuts.  Alternatively, you can use the
+ * buffer manager as usual, if performance is not critical, but you must not
+ * mix operations through the buffer manager and the bulk loading interface at
+ * the same time.

From "Alternatively" onward this is is somewhat confusing.

Rewrote that to just "Do not mix operations through the regular buffer
manager and the bulk loading interface!"

+ * One tricky point is that because we bypass the buffer manager, we need to
+ * register the relation for fsyncing at the next checkpoint ourselves, and
+ * make sure that the relation is correctly fsync by us or the checkpointer
+ * even if a checkpoint happens concurrently.

"fsync'ed" or such? Otherwise this reads awkwardly for me.

Yep, fixed.

+typedef struct BulkWriteBuffer
+{
+	Page		page;
+	BlockNumber blkno;
+	bool		page_std;
+	int16		order;
+} BulkWriteBuffer;
+

The name makes it sound like this struct itself contains a buffer - but it's
just pointing to one. *BufferRef or such maybe?

I was wondering how you dealt with the alignment of buffers given the struct
definition, which is what lead me to look at this...

I renamed this to PendingWrite, and the field that holds these
"pending_writes". Think of it as a queue of writes that haven't been
performed yet.

+/*
+ * Bulk writer state for one relation fork.
+ */
+typedef struct BulkWriteState
+{
+	/* Information about the target relation we're writing */
+	SMgrRelation smgr;

Isn't there a danger of this becoming a dangling pointer? At least until
/messages/by-id/CA+hUKGJ8NTvqLHz6dqbQnt2c8XCki4r2QvXjBQcXpVwxTY_pvA@mail.gmail.com
is merged?

Good point. I just added a FIXME comment to remind about that, hoping
that that patch gets merged soon. If not, I'll come up with a different fix.

+	ForkNumber	forknum;
+	bool		use_wal;
+
+	/* We keep several pages buffered, and WAL-log them in batches */
+	int			nbuffered;
+	BulkWriteBuffer buffers[MAX_BUFFERED_PAGES];
+
+	/* Current size of the relation */
+	BlockNumber pages_written;
+
+	/* The RedoRecPtr at the time that the bulk operation started */
+	XLogRecPtr	start_RedoRecPtr;
+
+	Page		zeropage;		/* workspace for filling zeroes */

We really should just have one such page in shared memory somewhere... For WAL
writes as well.

But until then - why do you allocate the page? Seems like we could just use a
static global variable?

I made it a static global variable for now. (The palloc way was copied
over from nbtsort.c)

+/*
+ * Write all buffered pages to disk.
+ */
+static void
+bulkw_flush(BulkWriteState *bulkw)
+{
+	int			nbuffered = bulkw->nbuffered;
+	BulkWriteBuffer *buffers = bulkw->buffers;
+
+	if (nbuffered == 0)
+		return;
+
+	if (nbuffered > 1)
+	{
+		int			o;
+
+		qsort(buffers, nbuffered, sizeof(BulkWriteBuffer), buffer_cmp);
+
+		/*
+		 * Eliminate duplicates, keeping the last write of each block.
+		 * (buffer_cmp uses 'order' as the last sort key)
+		 */

Huh - which use cases would actually cause duplicate writes?

Hmm, nothing anymore I guess. Many AMs used to write zero pages as a
placeholder and come back to fill them in later, but now that
bulk_write.c handles that,

Removed that, and replaced it with with an assertion in buffer_cmp()
that there are no duplicates.

--
Heikki Linnakangas
Neon (https://neon.tech)

Attachments:

v3-0001-Introduce-a-new-bulk-loading-facility.patchtext/x-patch; charset=UTF-8; name=v3-0001-Introduce-a-new-bulk-loading-facility.patchDownload
From 50b19da1a6d4c96084fc8388e64c1c646e9b5378 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Tue, 19 Sep 2023 18:09:34 +0300
Subject: [PATCH v3 1/1] Introduce a new bulk loading facility.

The new facility makes it easier to optimize bulk loading, as the
logic for buffering, WAL-logging, and syncing the relation only needs
to be implemented once. It's also less error-prone: We have had a
number of bugs in how a relation is fsync'd - or not - at the end of a
bulk loading operation. By centralizing that logic to one place, we
only need to write it correctly once.

The new facility is faster for small relations: Instead of of calling
smgrimmedsync(), we register the fsync to happen at next checkpoint,
which avoids the fsync latency. That can make a big difference if you
are e.g. restoring a schema-only dump with lots of relations.

It is also slightly more efficient with large relations, as the WAL
logging is performed multiple pages at a time. That avoids some WAL
header overhead. The sorted GiST index build did that already, this
moves the buffering to the new facility.

The changes to pageinspect GiST test needs an explanation: Before this
patch, the sorted GiST index build set the LSN on every page to the
special GistBuildLSN value, not the LSN of the WAL record, even though
they were WAL-logged. There was no particular need for it, it just
happened naturally when we wrote out the pages before WAL-logging
them. Now we WAL-log the pages first, like in B-tree build, so the
pages are stamped with the record's real LSN. When the build is not
WAL-logged, we still use GistBuildLSN. To make the test output
predictable, use an unlogged index.

Reviewed-by: Andres Freund
Discussion: https://www.postgresql.org/message-id/30e8f366-58b3-b239-c521-422122dd5150%40iki.fi
---
 contrib/pageinspect/expected/gist.out |  14 +-
 contrib/pageinspect/sql/gist.sql      |  16 +-
 src/backend/access/gist/gistbuild.c   | 122 +++--------
 src/backend/access/heap/rewriteheap.c |  75 ++-----
 src/backend/access/nbtree/nbtree.c    |  33 +--
 src/backend/access/nbtree/nbtsort.c   | 134 ++++--------
 src/backend/access/spgist/spginsert.c |  49 ++---
 src/backend/catalog/storage.c         |  46 ++--
 src/backend/storage/smgr/Makefile     |   1 +
 src/backend/storage/smgr/bulk_write.c | 301 ++++++++++++++++++++++++++
 src/backend/storage/smgr/md.c         |  43 ++++
 src/backend/storage/smgr/meson.build  |   1 +
 src/backend/storage/smgr/smgr.c       |  31 +++
 src/include/storage/bulk_write.h      |  38 ++++
 src/include/storage/md.h              |   1 +
 src/include/storage/smgr.h            |   1 +
 src/tools/pgindent/typedefs.list      |   3 +
 17 files changed, 556 insertions(+), 353 deletions(-)
 create mode 100644 src/backend/storage/smgr/bulk_write.c
 create mode 100644 src/include/storage/bulk_write.h

diff --git a/contrib/pageinspect/expected/gist.out b/contrib/pageinspect/expected/gist.out
index d1adbab8ae2..2b1d54a6279 100644
--- a/contrib/pageinspect/expected/gist.out
+++ b/contrib/pageinspect/expected/gist.out
@@ -1,13 +1,6 @@
--- The gist_page_opaque_info() function prints the page's LSN. Normally,
--- that's constant 1 (GistBuildLSN) on every page of a freshly built GiST
--- index. But with wal_level=minimal, the whole relation is dumped to WAL at
--- the end of the transaction if it's smaller than wal_skip_threshold, which
--- updates the LSNs. Wrap the tests on gist_page_opaque_info() in the
--- same transaction with the CREATE INDEX so that we see the LSNs before
--- they are possibly overwritten at end of transaction.
-BEGIN;
--- Create a test table and GiST index.
-CREATE TABLE test_gist AS SELECT point(i,i) p, i::text t FROM
+-- The gist_page_opaque_info() function prints the page's LSN.
+-- Use an unlogged index, so that the LSN is predictable.
+CREATE UNLOGGED TABLE test_gist AS SELECT point(i,i) p, i::text t FROM
     generate_series(1,1000) i;
 CREATE INDEX test_gist_idx ON test_gist USING gist (p);
 -- Page 0 is the root, the rest are leaf pages
@@ -29,7 +22,6 @@ SELECT * FROM gist_page_opaque_info(get_raw_page('test_gist_idx', 2));
  0/1 | 0/0 |         1 | {leaf}
 (1 row)
 
-COMMIT;
 SELECT * FROM gist_page_items(get_raw_page('test_gist_idx', 0), 'test_gist_idx');
  itemoffset |   ctid    | itemlen | dead |             keys              
 ------------+-----------+---------+------+-------------------------------
diff --git a/contrib/pageinspect/sql/gist.sql b/contrib/pageinspect/sql/gist.sql
index d263542ba15..85bc44b8000 100644
--- a/contrib/pageinspect/sql/gist.sql
+++ b/contrib/pageinspect/sql/gist.sql
@@ -1,14 +1,6 @@
--- The gist_page_opaque_info() function prints the page's LSN. Normally,
--- that's constant 1 (GistBuildLSN) on every page of a freshly built GiST
--- index. But with wal_level=minimal, the whole relation is dumped to WAL at
--- the end of the transaction if it's smaller than wal_skip_threshold, which
--- updates the LSNs. Wrap the tests on gist_page_opaque_info() in the
--- same transaction with the CREATE INDEX so that we see the LSNs before
--- they are possibly overwritten at end of transaction.
-BEGIN;
-
--- Create a test table and GiST index.
-CREATE TABLE test_gist AS SELECT point(i,i) p, i::text t FROM
+-- The gist_page_opaque_info() function prints the page's LSN.
+-- Use an unlogged index, so that the LSN is predictable.
+CREATE UNLOGGED TABLE test_gist AS SELECT point(i,i) p, i::text t FROM
     generate_series(1,1000) i;
 CREATE INDEX test_gist_idx ON test_gist USING gist (p);
 
@@ -17,8 +9,6 @@ SELECT * FROM gist_page_opaque_info(get_raw_page('test_gist_idx', 0));
 SELECT * FROM gist_page_opaque_info(get_raw_page('test_gist_idx', 1));
 SELECT * FROM gist_page_opaque_info(get_raw_page('test_gist_idx', 2));
 
-COMMIT;
-
 SELECT * FROM gist_page_items(get_raw_page('test_gist_idx', 0), 'test_gist_idx');
 SELECT * FROM gist_page_items(get_raw_page('test_gist_idx', 1), 'test_gist_idx') LIMIT 5;
 
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index a45e2fe3755..5b7cefdcaaf 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -43,7 +43,8 @@
 #include "miscadmin.h"
 #include "optimizer/optimizer.h"
 #include "storage/bufmgr.h"
-#include "storage/smgr.h"
+#include "storage/bulk_write.h"
+
 #include "utils/memutils.h"
 #include "utils/rel.h"
 #include "utils/tuplesort.h"
@@ -106,11 +107,8 @@ typedef struct
 	Tuplesortstate *sortstate;	/* state data for tuplesort.c */
 
 	BlockNumber pages_allocated;
-	BlockNumber pages_written;
 
-	int			ready_num_pages;
-	BlockNumber ready_blknos[XLR_MAX_BLOCK_ID];
-	Page		ready_pages[XLR_MAX_BLOCK_ID];
+	BulkWriteState *bulkw;
 } GISTBuildState;
 
 #define GIST_SORTED_BUILD_PAGE_NUM 4
@@ -142,7 +140,6 @@ static void gist_indexsortbuild_levelstate_add(GISTBuildState *state,
 											   IndexTuple itup);
 static void gist_indexsortbuild_levelstate_flush(GISTBuildState *state,
 												 GistSortedBuildLevelState *levelstate);
-static void gist_indexsortbuild_flush_ready_pages(GISTBuildState *state);
 
 static void gistInitBuffering(GISTBuildState *buildstate);
 static int	calculatePagesPerBuffer(GISTBuildState *buildstate, int levelStep);
@@ -405,27 +402,18 @@ gist_indexsortbuild(GISTBuildState *state)
 {
 	IndexTuple	itup;
 	GistSortedBuildLevelState *levelstate;
-	Page		page;
+	BulkWriteBuffer rootbuf;
 
-	state->pages_allocated = 0;
-	state->pages_written = 0;
-	state->ready_num_pages = 0;
+	/* Reserve block 0 for the root page */
+	state->pages_allocated = 1;
 
-	/*
-	 * Write an empty page as a placeholder for the root page. It will be
-	 * replaced with the real root page at the end.
-	 */
-	page = palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, MCXT_ALLOC_ZERO);
-	smgrextend(RelationGetSmgr(state->indexrel), MAIN_FORKNUM, GIST_ROOT_BLKNO,
-			   page, true);
-	state->pages_allocated++;
-	state->pages_written++;
+	state->bulkw = bulkw_start_rel(state->indexrel, MAIN_FORKNUM);
 
 	/* Allocate a temporary buffer for the first leaf page batch. */
 	levelstate = palloc0(sizeof(GistSortedBuildLevelState));
-	levelstate->pages[0] = page;
+	levelstate->pages[0] = palloc(BLCKSZ);
 	levelstate->parent = NULL;
-	gistinitpage(page, F_LEAF);
+	gistinitpage(levelstate->pages[0], F_LEAF);
 
 	/*
 	 * Fill index pages with tuples in the sorted order.
@@ -455,31 +443,16 @@ gist_indexsortbuild(GISTBuildState *state)
 		levelstate = parent;
 	}
 
-	gist_indexsortbuild_flush_ready_pages(state);
-
 	/* Write out the root */
 	PageSetLSN(levelstate->pages[0], GistBuildLSN);
-	PageSetChecksumInplace(levelstate->pages[0], GIST_ROOT_BLKNO);
-	smgrwrite(RelationGetSmgr(state->indexrel), MAIN_FORKNUM, GIST_ROOT_BLKNO,
-			  levelstate->pages[0], true);
-	if (RelationNeedsWAL(state->indexrel))
-		log_newpage(&state->indexrel->rd_locator, MAIN_FORKNUM, GIST_ROOT_BLKNO,
-					levelstate->pages[0], true);
-
-	pfree(levelstate->pages[0]);
+
+	rootbuf = bulkw_get_buf(state->bulkw);
+	memcpy(rootbuf, levelstate->pages[0], BLCKSZ);
+	bulkw_write(state->bulkw, GIST_ROOT_BLKNO, rootbuf, true);
+
 	pfree(levelstate);
 
-	/*
-	 * When we WAL-logged index pages, we must nonetheless fsync index files.
-	 * Since we're building outside shared buffers, a CHECKPOINT occurring
-	 * during the build has no way to flush the previously written data to
-	 * disk (indeed it won't know the index even exists).  A crash later on
-	 * would replay WAL from the checkpoint, therefore it wouldn't replay our
-	 * earlier WAL entries. If we do not fsync those pages here, they might
-	 * still not be on disk when the crash occurs.
-	 */
-	if (RelationNeedsWAL(state->indexrel))
-		smgrimmedsync(RelationGetSmgr(state->indexrel), MAIN_FORKNUM);
+	bulkw_finish(state->bulkw);
 }
 
 /*
@@ -509,8 +482,7 @@ gist_indexsortbuild_levelstate_add(GISTBuildState *state,
 			levelstate->current_page++;
 
 		if (levelstate->pages[levelstate->current_page] == NULL)
-			levelstate->pages[levelstate->current_page] =
-				palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
+			levelstate->pages[levelstate->current_page] = palloc0(BLCKSZ);
 
 		newPage = levelstate->pages[levelstate->current_page];
 		gistinitpage(newPage, old_page_flags);
@@ -573,6 +545,7 @@ gist_indexsortbuild_levelstate_flush(GISTBuildState *state,
 	for (; dist != NULL; dist = dist->next)
 	{
 		char	   *data;
+		BulkWriteBuffer buf;
 		Page		target;
 
 		/* check once per page */
@@ -580,7 +553,8 @@ gist_indexsortbuild_levelstate_flush(GISTBuildState *state,
 
 		/* Create page and copy data */
 		data = (char *) (dist->list);
-		target = palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, MCXT_ALLOC_ZERO);
+		buf = bulkw_get_buf(state->bulkw);
+		target = (Page) buf;
 		gistinitpage(target, isleaf ? F_LEAF : 0);
 		for (int i = 0; i < dist->block.num; i++)
 		{
@@ -593,20 +567,6 @@ gist_indexsortbuild_levelstate_flush(GISTBuildState *state,
 		}
 		union_tuple = dist->itup;
 
-		if (state->ready_num_pages == XLR_MAX_BLOCK_ID)
-			gist_indexsortbuild_flush_ready_pages(state);
-
-		/*
-		 * The page is now complete. Assign a block number to it, and add it
-		 * to the list of finished pages. (We don't write it out immediately,
-		 * because we want to WAL-log the pages in batches.)
-		 */
-		blkno = state->pages_allocated++;
-		state->ready_blknos[state->ready_num_pages] = blkno;
-		state->ready_pages[state->ready_num_pages] = target;
-		state->ready_num_pages++;
-		ItemPointerSetBlockNumber(&(union_tuple->t_tid), blkno);
-
 		/*
 		 * Set the right link to point to the previous page. This is just for
 		 * debugging purposes: GiST only follows the right link if a page is
@@ -621,6 +581,15 @@ gist_indexsortbuild_levelstate_flush(GISTBuildState *state,
 		 */
 		if (levelstate->last_blkno)
 			GistPageGetOpaque(target)->rightlink = levelstate->last_blkno;
+
+		/*
+		 * The page is now complete. Assign a block number to it, and pass it
+		 * to the bulk writer.
+		 */
+		blkno = state->pages_allocated++;
+		PageSetLSN(target, GistBuildLSN);
+		bulkw_write(state->bulkw, blkno, buf, true);
+		ItemPointerSetBlockNumber(&(union_tuple->t_tid), blkno);
 		levelstate->last_blkno = blkno;
 
 		/*
@@ -631,7 +600,7 @@ gist_indexsortbuild_levelstate_flush(GISTBuildState *state,
 		if (parent == NULL)
 		{
 			parent = palloc0(sizeof(GistSortedBuildLevelState));
-			parent->pages[0] = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
+			parent->pages[0] = palloc(BLCKSZ);
 			parent->parent = NULL;
 			gistinitpage(parent->pages[0], 0);
 
@@ -641,39 +610,6 @@ gist_indexsortbuild_levelstate_flush(GISTBuildState *state,
 	}
 }
 
-static void
-gist_indexsortbuild_flush_ready_pages(GISTBuildState *state)
-{
-	if (state->ready_num_pages == 0)
-		return;
-
-	for (int i = 0; i < state->ready_num_pages; i++)
-	{
-		Page		page = state->ready_pages[i];
-		BlockNumber blkno = state->ready_blknos[i];
-
-		/* Currently, the blocks must be buffered in order. */
-		if (blkno != state->pages_written)
-			elog(ERROR, "unexpected block number to flush GiST sorting build");
-
-		PageSetLSN(page, GistBuildLSN);
-		PageSetChecksumInplace(page, blkno);
-		smgrextend(RelationGetSmgr(state->indexrel), MAIN_FORKNUM, blkno, page,
-				   true);
-
-		state->pages_written++;
-	}
-
-	if (RelationNeedsWAL(state->indexrel))
-		log_newpages(&state->indexrel->rd_locator, MAIN_FORKNUM, state->ready_num_pages,
-					 state->ready_blknos, state->ready_pages, true);
-
-	for (int i = 0; i < state->ready_num_pages; i++)
-		pfree(state->ready_pages[i]);
-
-	state->ready_num_pages = 0;
-}
-
 
 /*-------------------------------------------------------------------------
  * Routines for non-sorted build
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 424958912c7..ad848e31b1d 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -87,8 +87,8 @@
  * is optimized for bulk inserting a lot of tuples, knowing that we have
  * exclusive access to the heap.  raw_heap_insert builds new pages in
  * local storage.  When a page is full, or at the end of the process,
- * we insert it to WAL as a single record and then write it to disk
- * directly through smgr.  Note, however, that any data sent to the new
+ * we insert it to WAL as a single record and then write it to disk with
+ * the bulk smgr writer.  Note, however, that any data sent to the new
  * heap's TOAST table will go through the normal bufmgr.
  *
  *
@@ -119,9 +119,9 @@
 #include "replication/logical.h"
 #include "replication/slot.h"
 #include "storage/bufmgr.h"
+#include "storage/bulk_write.h"
 #include "storage/fd.h"
 #include "storage/procarray.h"
-#include "storage/smgr.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -133,9 +133,11 @@ typedef struct RewriteStateData
 {
 	Relation	rs_old_rel;		/* source heap */
 	Relation	rs_new_rel;		/* destination heap */
-	Page		rs_buffer;		/* page currently being built */
+
+	BulkWriteState *rs_bulkw;
+
+	BulkWriteBuffer rs_buffer;	/* page currently being built */
 	BlockNumber rs_blockno;		/* block where page will go */
-	bool		rs_buffer_valid;	/* T if any tuples in buffer */
 	bool		rs_logical_rewrite; /* do we need to do logical rewriting */
 	TransactionId rs_oldest_xmin;	/* oldest xmin used by caller to determine
 									 * tuple visibility */
@@ -255,15 +257,16 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
 
 	state->rs_old_rel = old_heap;
 	state->rs_new_rel = new_heap;
-	state->rs_buffer = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
+	state->rs_buffer = NULL;
 	/* new_heap needn't be empty, just locked */
 	state->rs_blockno = RelationGetNumberOfBlocks(new_heap);
-	state->rs_buffer_valid = false;
 	state->rs_oldest_xmin = oldest_xmin;
 	state->rs_freeze_xid = freeze_xid;
 	state->rs_cutoff_multi = cutoff_multi;
 	state->rs_cxt = rw_cxt;
 
+	state->rs_bulkw = bulkw_start_rel(new_heap, MAIN_FORKNUM);
+
 	/* Initialize hash tables used to track update chains */
 	hash_ctl.keysize = sizeof(TidHashKey);
 	hash_ctl.entrysize = sizeof(UnresolvedTupData);
@@ -314,30 +317,13 @@ end_heap_rewrite(RewriteState state)
 	}
 
 	/* Write the last page, if any */
-	if (state->rs_buffer_valid)
+	if (state->rs_buffer)
 	{
-		if (RelationNeedsWAL(state->rs_new_rel))
-			log_newpage(&state->rs_new_rel->rd_locator,
-						MAIN_FORKNUM,
-						state->rs_blockno,
-						state->rs_buffer,
-						true);
-
-		PageSetChecksumInplace(state->rs_buffer, state->rs_blockno);
-
-		smgrextend(RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM,
-				   state->rs_blockno, state->rs_buffer, true);
+		bulkw_write(state->rs_bulkw, state->rs_blockno, state->rs_buffer, true);
+		state->rs_buffer = NULL;
 	}
 
-	/*
-	 * When we WAL-logged rel pages, we must nonetheless fsync them.  The
-	 * reason is the same as in storage.c's RelationCopyStorage(): we're
-	 * writing data that's not in shared buffers, and so a CHECKPOINT
-	 * occurring during the rewriteheap operation won't have fsync'd data we
-	 * wrote before the checkpoint.
-	 */
-	if (RelationNeedsWAL(state->rs_new_rel))
-		smgrimmedsync(RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM);
+	bulkw_finish(state->rs_bulkw);
 
 	logical_end_heap_rewrite(state);
 
@@ -611,7 +597,7 @@ rewrite_heap_dead_tuple(RewriteState state, HeapTuple old_tuple)
 static void
 raw_heap_insert(RewriteState state, HeapTuple tup)
 {
-	Page		page = state->rs_buffer;
+	Page		page;
 	Size		pageFreeSpace,
 				saveFreeSpace;
 	Size		len;
@@ -664,7 +650,8 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
 												   HEAP_DEFAULT_FILLFACTOR);
 
 	/* Now we can check to see if there's enough free space already. */
-	if (state->rs_buffer_valid)
+	page = (Page) state->rs_buffer;
+	if (page)
 	{
 		pageFreeSpace = PageGetHeapFreeSpace(page);
 
@@ -675,35 +662,19 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
 			 * contains a tuple.  Hence, unlike RelationGetBufferForTuple(),
 			 * enforce saveFreeSpace unconditionally.
 			 */
-
-			/* XLOG stuff */
-			if (RelationNeedsWAL(state->rs_new_rel))
-				log_newpage(&state->rs_new_rel->rd_locator,
-							MAIN_FORKNUM,
-							state->rs_blockno,
-							page,
-							true);
-
-			/*
-			 * Now write the page. We say skipFsync = true because there's no
-			 * need for smgr to schedule an fsync for this write; we'll do it
-			 * ourselves in end_heap_rewrite.
-			 */
-			PageSetChecksumInplace(page, state->rs_blockno);
-
-			smgrextend(RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM,
-					   state->rs_blockno, page, true);
-
+			bulkw_write(state->rs_bulkw, state->rs_blockno, state->rs_buffer, true);
+			state->rs_buffer = NULL;
+			page = NULL;
 			state->rs_blockno++;
-			state->rs_buffer_valid = false;
 		}
 	}
 
-	if (!state->rs_buffer_valid)
+	if (!page)
 	{
 		/* Initialize a new empty page */
+		state->rs_buffer = bulkw_get_buf(state->rs_bulkw);
+		page = (Page) state->rs_buffer;
 		PageInit(page, BLCKSZ, 0);
-		state->rs_buffer_valid = true;
 	}
 
 	/* And now we can insert the tuple into the page */
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index a88b36a589a..38f957e6f4c 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -29,11 +29,11 @@
 #include "nodes/execnodes.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
+#include "storage/bulk_write.h"
 #include "storage/condition_variable.h"
 #include "storage/indexfsm.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
-#include "storage/smgr.h"
 #include "utils/builtins.h"
 #include "utils/index_selfuncs.h"
 #include "utils/memutils.h"
@@ -152,32 +152,17 @@ void
 btbuildempty(Relation index)
 {
 	bool		allequalimage = _bt_allequalimage(index, false);
-	Buffer		metabuf;
-	Page		metapage;
+	BulkWriteState *bulkw;
+	BulkWriteBuffer metabuf;
 
-	/*
-	 * Initalize the metapage.
-	 *
-	 * Regular index build bypasses the buffer manager and uses smgr functions
-	 * directly, with an smgrimmedsync() call at the end.  That makes sense
-	 * when the index is large, but for an empty index, it's better to use the
-	 * buffer cache to avoid the smgrimmedsync().
-	 */
-	metabuf = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
-	Assert(BufferGetBlockNumber(metabuf) == BTREE_METAPAGE);
-	_bt_lockbuf(index, metabuf, BT_WRITE);
-
-	START_CRIT_SECTION();
-
-	metapage = BufferGetPage(metabuf);
-	_bt_initmetapage(metapage, P_NONE, 0, allequalimage);
-	MarkBufferDirty(metabuf);
-	log_newpage_buffer(metabuf, true);
+	bulkw = bulkw_start_rel(index, INIT_FORKNUM);
 
-	END_CRIT_SECTION();
+	/* Construct metapage. */
+	metabuf = bulkw_get_buf(bulkw);
+	_bt_initmetapage((Page) metabuf, P_NONE, 0, allequalimage);
+	bulkw_write(bulkw, BTREE_METAPAGE, metabuf, true);
 
-	_bt_unlockbuf(index, metabuf);
-	ReleaseBuffer(metabuf);
+	bulkw_finish(bulkw);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index c2665fce411..6034d7f61b2 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -23,13 +23,8 @@
  * many upper pages if the keys are reasonable-size) without risking a lot of
  * cascading splits during early insertions.
  *
- * Formerly the index pages being built were kept in shared buffers, but
- * that is of no value (since other backends have no interest in them yet)
- * and it created locking problems for CHECKPOINT, because the upper-level
- * pages were held exclusive-locked for long periods.  Now we just build
- * the pages in local memory and smgrwrite or smgrextend them as we finish
- * them.  They will need to be re-read into shared buffers on first use after
- * the build finishes.
+ * We use the bulk smgr loading facility to bypass the buffer cache and
+ * WAL log the pages efficiently.
  *
  * This code isn't concerned about the FSM at all. The caller is responsible
  * for initializing that.
@@ -57,7 +52,7 @@
 #include "executor/instrument.h"
 #include "miscadmin.h"
 #include "pgstat.h"
-#include "storage/smgr.h"
+#include "storage/bulk_write.h"
 #include "tcop/tcopprot.h"		/* pgrminclude ignore */
 #include "utils/rel.h"
 #include "utils/sortsupport.h"
@@ -234,7 +229,7 @@ typedef struct BTBuildState
  */
 typedef struct BTPageState
 {
-	Page		btps_page;		/* workspace for page building */
+	BulkWriteBuffer btps_buf;	/* workspace for page building */
 	BlockNumber btps_blkno;		/* block # to write this page at */
 	IndexTuple	btps_lowkey;	/* page's strict lower bound pivot tuple */
 	OffsetNumber btps_lastoff;	/* last item offset loaded */
@@ -251,11 +246,9 @@ typedef struct BTWriteState
 {
 	Relation	heap;
 	Relation	index;
+	BulkWriteState *bulkw;
 	BTScanInsert inskey;		/* generic insertion scankey */
-	bool		btws_use_wal;	/* dump pages to WAL? */
 	BlockNumber btws_pages_alloced; /* # pages allocated */
-	BlockNumber btws_pages_written; /* # pages written out */
-	Page		btws_zeropage;	/* workspace for filling zeroes */
 } BTWriteState;
 
 
@@ -267,7 +260,7 @@ static void _bt_spool(BTSpool *btspool, ItemPointer self,
 static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2);
 static void _bt_build_callback(Relation index, ItemPointer tid, Datum *values,
 							   bool *isnull, bool tupleIsAlive, void *state);
-static Page _bt_blnewpage(uint32 level);
+static BulkWriteBuffer _bt_blnewpage(BTWriteState *wstate, uint32 level);
 static BTPageState *_bt_pagestate(BTWriteState *wstate, uint32 level);
 static void _bt_slideleft(Page rightmostpage);
 static void _bt_sortaddtup(Page page, Size itemsize,
@@ -569,16 +562,17 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
 	/* _bt_mkscankey() won't set allequalimage without metapage */
 	wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
-	wstate.btws_use_wal = RelationNeedsWAL(wstate.index);
 
 	/* reserve the metapage */
 	wstate.btws_pages_alloced = BTREE_METAPAGE + 1;
-	wstate.btws_pages_written = 0;
-	wstate.btws_zeropage = NULL;	/* until needed */
+
+	wstate.bulkw = bulkw_start_rel(wstate.index, MAIN_FORKNUM);
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_BTREE_PHASE_LEAF_LOAD);
 	_bt_load(&wstate, btspool, btspool2);
+
+	bulkw_finish(wstate.bulkw);
 }
 
 /*
@@ -613,13 +607,15 @@ _bt_build_callback(Relation index,
 /*
  * allocate workspace for a new, clean btree page, not linked to any siblings.
  */
-static Page
-_bt_blnewpage(uint32 level)
+static BulkWriteBuffer
+_bt_blnewpage(BTWriteState *wstate, uint32 level)
 {
+	BulkWriteBuffer buf;
 	Page		page;
 	BTPageOpaque opaque;
 
-	page = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
+	buf = bulkw_get_buf(wstate->bulkw);
+	page = (Page) buf;
 
 	/* Zero the page and set up standard page header info */
 	_bt_pageinit(page, BLCKSZ);
@@ -634,63 +630,17 @@ _bt_blnewpage(uint32 level)
 	/* Make the P_HIKEY line pointer appear allocated */
 	((PageHeader) page)->pd_lower += sizeof(ItemIdData);
 
-	return page;
+	return buf;
 }
 
 /*
  * emit a completed btree page, and release the working storage.
  */
 static void
-_bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
+_bt_blwritepage(BTWriteState *wstate, BulkWriteBuffer buf, BlockNumber blkno)
 {
-	/* XLOG stuff */
-	if (wstate->btws_use_wal)
-	{
-		/* We use the XLOG_FPI record type for this */
-		log_newpage(&wstate->index->rd_locator, MAIN_FORKNUM, blkno, page, true);
-	}
-
-	/*
-	 * If we have to write pages nonsequentially, fill in the space with
-	 * zeroes until we come back and overwrite.  This is not logically
-	 * necessary on standard Unix filesystems (unwritten space will read as
-	 * zeroes anyway), but it should help to avoid fragmentation. The dummy
-	 * pages aren't WAL-logged though.
-	 */
-	while (blkno > wstate->btws_pages_written)
-	{
-		if (!wstate->btws_zeropage)
-			wstate->btws_zeropage = (Page) palloc_aligned(BLCKSZ,
-														  PG_IO_ALIGN_SIZE,
-														  MCXT_ALLOC_ZERO);
-		/* don't set checksum for all-zero page */
-		smgrextend(RelationGetSmgr(wstate->index), MAIN_FORKNUM,
-				   wstate->btws_pages_written++,
-				   wstate->btws_zeropage,
-				   true);
-	}
-
-	PageSetChecksumInplace(page, blkno);
-
-	/*
-	 * Now write the page.  There's no need for smgr to schedule an fsync for
-	 * this write; we'll do it ourselves before ending the build.
-	 */
-	if (blkno == wstate->btws_pages_written)
-	{
-		/* extending the file... */
-		smgrextend(RelationGetSmgr(wstate->index), MAIN_FORKNUM, blkno,
-				   page, true);
-		wstate->btws_pages_written++;
-	}
-	else
-	{
-		/* overwriting a block we zero-filled before */
-		smgrwrite(RelationGetSmgr(wstate->index), MAIN_FORKNUM, blkno,
-				  page, true);
-	}
-
-	pfree(page);
+	bulkw_write(wstate->bulkw, blkno, buf, true);
+	/* bulkw_write took ownership of 'buf' */
 }
 
 /*
@@ -703,7 +653,7 @@ _bt_pagestate(BTWriteState *wstate, uint32 level)
 	BTPageState *state = (BTPageState *) palloc0(sizeof(BTPageState));
 
 	/* create initial page for level */
-	state->btps_page = _bt_blnewpage(level);
+	state->btps_buf = _bt_blnewpage(wstate, level);
 
 	/* and assign it a page position */
 	state->btps_blkno = wstate->btws_pages_alloced++;
@@ -839,6 +789,7 @@ static void
 _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 			 Size truncextra)
 {
+	BulkWriteBuffer nbuf;
 	Page		npage;
 	BlockNumber nblkno;
 	OffsetNumber last_off;
@@ -853,7 +804,8 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 	 */
 	CHECK_FOR_INTERRUPTS();
 
-	npage = state->btps_page;
+	nbuf = state->btps_buf;
+	npage = (Page) nbuf;
 	nblkno = state->btps_blkno;
 	last_off = state->btps_lastoff;
 	last_truncextra = state->btps_lastextra;
@@ -909,6 +861,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 		/*
 		 * Finish off the page and write it out.
 		 */
+		BulkWriteBuffer obuf = nbuf;
 		Page		opage = npage;
 		BlockNumber oblkno = nblkno;
 		ItemId		ii;
@@ -916,7 +869,8 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 		IndexTuple	oitup;
 
 		/* Create new page of same level */
-		npage = _bt_blnewpage(state->btps_level);
+		nbuf = _bt_blnewpage(wstate, state->btps_level);
+		npage = (Page) nbuf;
 
 		/* and assign it a page position */
 		nblkno = wstate->btws_pages_alloced++;
@@ -1028,10 +982,10 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 		}
 
 		/*
-		 * Write out the old page.  We never need to touch it again, so we can
-		 * free the opage workspace too.
+		 * Write out the old page. _bt_blwritepage takes ownership of the
+		 * 'opage' buffer.
 		 */
-		_bt_blwritepage(wstate, opage, oblkno);
+		_bt_blwritepage(wstate, obuf, oblkno);
 
 		/*
 		 * Reset last_off to point to new page
@@ -1064,7 +1018,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 	_bt_sortaddtup(npage, itupsz, itup, last_off,
 				   !isleaf && last_off == P_FIRSTKEY);
 
-	state->btps_page = npage;
+	state->btps_buf = nbuf;
 	state->btps_blkno = nblkno;
 	state->btps_lastoff = last_off;
 }
@@ -1116,7 +1070,7 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 	BTPageState *s;
 	BlockNumber rootblkno = P_NONE;
 	uint32		rootlevel = 0;
-	Page		metapage;
+	BulkWriteBuffer metabuf;
 
 	/*
 	 * Each iteration of this loop completes one more level of the tree.
@@ -1127,7 +1081,7 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 		BTPageOpaque opaque;
 
 		blkno = s->btps_blkno;
-		opaque = BTPageGetOpaque(s->btps_page);
+		opaque = BTPageGetOpaque((Page) s->btps_buf);
 
 		/*
 		 * We have to link the last page on this level to somewhere.
@@ -1161,9 +1115,9 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 		 * This is the rightmost page, so the ItemId array needs to be slid
 		 * back one slot.  Then we can dump out the page.
 		 */
-		_bt_slideleft(s->btps_page);
-		_bt_blwritepage(wstate, s->btps_page, s->btps_blkno);
-		s->btps_page = NULL;	/* writepage freed the workspace */
+		_bt_slideleft((Page) s->btps_buf);
+		_bt_blwritepage(wstate, s->btps_buf, s->btps_blkno);
+		s->btps_buf = NULL;		/* writepage took ownership of the buffer */
 	}
 
 	/*
@@ -1172,10 +1126,10 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 	 * set to point to "P_NONE").  This changes the index to the "valid" state
 	 * by filling in a valid magic number in the metapage.
 	 */
-	metapage = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
-	_bt_initmetapage(metapage, rootblkno, rootlevel,
+	metabuf = bulkw_get_buf(wstate->bulkw);
+	_bt_initmetapage((Page) metabuf, rootblkno, rootlevel,
 					 wstate->inskey->allequalimage);
-	_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
+	_bt_blwritepage(wstate, metabuf, BTREE_METAPAGE);
 }
 
 /*
@@ -1422,18 +1376,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 
 	/* Close down final pages and write the metapage */
 	_bt_uppershutdown(wstate, state);
-
-	/*
-	 * When we WAL-logged index pages, we must nonetheless fsync index files.
-	 * Since we're building outside shared buffers, a CHECKPOINT occurring
-	 * during the build has no way to flush the previously written data to
-	 * disk (indeed it won't know the index even exists).  A crash later on
-	 * would replay WAL from the checkpoint, therefore it wouldn't replay our
-	 * earlier WAL entries. If we do not fsync those pages here, they might
-	 * still not be on disk when the crash occurs.
-	 */
-	if (wstate->btws_use_wal)
-		smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM);
 }
 
 /*
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index 4443f1918df..31b9d5244db 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -25,7 +25,7 @@
 #include "catalog/index.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
-#include "storage/smgr.h"
+#include "storage/bulk_write.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -155,42 +155,27 @@ spgbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 void
 spgbuildempty(Relation index)
 {
-	Buffer		metabuffer,
-				rootbuffer,
-				nullbuffer;
-
-	/*
-	 * Initialize the meta page and root pages
-	 */
-	metabuffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
-	LockBuffer(metabuffer, BUFFER_LOCK_EXCLUSIVE);
-	rootbuffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
-	LockBuffer(rootbuffer, BUFFER_LOCK_EXCLUSIVE);
-	nullbuffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
-	LockBuffer(nullbuffer, BUFFER_LOCK_EXCLUSIVE);
-
-	Assert(BufferGetBlockNumber(metabuffer) == SPGIST_METAPAGE_BLKNO);
-	Assert(BufferGetBlockNumber(rootbuffer) == SPGIST_ROOT_BLKNO);
-	Assert(BufferGetBlockNumber(nullbuffer) == SPGIST_NULL_BLKNO);
+	BulkWriteState *bulkw;
+	BulkWriteBuffer buf;
 
-	START_CRIT_SECTION();
+	bulkw = bulkw_start_rel(index, INIT_FORKNUM);
 
-	SpGistInitMetapage(BufferGetPage(metabuffer));
-	MarkBufferDirty(metabuffer);
-	SpGistInitBuffer(rootbuffer, SPGIST_LEAF);
-	MarkBufferDirty(rootbuffer);
-	SpGistInitBuffer(nullbuffer, SPGIST_LEAF | SPGIST_NULLS);
-	MarkBufferDirty(nullbuffer);
+	/* Construct metapage. */
+	buf = bulkw_get_buf(bulkw);
+	SpGistInitMetapage((Page) buf);
+	bulkw_write(bulkw, SPGIST_METAPAGE_BLKNO, buf, true);
 
-	log_newpage_buffer(metabuffer, true);
-	log_newpage_buffer(rootbuffer, true);
-	log_newpage_buffer(nullbuffer, true);
+	/* Likewise for the root page. */
+	buf = bulkw_get_buf(bulkw);
+	SpGistInitPage((Page) buf, SPGIST_LEAF);
+	bulkw_write(bulkw, SPGIST_ROOT_BLKNO, buf, true);
 
-	END_CRIT_SECTION();
+	/* Likewise for the null-tuples root page. */
+	buf = bulkw_get_buf(bulkw);
+	SpGistInitPage((Page) buf, SPGIST_LEAF | SPGIST_NULLS);
+	bulkw_write(bulkw, SPGIST_NULL_BLKNO, buf, true);
 
-	UnlockReleaseBuffer(metabuffer);
-	UnlockReleaseBuffer(rootbuffer);
-	UnlockReleaseBuffer(nullbuffer);
+	bulkw_finish(bulkw);
 }
 
 /*
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 93f07e49b72..c2d5b3ecb28 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -28,6 +28,7 @@
 #include "catalog/storage.h"
 #include "catalog/storage_xlog.h"
 #include "miscadmin.h"
+#include "storage/bulk_write.h"
 #include "storage/freespace.h"
 #include "storage/smgr.h"
 #include "utils/hsearch.h"
@@ -451,14 +452,11 @@ void
 RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 					ForkNumber forkNum, char relpersistence)
 {
-	PGIOAlignedBlock buf;
-	Page		page;
 	bool		use_wal;
 	bool		copying_initfork;
 	BlockNumber nblocks;
 	BlockNumber blkno;
-
-	page = (Page) buf.data;
+	BulkWriteState *bulkw;
 
 	/*
 	 * The init fork for an unlogged relation in many respects has to be
@@ -477,16 +475,21 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 	use_wal = XLogIsNeeded() &&
 		(relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork);
 
+	bulkw = bulkw_start_smgr(dst, forkNum, use_wal);
+
 	nblocks = smgrnblocks(src, forkNum);
 
 	for (blkno = 0; blkno < nblocks; blkno++)
 	{
+		BulkWriteBuffer buf;
+
 		/* If we got a cancel signal during the copy of the data, quit */
 		CHECK_FOR_INTERRUPTS();
 
-		smgrread(src, forkNum, blkno, buf.data);
+		buf = bulkw_get_buf(bulkw);
+		smgrread(src, forkNum, blkno, (Page) buf);
 
-		if (!PageIsVerifiedExtended(page, blkno,
+		if (!PageIsVerifiedExtended((Page) buf, blkno,
 									PIV_LOG_WARNING | PIV_REPORT_STAT))
 		{
 			/*
@@ -507,34 +510,13 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 		}
 
 		/*
-		 * WAL-log the copied page. Unfortunately we don't know what kind of a
-		 * page this is, so we have to log the full page including any unused
-		 * space.
-		 */
-		if (use_wal)
-			log_newpage(&dst->smgr_rlocator.locator, forkNum, blkno, page, false);
-
-		PageSetChecksumInplace(page, blkno);
-
-		/*
-		 * Now write the page.  We say skipFsync = true because there's no
-		 * need for smgr to schedule an fsync for this write; we'll do it
-		 * ourselves below.
+		 * Queue the page for WAL-logging and writing out.  Unfortunately we
+		 * don't know what kind of a page this is, so we have to log the full
+		 * page including any unused space.
 		 */
-		smgrextend(dst, forkNum, blkno, buf.data, true);
+		bulkw_write(bulkw, blkno, buf, false);
 	}
-
-	/*
-	 * When we WAL-logged rel pages, we must nonetheless fsync them.  The
-	 * reason is that since we're copying outside shared buffers, a CHECKPOINT
-	 * occurring during the copy has no way to flush the previously written
-	 * data to disk (indeed it won't know the new rel even exists).  A crash
-	 * later on would replay WAL from the checkpoint, therefore it wouldn't
-	 * replay our earlier WAL entries. If we do not fsync those pages here,
-	 * they might still not be on disk when the crash occurs.
-	 */
-	if (use_wal || copying_initfork)
-		smgrimmedsync(dst, forkNum);
+	bulkw_finish(bulkw);
 }
 
 /*
diff --git a/src/backend/storage/smgr/Makefile b/src/backend/storage/smgr/Makefile
index 596b564656f..1d0b98764f9 100644
--- a/src/backend/storage/smgr/Makefile
+++ b/src/backend/storage/smgr/Makefile
@@ -13,6 +13,7 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = \
+	bulk_write.o \
 	md.o \
 	smgr.o
 
diff --git a/src/backend/storage/smgr/bulk_write.c b/src/backend/storage/smgr/bulk_write.c
new file mode 100644
index 00000000000..1913c39bf21
--- /dev/null
+++ b/src/backend/storage/smgr/bulk_write.c
@@ -0,0 +1,301 @@
+/*-------------------------------------------------------------------------
+ *
+ * bulk_write.c
+ *	  Efficiently and reliably populate a new relation
+ *
+ * The assumption is that no other backends access the relation while we are
+ * loading it, so we can take some shortcuts.  Do not mix operations through
+ * the regular buffer manager and the bulk loading interface!
+ *
+ * We bypass the buffer manager to avoid the locking overhead, and call
+ * smgrextend() directly.  A downside is that the pages will need to be
+ * re-read into shared buffers on first use after the build finishes.  That's
+ * usually a good tradeoff for large relations, and for small relations, the
+ * overhead isn't very significant compared to creating the relation in the
+ * first place.
+ *
+ * The pages are WAL-logged if needed.  To save on WAL header overhead, we
+ * WAL-log several pages in one record.
+ *
+ * One tricky point is that because we bypass the buffer manager, we need to
+ * register the relation for fsyncing at the next checkpoint ourselves, and
+ * make sure that the relation is correctly fsync'd by us or the checkpointer
+ * even if a checkpoint happens concurrently.
+ *
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/smgr/bulk_write.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/xloginsert.h"
+#include "access/xlogrecord.h"
+#include "storage/bufmgr.h"
+#include "storage/bufpage.h"
+#include "storage/bulk_write.h"
+#include "storage/proc.h"
+#include "storage/smgr.h"
+#include "utils/rel.h"
+
+#define MAX_PENDING_WRITES XLR_MAX_BLOCK_ID
+
+static const PGIOAlignedBlock zero_buffer = {{0}};	/* worth BLCKSZ */
+
+typedef struct PendingWrite
+{
+	BulkWriteBuffer buf;
+	BlockNumber blkno;
+	bool		page_std;
+} PendingWrite;
+
+/*
+ * Bulk writer state for one relation fork.
+ */
+typedef struct BulkWriteState
+{
+	/* Information about the target relation we're writing */
+	/*
+	 * FIXME: 'smgr' might get invalidated. Hopefully
+	 * https://postgr.es/m/CA%2BhUKGJ8NTvqLHz6dqbQnt2c8XCki4r2QvXjBQcXpVwxTY_pvA%40mail.gmail.com
+	 * is merged before this.
+	 */
+	SMgrRelation smgr;
+	ForkNumber	forknum;
+	bool		use_wal;
+
+	/* We keep several writes queued, and WAL-log them in batches */
+	int			npending;
+	PendingWrite pending_writes[MAX_PENDING_WRITES];
+
+	/* Current size of the relation */
+	BlockNumber pages_written;
+
+	/* The RedoRecPtr at the time that the bulk operation started */
+	XLogRecPtr	start_RedoRecPtr;
+
+	MemoryContext memcxt;
+} BulkWriteState;
+
+static void bulkw_flush(BulkWriteState *bulkw);
+
+/*
+ * Start a bulk write operation on a relation fork.
+ */
+BulkWriteState *
+bulkw_start_rel(Relation rel, ForkNumber forknum)
+{
+	return bulkw_start_smgr(RelationGetSmgr(rel),
+							forknum,
+							RelationNeedsWAL(rel) || forknum == INIT_FORKNUM);
+}
+
+/*
+ * Start a bulk write operation on a relation fork.
+ *
+ * This is like bulkw_start_rel, but can be used without a relcache entry.
+ */
+BulkWriteState *
+bulkw_start_smgr(SMgrRelation smgr, ForkNumber forknum, bool use_wal)
+{
+	BulkWriteState *bulkw;
+
+	bulkw = palloc(sizeof(BulkWriteState));
+	bulkw->smgr = smgr;
+	bulkw->forknum = forknum;
+	bulkw->use_wal = use_wal;
+
+	bulkw->npending = 0;
+	bulkw->pages_written = 0;
+
+	bulkw->start_RedoRecPtr = GetRedoRecPtr();
+
+	/*
+	 * Remember the memory context.  We will use it to allocate all the
+	 * buffers later.
+	 */
+	bulkw->memcxt = CurrentMemoryContext;
+
+	return bulkw;
+}
+
+/*
+ * Finish bulk write operation.
+ *
+ * This WAL-logs and flushes any remaining pending writes to disk, and fsyncs
+ * the relation if needed.
+ */
+void
+bulkw_finish(BulkWriteState *bulkw)
+{
+	/* WAL-log and flush any remaining pages */
+	bulkw_flush(bulkw);
+
+	/*
+	 * When we wrote out the pages, we passed skipFsync=true to avoid the
+	 * overhead of registering all the writes with the checkpointer.  Register
+	 * the whole relation now.
+	 *
+	 * There is one hole in that idea: If a checkpoint occurred while we were
+	 * writing the pages, it already missed fsyncing the pages we had written
+	 * before the checkpoint started.  A crash later on would replay the WAL
+	 * starting from the checkpoint, therefore it wouldn't replay our earlier
+	 * WAL records.  So if a checkpoint started after the bulk write, fsync
+	 * the files now.
+	 */
+	if (!SmgrIsTemp(bulkw->smgr))
+	{
+		/*
+		 * Prevent a checkpoint from starting between the GetRedoRecPtr() and
+		 * smgrregistersync() calls.
+		 */
+		Assert((MyProc->delayChkptFlags & DELAY_CHKPT_START) == 0);
+		MyProc->delayChkptFlags |= DELAY_CHKPT_START;
+
+		if (bulkw->start_RedoRecPtr != GetRedoRecPtr())
+		{
+			/*
+			 * A checkpoint occurred and it didn't know about our writes, so
+			 * fsync() the relation ourselves.
+			 */
+			MyProc->delayChkptFlags &= ~DELAY_CHKPT_START;
+			smgrimmedsync(bulkw->smgr, bulkw->forknum);
+			elog(DEBUG1, "flushed relation because a checkpoint occurred concurrently");
+		}
+		else
+		{
+			smgrregistersync(bulkw->smgr, bulkw->forknum);
+			MyProc->delayChkptFlags &= ~DELAY_CHKPT_START;
+		}
+	}
+}
+
+static int
+buffer_cmp(const void *a, const void *b)
+{
+	const PendingWrite *bufa = (const PendingWrite *) a;
+	const PendingWrite *bufb = (const PendingWrite *) b;
+
+	/* We should not see duplicated writes for the same block */
+	Assert(bufa->blkno != bufb->blkno);
+	if (bufa->blkno > bufb->blkno)
+		return 1;
+	else
+		return -1;
+}
+
+/*
+ * Finish all the pending writes.
+ */
+static void
+bulkw_flush(BulkWriteState *bulkw)
+{
+	int			npending = bulkw->npending;
+	PendingWrite *pending_writes = bulkw->pending_writes;
+
+	if (npending == 0)
+		return;
+
+	if (npending > 1)
+		qsort(pending_writes, npending, sizeof(PendingWrite), buffer_cmp);
+
+	if (bulkw->use_wal)
+	{
+		BlockNumber blknos[MAX_PENDING_WRITES];
+		Page		pages[MAX_PENDING_WRITES];
+		bool		page_std = true;
+
+		for (int i = 0; i < npending; i++)
+		{
+			blknos[i] = pending_writes[i].blkno;
+			pages[i] = pending_writes[i].buf->data;
+
+			/*
+			 * If any of the pages use !page_std, we log them all as such.
+			 * That's a bit wasteful, but in practice, a mix of standard and
+			 * non-standard page layout is rare.  None of the built-in AMs do
+			 * that.
+			 */
+			if (!pending_writes[i].page_std)
+				page_std = false;
+		}
+		log_newpages(&bulkw->smgr->smgr_rlocator.locator, bulkw->forknum,
+					 npending, blknos, pages, page_std);
+	}
+
+	for (int i = 0; i < npending; i++)
+	{
+		BlockNumber blkno = pending_writes[i].blkno;
+		Page		page = pending_writes[i].buf->data;
+
+		PageSetChecksumInplace(page, blkno);
+
+		if (blkno >= bulkw->pages_written)
+		{
+			/*
+			 * If we have to write pages nonsequentially, fill in the space
+			 * with zeroes until we come back and overwrite.  This is not
+			 * logically necessary on standard Unix filesystems (unwritten
+			 * space will read as zeroes anyway), but it should help to avoid
+			 * fragmentation.  The dummy pages aren't WAL-logged though.
+			 */
+			while (blkno > bulkw->pages_written)
+			{
+				/* don't set checksum for all-zero page */
+				smgrextend(bulkw->smgr, bulkw->forknum,
+						   bulkw->pages_written++,
+						   &zero_buffer,
+						   true);
+			}
+
+			smgrextend(bulkw->smgr, bulkw->forknum, blkno, page, true);
+			bulkw->pages_written = pending_writes[i].blkno + 1;
+		}
+		else
+			smgrwrite(bulkw->smgr, bulkw->forknum, blkno, page, true);
+		pfree(page);
+	}
+
+	bulkw->npending = 0;
+}
+
+/*
+ * Queue write of 'buf'.
+ *
+ * NB: this takes ownership of 'buf'!
+ *
+ * You are only allowed to write a given block once as part of one bulk write
+ * operation.
+ */
+void
+bulkw_write(BulkWriteState *bulkw, BlockNumber blocknum, BulkWriteBuffer buf, bool page_std)
+{
+	bulkw->pending_writes[bulkw->npending].buf = buf;
+	bulkw->pending_writes[bulkw->npending].blkno = blocknum;
+	bulkw->pending_writes[bulkw->npending].page_std = page_std;
+
+	bulkw->npending++;
+
+	if (bulkw->npending == MAX_PENDING_WRITES)
+		bulkw_flush(bulkw);
+}
+
+/*
+ * Allocate a new buffer which can later be written with bulkw_write().
+ *
+ * There is no function to free a buffer.  When you pass it to bulkw_write(),
+ * it takes ownership and frees it when it's no longer needed.
+ *
+ * This is currently implemented as a simple palloc, but could be implemented
+ * using a ring buffer or larger chunks in the future, so don't rely on it.
+ */
+BulkWriteBuffer
+bulkw_get_buf(BulkWriteState *bulkw)
+{
+	return MemoryContextAllocAligned(bulkw->memcxt, BLCKSZ, PG_IO_ALIGN_SIZE, 0);
+}
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index fdecbad1709..343ee51048e 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -1082,6 +1082,49 @@ mdtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
 	}
 }
 
+/*
+ * mdregistersync() -- Mark whole relation as needing fsync
+ */
+void
+mdregistersync(SMgrRelation reln, ForkNumber forknum)
+{
+	int			segno;
+	int			min_inactive_seg;
+
+	/*
+	 * NOTE: mdnblocks makes sure we have opened all active segments, so that
+	 * the loop below will get them all!
+	 */
+	mdnblocks(reln, forknum);
+
+	min_inactive_seg = segno = reln->md_num_open_segs[forknum];
+
+	/*
+	 * Temporarily open inactive segments, then close them after sync.  There
+	 * may be some inactive segments left opened after error, but that is
+	 * harmless.  We don't bother to clean them up and take a risk of further
+	 * trouble.  The next mdclose() will soon close them.
+	 */
+	while (_mdfd_openseg(reln, forknum, segno, 0) != NULL)
+		segno++;
+
+	while (segno > 0)
+	{
+		MdfdVec    *v = &reln->md_seg_fds[forknum][segno - 1];
+
+		register_dirty_segment(reln, forknum, v);
+
+		/* Close inactive segments immediately */
+		if (segno > min_inactive_seg)
+		{
+			FileClose(v->mdfd_vfd);
+			_fdvec_resize(reln, forknum, segno - 1);
+		}
+
+		segno--;
+	}
+}
+
 /*
  * mdimmedsync() -- Immediately sync a relation to stable storage.
  *
diff --git a/src/backend/storage/smgr/meson.build b/src/backend/storage/smgr/meson.build
index e1ba6ed74b8..133622a6528 100644
--- a/src/backend/storage/smgr/meson.build
+++ b/src/backend/storage/smgr/meson.build
@@ -1,6 +1,7 @@
 # Copyright (c) 2022-2023, PostgreSQL Global Development Group
 
 backend_sources += files(
+  'bulk_write.c',
   'md.c',
   'smgr.c',
 )
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 5d0f3d515c3..9f7405e3c88 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -65,6 +65,7 @@ typedef struct f_smgr
 	void		(*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
 								  BlockNumber nblocks);
 	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
+	void		(*smgr_registersync) (SMgrRelation reln, ForkNumber forknum);
 } f_smgr;
 
 static const f_smgr smgrsw[] = {
@@ -86,6 +87,7 @@ static const f_smgr smgrsw[] = {
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
 		.smgr_immedsync = mdimmedsync,
+		.smgr_registersync = mdregistersync,
 	}
 };
 
@@ -576,6 +578,14 @@ smgrread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
  * on disk at return, only dumped out to the kernel.  However,
  * provisions will be made to fsync the write before the next checkpoint.
  *
+ * NB: The mechanism to ensure fsync at next checkpoint assumes that there is
+ * something that prevents a concurrent checkpoint from "racing ahead" of the
+ * write.  One way to prevent that is by holding a lock on the buffer; the
+ * buffer manager's writes are protected by that.  The bulk writer facility in
+ * bulk_write.c checks the redo pointer and and calls smgrimmedsync() if a
+ * checkpoint happened; that relies on the fact that no other backend can be
+ * concurrently modify the page.
+ *
  * skipFsync indicates that the caller will make other provisions to
  * fsync the relation, so we needn't bother.  Temporary relations also
  * do not require fsync.
@@ -694,6 +704,24 @@ smgrtruncate(SMgrRelation reln, ForkNumber *forknum, int nforks, BlockNumber *nb
 	}
 }
 
+/*
+ * smgrregistersync() -- Request a relation to be sync'd at next checkpoint
+ *
+ * This can be used after calling smgrwrite() or smgrextend() with skipFsync =
+ * true, to register the fsyncs that were skipped earlier.
+ *
+ * Note: be mindful that a checkpoint could already have happened between the
+ * smgrwrite or smgrextend calls and this!  In that case, the checkpoint
+ * already missed fsyncing this relation, and you should use smgrimmedsync
+ * instead.  Most callers should use the bulk loading facility in bulk_write.c
+ * instead, which handles all that.
+ */
+void
+smgrregistersync(SMgrRelation reln, ForkNumber forknum)
+{
+	smgrsw[reln->smgr_which].smgr_registersync(reln, forknum);
+}
+
 /*
  * smgrimmedsync() -- Force the specified relation to stable storage.
  *
@@ -716,6 +744,9 @@ smgrtruncate(SMgrRelation reln, ForkNumber *forknum, int nforks, BlockNumber *nb
  * Note that you need to do FlushRelationBuffers() first if there is
  * any possibility that there are dirty buffers for the relation;
  * otherwise the sync is not very meaningful.
+ *
+ * Most callers should use the bulk loading facility in bulk_write.c
+ * instead of calling this directly.
  */
 void
 smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
diff --git a/src/include/storage/bulk_write.h b/src/include/storage/bulk_write.h
new file mode 100644
index 00000000000..4defaf20125
--- /dev/null
+++ b/src/include/storage/bulk_write.h
@@ -0,0 +1,38 @@
+/*-------------------------------------------------------------------------
+ *
+ * bulk_write.h
+ *	  Efficiently and reliably populate a new relation
+ *
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/bulk_write.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BULK_WRITE_H
+#define BULK_WRITE_H
+
+typedef struct BulkWriteState BulkWriteState;
+
+/*
+ * Temporary buffer to hold a page to until it's written out. Use
+ * bulkw_get_buf() to reserve one of these.  This is a separate typedef to
+ * make distinguish from other block-sized buffers passed around in the
+ * system.
+ */
+typedef PGIOAlignedBlock *BulkWriteBuffer;
+
+/* forward declared from smgr.h */
+struct SMgrRelationData;
+
+extern BulkWriteState *bulkw_start_rel(Relation rel, ForkNumber forknum);
+extern BulkWriteState *bulkw_start_smgr(struct SMgrRelationData *smgr, ForkNumber forknum, bool use_wal);
+
+extern BulkWriteBuffer bulkw_get_buf(BulkWriteState *bulkw);
+extern void bulkw_write(BulkWriteState *bulkw, BlockNumber blocknum, BulkWriteBuffer buf, bool page_std);
+
+extern void bulkw_finish(BulkWriteState *bulkw);
+
+#endif							/* BULK_WRITE_H */
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 941879ee6a8..225701271d2 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -42,6 +42,7 @@ extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber nblocks);
 extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
+extern void mdregistersync(SMgrRelation reln, ForkNumber forknum);
 
 extern void ForgetDatabaseSyncRequests(Oid dbid);
 extern void DropRelationFiles(RelFileLocator *delrels, int ndelrels, bool isRedo);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index a9a179aabac..cc5a91dc624 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -107,6 +107,7 @@ extern BlockNumber smgrnblocks_cached(SMgrRelation reln, ForkNumber forknum);
 extern void smgrtruncate(SMgrRelation reln, ForkNumber *forknum,
 						 int nforks, BlockNumber *nblocks);
 extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
+extern void smgrregistersync(SMgrRelation reln, ForkNumber forknum);
 extern void AtEOXact_SMgr(void);
 extern bool ProcessBarrierSmgrRelease(void);
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index dba3498a13e..85b8491fd49 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -327,6 +327,8 @@ BuildAccumulator
 BuiltinScript
 BulkInsertState
 BulkInsertStateData
+BulkWriteBuffer
+BulkWriteState
 CACHESIGN
 CAC_state
 CCFastEqualFN
@@ -2005,6 +2007,7 @@ PendingFsyncEntry
 PendingRelDelete
 PendingRelSync
 PendingUnlinkEntry
+PendingWrite
 PendingWriteback
 PerLockTagEntry
 PerlInterpreter
-- 
2.39.2

#5Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Heikki Linnakangas (#4)
Re: Relation bulk write facility

Melanie just reminded about an older thread about this same thing:
/messages/by-id/CAAKRu_ZQEpk6Q1WtNLgfXBdCmdU5xN_w0boVO6faO_Ax+ckjig@mail.gmail.com.
I had completely forgotten about that.

Melanie's patches in that thread implemented the same optimization of
avoiding the fsync() if no checkpoint has happened during the index
build. My patch here also implements batching the WAL records of
multiple blocks, which was not part of those older patches. OTOH, those
patches included an additional optimization of not bypassing the shared
buffer cache if the index is small. That seems sensible too.

In this new patch, I subconsciously implemented an API close to what I
suggested at the end of that old thread.

So I'd like to continue this effort based on this new patch. We can add
the bypass-buffer-cache optimization later on top of this. With the new
API that this introduces, it should be an isolated change to the
implementation, with no changes required to the callers.

--
Heikki Linnakangas
Neon (https://neon.tech)

#6vignesh C
vignesh21@gmail.com
In reply to: Heikki Linnakangas (#4)
Re: Relation bulk write facility

On Sat, 25 Nov 2023 at 06:49, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 19/11/2023 02:04, Andres Freund wrote:

On 2023-11-17 11:37:21 +0100, Heikki Linnakangas wrote:

The new facility makes it easier to optimize bulk loading, as the
logic for buffering, WAL-logging, and syncing the relation only needs
to be implemented once. It's also less error-prone: We have had a
number of bugs in how a relation is fsync'd - or not - at the end of a
bulk loading operation. By centralizing that logic to one place, we
only need to write it correctly once.

One thing I'd like to use the centralized handling for is to track such
writes in pg_stat_io. I don't mean as part of the initial patch, just that
that's another reason I like the facility.

Oh I didn't realize they're not counted at the moment.

+    bulkw = bulkw_start_smgr(dst, forkNum, use_wal);
+
nblocks = smgrnblocks(src, forkNum);

for (blkno = 0; blkno < nblocks; blkno++)
{
+ Page page;
+
/* If we got a cancel signal during the copy of the data, quit */
CHECK_FOR_INTERRUPTS();

-            smgrread(src, forkNum, blkno, buf.data);
+            page = bulkw_alloc_buf(bulkw);
+            smgrread(src, forkNum, blkno, page);
if (!PageIsVerifiedExtended(page, blkno,
PIV_LOG_WARNING | PIV_REPORT_STAT))
@@ -511,30 +514,9 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
* page this is, so we have to log the full page including any unused
* space.
*/
-            if (use_wal)
-                    log_newpage(&dst->smgr_rlocator.locator, forkNum, blkno, page, false);
-
-            PageSetChecksumInplace(page, blkno);
-
-            /*
-             * Now write the page.  We say skipFsync = true because there's no
-             * need for smgr to schedule an fsync for this write; we'll do it
-             * ourselves below.
-             */
-            smgrextend(dst, forkNum, blkno, buf.data, true);
+            bulkw_write(bulkw, blkno, page, false);

I wonder if bulkw_alloc_buf() is a good name - if you naively read this
change, it looks like it'll just leak memory. It also might be taken to be
valid until freed, which I don't think is the case?

Yeah, I'm not very happy with this interface. The model is that you get
a buffer to write to by calling bulkw_alloc_buf(). Later, you hand it
over to bulkw_write(), which takes ownership of it and frees it later.
There is no other function to free it, although currently the buffer is
just palloc'd so you could call pfree on it.

However, I'd like to not expose that detail to the callers. I'm
imagining that in the future we might optimize further, by having a
larger e.g. 1 MB buffer, and carve the 8kB blocks from that. Then
opportunistically, if you fill the buffers sequentially, bulk_write.c
could do one smgrextend() to write the whole 1 MB chunk.

I renamed it to bulkw_get_buf() now, and made it return a new
BulkWriteBuffer typedef instead of a plain Page. The point of the new
typedef is to distinguish a buffer returned by bulkw_get_buf() from a
Page or char[BLCKSZ] that you might palloc on your own. That indeed
revealed some latent bugs in gistbuild.c where I had mixed up buffers
returned by bulkw_alloc_buf() and palloc'd buffers.

(The previous version of this patch called a different struct
BulkWriteBuffer, but I renamed that to PendingWrite; see below. Don't be
confused!)

I think this helps a little, but I'm still not very happy with it. I'll
give it some more thought after sleeping over it, but in the meanwhile,
I'm all ears for suggestions.

+/*-------------------------------------------------------------------------
+ *
+ * bulk_write.c
+ *    Efficiently and reliably populate a new relation
+ *
+ * The assumption is that no other backends access the relation while we are
+ * loading it, so we can take some shortcuts.  Alternatively, you can use the
+ * buffer manager as usual, if performance is not critical, but you must not
+ * mix operations through the buffer manager and the bulk loading interface at
+ * the same time.

From "Alternatively" onward this is is somewhat confusing.

Rewrote that to just "Do not mix operations through the regular buffer
manager and the bulk loading interface!"

+ * One tricky point is that because we bypass the buffer manager, we need to
+ * register the relation for fsyncing at the next checkpoint ourselves, and
+ * make sure that the relation is correctly fsync by us or the checkpointer
+ * even if a checkpoint happens concurrently.

"fsync'ed" or such? Otherwise this reads awkwardly for me.

Yep, fixed.

+typedef struct BulkWriteBuffer
+{
+    Page            page;
+    BlockNumber blkno;
+    bool            page_std;
+    int16           order;
+} BulkWriteBuffer;
+

The name makes it sound like this struct itself contains a buffer - but it's
just pointing to one. *BufferRef or such maybe?

I was wondering how you dealt with the alignment of buffers given the struct
definition, which is what lead me to look at this...

I renamed this to PendingWrite, and the field that holds these
"pending_writes". Think of it as a queue of writes that haven't been
performed yet.

+/*
+ * Bulk writer state for one relation fork.
+ */
+typedef struct BulkWriteState
+{
+    /* Information about the target relation we're writing */
+    SMgrRelation smgr;

Isn't there a danger of this becoming a dangling pointer? At least until
/messages/by-id/CA+hUKGJ8NTvqLHz6dqbQnt2c8XCki4r2QvXjBQcXpVwxTY_pvA@mail.gmail.com
is merged?

Good point. I just added a FIXME comment to remind about that, hoping
that that patch gets merged soon. If not, I'll come up with a different fix.

+    ForkNumber      forknum;
+    bool            use_wal;
+
+    /* We keep several pages buffered, and WAL-log them in batches */
+    int                     nbuffered;
+    BulkWriteBuffer buffers[MAX_BUFFERED_PAGES];
+
+    /* Current size of the relation */
+    BlockNumber pages_written;
+
+    /* The RedoRecPtr at the time that the bulk operation started */
+    XLogRecPtr      start_RedoRecPtr;
+
+    Page            zeropage;               /* workspace for filling zeroes */

We really should just have one such page in shared memory somewhere... For WAL
writes as well.

But until then - why do you allocate the page? Seems like we could just use a
static global variable?

I made it a static global variable for now. (The palloc way was copied
over from nbtsort.c)

+/*
+ * Write all buffered pages to disk.
+ */
+static void
+bulkw_flush(BulkWriteState *bulkw)
+{
+    int                     nbuffered = bulkw->nbuffered;
+    BulkWriteBuffer *buffers = bulkw->buffers;
+
+    if (nbuffered == 0)
+            return;
+
+    if (nbuffered > 1)
+    {
+            int                     o;
+
+            qsort(buffers, nbuffered, sizeof(BulkWriteBuffer), buffer_cmp);
+
+            /*
+             * Eliminate duplicates, keeping the last write of each block.
+             * (buffer_cmp uses 'order' as the last sort key)
+             */

Huh - which use cases would actually cause duplicate writes?

Hmm, nothing anymore I guess. Many AMs used to write zero pages as a
placeholder and come back to fill them in later, but now that
bulk_write.c handles that,

Removed that, and replaced it with with an assertion in buffer_cmp()
that there are no duplicates.

There are few compilation errors reported by CFBot at [1]https://cirrus-ci.com/task/5299954164432896, patch needs
to be rebased:
[02:38:12.675] In file included from ../../../../src/include/postgres.h:45,
[02:38:12.675] from nbtsort.c:41:
[02:38:12.675] nbtsort.c: In function ‘_bt_load’:
[02:38:12.675] nbtsort.c:1309:57: error: ‘BTPageState’ has no member
named ‘btps_page’
[02:38:12.675] 1309 | Assert(dstate->maxpostingsize <=
BTMaxItemSize(state->btps_page) &&
[02:38:12.675] | ^~
[02:38:12.675] ../../../../src/include/c.h:864:9: note: in definition
of macro ‘Assert’
[02:38:12.675] 864 | if (!(condition)) \
[02:38:12.675] | ^~~~~~~~~
[02:38:12.675] ../../../../src/include/c.h:812:29: note: in expansion
of macro ‘TYPEALIGN_DOWN’
[02:38:12.675] 812 | #define MAXALIGN_DOWN(LEN)
TYPEALIGN_DOWN(MAXIMUM_ALIGNOF, (LEN))
[02:38:12.675] | ^~~~~~~~~~~~~~
[02:38:12.675] ../../../../src/include/access/nbtree.h:165:3: note: in
expansion of macro ‘MAXALIGN_DOWN’
[02:38:12.675] 165 | (MAXALIGN_DOWN((PageGetPageSize(page) - \

[1]: https://cirrus-ci.com/task/5299954164432896

Regards,
Vignesh

#7Heikki Linnakangas
hlinnaka@iki.fi
In reply to: vignesh C (#6)
1 attachment(s)
Re: Relation bulk write facility

On 09/01/2024 08:50, vignesh C wrote:

There are few compilation errors reported by CFBot at [1], patch needs
to be rebased:

Here you go.

--
Heikki Linnakangas
Neon (https://neon.tech)

Attachments:

v4-0001-Introduce-a-new-bulk-loading-facility.patchtext/x-patch; charset=UTF-8; name=v4-0001-Introduce-a-new-bulk-loading-facility.patchDownload
From b1791303c54da762ecf3d63f1a60d5b93732af57 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 10 Jan 2024 11:24:20 +0200
Subject: [PATCH v4 1/1] Introduce a new bulk loading facility.

The new facility makes it easier to optimize bulk loading, as the
logic for buffering, WAL-logging, and syncing the relation only needs
to be implemented once. It's also less error-prone: We have had a
number of bugs in how a relation is fsync'd - or not - at the end of a
bulk loading operation. By centralizing that logic to one place, we
only need to write it correctly once.

The new facility is faster for small relations: Instead of of calling
smgrimmedsync(), we register the fsync to happen at next checkpoint,
which avoids the fsync latency. That can make a big difference if you
are e.g. restoring a schema-only dump with lots of relations.

It is also slightly more efficient with large relations, as the WAL
logging is performed multiple pages at a time. That avoids some WAL
header overhead. The sorted GiST index build did that already, this
moves the buffering to the new facility.

The changes to pageinspect GiST test needs an explanation: Before this
patch, the sorted GiST index build set the LSN on every page to the
special GistBuildLSN value, not the LSN of the WAL record, even though
they were WAL-logged. There was no particular need for it, it just
happened naturally when we wrote out the pages before WAL-logging
them. Now we WAL-log the pages first, like in B-tree build, so the
pages are stamped with the record's real LSN. When the build is not
WAL-logged, we still use GistBuildLSN. To make the test output
predictable, use an unlogged index.

Reviewed-by: Andres Freund
Discussion: https://www.postgresql.org/message-id/30e8f366-58b3-b239-c521-422122dd5150%40iki.fi
---
 contrib/pageinspect/expected/gist.out |  14 +-
 contrib/pageinspect/sql/gist.sql      |  16 +-
 src/backend/access/gist/gistbuild.c   | 122 +++--------
 src/backend/access/heap/rewriteheap.c |  72 ++----
 src/backend/access/nbtree/nbtree.c    |  33 +--
 src/backend/access/nbtree/nbtsort.c   | 135 ++++--------
 src/backend/access/spgist/spginsert.c |  49 ++---
 src/backend/catalog/storage.c         |  46 ++--
 src/backend/storage/smgr/Makefile     |   1 +
 src/backend/storage/smgr/bulk_write.c | 301 ++++++++++++++++++++++++++
 src/backend/storage/smgr/md.c         |  45 +++-
 src/backend/storage/smgr/meson.build  |   1 +
 src/backend/storage/smgr/smgr.c       |  31 +++
 src/include/storage/bulk_write.h      |  37 ++++
 src/include/storage/md.h              |   1 +
 src/include/storage/smgr.h            |   1 +
 src/tools/pgindent/typedefs.list      |   3 +
 17 files changed, 553 insertions(+), 355 deletions(-)
 create mode 100644 src/backend/storage/smgr/bulk_write.c
 create mode 100644 src/include/storage/bulk_write.h

diff --git a/contrib/pageinspect/expected/gist.out b/contrib/pageinspect/expected/gist.out
index d1adbab8ae2..2b1d54a6279 100644
--- a/contrib/pageinspect/expected/gist.out
+++ b/contrib/pageinspect/expected/gist.out
@@ -1,13 +1,6 @@
--- The gist_page_opaque_info() function prints the page's LSN. Normally,
--- that's constant 1 (GistBuildLSN) on every page of a freshly built GiST
--- index. But with wal_level=minimal, the whole relation is dumped to WAL at
--- the end of the transaction if it's smaller than wal_skip_threshold, which
--- updates the LSNs. Wrap the tests on gist_page_opaque_info() in the
--- same transaction with the CREATE INDEX so that we see the LSNs before
--- they are possibly overwritten at end of transaction.
-BEGIN;
--- Create a test table and GiST index.
-CREATE TABLE test_gist AS SELECT point(i,i) p, i::text t FROM
+-- The gist_page_opaque_info() function prints the page's LSN.
+-- Use an unlogged index, so that the LSN is predictable.
+CREATE UNLOGGED TABLE test_gist AS SELECT point(i,i) p, i::text t FROM
     generate_series(1,1000) i;
 CREATE INDEX test_gist_idx ON test_gist USING gist (p);
 -- Page 0 is the root, the rest are leaf pages
@@ -29,7 +22,6 @@ SELECT * FROM gist_page_opaque_info(get_raw_page('test_gist_idx', 2));
  0/1 | 0/0 |         1 | {leaf}
 (1 row)
 
-COMMIT;
 SELECT * FROM gist_page_items(get_raw_page('test_gist_idx', 0), 'test_gist_idx');
  itemoffset |   ctid    | itemlen | dead |             keys              
 ------------+-----------+---------+------+-------------------------------
diff --git a/contrib/pageinspect/sql/gist.sql b/contrib/pageinspect/sql/gist.sql
index d263542ba15..85bc44b8000 100644
--- a/contrib/pageinspect/sql/gist.sql
+++ b/contrib/pageinspect/sql/gist.sql
@@ -1,14 +1,6 @@
--- The gist_page_opaque_info() function prints the page's LSN. Normally,
--- that's constant 1 (GistBuildLSN) on every page of a freshly built GiST
--- index. But with wal_level=minimal, the whole relation is dumped to WAL at
--- the end of the transaction if it's smaller than wal_skip_threshold, which
--- updates the LSNs. Wrap the tests on gist_page_opaque_info() in the
--- same transaction with the CREATE INDEX so that we see the LSNs before
--- they are possibly overwritten at end of transaction.
-BEGIN;
-
--- Create a test table and GiST index.
-CREATE TABLE test_gist AS SELECT point(i,i) p, i::text t FROM
+-- The gist_page_opaque_info() function prints the page's LSN.
+-- Use an unlogged index, so that the LSN is predictable.
+CREATE UNLOGGED TABLE test_gist AS SELECT point(i,i) p, i::text t FROM
     generate_series(1,1000) i;
 CREATE INDEX test_gist_idx ON test_gist USING gist (p);
 
@@ -17,8 +9,6 @@ SELECT * FROM gist_page_opaque_info(get_raw_page('test_gist_idx', 0));
 SELECT * FROM gist_page_opaque_info(get_raw_page('test_gist_idx', 1));
 SELECT * FROM gist_page_opaque_info(get_raw_page('test_gist_idx', 2));
 
-COMMIT;
-
 SELECT * FROM gist_page_items(get_raw_page('test_gist_idx', 0), 'test_gist_idx');
 SELECT * FROM gist_page_items(get_raw_page('test_gist_idx', 1), 'test_gist_idx') LIMIT 5;
 
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 08555b97f92..5f23a7614b0 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -43,7 +43,8 @@
 #include "miscadmin.h"
 #include "optimizer/optimizer.h"
 #include "storage/bufmgr.h"
-#include "storage/smgr.h"
+#include "storage/bulk_write.h"
+
 #include "utils/memutils.h"
 #include "utils/rel.h"
 #include "utils/tuplesort.h"
@@ -106,11 +107,8 @@ typedef struct
 	Tuplesortstate *sortstate;	/* state data for tuplesort.c */
 
 	BlockNumber pages_allocated;
-	BlockNumber pages_written;
 
-	int			ready_num_pages;
-	BlockNumber ready_blknos[XLR_MAX_BLOCK_ID];
-	Page		ready_pages[XLR_MAX_BLOCK_ID];
+	BulkWriteState *bulkw;
 } GISTBuildState;
 
 #define GIST_SORTED_BUILD_PAGE_NUM 4
@@ -142,7 +140,6 @@ static void gist_indexsortbuild_levelstate_add(GISTBuildState *state,
 											   IndexTuple itup);
 static void gist_indexsortbuild_levelstate_flush(GISTBuildState *state,
 												 GistSortedBuildLevelState *levelstate);
-static void gist_indexsortbuild_flush_ready_pages(GISTBuildState *state);
 
 static void gistInitBuffering(GISTBuildState *buildstate);
 static int	calculatePagesPerBuffer(GISTBuildState *buildstate, int levelStep);
@@ -405,27 +402,18 @@ gist_indexsortbuild(GISTBuildState *state)
 {
 	IndexTuple	itup;
 	GistSortedBuildLevelState *levelstate;
-	Page		page;
+	BulkWriteBuffer rootbuf;
 
-	state->pages_allocated = 0;
-	state->pages_written = 0;
-	state->ready_num_pages = 0;
+	/* Reserve block 0 for the root page */
+	state->pages_allocated = 1;
 
-	/*
-	 * Write an empty page as a placeholder for the root page. It will be
-	 * replaced with the real root page at the end.
-	 */
-	page = palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, MCXT_ALLOC_ZERO);
-	smgrextend(RelationGetSmgr(state->indexrel), MAIN_FORKNUM, GIST_ROOT_BLKNO,
-			   page, true);
-	state->pages_allocated++;
-	state->pages_written++;
+	state->bulkw = bulkw_start_rel(state->indexrel, MAIN_FORKNUM);
 
 	/* Allocate a temporary buffer for the first leaf page batch. */
 	levelstate = palloc0(sizeof(GistSortedBuildLevelState));
-	levelstate->pages[0] = page;
+	levelstate->pages[0] = palloc(BLCKSZ);
 	levelstate->parent = NULL;
-	gistinitpage(page, F_LEAF);
+	gistinitpage(levelstate->pages[0], F_LEAF);
 
 	/*
 	 * Fill index pages with tuples in the sorted order.
@@ -455,31 +443,16 @@ gist_indexsortbuild(GISTBuildState *state)
 		levelstate = parent;
 	}
 
-	gist_indexsortbuild_flush_ready_pages(state);
-
 	/* Write out the root */
 	PageSetLSN(levelstate->pages[0], GistBuildLSN);
-	PageSetChecksumInplace(levelstate->pages[0], GIST_ROOT_BLKNO);
-	smgrwrite(RelationGetSmgr(state->indexrel), MAIN_FORKNUM, GIST_ROOT_BLKNO,
-			  levelstate->pages[0], true);
-	if (RelationNeedsWAL(state->indexrel))
-		log_newpage(&state->indexrel->rd_locator, MAIN_FORKNUM, GIST_ROOT_BLKNO,
-					levelstate->pages[0], true);
-
-	pfree(levelstate->pages[0]);
+
+	rootbuf = bulkw_get_buf(state->bulkw);
+	memcpy(rootbuf, levelstate->pages[0], BLCKSZ);
+	bulkw_write(state->bulkw, GIST_ROOT_BLKNO, rootbuf, true);
+
 	pfree(levelstate);
 
-	/*
-	 * When we WAL-logged index pages, we must nonetheless fsync index files.
-	 * Since we're building outside shared buffers, a CHECKPOINT occurring
-	 * during the build has no way to flush the previously written data to
-	 * disk (indeed it won't know the index even exists).  A crash later on
-	 * would replay WAL from the checkpoint, therefore it wouldn't replay our
-	 * earlier WAL entries. If we do not fsync those pages here, they might
-	 * still not be on disk when the crash occurs.
-	 */
-	if (RelationNeedsWAL(state->indexrel))
-		smgrimmedsync(RelationGetSmgr(state->indexrel), MAIN_FORKNUM);
+	bulkw_finish(state->bulkw);
 }
 
 /*
@@ -509,8 +482,7 @@ gist_indexsortbuild_levelstate_add(GISTBuildState *state,
 			levelstate->current_page++;
 
 		if (levelstate->pages[levelstate->current_page] == NULL)
-			levelstate->pages[levelstate->current_page] =
-				palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
+			levelstate->pages[levelstate->current_page] = palloc0(BLCKSZ);
 
 		newPage = levelstate->pages[levelstate->current_page];
 		gistinitpage(newPage, old_page_flags);
@@ -573,6 +545,7 @@ gist_indexsortbuild_levelstate_flush(GISTBuildState *state,
 	for (; dist != NULL; dist = dist->next)
 	{
 		char	   *data;
+		BulkWriteBuffer buf;
 		Page		target;
 
 		/* check once per page */
@@ -580,7 +553,8 @@ gist_indexsortbuild_levelstate_flush(GISTBuildState *state,
 
 		/* Create page and copy data */
 		data = (char *) (dist->list);
-		target = palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, MCXT_ALLOC_ZERO);
+		buf = bulkw_get_buf(state->bulkw);
+		target = (Page) buf;
 		gistinitpage(target, isleaf ? F_LEAF : 0);
 		for (int i = 0; i < dist->block.num; i++)
 		{
@@ -593,20 +567,6 @@ gist_indexsortbuild_levelstate_flush(GISTBuildState *state,
 		}
 		union_tuple = dist->itup;
 
-		if (state->ready_num_pages == XLR_MAX_BLOCK_ID)
-			gist_indexsortbuild_flush_ready_pages(state);
-
-		/*
-		 * The page is now complete. Assign a block number to it, and add it
-		 * to the list of finished pages. (We don't write it out immediately,
-		 * because we want to WAL-log the pages in batches.)
-		 */
-		blkno = state->pages_allocated++;
-		state->ready_blknos[state->ready_num_pages] = blkno;
-		state->ready_pages[state->ready_num_pages] = target;
-		state->ready_num_pages++;
-		ItemPointerSetBlockNumber(&(union_tuple->t_tid), blkno);
-
 		/*
 		 * Set the right link to point to the previous page. This is just for
 		 * debugging purposes: GiST only follows the right link if a page is
@@ -621,6 +581,15 @@ gist_indexsortbuild_levelstate_flush(GISTBuildState *state,
 		 */
 		if (levelstate->last_blkno)
 			GistPageGetOpaque(target)->rightlink = levelstate->last_blkno;
+
+		/*
+		 * The page is now complete. Assign a block number to it, and pass it
+		 * to the bulk writer.
+		 */
+		blkno = state->pages_allocated++;
+		PageSetLSN(target, GistBuildLSN);
+		bulkw_write(state->bulkw, blkno, buf, true);
+		ItemPointerSetBlockNumber(&(union_tuple->t_tid), blkno);
 		levelstate->last_blkno = blkno;
 
 		/*
@@ -631,7 +600,7 @@ gist_indexsortbuild_levelstate_flush(GISTBuildState *state,
 		if (parent == NULL)
 		{
 			parent = palloc0(sizeof(GistSortedBuildLevelState));
-			parent->pages[0] = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
+			parent->pages[0] = palloc(BLCKSZ);
 			parent->parent = NULL;
 			gistinitpage(parent->pages[0], 0);
 
@@ -641,39 +610,6 @@ gist_indexsortbuild_levelstate_flush(GISTBuildState *state,
 	}
 }
 
-static void
-gist_indexsortbuild_flush_ready_pages(GISTBuildState *state)
-{
-	if (state->ready_num_pages == 0)
-		return;
-
-	for (int i = 0; i < state->ready_num_pages; i++)
-	{
-		Page		page = state->ready_pages[i];
-		BlockNumber blkno = state->ready_blknos[i];
-
-		/* Currently, the blocks must be buffered in order. */
-		if (blkno != state->pages_written)
-			elog(ERROR, "unexpected block number to flush GiST sorting build");
-
-		PageSetLSN(page, GistBuildLSN);
-		PageSetChecksumInplace(page, blkno);
-		smgrextend(RelationGetSmgr(state->indexrel), MAIN_FORKNUM, blkno, page,
-				   true);
-
-		state->pages_written++;
-	}
-
-	if (RelationNeedsWAL(state->indexrel))
-		log_newpages(&state->indexrel->rd_locator, MAIN_FORKNUM, state->ready_num_pages,
-					 state->ready_blknos, state->ready_pages, true);
-
-	for (int i = 0; i < state->ready_num_pages; i++)
-		pfree(state->ready_pages[i]);
-
-	state->ready_num_pages = 0;
-}
-
 
 /*-------------------------------------------------------------------------
  * Routines for non-sorted build
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 34107323ffe..ead5487f22a 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -87,8 +87,8 @@
  * is optimized for bulk inserting a lot of tuples, knowing that we have
  * exclusive access to the heap.  raw_heap_insert builds new pages in
  * local storage.  When a page is full, or at the end of the process,
- * we insert it to WAL as a single record and then write it to disk
- * directly through smgr.  Note, however, that any data sent to the new
+ * we insert it to WAL as a single record and then write it to disk with
+ * the bulk smgr writer.  Note, however, that any data sent to the new
  * heap's TOAST table will go through the normal bufmgr.
  *
  *
@@ -119,9 +119,9 @@
 #include "replication/logical.h"
 #include "replication/slot.h"
 #include "storage/bufmgr.h"
+#include "storage/bulk_write.h"
 #include "storage/fd.h"
 #include "storage/procarray.h"
-#include "storage/smgr.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -133,9 +133,9 @@ typedef struct RewriteStateData
 {
 	Relation	rs_old_rel;		/* source heap */
 	Relation	rs_new_rel;		/* destination heap */
-	Page		rs_buffer;		/* page currently being built */
+	BulkWriteState *rs_bulkw;	/* writer for the destination */
+	BulkWriteBuffer rs_buffer;	/* page currently being built */
 	BlockNumber rs_blockno;		/* block where page will go */
-	bool		rs_buffer_valid;	/* T if any tuples in buffer */
 	bool		rs_logical_rewrite; /* do we need to do logical rewriting */
 	TransactionId rs_oldest_xmin;	/* oldest xmin used by caller to determine
 									 * tuple visibility */
@@ -255,14 +255,14 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
 
 	state->rs_old_rel = old_heap;
 	state->rs_new_rel = new_heap;
-	state->rs_buffer = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
+	state->rs_buffer = NULL;
 	/* new_heap needn't be empty, just locked */
 	state->rs_blockno = RelationGetNumberOfBlocks(new_heap);
-	state->rs_buffer_valid = false;
 	state->rs_oldest_xmin = oldest_xmin;
 	state->rs_freeze_xid = freeze_xid;
 	state->rs_cutoff_multi = cutoff_multi;
 	state->rs_cxt = rw_cxt;
+	state->rs_bulkw = bulkw_start_rel(new_heap, MAIN_FORKNUM);
 
 	/* Initialize hash tables used to track update chains */
 	hash_ctl.keysize = sizeof(TidHashKey);
@@ -314,30 +314,13 @@ end_heap_rewrite(RewriteState state)
 	}
 
 	/* Write the last page, if any */
-	if (state->rs_buffer_valid)
+	if (state->rs_buffer)
 	{
-		if (RelationNeedsWAL(state->rs_new_rel))
-			log_newpage(&state->rs_new_rel->rd_locator,
-						MAIN_FORKNUM,
-						state->rs_blockno,
-						state->rs_buffer,
-						true);
-
-		PageSetChecksumInplace(state->rs_buffer, state->rs_blockno);
-
-		smgrextend(RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM,
-				   state->rs_blockno, state->rs_buffer, true);
+		bulkw_write(state->rs_bulkw, state->rs_blockno, state->rs_buffer, true);
+		state->rs_buffer = NULL;
 	}
 
-	/*
-	 * When we WAL-logged rel pages, we must nonetheless fsync them.  The
-	 * reason is the same as in storage.c's RelationCopyStorage(): we're
-	 * writing data that's not in shared buffers, and so a CHECKPOINT
-	 * occurring during the rewriteheap operation won't have fsync'd data we
-	 * wrote before the checkpoint.
-	 */
-	if (RelationNeedsWAL(state->rs_new_rel))
-		smgrimmedsync(RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM);
+	bulkw_finish(state->rs_bulkw);
 
 	logical_end_heap_rewrite(state);
 
@@ -611,7 +594,7 @@ rewrite_heap_dead_tuple(RewriteState state, HeapTuple old_tuple)
 static void
 raw_heap_insert(RewriteState state, HeapTuple tup)
 {
-	Page		page = state->rs_buffer;
+	Page		page;
 	Size		pageFreeSpace,
 				saveFreeSpace;
 	Size		len;
@@ -664,7 +647,8 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
 												   HEAP_DEFAULT_FILLFACTOR);
 
 	/* Now we can check to see if there's enough free space already. */
-	if (state->rs_buffer_valid)
+	page = (Page) state->rs_buffer;
+	if (page)
 	{
 		pageFreeSpace = PageGetHeapFreeSpace(page);
 
@@ -675,35 +659,19 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
 			 * contains a tuple.  Hence, unlike RelationGetBufferForTuple(),
 			 * enforce saveFreeSpace unconditionally.
 			 */
-
-			/* XLOG stuff */
-			if (RelationNeedsWAL(state->rs_new_rel))
-				log_newpage(&state->rs_new_rel->rd_locator,
-							MAIN_FORKNUM,
-							state->rs_blockno,
-							page,
-							true);
-
-			/*
-			 * Now write the page. We say skipFsync = true because there's no
-			 * need for smgr to schedule an fsync for this write; we'll do it
-			 * ourselves in end_heap_rewrite.
-			 */
-			PageSetChecksumInplace(page, state->rs_blockno);
-
-			smgrextend(RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM,
-					   state->rs_blockno, page, true);
-
+			bulkw_write(state->rs_bulkw, state->rs_blockno, state->rs_buffer, true);
+			state->rs_buffer = NULL;
+			page = NULL;
 			state->rs_blockno++;
-			state->rs_buffer_valid = false;
 		}
 	}
 
-	if (!state->rs_buffer_valid)
+	if (!page)
 	{
 		/* Initialize a new empty page */
+		state->rs_buffer = bulkw_get_buf(state->rs_bulkw);
+		page = (Page) state->rs_buffer;
 		PageInit(page, BLCKSZ, 0);
-		state->rs_buffer_valid = true;
 	}
 
 	/* And now we can insert the tuple into the page */
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 696d79c0852..da72966b567 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -29,11 +29,11 @@
 #include "nodes/execnodes.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
+#include "storage/bulk_write.h"
 #include "storage/condition_variable.h"
 #include "storage/indexfsm.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
-#include "storage/smgr.h"
 #include "utils/builtins.h"
 #include "utils/index_selfuncs.h"
 #include "utils/memutils.h"
@@ -154,32 +154,17 @@ void
 btbuildempty(Relation index)
 {
 	bool		allequalimage = _bt_allequalimage(index, false);
-	Buffer		metabuf;
-	Page		metapage;
+	BulkWriteState *bulkw;
+	BulkWriteBuffer metabuf;
 
-	/*
-	 * Initialize the metapage.
-	 *
-	 * Regular index build bypasses the buffer manager and uses smgr functions
-	 * directly, with an smgrimmedsync() call at the end.  That makes sense
-	 * when the index is large, but for an empty index, it's better to use the
-	 * buffer cache to avoid the smgrimmedsync().
-	 */
-	metabuf = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
-	Assert(BufferGetBlockNumber(metabuf) == BTREE_METAPAGE);
-	_bt_lockbuf(index, metabuf, BT_WRITE);
-
-	START_CRIT_SECTION();
-
-	metapage = BufferGetPage(metabuf);
-	_bt_initmetapage(metapage, P_NONE, 0, allequalimage);
-	MarkBufferDirty(metabuf);
-	log_newpage_buffer(metabuf, true);
+	bulkw = bulkw_start_rel(index, INIT_FORKNUM);
 
-	END_CRIT_SECTION();
+	/* Construct metapage. */
+	metabuf = bulkw_get_buf(bulkw);
+	_bt_initmetapage((Page) metabuf, P_NONE, 0, allequalimage);
+	bulkw_write(bulkw, BTREE_METAPAGE, metabuf, true);
 
-	_bt_unlockbuf(index, metabuf);
-	ReleaseBuffer(metabuf);
+	bulkw_finish(bulkw);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 20111965793..f3d10e3dc0a 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -23,13 +23,8 @@
  * many upper pages if the keys are reasonable-size) without risking a lot of
  * cascading splits during early insertions.
  *
- * Formerly the index pages being built were kept in shared buffers, but
- * that is of no value (since other backends have no interest in them yet)
- * and it created locking problems for CHECKPOINT, because the upper-level
- * pages were held exclusive-locked for long periods.  Now we just build
- * the pages in local memory and smgrwrite or smgrextend them as we finish
- * them.  They will need to be re-read into shared buffers on first use after
- * the build finishes.
+ * We use the bulk smgr loading facility to bypass the buffer cache and
+ * WAL log the pages efficiently.
  *
  * This code isn't concerned about the FSM at all. The caller is responsible
  * for initializing that.
@@ -57,7 +52,7 @@
 #include "executor/instrument.h"
 #include "miscadmin.h"
 #include "pgstat.h"
-#include "storage/smgr.h"
+#include "storage/bulk_write.h"
 #include "tcop/tcopprot.h"		/* pgrminclude ignore */
 #include "utils/rel.h"
 #include "utils/sortsupport.h"
@@ -234,7 +229,7 @@ typedef struct BTBuildState
  */
 typedef struct BTPageState
 {
-	Page		btps_page;		/* workspace for page building */
+	BulkWriteBuffer btps_buf;	/* workspace for page building */
 	BlockNumber btps_blkno;		/* block # to write this page at */
 	IndexTuple	btps_lowkey;	/* page's strict lower bound pivot tuple */
 	OffsetNumber btps_lastoff;	/* last item offset loaded */
@@ -251,11 +246,9 @@ typedef struct BTWriteState
 {
 	Relation	heap;
 	Relation	index;
+	BulkWriteState *bulkw;
 	BTScanInsert inskey;		/* generic insertion scankey */
-	bool		btws_use_wal;	/* dump pages to WAL? */
 	BlockNumber btws_pages_alloced; /* # pages allocated */
-	BlockNumber btws_pages_written; /* # pages written out */
-	Page		btws_zeropage;	/* workspace for filling zeroes */
 } BTWriteState;
 
 
@@ -267,7 +260,7 @@ static void _bt_spool(BTSpool *btspool, ItemPointer self,
 static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2);
 static void _bt_build_callback(Relation index, ItemPointer tid, Datum *values,
 							   bool *isnull, bool tupleIsAlive, void *state);
-static Page _bt_blnewpage(uint32 level);
+static BulkWriteBuffer _bt_blnewpage(BTWriteState *wstate, uint32 level);
 static BTPageState *_bt_pagestate(BTWriteState *wstate, uint32 level);
 static void _bt_slideleft(Page rightmostpage);
 static void _bt_sortaddtup(Page page, Size itemsize,
@@ -569,12 +562,9 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
 	/* _bt_mkscankey() won't set allequalimage without metapage */
 	wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
-	wstate.btws_use_wal = RelationNeedsWAL(wstate.index);
 
 	/* reserve the metapage */
 	wstate.btws_pages_alloced = BTREE_METAPAGE + 1;
-	wstate.btws_pages_written = 0;
-	wstate.btws_zeropage = NULL;	/* until needed */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_BTREE_PHASE_LEAF_LOAD);
@@ -613,13 +603,15 @@ _bt_build_callback(Relation index,
 /*
  * allocate workspace for a new, clean btree page, not linked to any siblings.
  */
-static Page
-_bt_blnewpage(uint32 level)
+static BulkWriteBuffer
+_bt_blnewpage(BTWriteState *wstate, uint32 level)
 {
+	BulkWriteBuffer buf;
 	Page		page;
 	BTPageOpaque opaque;
 
-	page = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
+	buf = bulkw_get_buf(wstate->bulkw);
+	page = (Page) buf;
 
 	/* Zero the page and set up standard page header info */
 	_bt_pageinit(page, BLCKSZ);
@@ -634,63 +626,17 @@ _bt_blnewpage(uint32 level)
 	/* Make the P_HIKEY line pointer appear allocated */
 	((PageHeader) page)->pd_lower += sizeof(ItemIdData);
 
-	return page;
+	return buf;
 }
 
 /*
  * emit a completed btree page, and release the working storage.
  */
 static void
-_bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
+_bt_blwritepage(BTWriteState *wstate, BulkWriteBuffer buf, BlockNumber blkno)
 {
-	/* XLOG stuff */
-	if (wstate->btws_use_wal)
-	{
-		/* We use the XLOG_FPI record type for this */
-		log_newpage(&wstate->index->rd_locator, MAIN_FORKNUM, blkno, page, true);
-	}
-
-	/*
-	 * If we have to write pages nonsequentially, fill in the space with
-	 * zeroes until we come back and overwrite.  This is not logically
-	 * necessary on standard Unix filesystems (unwritten space will read as
-	 * zeroes anyway), but it should help to avoid fragmentation. The dummy
-	 * pages aren't WAL-logged though.
-	 */
-	while (blkno > wstate->btws_pages_written)
-	{
-		if (!wstate->btws_zeropage)
-			wstate->btws_zeropage = (Page) palloc_aligned(BLCKSZ,
-														  PG_IO_ALIGN_SIZE,
-														  MCXT_ALLOC_ZERO);
-		/* don't set checksum for all-zero page */
-		smgrextend(RelationGetSmgr(wstate->index), MAIN_FORKNUM,
-				   wstate->btws_pages_written++,
-				   wstate->btws_zeropage,
-				   true);
-	}
-
-	PageSetChecksumInplace(page, blkno);
-
-	/*
-	 * Now write the page.  There's no need for smgr to schedule an fsync for
-	 * this write; we'll do it ourselves before ending the build.
-	 */
-	if (blkno == wstate->btws_pages_written)
-	{
-		/* extending the file... */
-		smgrextend(RelationGetSmgr(wstate->index), MAIN_FORKNUM, blkno,
-				   page, true);
-		wstate->btws_pages_written++;
-	}
-	else
-	{
-		/* overwriting a block we zero-filled before */
-		smgrwrite(RelationGetSmgr(wstate->index), MAIN_FORKNUM, blkno,
-				  page, true);
-	}
-
-	pfree(page);
+	bulkw_write(wstate->bulkw, blkno, buf, true);
+	/* bulkw_write took ownership of 'buf' */
 }
 
 /*
@@ -703,7 +649,7 @@ _bt_pagestate(BTWriteState *wstate, uint32 level)
 	BTPageState *state = (BTPageState *) palloc0(sizeof(BTPageState));
 
 	/* create initial page for level */
-	state->btps_page = _bt_blnewpage(level);
+	state->btps_buf = _bt_blnewpage(wstate, level);
 
 	/* and assign it a page position */
 	state->btps_blkno = wstate->btws_pages_alloced++;
@@ -839,6 +785,7 @@ static void
 _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 			 Size truncextra)
 {
+	BulkWriteBuffer nbuf;
 	Page		npage;
 	BlockNumber nblkno;
 	OffsetNumber last_off;
@@ -853,7 +800,8 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 	 */
 	CHECK_FOR_INTERRUPTS();
 
-	npage = state->btps_page;
+	nbuf = state->btps_buf;
+	npage = (Page) nbuf;
 	nblkno = state->btps_blkno;
 	last_off = state->btps_lastoff;
 	last_truncextra = state->btps_lastextra;
@@ -909,6 +857,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 		/*
 		 * Finish off the page and write it out.
 		 */
+		BulkWriteBuffer obuf = nbuf;
 		Page		opage = npage;
 		BlockNumber oblkno = nblkno;
 		ItemId		ii;
@@ -916,7 +865,8 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 		IndexTuple	oitup;
 
 		/* Create new page of same level */
-		npage = _bt_blnewpage(state->btps_level);
+		nbuf = _bt_blnewpage(wstate, state->btps_level);
+		npage = (Page) nbuf;
 
 		/* and assign it a page position */
 		nblkno = wstate->btws_pages_alloced++;
@@ -1028,10 +978,10 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 		}
 
 		/*
-		 * Write out the old page.  We never need to touch it again, so we can
-		 * free the opage workspace too.
+		 * Write out the old page. _bt_blwritepage takes ownership of the
+		 * 'opage' buffer.
 		 */
-		_bt_blwritepage(wstate, opage, oblkno);
+		_bt_blwritepage(wstate, obuf, oblkno);
 
 		/*
 		 * Reset last_off to point to new page
@@ -1064,7 +1014,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 	_bt_sortaddtup(npage, itupsz, itup, last_off,
 				   !isleaf && last_off == P_FIRSTKEY);
 
-	state->btps_page = npage;
+	state->btps_buf = nbuf;
 	state->btps_blkno = nblkno;
 	state->btps_lastoff = last_off;
 }
@@ -1116,7 +1066,7 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 	BTPageState *s;
 	BlockNumber rootblkno = P_NONE;
 	uint32		rootlevel = 0;
-	Page		metapage;
+	BulkWriteBuffer metabuf;
 
 	/*
 	 * Each iteration of this loop completes one more level of the tree.
@@ -1127,7 +1077,7 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 		BTPageOpaque opaque;
 
 		blkno = s->btps_blkno;
-		opaque = BTPageGetOpaque(s->btps_page);
+		opaque = BTPageGetOpaque((Page) s->btps_buf);
 
 		/*
 		 * We have to link the last page on this level to somewhere.
@@ -1161,9 +1111,9 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 		 * This is the rightmost page, so the ItemId array needs to be slid
 		 * back one slot.  Then we can dump out the page.
 		 */
-		_bt_slideleft(s->btps_page);
-		_bt_blwritepage(wstate, s->btps_page, s->btps_blkno);
-		s->btps_page = NULL;	/* writepage freed the workspace */
+		_bt_slideleft((Page) s->btps_buf);
+		_bt_blwritepage(wstate, s->btps_buf, s->btps_blkno);
+		s->btps_buf = NULL;		/* writepage took ownership of the buffer */
 	}
 
 	/*
@@ -1172,10 +1122,10 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 	 * set to point to "P_NONE").  This changes the index to the "valid" state
 	 * by filling in a valid magic number in the metapage.
 	 */
-	metapage = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
-	_bt_initmetapage(metapage, rootblkno, rootlevel,
+	metabuf = bulkw_get_buf(wstate->bulkw);
+	_bt_initmetapage((Page) metabuf, rootblkno, rootlevel,
 					 wstate->inskey->allequalimage);
-	_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
+	_bt_blwritepage(wstate, metabuf, BTREE_METAPAGE);
 }
 
 /*
@@ -1197,6 +1147,8 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	int64		tuples_done = 0;
 	bool		deduplicate;
 
+	wstate->bulkw = bulkw_start_rel(wstate->index, MAIN_FORKNUM);
+
 	deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
 		BTGetDeduplicateItems(wstate->index);
 
@@ -1352,7 +1304,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 				 */
 				dstate->maxpostingsize = MAXALIGN_DOWN((BLCKSZ * 10 / 100)) -
 					sizeof(ItemIdData);
-				Assert(dstate->maxpostingsize <= BTMaxItemSize(state->btps_page) &&
+				Assert(dstate->maxpostingsize <= BTMaxItemSize((Page) state->btps_buf) &&
 					   dstate->maxpostingsize <= INDEX_SIZE_MASK);
 				dstate->htids = palloc(dstate->maxpostingsize);
 
@@ -1422,18 +1374,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 
 	/* Close down final pages and write the metapage */
 	_bt_uppershutdown(wstate, state);
-
-	/*
-	 * When we WAL-logged index pages, we must nonetheless fsync index files.
-	 * Since we're building outside shared buffers, a CHECKPOINT occurring
-	 * during the build has no way to flush the previously written data to
-	 * disk (indeed it won't know the index even exists).  A crash later on
-	 * would replay WAL from the checkpoint, therefore it wouldn't replay our
-	 * earlier WAL entries. If we do not fsync those pages here, they might
-	 * still not be on disk when the crash occurs.
-	 */
-	if (wstate->btws_use_wal)
-		smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM);
+	bulkw_finish(wstate->bulkw);
 }
 
 /*
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index 98b1da20d58..e246985de3d 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -25,7 +25,7 @@
 #include "catalog/index.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
-#include "storage/smgr.h"
+#include "storage/bulk_write.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -155,42 +155,27 @@ spgbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 void
 spgbuildempty(Relation index)
 {
-	Buffer		metabuffer,
-				rootbuffer,
-				nullbuffer;
-
-	/*
-	 * Initialize the meta page and root pages
-	 */
-	metabuffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
-	LockBuffer(metabuffer, BUFFER_LOCK_EXCLUSIVE);
-	rootbuffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
-	LockBuffer(rootbuffer, BUFFER_LOCK_EXCLUSIVE);
-	nullbuffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
-	LockBuffer(nullbuffer, BUFFER_LOCK_EXCLUSIVE);
-
-	Assert(BufferGetBlockNumber(metabuffer) == SPGIST_METAPAGE_BLKNO);
-	Assert(BufferGetBlockNumber(rootbuffer) == SPGIST_ROOT_BLKNO);
-	Assert(BufferGetBlockNumber(nullbuffer) == SPGIST_NULL_BLKNO);
+	BulkWriteState *bulkw;
+	BulkWriteBuffer buf;
 
-	START_CRIT_SECTION();
+	bulkw = bulkw_start_rel(index, INIT_FORKNUM);
 
-	SpGistInitMetapage(BufferGetPage(metabuffer));
-	MarkBufferDirty(metabuffer);
-	SpGistInitBuffer(rootbuffer, SPGIST_LEAF);
-	MarkBufferDirty(rootbuffer);
-	SpGistInitBuffer(nullbuffer, SPGIST_LEAF | SPGIST_NULLS);
-	MarkBufferDirty(nullbuffer);
+	/* Construct metapage. */
+	buf = bulkw_get_buf(bulkw);
+	SpGistInitMetapage((Page) buf);
+	bulkw_write(bulkw, SPGIST_METAPAGE_BLKNO, buf, true);
 
-	log_newpage_buffer(metabuffer, true);
-	log_newpage_buffer(rootbuffer, true);
-	log_newpage_buffer(nullbuffer, true);
+	/* Likewise for the root page. */
+	buf = bulkw_get_buf(bulkw);
+	SpGistInitPage((Page) buf, SPGIST_LEAF);
+	bulkw_write(bulkw, SPGIST_ROOT_BLKNO, buf, true);
 
-	END_CRIT_SECTION();
+	/* Likewise for the null-tuples root page. */
+	buf = bulkw_get_buf(bulkw);
+	SpGistInitPage((Page) buf, SPGIST_LEAF | SPGIST_NULLS);
+	bulkw_write(bulkw, SPGIST_NULL_BLKNO, buf, true);
 
-	UnlockReleaseBuffer(metabuffer);
-	UnlockReleaseBuffer(rootbuffer);
-	UnlockReleaseBuffer(nullbuffer);
+	bulkw_finish(bulkw);
 }
 
 /*
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index b155c03386e..ccfcd8dbd51 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -28,6 +28,7 @@
 #include "catalog/storage.h"
 #include "catalog/storage_xlog.h"
 #include "miscadmin.h"
+#include "storage/bulk_write.h"
 #include "storage/freespace.h"
 #include "storage/smgr.h"
 #include "utils/hsearch.h"
@@ -451,14 +452,11 @@ void
 RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 					ForkNumber forkNum, char relpersistence)
 {
-	PGIOAlignedBlock buf;
-	Page		page;
 	bool		use_wal;
 	bool		copying_initfork;
 	BlockNumber nblocks;
 	BlockNumber blkno;
-
-	page = (Page) buf.data;
+	BulkWriteState *bulkw;
 
 	/*
 	 * The init fork for an unlogged relation in many respects has to be
@@ -477,16 +475,21 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 	use_wal = XLogIsNeeded() &&
 		(relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork);
 
+	bulkw = bulkw_start_smgr(dst, forkNum, use_wal);
+
 	nblocks = smgrnblocks(src, forkNum);
 
 	for (blkno = 0; blkno < nblocks; blkno++)
 	{
+		BulkWriteBuffer buf;
+
 		/* If we got a cancel signal during the copy of the data, quit */
 		CHECK_FOR_INTERRUPTS();
 
-		smgrread(src, forkNum, blkno, buf.data);
+		buf = bulkw_get_buf(bulkw);
+		smgrread(src, forkNum, blkno, (Page) buf);
 
-		if (!PageIsVerifiedExtended(page, blkno,
+		if (!PageIsVerifiedExtended((Page) buf, blkno,
 									PIV_LOG_WARNING | PIV_REPORT_STAT))
 		{
 			/*
@@ -507,34 +510,13 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 		}
 
 		/*
-		 * WAL-log the copied page. Unfortunately we don't know what kind of a
-		 * page this is, so we have to log the full page including any unused
-		 * space.
-		 */
-		if (use_wal)
-			log_newpage(&dst->smgr_rlocator.locator, forkNum, blkno, page, false);
-
-		PageSetChecksumInplace(page, blkno);
-
-		/*
-		 * Now write the page.  We say skipFsync = true because there's no
-		 * need for smgr to schedule an fsync for this write; we'll do it
-		 * ourselves below.
+		 * Queue the page for WAL-logging and writing out.  Unfortunately we
+		 * don't know what kind of a page this is, so we have to log the full
+		 * page including any unused space.
 		 */
-		smgrextend(dst, forkNum, blkno, buf.data, true);
+		bulkw_write(bulkw, blkno, buf, false);
 	}
-
-	/*
-	 * When we WAL-logged rel pages, we must nonetheless fsync them.  The
-	 * reason is that since we're copying outside shared buffers, a CHECKPOINT
-	 * occurring during the copy has no way to flush the previously written
-	 * data to disk (indeed it won't know the new rel even exists).  A crash
-	 * later on would replay WAL from the checkpoint, therefore it wouldn't
-	 * replay our earlier WAL entries. If we do not fsync those pages here,
-	 * they might still not be on disk when the crash occurs.
-	 */
-	if (use_wal || copying_initfork)
-		smgrimmedsync(dst, forkNum);
+	bulkw_finish(bulkw);
 }
 
 /*
diff --git a/src/backend/storage/smgr/Makefile b/src/backend/storage/smgr/Makefile
index 596b564656f..1d0b98764f9 100644
--- a/src/backend/storage/smgr/Makefile
+++ b/src/backend/storage/smgr/Makefile
@@ -13,6 +13,7 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = \
+	bulk_write.o \
 	md.o \
 	smgr.o
 
diff --git a/src/backend/storage/smgr/bulk_write.c b/src/backend/storage/smgr/bulk_write.c
new file mode 100644
index 00000000000..34d2b874e03
--- /dev/null
+++ b/src/backend/storage/smgr/bulk_write.c
@@ -0,0 +1,301 @@
+/*-------------------------------------------------------------------------
+ *
+ * bulk_write.c
+ *	  Efficiently and reliably populate a new relation
+ *
+ * The assumption is that no other backends access the relation while we are
+ * loading it, so we can take some shortcuts.  Do not mix operations through
+ * the regular buffer manager and the bulk loading interface!
+ *
+ * We bypass the buffer manager to avoid the locking overhead, and call
+ * smgrextend() directly.  A downside is that the pages will need to be
+ * re-read into shared buffers on first use after the build finishes.  That's
+ * usually a good tradeoff for large relations, and for small relations, the
+ * overhead isn't very significant compared to creating the relation in the
+ * first place.
+ *
+ * The pages are WAL-logged if needed.  To save on WAL header overhead, we
+ * WAL-log several pages in one record.
+ *
+ * One tricky point is that because we bypass the buffer manager, we need to
+ * register the relation for fsyncing at the next checkpoint ourselves, and
+ * make sure that the relation is correctly fsync'd by us or the checkpointer
+ * even if a checkpoint happens concurrently.
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/smgr/bulk_write.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/xloginsert.h"
+#include "access/xlogrecord.h"
+#include "storage/bufmgr.h"
+#include "storage/bufpage.h"
+#include "storage/bulk_write.h"
+#include "storage/proc.h"
+#include "storage/smgr.h"
+#include "utils/rel.h"
+
+#define MAX_PENDING_WRITES XLR_MAX_BLOCK_ID
+
+static const PGIOAlignedBlock zero_buffer = {{0}};	/* worth BLCKSZ */
+
+typedef struct PendingWrite
+{
+	BulkWriteBuffer buf;
+	BlockNumber blkno;
+	bool		page_std;
+} PendingWrite;
+
+/*
+ * Bulk writer state for one relation fork.
+ */
+typedef struct BulkWriteState
+{
+	/* Information about the target relation we're writing */
+	/*
+	 * FIXME: 'smgr' might get invalidated. Hopefully
+	 * https://postgr.es/m/CA%2BhUKGJ8NTvqLHz6dqbQnt2c8XCki4r2QvXjBQcXpVwxTY_pvA%40mail.gmail.com
+	 * is merged before this.
+	 */
+	SMgrRelation smgr;
+	ForkNumber	forknum;
+	bool		use_wal;
+
+	/* We keep several writes queued, and WAL-log them in batches */
+	int			npending;
+	PendingWrite pending_writes[MAX_PENDING_WRITES];
+
+	/* Current size of the relation */
+	BlockNumber pages_written;
+
+	/* The RedoRecPtr at the time that the bulk operation started */
+	XLogRecPtr	start_RedoRecPtr;
+
+	MemoryContext memcxt;
+} BulkWriteState;
+
+static void bulkw_flush(BulkWriteState *bulkw);
+
+/*
+ * Start a bulk write operation on a relation fork.
+ */
+BulkWriteState *
+bulkw_start_rel(Relation rel, ForkNumber forknum)
+{
+	return bulkw_start_smgr(RelationGetSmgr(rel),
+							forknum,
+							RelationNeedsWAL(rel) || forknum == INIT_FORKNUM);
+}
+
+/*
+ * Start a bulk write operation on a relation fork.
+ *
+ * This is like bulkw_start_rel, but can be used without a relcache entry.
+ */
+BulkWriteState *
+bulkw_start_smgr(SMgrRelation smgr, ForkNumber forknum, bool use_wal)
+{
+	BulkWriteState *bulkw;
+
+	bulkw = palloc(sizeof(BulkWriteState));
+	bulkw->smgr = smgr;
+	bulkw->forknum = forknum;
+	bulkw->use_wal = use_wal;
+
+	bulkw->npending = 0;
+	bulkw->pages_written = 0;
+
+	bulkw->start_RedoRecPtr = GetRedoRecPtr();
+
+	/*
+	 * Remember the memory context.  We will use it to allocate all the
+	 * buffers later.
+	 */
+	bulkw->memcxt = CurrentMemoryContext;
+
+	return bulkw;
+}
+
+/*
+ * Finish bulk write operation.
+ *
+ * This WAL-logs and flushes any remaining pending writes to disk, and fsyncs
+ * the relation if needed.
+ */
+void
+bulkw_finish(BulkWriteState *bulkw)
+{
+	/* WAL-log and flush any remaining pages */
+	bulkw_flush(bulkw);
+
+	/*
+	 * When we wrote out the pages, we passed skipFsync=true to avoid the
+	 * overhead of registering all the writes with the checkpointer.  Register
+	 * the whole relation now.
+	 *
+	 * There is one hole in that idea: If a checkpoint occurred while we were
+	 * writing the pages, it already missed fsyncing the pages we had written
+	 * before the checkpoint started.  A crash later on would replay the WAL
+	 * starting from the checkpoint, therefore it wouldn't replay our earlier
+	 * WAL records.  So if a checkpoint started after the bulk write, fsync
+	 * the files now.
+	 */
+	if (!SmgrIsTemp(bulkw->smgr))
+	{
+		/*
+		 * Prevent a checkpoint from starting between the GetRedoRecPtr() and
+		 * smgrregistersync() calls.
+		 */
+		Assert((MyProc->delayChkptFlags & DELAY_CHKPT_START) == 0);
+		MyProc->delayChkptFlags |= DELAY_CHKPT_START;
+
+		if (bulkw->start_RedoRecPtr != GetRedoRecPtr())
+		{
+			/*
+			 * A checkpoint occurred and it didn't know about our writes, so
+			 * fsync() the relation ourselves.
+			 */
+			MyProc->delayChkptFlags &= ~DELAY_CHKPT_START;
+			smgrimmedsync(bulkw->smgr, bulkw->forknum);
+			elog(DEBUG1, "flushed relation because a checkpoint occurred concurrently");
+		}
+		else
+		{
+			smgrregistersync(bulkw->smgr, bulkw->forknum);
+			MyProc->delayChkptFlags &= ~DELAY_CHKPT_START;
+		}
+	}
+}
+
+static int
+buffer_cmp(const void *a, const void *b)
+{
+	const PendingWrite *bufa = (const PendingWrite *) a;
+	const PendingWrite *bufb = (const PendingWrite *) b;
+
+	/* We should not see duplicated writes for the same block */
+	Assert(bufa->blkno != bufb->blkno);
+	if (bufa->blkno > bufb->blkno)
+		return 1;
+	else
+		return -1;
+}
+
+/*
+ * Finish all the pending writes.
+ */
+static void
+bulkw_flush(BulkWriteState *bulkw)
+{
+	int			npending = bulkw->npending;
+	PendingWrite *pending_writes = bulkw->pending_writes;
+
+	if (npending == 0)
+		return;
+
+	if (npending > 1)
+		qsort(pending_writes, npending, sizeof(PendingWrite), buffer_cmp);
+
+	if (bulkw->use_wal)
+	{
+		BlockNumber blknos[MAX_PENDING_WRITES];
+		Page		pages[MAX_PENDING_WRITES];
+		bool		page_std = true;
+
+		for (int i = 0; i < npending; i++)
+		{
+			blknos[i] = pending_writes[i].blkno;
+			pages[i] = pending_writes[i].buf->data;
+
+			/*
+			 * If any of the pages use !page_std, we log them all as such.
+			 * That's a bit wasteful, but in practice, a mix of standard and
+			 * non-standard page layout is rare.  None of the built-in AMs do
+			 * that.
+			 */
+			if (!pending_writes[i].page_std)
+				page_std = false;
+		}
+		log_newpages(&bulkw->smgr->smgr_rlocator.locator, bulkw->forknum,
+					 npending, blknos, pages, page_std);
+	}
+
+	for (int i = 0; i < npending; i++)
+	{
+		BlockNumber blkno = pending_writes[i].blkno;
+		Page		page = pending_writes[i].buf->data;
+
+		PageSetChecksumInplace(page, blkno);
+
+		if (blkno >= bulkw->pages_written)
+		{
+			/*
+			 * If we have to write pages nonsequentially, fill in the space
+			 * with zeroes until we come back and overwrite.  This is not
+			 * logically necessary on standard Unix filesystems (unwritten
+			 * space will read as zeroes anyway), but it should help to avoid
+			 * fragmentation.  The dummy pages aren't WAL-logged though.
+			 */
+			while (blkno > bulkw->pages_written)
+			{
+				/* don't set checksum for all-zero page */
+				smgrextend(bulkw->smgr, bulkw->forknum,
+						   bulkw->pages_written++,
+						   &zero_buffer,
+						   true);
+			}
+
+			smgrextend(bulkw->smgr, bulkw->forknum, blkno, page, true);
+			bulkw->pages_written = pending_writes[i].blkno + 1;
+		}
+		else
+			smgrwrite(bulkw->smgr, bulkw->forknum, blkno, page, true);
+		pfree(page);
+	}
+
+	bulkw->npending = 0;
+}
+
+/*
+ * Queue write of 'buf'.
+ *
+ * NB: this takes ownership of 'buf'!
+ *
+ * You are only allowed to write a given block once as part of one bulk write
+ * operation.
+ */
+void
+bulkw_write(BulkWriteState *bulkw, BlockNumber blocknum, BulkWriteBuffer buf, bool page_std)
+{
+	bulkw->pending_writes[bulkw->npending].buf = buf;
+	bulkw->pending_writes[bulkw->npending].blkno = blocknum;
+	bulkw->pending_writes[bulkw->npending].page_std = page_std;
+
+	bulkw->npending++;
+
+	if (bulkw->npending == MAX_PENDING_WRITES)
+		bulkw_flush(bulkw);
+}
+
+/*
+ * Allocate a new buffer which can later be written with bulkw_write().
+ *
+ * There is no function to free a buffer.  When you pass it to bulkw_write(),
+ * it takes ownership and frees it when it's no longer needed.
+ *
+ * This is currently implemented as a simple palloc, but could be implemented
+ * using a ring buffer or larger chunks in the future, so don't rely on it.
+ */
+BulkWriteBuffer
+bulkw_get_buf(BulkWriteState *bulkw)
+{
+	return MemoryContextAllocAligned(bulkw->memcxt, BLCKSZ, PG_IO_ALIGN_SIZE, 0);
+}
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index b1e9932a291..233f6987f59 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -1236,6 +1236,49 @@ mdtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
 	}
 }
 
+/*
+ * mdregistersync() -- Mark whole relation as needing fsync
+ */
+void
+mdregistersync(SMgrRelation reln, ForkNumber forknum)
+{
+	int			segno;
+	int			min_inactive_seg;
+
+	/*
+	 * NOTE: mdnblocks makes sure we have opened all active segments, so that
+	 * the loop below will get them all!
+	 */
+	mdnblocks(reln, forknum);
+
+	min_inactive_seg = segno = reln->md_num_open_segs[forknum];
+
+	/*
+	 * Temporarily open inactive segments, then close them after sync.  There
+	 * may be some inactive segments left opened after error, but that is
+	 * harmless.  We don't bother to clean them up and take a risk of further
+	 * trouble.  The next mdclose() will soon close them.
+	 */
+	while (_mdfd_openseg(reln, forknum, segno, 0) != NULL)
+		segno++;
+
+	while (segno > 0)
+	{
+		MdfdVec    *v = &reln->md_seg_fds[forknum][segno - 1];
+
+		register_dirty_segment(reln, forknum, v);
+
+		/* Close inactive segments immediately */
+		if (segno > min_inactive_seg)
+		{
+			FileClose(v->mdfd_vfd);
+			_fdvec_resize(reln, forknum, segno - 1);
+		}
+
+		segno--;
+	}
+}
+
 /*
  * mdimmedsync() -- Immediately sync a relation to stable storage.
  *
@@ -1255,7 +1298,7 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 
 	/*
 	 * NOTE: mdnblocks makes sure we have opened all active segments, so that
-	 * fsync loop will get them all!
+	 * the loop below will get them all!
 	 */
 	mdnblocks(reln, forknum);
 
diff --git a/src/backend/storage/smgr/meson.build b/src/backend/storage/smgr/meson.build
index 003d5e30dd4..6d91b18fe67 100644
--- a/src/backend/storage/smgr/meson.build
+++ b/src/backend/storage/smgr/meson.build
@@ -1,6 +1,7 @@
 # Copyright (c) 2022-2024, PostgreSQL Global Development Group
 
 backend_sources += files(
+  'bulk_write.c',
   'md.c',
   'smgr.c',
 )
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 563a0be5c74..ea4aed63dab 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -68,6 +68,7 @@ typedef struct f_smgr
 	void		(*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
 								  BlockNumber nblocks);
 	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
+	void		(*smgr_registersync) (SMgrRelation reln, ForkNumber forknum);
 } f_smgr;
 
 static const f_smgr smgrsw[] = {
@@ -89,6 +90,7 @@ static const f_smgr smgrsw[] = {
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
 		.smgr_immedsync = mdimmedsync,
+		.smgr_registersync = mdregistersync,
 	}
 };
 
@@ -583,6 +585,14 @@ smgrreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
  * on disk at return, only dumped out to the kernel.  However,
  * provisions will be made to fsync the write before the next checkpoint.
  *
+ * NB: The mechanism to ensure fsync at next checkpoint assumes that there is
+ * something that prevents a concurrent checkpoint from "racing ahead" of the
+ * write.  One way to prevent that is by holding a lock on the buffer; the
+ * buffer manager's writes are protected by that.  The bulk writer facility in
+ * bulk_write.c checks the redo pointer and and calls smgrimmedsync() if a
+ * checkpoint happened; that relies on the fact that no other backend can be
+ * concurrently modify the page.
+ *
  * skipFsync indicates that the caller will make other provisions to
  * fsync the relation, so we needn't bother.  Temporary relations also
  * do not require fsync.
@@ -700,6 +710,24 @@ smgrtruncate(SMgrRelation reln, ForkNumber *forknum, int nforks, BlockNumber *nb
 	}
 }
 
+/*
+ * smgrregistersync() -- Request a relation to be sync'd at next checkpoint
+ *
+ * This can be used after calling smgrwrite() or smgrextend() with skipFsync =
+ * true, to register the fsyncs that were skipped earlier.
+ *
+ * Note: be mindful that a checkpoint could already have happened between the
+ * smgrwrite or smgrextend calls and this!  In that case, the checkpoint
+ * already missed fsyncing this relation, and you should use smgrimmedsync
+ * instead.  Most callers should use the bulk loading facility in bulk_write.c
+ * instead, which handles all that.
+ */
+void
+smgrregistersync(SMgrRelation reln, ForkNumber forknum)
+{
+	smgrsw[reln->smgr_which].smgr_registersync(reln, forknum);
+}
+
 /*
  * smgrimmedsync() -- Force the specified relation to stable storage.
  *
@@ -722,6 +750,9 @@ smgrtruncate(SMgrRelation reln, ForkNumber *forknum, int nforks, BlockNumber *nb
  * Note that you need to do FlushRelationBuffers() first if there is
  * any possibility that there are dirty buffers for the relation;
  * otherwise the sync is not very meaningful.
+ *
+ * Most callers should use the bulk loading facility in bulk_write.c
+ * instead of calling this directly.
  */
 void
 smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
diff --git a/src/include/storage/bulk_write.h b/src/include/storage/bulk_write.h
new file mode 100644
index 00000000000..76a0b05e310
--- /dev/null
+++ b/src/include/storage/bulk_write.h
@@ -0,0 +1,37 @@
+/*-------------------------------------------------------------------------
+ *
+ * bulk_write.h
+ *	  Efficiently and reliably populate a new relation
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/bulk_write.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BULK_WRITE_H
+#define BULK_WRITE_H
+
+typedef struct BulkWriteState BulkWriteState;
+
+/*
+ * Temporary buffer to hold a page to until it's written out. Use
+ * bulkw_get_buf() to reserve one of these.  This is a separate typedef to
+ * distinguish it from other block-sized buffers passed around in the system.
+ */
+typedef PGIOAlignedBlock *BulkWriteBuffer;
+
+/* forward declared from smgr.h */
+struct SMgrRelationData;
+
+extern BulkWriteState *bulkw_start_rel(Relation rel, ForkNumber forknum);
+extern BulkWriteState *bulkw_start_smgr(struct SMgrRelationData *smgr, ForkNumber forknum, bool use_wal);
+
+extern BulkWriteBuffer bulkw_get_buf(BulkWriteState *bulkw);
+extern void bulkw_write(BulkWriteState *bulkw, BlockNumber blocknum, BulkWriteBuffer buf, bool page_std);
+
+extern void bulkw_finish(BulkWriteState *bulkw);
+
+#endif							/* BULK_WRITE_H */
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 7c181e5a171..620f10abdeb 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -43,6 +43,7 @@ extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber nblocks);
 extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
+extern void mdregistersync(SMgrRelation reln, ForkNumber forknum);
 
 extern void ForgetDatabaseSyncRequests(Oid dbid);
 extern void DropRelationFiles(RelFileLocator *delrels, int ndelrels, bool isRedo);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 527cd2a0568..d6d24487763 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -110,6 +110,7 @@ extern BlockNumber smgrnblocks_cached(SMgrRelation reln, ForkNumber forknum);
 extern void smgrtruncate(SMgrRelation reln, ForkNumber *forknum,
 						 int nforks, BlockNumber *nblocks);
 extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
+extern void smgrregistersync(SMgrRelation reln, ForkNumber forknum);
 extern void AtEOXact_SMgr(void);
 extern bool ProcessBarrierSmgrRelease(void);
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 5fd46b7bd1f..ff833a87fe0 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -332,6 +332,8 @@ BuildAccumulator
 BuiltinScript
 BulkInsertState
 BulkInsertStateData
+BulkWriteBuffer
+BulkWriteState
 CACHESIGN
 CAC_state
 CCFastEqualFN
@@ -2013,6 +2015,7 @@ PendingFsyncEntry
 PendingRelDelete
 PendingRelSync
 PendingUnlinkEntry
+PendingWrite
 PendingWriteback
 PerLockTagEntry
 PerlInterpreter
-- 
2.39.2

#8Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#4)
Re: Relation bulk write facility

On Fri, Nov 24, 2023 at 10:22 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Yeah, I'm not very happy with this interface. The model is that you get
a buffer to write to by calling bulkw_alloc_buf(). Later, you hand it
over to bulkw_write(), which takes ownership of it and frees it later.
There is no other function to free it, although currently the buffer is
just palloc'd so you could call pfree on it.

I think we should try to pick prefixes that are one or more words
rather than using word fragments. bulkw is an awkward prefix even for
people whose first language is English, and probably more awkward for
others.

--
Robert Haas
EDB: http://www.enterprisedb.com

#9Peter Smith
smithpb2250@gmail.com
In reply to: Heikki Linnakangas (#7)
Re: Relation bulk write facility

2024-01 Commitfest.

Hi, This patch has a CF status of "Needs Review" [1]https://commitfest.postgresql.org/46/4575/, but it seems
there was a CFbot test failure last time it was run [2]https://cirrus-ci.com/task/4990764426461184. Please have a
look and post an updated version if necessary.

======
[1]: https://commitfest.postgresql.org/46/4575/
[2]: https://cirrus-ci.com/task/4990764426461184

Kind Regards,
Peter Smith.

#10Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Peter Smith (#9)
1 attachment(s)
Re: Relation bulk write facility

On 10/01/2024 18:17, Robert Haas wrote:

I think we should try to pick prefixes that are one or more words
rather than using word fragments. bulkw is an awkward prefix even for
people whose first language is English, and probably more awkward for
others.

Renamed the 'bulkw' variables to 'bulkstate, and the functions to have
smgr_bulk_* prefix.

I was tempted to use just bulk_* as the prefix, but I'm afraid e.g.
bulk_write() is too generic.

On 22/01/2024 07:50, Peter Smith wrote:

Hi, This patch has a CF status of "Needs Review" [1], but it seems
there was a CFbot test failure last time it was run [2]. Please have a
look and post an updated version if necessary.

Fixed the headerscheck failure by adding appropriate #includes.

--
Heikki Linnakangas
Neon (https://neon.tech)

Attachments:

v5-0001-Introduce-a-new-bulk-loading-facility.patchtext/x-patch; charset=UTF-8; name=v5-0001-Introduce-a-new-bulk-loading-facility.patchDownload
From c2e8cff9326fb874b2e1643f5c3c8a4952eaa3ac Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Thu, 25 Jan 2024 21:40:46 +0200
Subject: [PATCH v5 1/1] Introduce a new bulk loading facility.

The new facility makes it easier to optimize bulk loading, as the
logic for buffering, WAL-logging, and syncing the relation only needs
to be implemented once. It's also less error-prone: We have had a
number of bugs in how a relation is fsync'd - or not - at the end of a
bulk loading operation. By centralizing that logic to one place, we
only need to write it correctly once.

The new facility is faster for small relations: Instead of of calling
smgrimmedsync(), we register the fsync to happen at next checkpoint,
which avoids the fsync latency. That can make a big difference if you
are e.g. restoring a schema-only dump with lots of relations.

It is also slightly more efficient with large relations, as the WAL
logging is performed multiple pages at a time. That avoids some WAL
header overhead. The sorted GiST index build did that already, this
moves the buffering to the new facility.

The changes to pageinspect GiST test needs an explanation: Before this
patch, the sorted GiST index build set the LSN on every page to the
special GistBuildLSN value, not the LSN of the WAL record, even though
they were WAL-logged. There was no particular need for it, it just
happened naturally when we wrote out the pages before WAL-logging
them. Now we WAL-log the pages first, like in B-tree build, so the
pages are stamped with the record's real LSN. When the build is not
WAL-logged, we still use GistBuildLSN. To make the test output
predictable, use an unlogged index.

Reviewed-by: Andres Freund
Discussion: https://www.postgresql.org/message-id/30e8f366-58b3-b239-c521-422122dd5150%40iki.fi
---
 contrib/pageinspect/expected/gist.out |  14 +-
 contrib/pageinspect/sql/gist.sql      |  16 +-
 src/backend/access/gist/gistbuild.c   | 122 +++--------
 src/backend/access/heap/rewriteheap.c |  72 ++----
 src/backend/access/nbtree/nbtree.c    |  33 +--
 src/backend/access/nbtree/nbtsort.c   | 135 ++++--------
 src/backend/access/spgist/spginsert.c |  49 ++---
 src/backend/catalog/storage.c         |  46 ++--
 src/backend/storage/smgr/Makefile     |   1 +
 src/backend/storage/smgr/bulk_write.c | 303 ++++++++++++++++++++++++++
 src/backend/storage/smgr/md.c         |  45 +++-
 src/backend/storage/smgr/meson.build  |   1 +
 src/backend/storage/smgr/smgr.c       |  31 +++
 src/include/storage/bulk_write.h      |  40 ++++
 src/include/storage/md.h              |   1 +
 src/include/storage/smgr.h            |   1 +
 src/tools/pgindent/typedefs.list      |   3 +
 17 files changed, 558 insertions(+), 355 deletions(-)
 create mode 100644 src/backend/storage/smgr/bulk_write.c
 create mode 100644 src/include/storage/bulk_write.h

diff --git a/contrib/pageinspect/expected/gist.out b/contrib/pageinspect/expected/gist.out
index d1adbab8ae2..2b1d54a6279 100644
--- a/contrib/pageinspect/expected/gist.out
+++ b/contrib/pageinspect/expected/gist.out
@@ -1,13 +1,6 @@
--- The gist_page_opaque_info() function prints the page's LSN. Normally,
--- that's constant 1 (GistBuildLSN) on every page of a freshly built GiST
--- index. But with wal_level=minimal, the whole relation is dumped to WAL at
--- the end of the transaction if it's smaller than wal_skip_threshold, which
--- updates the LSNs. Wrap the tests on gist_page_opaque_info() in the
--- same transaction with the CREATE INDEX so that we see the LSNs before
--- they are possibly overwritten at end of transaction.
-BEGIN;
--- Create a test table and GiST index.
-CREATE TABLE test_gist AS SELECT point(i,i) p, i::text t FROM
+-- The gist_page_opaque_info() function prints the page's LSN.
+-- Use an unlogged index, so that the LSN is predictable.
+CREATE UNLOGGED TABLE test_gist AS SELECT point(i,i) p, i::text t FROM
     generate_series(1,1000) i;
 CREATE INDEX test_gist_idx ON test_gist USING gist (p);
 -- Page 0 is the root, the rest are leaf pages
@@ -29,7 +22,6 @@ SELECT * FROM gist_page_opaque_info(get_raw_page('test_gist_idx', 2));
  0/1 | 0/0 |         1 | {leaf}
 (1 row)
 
-COMMIT;
 SELECT * FROM gist_page_items(get_raw_page('test_gist_idx', 0), 'test_gist_idx');
  itemoffset |   ctid    | itemlen | dead |             keys              
 ------------+-----------+---------+------+-------------------------------
diff --git a/contrib/pageinspect/sql/gist.sql b/contrib/pageinspect/sql/gist.sql
index d263542ba15..85bc44b8000 100644
--- a/contrib/pageinspect/sql/gist.sql
+++ b/contrib/pageinspect/sql/gist.sql
@@ -1,14 +1,6 @@
--- The gist_page_opaque_info() function prints the page's LSN. Normally,
--- that's constant 1 (GistBuildLSN) on every page of a freshly built GiST
--- index. But with wal_level=minimal, the whole relation is dumped to WAL at
--- the end of the transaction if it's smaller than wal_skip_threshold, which
--- updates the LSNs. Wrap the tests on gist_page_opaque_info() in the
--- same transaction with the CREATE INDEX so that we see the LSNs before
--- they are possibly overwritten at end of transaction.
-BEGIN;
-
--- Create a test table and GiST index.
-CREATE TABLE test_gist AS SELECT point(i,i) p, i::text t FROM
+-- The gist_page_opaque_info() function prints the page's LSN.
+-- Use an unlogged index, so that the LSN is predictable.
+CREATE UNLOGGED TABLE test_gist AS SELECT point(i,i) p, i::text t FROM
     generate_series(1,1000) i;
 CREATE INDEX test_gist_idx ON test_gist USING gist (p);
 
@@ -17,8 +9,6 @@ SELECT * FROM gist_page_opaque_info(get_raw_page('test_gist_idx', 0));
 SELECT * FROM gist_page_opaque_info(get_raw_page('test_gist_idx', 1));
 SELECT * FROM gist_page_opaque_info(get_raw_page('test_gist_idx', 2));
 
-COMMIT;
-
 SELECT * FROM gist_page_items(get_raw_page('test_gist_idx', 0), 'test_gist_idx');
 SELECT * FROM gist_page_items(get_raw_page('test_gist_idx', 1), 'test_gist_idx') LIMIT 5;
 
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 08555b97f92..b35d945d3d4 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -43,7 +43,8 @@
 #include "miscadmin.h"
 #include "optimizer/optimizer.h"
 #include "storage/bufmgr.h"
-#include "storage/smgr.h"
+#include "storage/bulk_write.h"
+
 #include "utils/memutils.h"
 #include "utils/rel.h"
 #include "utils/tuplesort.h"
@@ -106,11 +107,8 @@ typedef struct
 	Tuplesortstate *sortstate;	/* state data for tuplesort.c */
 
 	BlockNumber pages_allocated;
-	BlockNumber pages_written;
 
-	int			ready_num_pages;
-	BlockNumber ready_blknos[XLR_MAX_BLOCK_ID];
-	Page		ready_pages[XLR_MAX_BLOCK_ID];
+	BulkWriteState *bulkstate;
 } GISTBuildState;
 
 #define GIST_SORTED_BUILD_PAGE_NUM 4
@@ -142,7 +140,6 @@ static void gist_indexsortbuild_levelstate_add(GISTBuildState *state,
 											   IndexTuple itup);
 static void gist_indexsortbuild_levelstate_flush(GISTBuildState *state,
 												 GistSortedBuildLevelState *levelstate);
-static void gist_indexsortbuild_flush_ready_pages(GISTBuildState *state);
 
 static void gistInitBuffering(GISTBuildState *buildstate);
 static int	calculatePagesPerBuffer(GISTBuildState *buildstate, int levelStep);
@@ -405,27 +402,18 @@ gist_indexsortbuild(GISTBuildState *state)
 {
 	IndexTuple	itup;
 	GistSortedBuildLevelState *levelstate;
-	Page		page;
+	BulkWriteBuffer rootbuf;
 
-	state->pages_allocated = 0;
-	state->pages_written = 0;
-	state->ready_num_pages = 0;
+	/* Reserve block 0 for the root page */
+	state->pages_allocated = 1;
 
-	/*
-	 * Write an empty page as a placeholder for the root page. It will be
-	 * replaced with the real root page at the end.
-	 */
-	page = palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, MCXT_ALLOC_ZERO);
-	smgrextend(RelationGetSmgr(state->indexrel), MAIN_FORKNUM, GIST_ROOT_BLKNO,
-			   page, true);
-	state->pages_allocated++;
-	state->pages_written++;
+	state->bulkstate = smgr_bulk_start_rel(state->indexrel, MAIN_FORKNUM);
 
 	/* Allocate a temporary buffer for the first leaf page batch. */
 	levelstate = palloc0(sizeof(GistSortedBuildLevelState));
-	levelstate->pages[0] = page;
+	levelstate->pages[0] = palloc(BLCKSZ);
 	levelstate->parent = NULL;
-	gistinitpage(page, F_LEAF);
+	gistinitpage(levelstate->pages[0], F_LEAF);
 
 	/*
 	 * Fill index pages with tuples in the sorted order.
@@ -455,31 +443,16 @@ gist_indexsortbuild(GISTBuildState *state)
 		levelstate = parent;
 	}
 
-	gist_indexsortbuild_flush_ready_pages(state);
-
 	/* Write out the root */
 	PageSetLSN(levelstate->pages[0], GistBuildLSN);
-	PageSetChecksumInplace(levelstate->pages[0], GIST_ROOT_BLKNO);
-	smgrwrite(RelationGetSmgr(state->indexrel), MAIN_FORKNUM, GIST_ROOT_BLKNO,
-			  levelstate->pages[0], true);
-	if (RelationNeedsWAL(state->indexrel))
-		log_newpage(&state->indexrel->rd_locator, MAIN_FORKNUM, GIST_ROOT_BLKNO,
-					levelstate->pages[0], true);
-
-	pfree(levelstate->pages[0]);
+
+	rootbuf = smgr_bulk_get_buf(state->bulkstate);
+	memcpy(rootbuf, levelstate->pages[0], BLCKSZ);
+	smgr_bulk_write(state->bulkstate, GIST_ROOT_BLKNO, rootbuf, true);
+
 	pfree(levelstate);
 
-	/*
-	 * When we WAL-logged index pages, we must nonetheless fsync index files.
-	 * Since we're building outside shared buffers, a CHECKPOINT occurring
-	 * during the build has no way to flush the previously written data to
-	 * disk (indeed it won't know the index even exists).  A crash later on
-	 * would replay WAL from the checkpoint, therefore it wouldn't replay our
-	 * earlier WAL entries. If we do not fsync those pages here, they might
-	 * still not be on disk when the crash occurs.
-	 */
-	if (RelationNeedsWAL(state->indexrel))
-		smgrimmedsync(RelationGetSmgr(state->indexrel), MAIN_FORKNUM);
+	smgr_bulk_finish(state->bulkstate);
 }
 
 /*
@@ -509,8 +482,7 @@ gist_indexsortbuild_levelstate_add(GISTBuildState *state,
 			levelstate->current_page++;
 
 		if (levelstate->pages[levelstate->current_page] == NULL)
-			levelstate->pages[levelstate->current_page] =
-				palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
+			levelstate->pages[levelstate->current_page] = palloc0(BLCKSZ);
 
 		newPage = levelstate->pages[levelstate->current_page];
 		gistinitpage(newPage, old_page_flags);
@@ -573,6 +545,7 @@ gist_indexsortbuild_levelstate_flush(GISTBuildState *state,
 	for (; dist != NULL; dist = dist->next)
 	{
 		char	   *data;
+		BulkWriteBuffer buf;
 		Page		target;
 
 		/* check once per page */
@@ -580,7 +553,8 @@ gist_indexsortbuild_levelstate_flush(GISTBuildState *state,
 
 		/* Create page and copy data */
 		data = (char *) (dist->list);
-		target = palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, MCXT_ALLOC_ZERO);
+		buf = smgr_bulk_get_buf(state->bulkstate);
+		target = (Page) buf;
 		gistinitpage(target, isleaf ? F_LEAF : 0);
 		for (int i = 0; i < dist->block.num; i++)
 		{
@@ -593,20 +567,6 @@ gist_indexsortbuild_levelstate_flush(GISTBuildState *state,
 		}
 		union_tuple = dist->itup;
 
-		if (state->ready_num_pages == XLR_MAX_BLOCK_ID)
-			gist_indexsortbuild_flush_ready_pages(state);
-
-		/*
-		 * The page is now complete. Assign a block number to it, and add it
-		 * to the list of finished pages. (We don't write it out immediately,
-		 * because we want to WAL-log the pages in batches.)
-		 */
-		blkno = state->pages_allocated++;
-		state->ready_blknos[state->ready_num_pages] = blkno;
-		state->ready_pages[state->ready_num_pages] = target;
-		state->ready_num_pages++;
-		ItemPointerSetBlockNumber(&(union_tuple->t_tid), blkno);
-
 		/*
 		 * Set the right link to point to the previous page. This is just for
 		 * debugging purposes: GiST only follows the right link if a page is
@@ -621,6 +581,15 @@ gist_indexsortbuild_levelstate_flush(GISTBuildState *state,
 		 */
 		if (levelstate->last_blkno)
 			GistPageGetOpaque(target)->rightlink = levelstate->last_blkno;
+
+		/*
+		 * The page is now complete. Assign a block number to it, and pass it
+		 * to the bulk writer.
+		 */
+		blkno = state->pages_allocated++;
+		PageSetLSN(target, GistBuildLSN);
+		smgr_bulk_write(state->bulkstate, blkno, buf, true);
+		ItemPointerSetBlockNumber(&(union_tuple->t_tid), blkno);
 		levelstate->last_blkno = blkno;
 
 		/*
@@ -631,7 +600,7 @@ gist_indexsortbuild_levelstate_flush(GISTBuildState *state,
 		if (parent == NULL)
 		{
 			parent = palloc0(sizeof(GistSortedBuildLevelState));
-			parent->pages[0] = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
+			parent->pages[0] = palloc(BLCKSZ);
 			parent->parent = NULL;
 			gistinitpage(parent->pages[0], 0);
 
@@ -641,39 +610,6 @@ gist_indexsortbuild_levelstate_flush(GISTBuildState *state,
 	}
 }
 
-static void
-gist_indexsortbuild_flush_ready_pages(GISTBuildState *state)
-{
-	if (state->ready_num_pages == 0)
-		return;
-
-	for (int i = 0; i < state->ready_num_pages; i++)
-	{
-		Page		page = state->ready_pages[i];
-		BlockNumber blkno = state->ready_blknos[i];
-
-		/* Currently, the blocks must be buffered in order. */
-		if (blkno != state->pages_written)
-			elog(ERROR, "unexpected block number to flush GiST sorting build");
-
-		PageSetLSN(page, GistBuildLSN);
-		PageSetChecksumInplace(page, blkno);
-		smgrextend(RelationGetSmgr(state->indexrel), MAIN_FORKNUM, blkno, page,
-				   true);
-
-		state->pages_written++;
-	}
-
-	if (RelationNeedsWAL(state->indexrel))
-		log_newpages(&state->indexrel->rd_locator, MAIN_FORKNUM, state->ready_num_pages,
-					 state->ready_blknos, state->ready_pages, true);
-
-	for (int i = 0; i < state->ready_num_pages; i++)
-		pfree(state->ready_pages[i]);
-
-	state->ready_num_pages = 0;
-}
-
 
 /*-------------------------------------------------------------------------
  * Routines for non-sorted build
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 34107323ffe..a578b876174 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -87,8 +87,8 @@
  * is optimized for bulk inserting a lot of tuples, knowing that we have
  * exclusive access to the heap.  raw_heap_insert builds new pages in
  * local storage.  When a page is full, or at the end of the process,
- * we insert it to WAL as a single record and then write it to disk
- * directly through smgr.  Note, however, that any data sent to the new
+ * we insert it to WAL as a single record and then write it to disk with
+ * the bulk smgr writer.  Note, however, that any data sent to the new
  * heap's TOAST table will go through the normal bufmgr.
  *
  *
@@ -119,9 +119,9 @@
 #include "replication/logical.h"
 #include "replication/slot.h"
 #include "storage/bufmgr.h"
+#include "storage/bulk_write.h"
 #include "storage/fd.h"
 #include "storage/procarray.h"
-#include "storage/smgr.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -133,9 +133,9 @@ typedef struct RewriteStateData
 {
 	Relation	rs_old_rel;		/* source heap */
 	Relation	rs_new_rel;		/* destination heap */
-	Page		rs_buffer;		/* page currently being built */
+	BulkWriteState *rs_bulkstate;	/* writer for the destination */
+	BulkWriteBuffer rs_buffer;	/* page currently being built */
 	BlockNumber rs_blockno;		/* block where page will go */
-	bool		rs_buffer_valid;	/* T if any tuples in buffer */
 	bool		rs_logical_rewrite; /* do we need to do logical rewriting */
 	TransactionId rs_oldest_xmin;	/* oldest xmin used by caller to determine
 									 * tuple visibility */
@@ -255,14 +255,14 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
 
 	state->rs_old_rel = old_heap;
 	state->rs_new_rel = new_heap;
-	state->rs_buffer = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
+	state->rs_buffer = NULL;
 	/* new_heap needn't be empty, just locked */
 	state->rs_blockno = RelationGetNumberOfBlocks(new_heap);
-	state->rs_buffer_valid = false;
 	state->rs_oldest_xmin = oldest_xmin;
 	state->rs_freeze_xid = freeze_xid;
 	state->rs_cutoff_multi = cutoff_multi;
 	state->rs_cxt = rw_cxt;
+	state->rs_bulkstate = smgr_bulk_start_rel(new_heap, MAIN_FORKNUM);
 
 	/* Initialize hash tables used to track update chains */
 	hash_ctl.keysize = sizeof(TidHashKey);
@@ -314,30 +314,13 @@ end_heap_rewrite(RewriteState state)
 	}
 
 	/* Write the last page, if any */
-	if (state->rs_buffer_valid)
+	if (state->rs_buffer)
 	{
-		if (RelationNeedsWAL(state->rs_new_rel))
-			log_newpage(&state->rs_new_rel->rd_locator,
-						MAIN_FORKNUM,
-						state->rs_blockno,
-						state->rs_buffer,
-						true);
-
-		PageSetChecksumInplace(state->rs_buffer, state->rs_blockno);
-
-		smgrextend(RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM,
-				   state->rs_blockno, state->rs_buffer, true);
+		smgr_bulk_write(state->rs_bulkstate, state->rs_blockno, state->rs_buffer, true);
+		state->rs_buffer = NULL;
 	}
 
-	/*
-	 * When we WAL-logged rel pages, we must nonetheless fsync them.  The
-	 * reason is the same as in storage.c's RelationCopyStorage(): we're
-	 * writing data that's not in shared buffers, and so a CHECKPOINT
-	 * occurring during the rewriteheap operation won't have fsync'd data we
-	 * wrote before the checkpoint.
-	 */
-	if (RelationNeedsWAL(state->rs_new_rel))
-		smgrimmedsync(RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM);
+	smgr_bulk_finish(state->rs_bulkstate);
 
 	logical_end_heap_rewrite(state);
 
@@ -611,7 +594,7 @@ rewrite_heap_dead_tuple(RewriteState state, HeapTuple old_tuple)
 static void
 raw_heap_insert(RewriteState state, HeapTuple tup)
 {
-	Page		page = state->rs_buffer;
+	Page		page;
 	Size		pageFreeSpace,
 				saveFreeSpace;
 	Size		len;
@@ -664,7 +647,8 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
 												   HEAP_DEFAULT_FILLFACTOR);
 
 	/* Now we can check to see if there's enough free space already. */
-	if (state->rs_buffer_valid)
+	page = (Page) state->rs_buffer;
+	if (page)
 	{
 		pageFreeSpace = PageGetHeapFreeSpace(page);
 
@@ -675,35 +659,19 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
 			 * contains a tuple.  Hence, unlike RelationGetBufferForTuple(),
 			 * enforce saveFreeSpace unconditionally.
 			 */
-
-			/* XLOG stuff */
-			if (RelationNeedsWAL(state->rs_new_rel))
-				log_newpage(&state->rs_new_rel->rd_locator,
-							MAIN_FORKNUM,
-							state->rs_blockno,
-							page,
-							true);
-
-			/*
-			 * Now write the page. We say skipFsync = true because there's no
-			 * need for smgr to schedule an fsync for this write; we'll do it
-			 * ourselves in end_heap_rewrite.
-			 */
-			PageSetChecksumInplace(page, state->rs_blockno);
-
-			smgrextend(RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM,
-					   state->rs_blockno, page, true);
-
+			smgr_bulk_write(state->rs_bulkstate, state->rs_blockno, state->rs_buffer, true);
+			state->rs_buffer = NULL;
+			page = NULL;
 			state->rs_blockno++;
-			state->rs_buffer_valid = false;
 		}
 	}
 
-	if (!state->rs_buffer_valid)
+	if (!page)
 	{
 		/* Initialize a new empty page */
+		state->rs_buffer = smgr_bulk_get_buf(state->rs_bulkstate);
+		page = (Page) state->rs_buffer;
 		PageInit(page, BLCKSZ, 0);
-		state->rs_buffer_valid = true;
 	}
 
 	/* And now we can insert the tuple into the page */
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 696d79c0852..21d879a3bdf 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -29,11 +29,11 @@
 #include "nodes/execnodes.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
+#include "storage/bulk_write.h"
 #include "storage/condition_variable.h"
 #include "storage/indexfsm.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
-#include "storage/smgr.h"
 #include "utils/builtins.h"
 #include "utils/index_selfuncs.h"
 #include "utils/memutils.h"
@@ -154,32 +154,17 @@ void
 btbuildempty(Relation index)
 {
 	bool		allequalimage = _bt_allequalimage(index, false);
-	Buffer		metabuf;
-	Page		metapage;
+	BulkWriteState *bulkstate;
+	BulkWriteBuffer metabuf;
 
-	/*
-	 * Initialize the metapage.
-	 *
-	 * Regular index build bypasses the buffer manager and uses smgr functions
-	 * directly, with an smgrimmedsync() call at the end.  That makes sense
-	 * when the index is large, but for an empty index, it's better to use the
-	 * buffer cache to avoid the smgrimmedsync().
-	 */
-	metabuf = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
-	Assert(BufferGetBlockNumber(metabuf) == BTREE_METAPAGE);
-	_bt_lockbuf(index, metabuf, BT_WRITE);
-
-	START_CRIT_SECTION();
-
-	metapage = BufferGetPage(metabuf);
-	_bt_initmetapage(metapage, P_NONE, 0, allequalimage);
-	MarkBufferDirty(metabuf);
-	log_newpage_buffer(metabuf, true);
+	bulkstate = smgr_bulk_start_rel(index, INIT_FORKNUM);
 
-	END_CRIT_SECTION();
+	/* Construct metapage. */
+	metabuf = smgr_bulk_get_buf(bulkstate);
+	_bt_initmetapage((Page) metabuf, P_NONE, 0, allequalimage);
+	smgr_bulk_write(bulkstate, BTREE_METAPAGE, metabuf, true);
 
-	_bt_unlockbuf(index, metabuf);
-	ReleaseBuffer(metabuf);
+	smgr_bulk_finish(bulkstate);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 20111965793..7c9ba609194 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -23,13 +23,8 @@
  * many upper pages if the keys are reasonable-size) without risking a lot of
  * cascading splits during early insertions.
  *
- * Formerly the index pages being built were kept in shared buffers, but
- * that is of no value (since other backends have no interest in them yet)
- * and it created locking problems for CHECKPOINT, because the upper-level
- * pages were held exclusive-locked for long periods.  Now we just build
- * the pages in local memory and smgrwrite or smgrextend them as we finish
- * them.  They will need to be re-read into shared buffers on first use after
- * the build finishes.
+ * We use the bulk smgr loading facility to bypass the buffer cache and
+ * WAL log the pages efficiently.
  *
  * This code isn't concerned about the FSM at all. The caller is responsible
  * for initializing that.
@@ -57,7 +52,7 @@
 #include "executor/instrument.h"
 #include "miscadmin.h"
 #include "pgstat.h"
-#include "storage/smgr.h"
+#include "storage/bulk_write.h"
 #include "tcop/tcopprot.h"		/* pgrminclude ignore */
 #include "utils/rel.h"
 #include "utils/sortsupport.h"
@@ -234,7 +229,7 @@ typedef struct BTBuildState
  */
 typedef struct BTPageState
 {
-	Page		btps_page;		/* workspace for page building */
+	BulkWriteBuffer btps_buf;	/* workspace for page building */
 	BlockNumber btps_blkno;		/* block # to write this page at */
 	IndexTuple	btps_lowkey;	/* page's strict lower bound pivot tuple */
 	OffsetNumber btps_lastoff;	/* last item offset loaded */
@@ -251,11 +246,9 @@ typedef struct BTWriteState
 {
 	Relation	heap;
 	Relation	index;
+	BulkWriteState *bulkstate;
 	BTScanInsert inskey;		/* generic insertion scankey */
-	bool		btws_use_wal;	/* dump pages to WAL? */
 	BlockNumber btws_pages_alloced; /* # pages allocated */
-	BlockNumber btws_pages_written; /* # pages written out */
-	Page		btws_zeropage;	/* workspace for filling zeroes */
 } BTWriteState;
 
 
@@ -267,7 +260,7 @@ static void _bt_spool(BTSpool *btspool, ItemPointer self,
 static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2);
 static void _bt_build_callback(Relation index, ItemPointer tid, Datum *values,
 							   bool *isnull, bool tupleIsAlive, void *state);
-static Page _bt_blnewpage(uint32 level);
+static BulkWriteBuffer _bt_blnewpage(BTWriteState *wstate, uint32 level);
 static BTPageState *_bt_pagestate(BTWriteState *wstate, uint32 level);
 static void _bt_slideleft(Page rightmostpage);
 static void _bt_sortaddtup(Page page, Size itemsize,
@@ -569,12 +562,9 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
 	/* _bt_mkscankey() won't set allequalimage without metapage */
 	wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
-	wstate.btws_use_wal = RelationNeedsWAL(wstate.index);
 
 	/* reserve the metapage */
 	wstate.btws_pages_alloced = BTREE_METAPAGE + 1;
-	wstate.btws_pages_written = 0;
-	wstate.btws_zeropage = NULL;	/* until needed */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_BTREE_PHASE_LEAF_LOAD);
@@ -613,13 +603,15 @@ _bt_build_callback(Relation index,
 /*
  * allocate workspace for a new, clean btree page, not linked to any siblings.
  */
-static Page
-_bt_blnewpage(uint32 level)
+static BulkWriteBuffer
+_bt_blnewpage(BTWriteState *wstate, uint32 level)
 {
+	BulkWriteBuffer buf;
 	Page		page;
 	BTPageOpaque opaque;
 
-	page = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
+	buf = smgr_bulk_get_buf(wstate->bulkstate);
+	page = (Page) buf;
 
 	/* Zero the page and set up standard page header info */
 	_bt_pageinit(page, BLCKSZ);
@@ -634,63 +626,17 @@ _bt_blnewpage(uint32 level)
 	/* Make the P_HIKEY line pointer appear allocated */
 	((PageHeader) page)->pd_lower += sizeof(ItemIdData);
 
-	return page;
+	return buf;
 }
 
 /*
  * emit a completed btree page, and release the working storage.
  */
 static void
-_bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
+_bt_blwritepage(BTWriteState *wstate, BulkWriteBuffer buf, BlockNumber blkno)
 {
-	/* XLOG stuff */
-	if (wstate->btws_use_wal)
-	{
-		/* We use the XLOG_FPI record type for this */
-		log_newpage(&wstate->index->rd_locator, MAIN_FORKNUM, blkno, page, true);
-	}
-
-	/*
-	 * If we have to write pages nonsequentially, fill in the space with
-	 * zeroes until we come back and overwrite.  This is not logically
-	 * necessary on standard Unix filesystems (unwritten space will read as
-	 * zeroes anyway), but it should help to avoid fragmentation. The dummy
-	 * pages aren't WAL-logged though.
-	 */
-	while (blkno > wstate->btws_pages_written)
-	{
-		if (!wstate->btws_zeropage)
-			wstate->btws_zeropage = (Page) palloc_aligned(BLCKSZ,
-														  PG_IO_ALIGN_SIZE,
-														  MCXT_ALLOC_ZERO);
-		/* don't set checksum for all-zero page */
-		smgrextend(RelationGetSmgr(wstate->index), MAIN_FORKNUM,
-				   wstate->btws_pages_written++,
-				   wstate->btws_zeropage,
-				   true);
-	}
-
-	PageSetChecksumInplace(page, blkno);
-
-	/*
-	 * Now write the page.  There's no need for smgr to schedule an fsync for
-	 * this write; we'll do it ourselves before ending the build.
-	 */
-	if (blkno == wstate->btws_pages_written)
-	{
-		/* extending the file... */
-		smgrextend(RelationGetSmgr(wstate->index), MAIN_FORKNUM, blkno,
-				   page, true);
-		wstate->btws_pages_written++;
-	}
-	else
-	{
-		/* overwriting a block we zero-filled before */
-		smgrwrite(RelationGetSmgr(wstate->index), MAIN_FORKNUM, blkno,
-				  page, true);
-	}
-
-	pfree(page);
+	smgr_bulk_write(wstate->bulkstate, blkno, buf, true);
+	/* smgr_bulk_write took ownership of 'buf' */
 }
 
 /*
@@ -703,7 +649,7 @@ _bt_pagestate(BTWriteState *wstate, uint32 level)
 	BTPageState *state = (BTPageState *) palloc0(sizeof(BTPageState));
 
 	/* create initial page for level */
-	state->btps_page = _bt_blnewpage(level);
+	state->btps_buf = _bt_blnewpage(wstate, level);
 
 	/* and assign it a page position */
 	state->btps_blkno = wstate->btws_pages_alloced++;
@@ -839,6 +785,7 @@ static void
 _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 			 Size truncextra)
 {
+	BulkWriteBuffer nbuf;
 	Page		npage;
 	BlockNumber nblkno;
 	OffsetNumber last_off;
@@ -853,7 +800,8 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 	 */
 	CHECK_FOR_INTERRUPTS();
 
-	npage = state->btps_page;
+	nbuf = state->btps_buf;
+	npage = (Page) nbuf;
 	nblkno = state->btps_blkno;
 	last_off = state->btps_lastoff;
 	last_truncextra = state->btps_lastextra;
@@ -909,6 +857,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 		/*
 		 * Finish off the page and write it out.
 		 */
+		BulkWriteBuffer obuf = nbuf;
 		Page		opage = npage;
 		BlockNumber oblkno = nblkno;
 		ItemId		ii;
@@ -916,7 +865,8 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 		IndexTuple	oitup;
 
 		/* Create new page of same level */
-		npage = _bt_blnewpage(state->btps_level);
+		nbuf = _bt_blnewpage(wstate, state->btps_level);
+		npage = (Page) nbuf;
 
 		/* and assign it a page position */
 		nblkno = wstate->btws_pages_alloced++;
@@ -1028,10 +978,10 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 		}
 
 		/*
-		 * Write out the old page.  We never need to touch it again, so we can
-		 * free the opage workspace too.
+		 * Write out the old page. _bt_blwritepage takes ownership of the
+		 * 'opage' buffer.
 		 */
-		_bt_blwritepage(wstate, opage, oblkno);
+		_bt_blwritepage(wstate, obuf, oblkno);
 
 		/*
 		 * Reset last_off to point to new page
@@ -1064,7 +1014,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 	_bt_sortaddtup(npage, itupsz, itup, last_off,
 				   !isleaf && last_off == P_FIRSTKEY);
 
-	state->btps_page = npage;
+	state->btps_buf = nbuf;
 	state->btps_blkno = nblkno;
 	state->btps_lastoff = last_off;
 }
@@ -1116,7 +1066,7 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 	BTPageState *s;
 	BlockNumber rootblkno = P_NONE;
 	uint32		rootlevel = 0;
-	Page		metapage;
+	BulkWriteBuffer metabuf;
 
 	/*
 	 * Each iteration of this loop completes one more level of the tree.
@@ -1127,7 +1077,7 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 		BTPageOpaque opaque;
 
 		blkno = s->btps_blkno;
-		opaque = BTPageGetOpaque(s->btps_page);
+		opaque = BTPageGetOpaque((Page) s->btps_buf);
 
 		/*
 		 * We have to link the last page on this level to somewhere.
@@ -1161,9 +1111,9 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 		 * This is the rightmost page, so the ItemId array needs to be slid
 		 * back one slot.  Then we can dump out the page.
 		 */
-		_bt_slideleft(s->btps_page);
-		_bt_blwritepage(wstate, s->btps_page, s->btps_blkno);
-		s->btps_page = NULL;	/* writepage freed the workspace */
+		_bt_slideleft((Page) s->btps_buf);
+		_bt_blwritepage(wstate, s->btps_buf, s->btps_blkno);
+		s->btps_buf = NULL;		/* writepage took ownership of the buffer */
 	}
 
 	/*
@@ -1172,10 +1122,10 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 	 * set to point to "P_NONE").  This changes the index to the "valid" state
 	 * by filling in a valid magic number in the metapage.
 	 */
-	metapage = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
-	_bt_initmetapage(metapage, rootblkno, rootlevel,
+	metabuf = smgr_bulk_get_buf(wstate->bulkstate);
+	_bt_initmetapage((Page) metabuf, rootblkno, rootlevel,
 					 wstate->inskey->allequalimage);
-	_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
+	_bt_blwritepage(wstate, metabuf, BTREE_METAPAGE);
 }
 
 /*
@@ -1197,6 +1147,8 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	int64		tuples_done = 0;
 	bool		deduplicate;
 
+	wstate->bulkstate = smgr_bulk_start_rel(wstate->index, MAIN_FORKNUM);
+
 	deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
 		BTGetDeduplicateItems(wstate->index);
 
@@ -1352,7 +1304,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 				 */
 				dstate->maxpostingsize = MAXALIGN_DOWN((BLCKSZ * 10 / 100)) -
 					sizeof(ItemIdData);
-				Assert(dstate->maxpostingsize <= BTMaxItemSize(state->btps_page) &&
+				Assert(dstate->maxpostingsize <= BTMaxItemSize((Page) state->btps_buf) &&
 					   dstate->maxpostingsize <= INDEX_SIZE_MASK);
 				dstate->htids = palloc(dstate->maxpostingsize);
 
@@ -1422,18 +1374,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 
 	/* Close down final pages and write the metapage */
 	_bt_uppershutdown(wstate, state);
-
-	/*
-	 * When we WAL-logged index pages, we must nonetheless fsync index files.
-	 * Since we're building outside shared buffers, a CHECKPOINT occurring
-	 * during the build has no way to flush the previously written data to
-	 * disk (indeed it won't know the index even exists).  A crash later on
-	 * would replay WAL from the checkpoint, therefore it wouldn't replay our
-	 * earlier WAL entries. If we do not fsync those pages here, they might
-	 * still not be on disk when the crash occurs.
-	 */
-	if (wstate->btws_use_wal)
-		smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM);
+	smgr_bulk_finish(wstate->bulkstate);
 }
 
 /*
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index 98b1da20d58..1b70c5a59fd 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -25,7 +25,7 @@
 #include "catalog/index.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
-#include "storage/smgr.h"
+#include "storage/bulk_write.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -155,42 +155,27 @@ spgbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 void
 spgbuildempty(Relation index)
 {
-	Buffer		metabuffer,
-				rootbuffer,
-				nullbuffer;
-
-	/*
-	 * Initialize the meta page and root pages
-	 */
-	metabuffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
-	LockBuffer(metabuffer, BUFFER_LOCK_EXCLUSIVE);
-	rootbuffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
-	LockBuffer(rootbuffer, BUFFER_LOCK_EXCLUSIVE);
-	nullbuffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
-	LockBuffer(nullbuffer, BUFFER_LOCK_EXCLUSIVE);
-
-	Assert(BufferGetBlockNumber(metabuffer) == SPGIST_METAPAGE_BLKNO);
-	Assert(BufferGetBlockNumber(rootbuffer) == SPGIST_ROOT_BLKNO);
-	Assert(BufferGetBlockNumber(nullbuffer) == SPGIST_NULL_BLKNO);
+	BulkWriteState *bulkstate;
+	BulkWriteBuffer buf;
 
-	START_CRIT_SECTION();
+	bulkstate = smgr_bulk_start_rel(index, INIT_FORKNUM);
 
-	SpGistInitMetapage(BufferGetPage(metabuffer));
-	MarkBufferDirty(metabuffer);
-	SpGistInitBuffer(rootbuffer, SPGIST_LEAF);
-	MarkBufferDirty(rootbuffer);
-	SpGistInitBuffer(nullbuffer, SPGIST_LEAF | SPGIST_NULLS);
-	MarkBufferDirty(nullbuffer);
+	/* Construct metapage. */
+	buf = smgr_bulk_get_buf(bulkstate);
+	SpGistInitMetapage((Page) buf);
+	smgr_bulk_write(bulkstate, SPGIST_METAPAGE_BLKNO, buf, true);
 
-	log_newpage_buffer(metabuffer, true);
-	log_newpage_buffer(rootbuffer, true);
-	log_newpage_buffer(nullbuffer, true);
+	/* Likewise for the root page. */
+	buf = smgr_bulk_get_buf(bulkstate);
+	SpGistInitPage((Page) buf, SPGIST_LEAF);
+	smgr_bulk_write(bulkstate, SPGIST_ROOT_BLKNO, buf, true);
 
-	END_CRIT_SECTION();
+	/* Likewise for the null-tuples root page. */
+	buf = smgr_bulk_get_buf(bulkstate);
+	SpGistInitPage((Page) buf, SPGIST_LEAF | SPGIST_NULLS);
+	smgr_bulk_write(bulkstate, SPGIST_NULL_BLKNO, buf, true);
 
-	UnlockReleaseBuffer(metabuffer);
-	UnlockReleaseBuffer(rootbuffer);
-	UnlockReleaseBuffer(nullbuffer);
+	smgr_bulk_finish(bulkstate);
 }
 
 /*
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index b155c03386e..a17cf4bb0cc 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -28,6 +28,7 @@
 #include "catalog/storage.h"
 #include "catalog/storage_xlog.h"
 #include "miscadmin.h"
+#include "storage/bulk_write.h"
 #include "storage/freespace.h"
 #include "storage/smgr.h"
 #include "utils/hsearch.h"
@@ -451,14 +452,11 @@ void
 RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 					ForkNumber forkNum, char relpersistence)
 {
-	PGIOAlignedBlock buf;
-	Page		page;
 	bool		use_wal;
 	bool		copying_initfork;
 	BlockNumber nblocks;
 	BlockNumber blkno;
-
-	page = (Page) buf.data;
+	BulkWriteState *bulkstate;
 
 	/*
 	 * The init fork for an unlogged relation in many respects has to be
@@ -477,16 +475,21 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 	use_wal = XLogIsNeeded() &&
 		(relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork);
 
+	bulkstate = smgr_bulk_start_smgr(dst, forkNum, use_wal);
+
 	nblocks = smgrnblocks(src, forkNum);
 
 	for (blkno = 0; blkno < nblocks; blkno++)
 	{
+		BulkWriteBuffer buf;
+
 		/* If we got a cancel signal during the copy of the data, quit */
 		CHECK_FOR_INTERRUPTS();
 
-		smgrread(src, forkNum, blkno, buf.data);
+		buf = smgr_bulk_get_buf(bulkstate);
+		smgrread(src, forkNum, blkno, (Page) buf);
 
-		if (!PageIsVerifiedExtended(page, blkno,
+		if (!PageIsVerifiedExtended((Page) buf, blkno,
 									PIV_LOG_WARNING | PIV_REPORT_STAT))
 		{
 			/*
@@ -507,34 +510,13 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 		}
 
 		/*
-		 * WAL-log the copied page. Unfortunately we don't know what kind of a
-		 * page this is, so we have to log the full page including any unused
-		 * space.
-		 */
-		if (use_wal)
-			log_newpage(&dst->smgr_rlocator.locator, forkNum, blkno, page, false);
-
-		PageSetChecksumInplace(page, blkno);
-
-		/*
-		 * Now write the page.  We say skipFsync = true because there's no
-		 * need for smgr to schedule an fsync for this write; we'll do it
-		 * ourselves below.
+		 * Queue the page for WAL-logging and writing out.  Unfortunately we
+		 * don't know what kind of a page this is, so we have to log the full
+		 * page including any unused space.
 		 */
-		smgrextend(dst, forkNum, blkno, buf.data, true);
+		smgr_bulk_write(bulkstate, blkno, buf, false);
 	}
-
-	/*
-	 * When we WAL-logged rel pages, we must nonetheless fsync them.  The
-	 * reason is that since we're copying outside shared buffers, a CHECKPOINT
-	 * occurring during the copy has no way to flush the previously written
-	 * data to disk (indeed it won't know the new rel even exists).  A crash
-	 * later on would replay WAL from the checkpoint, therefore it wouldn't
-	 * replay our earlier WAL entries. If we do not fsync those pages here,
-	 * they might still not be on disk when the crash occurs.
-	 */
-	if (use_wal || copying_initfork)
-		smgrimmedsync(dst, forkNum);
+	smgr_bulk_finish(bulkstate);
 }
 
 /*
diff --git a/src/backend/storage/smgr/Makefile b/src/backend/storage/smgr/Makefile
index 596b564656f..1d0b98764f9 100644
--- a/src/backend/storage/smgr/Makefile
+++ b/src/backend/storage/smgr/Makefile
@@ -13,6 +13,7 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = \
+	bulk_write.o \
 	md.o \
 	smgr.o
 
diff --git a/src/backend/storage/smgr/bulk_write.c b/src/backend/storage/smgr/bulk_write.c
new file mode 100644
index 00000000000..934c28e4603
--- /dev/null
+++ b/src/backend/storage/smgr/bulk_write.c
@@ -0,0 +1,303 @@
+/*-------------------------------------------------------------------------
+ *
+ * bulk_write.c
+ *	  Efficiently and reliably populate a new relation
+ *
+ * The assumption is that no other backends access the relation while we are
+ * loading it, so we can take some shortcuts.  Do not mix operations through
+ * the regular buffer manager and the bulk loading interface!
+ *
+ * We bypass the buffer manager to avoid the locking overhead, and call
+ * smgrextend() directly.  A downside is that the pages will need to be
+ * re-read into shared buffers on first use after the build finishes.  That's
+ * usually a good tradeoff for large relations, and for small relations, the
+ * overhead isn't very significant compared to creating the relation in the
+ * first place.
+ *
+ * The pages are WAL-logged if needed.  To save on WAL header overhead, we
+ * WAL-log several pages in one record.
+ *
+ * One tricky point is that because we bypass the buffer manager, we need to
+ * register the relation for fsyncing at the next checkpoint ourselves, and
+ * make sure that the relation is correctly fsync'd by us or the checkpointer
+ * even if a checkpoint happens concurrently.
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/smgr/bulk_write.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/xloginsert.h"
+#include "access/xlogrecord.h"
+#include "storage/bufmgr.h"
+#include "storage/bufpage.h"
+#include "storage/bulk_write.h"
+#include "storage/proc.h"
+#include "storage/smgr.h"
+#include "utils/rel.h"
+
+#define MAX_PENDING_WRITES XLR_MAX_BLOCK_ID
+
+static const PGIOAlignedBlock zero_buffer = {{0}};	/* worth BLCKSZ */
+
+typedef struct PendingWrite
+{
+	BulkWriteBuffer buf;
+	BlockNumber blkno;
+	bool		page_std;
+} PendingWrite;
+
+/*
+ * Bulk writer state for one relation fork.
+ */
+typedef struct BulkWriteState
+{
+	/* Information about the target relation we're writing */
+	/*
+	 * FIXME: 'smgr' might get invalidated. Hopefully
+	 * https://postgr.es/m/CA%2BhUKGJ8NTvqLHz6dqbQnt2c8XCki4r2QvXjBQcXpVwxTY_pvA%40mail.gmail.com
+	 * is merged before this.
+	 */
+	SMgrRelation smgr;
+	ForkNumber	forknum;
+	bool		use_wal;
+
+	/* We keep several writes queued, and WAL-log them in batches */
+	int			npending;
+	PendingWrite pending_writes[MAX_PENDING_WRITES];
+
+	/* Current size of the relation */
+	BlockNumber pages_written;
+
+	/* The RedoRecPtr at the time that the bulk operation started */
+	XLogRecPtr	start_RedoRecPtr;
+
+	MemoryContext memcxt;
+} BulkWriteState;
+
+static void smgr_bulk_flush(BulkWriteState *bulkstate);
+
+/*
+ * Start a bulk write operation on a relation fork.
+ */
+BulkWriteState *
+smgr_bulk_start_rel(Relation rel, ForkNumber forknum)
+{
+	return smgr_bulk_start_smgr(RelationGetSmgr(rel),
+								forknum,
+								RelationNeedsWAL(rel) || forknum == INIT_FORKNUM);
+}
+
+/*
+ * Start a bulk write operation on a relation fork.
+ *
+ * This is like smgr_bulk_start_rel, but can be used without a relcache entry.
+ */
+BulkWriteState *
+smgr_bulk_start_smgr(SMgrRelation smgr, ForkNumber forknum, bool use_wal)
+{
+	BulkWriteState *state;
+
+	state = palloc(sizeof(BulkWriteState));
+	state->smgr = smgr;
+	state->forknum = forknum;
+	state->use_wal = use_wal;
+
+	state->npending = 0;
+	state->pages_written = 0;
+
+	state->start_RedoRecPtr = GetRedoRecPtr();
+
+	/*
+	 * Remember the memory context.  We will use it to allocate all the
+	 * buffers later.
+	 */
+	state->memcxt = CurrentMemoryContext;
+
+	return state;
+}
+
+/*
+ * Finish bulk write operation.
+ *
+ * This WAL-logs and flushes any remaining pending writes to disk, and fsyncs
+ * the relation if needed.
+ */
+void
+smgr_bulk_finish(BulkWriteState *bulkstate)
+{
+	/* WAL-log and flush any remaining pages */
+	smgr_bulk_flush(bulkstate);
+
+	/*
+	 * When we wrote out the pages, we passed skipFsync=true to avoid the
+	 * overhead of registering all the writes with the checkpointer.  Register
+	 * the whole relation now.
+	 *
+	 * There is one hole in that idea: If a checkpoint occurred while we were
+	 * writing the pages, it already missed fsyncing the pages we had written
+	 * before the checkpoint started.  A crash later on would replay the WAL
+	 * starting from the checkpoint, therefore it wouldn't replay our earlier
+	 * WAL records.  So if a checkpoint started after the bulk write, fsync
+	 * the files now.
+	 */
+	if (!SmgrIsTemp(bulkstate->smgr))
+	{
+		/*
+		 * Prevent a checkpoint from starting between the GetRedoRecPtr() and
+		 * smgrregistersync() calls.
+		 */
+		Assert((MyProc->delayChkptFlags & DELAY_CHKPT_START) == 0);
+		MyProc->delayChkptFlags |= DELAY_CHKPT_START;
+
+		if (bulkstate->start_RedoRecPtr != GetRedoRecPtr())
+		{
+			/*
+			 * A checkpoint occurred and it didn't know about our writes, so
+			 * fsync() the relation ourselves.
+			 */
+			MyProc->delayChkptFlags &= ~DELAY_CHKPT_START;
+			smgrimmedsync(bulkstate->smgr, bulkstate->forknum);
+			elog(DEBUG1, "flushed relation because a checkpoint occurred concurrently");
+		}
+		else
+		{
+			smgrregistersync(bulkstate->smgr, bulkstate->forknum);
+			MyProc->delayChkptFlags &= ~DELAY_CHKPT_START;
+		}
+	}
+}
+
+static int
+buffer_cmp(const void *a, const void *b)
+{
+	const PendingWrite *bufa = (const PendingWrite *) a;
+	const PendingWrite *bufb = (const PendingWrite *) b;
+
+	/* We should not see duplicated writes for the same block */
+	Assert(bufa->blkno != bufb->blkno);
+	if (bufa->blkno > bufb->blkno)
+		return 1;
+	else
+		return -1;
+}
+
+/*
+ * Finish all the pending writes.
+ */
+static void
+smgr_bulk_flush(BulkWriteState *bulkstate)
+{
+	int			npending = bulkstate->npending;
+	PendingWrite *pending_writes = bulkstate->pending_writes;
+
+	if (npending == 0)
+		return;
+
+	if (npending > 1)
+		qsort(pending_writes, npending, sizeof(PendingWrite), buffer_cmp);
+
+	if (bulkstate->use_wal)
+	{
+		BlockNumber blknos[MAX_PENDING_WRITES];
+		Page		pages[MAX_PENDING_WRITES];
+		bool		page_std = true;
+
+		for (int i = 0; i < npending; i++)
+		{
+			blknos[i] = pending_writes[i].blkno;
+			pages[i] = pending_writes[i].buf->data;
+
+			/*
+			 * If any of the pages use !page_std, we log them all as such.
+			 * That's a bit wasteful, but in practice, a mix of standard and
+			 * non-standard page layout is rare.  None of the built-in AMs do
+			 * that.
+			 */
+			if (!pending_writes[i].page_std)
+				page_std = false;
+		}
+		log_newpages(&bulkstate->smgr->smgr_rlocator.locator, bulkstate->forknum,
+					 npending, blknos, pages, page_std);
+	}
+
+	for (int i = 0; i < npending; i++)
+	{
+		BlockNumber blkno = pending_writes[i].blkno;
+		Page		page = pending_writes[i].buf->data;
+
+		PageSetChecksumInplace(page, blkno);
+
+		if (blkno >= bulkstate->pages_written)
+		{
+			/*
+			 * If we have to write pages nonsequentially, fill in the space
+			 * with zeroes until we come back and overwrite.  This is not
+			 * logically necessary on standard Unix filesystems (unwritten
+			 * space will read as zeroes anyway), but it should help to avoid
+			 * fragmentation.  The dummy pages aren't WAL-logged though.
+			 */
+			while (blkno > bulkstate->pages_written)
+			{
+				/* don't set checksum for all-zero page */
+				smgrextend(bulkstate->smgr, bulkstate->forknum,
+						   bulkstate->pages_written++,
+						   &zero_buffer,
+						   true);
+			}
+
+			smgrextend(bulkstate->smgr, bulkstate->forknum, blkno, page, true);
+			bulkstate->pages_written = pending_writes[i].blkno + 1;
+		}
+		else
+			smgrwrite(bulkstate->smgr, bulkstate->forknum, blkno, page, true);
+		pfree(page);
+	}
+
+	bulkstate->npending = 0;
+}
+
+/*
+ * Queue write of 'buf'.
+ *
+ * NB: this takes ownership of 'buf'!
+ *
+ * You are only allowed to write a given block once as part of one bulk write
+ * operation.
+ */
+void
+smgr_bulk_write(BulkWriteState *bulkstate, BlockNumber blocknum, BulkWriteBuffer buf, bool page_std)
+{
+	PendingWrite *w;
+
+	w = &bulkstate->pending_writes[bulkstate->npending++];
+	w->buf = buf;
+	w->blkno = blocknum;
+	w->page_std = page_std;
+
+	if (bulkstate->npending == MAX_PENDING_WRITES)
+		smgr_bulk_flush(bulkstate);
+}
+
+/*
+ * Allocate a new buffer which can later be written with smgr_bulk_write().
+ *
+ * There is no function to free a buffer.  When you pass it to
+ * smgr_bulk_write(), it takes ownership and frees it when it's no longer
+ * needed.
+ *
+ * This is currently implemented as a simple palloc, but could be implemented
+ * using a ring buffer or larger chunks in the future, so don't rely on it.
+ */
+BulkWriteBuffer
+smgr_bulk_get_buf(BulkWriteState *bulkstate)
+{
+	return MemoryContextAllocAligned(bulkstate->memcxt, BLCKSZ, PG_IO_ALIGN_SIZE, 0);
+}
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index b1e9932a291..233f6987f59 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -1236,6 +1236,49 @@ mdtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
 	}
 }
 
+/*
+ * mdregistersync() -- Mark whole relation as needing fsync
+ */
+void
+mdregistersync(SMgrRelation reln, ForkNumber forknum)
+{
+	int			segno;
+	int			min_inactive_seg;
+
+	/*
+	 * NOTE: mdnblocks makes sure we have opened all active segments, so that
+	 * the loop below will get them all!
+	 */
+	mdnblocks(reln, forknum);
+
+	min_inactive_seg = segno = reln->md_num_open_segs[forknum];
+
+	/*
+	 * Temporarily open inactive segments, then close them after sync.  There
+	 * may be some inactive segments left opened after error, but that is
+	 * harmless.  We don't bother to clean them up and take a risk of further
+	 * trouble.  The next mdclose() will soon close them.
+	 */
+	while (_mdfd_openseg(reln, forknum, segno, 0) != NULL)
+		segno++;
+
+	while (segno > 0)
+	{
+		MdfdVec    *v = &reln->md_seg_fds[forknum][segno - 1];
+
+		register_dirty_segment(reln, forknum, v);
+
+		/* Close inactive segments immediately */
+		if (segno > min_inactive_seg)
+		{
+			FileClose(v->mdfd_vfd);
+			_fdvec_resize(reln, forknum, segno - 1);
+		}
+
+		segno--;
+	}
+}
+
 /*
  * mdimmedsync() -- Immediately sync a relation to stable storage.
  *
@@ -1255,7 +1298,7 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 
 	/*
 	 * NOTE: mdnblocks makes sure we have opened all active segments, so that
-	 * fsync loop will get them all!
+	 * the loop below will get them all!
 	 */
 	mdnblocks(reln, forknum);
 
diff --git a/src/backend/storage/smgr/meson.build b/src/backend/storage/smgr/meson.build
index 003d5e30dd4..6d91b18fe67 100644
--- a/src/backend/storage/smgr/meson.build
+++ b/src/backend/storage/smgr/meson.build
@@ -1,6 +1,7 @@
 # Copyright (c) 2022-2024, PostgreSQL Global Development Group
 
 backend_sources += files(
+  'bulk_write.c',
   'md.c',
   'smgr.c',
 )
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 563a0be5c74..ea4aed63dab 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -68,6 +68,7 @@ typedef struct f_smgr
 	void		(*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
 								  BlockNumber nblocks);
 	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
+	void		(*smgr_registersync) (SMgrRelation reln, ForkNumber forknum);
 } f_smgr;
 
 static const f_smgr smgrsw[] = {
@@ -89,6 +90,7 @@ static const f_smgr smgrsw[] = {
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
 		.smgr_immedsync = mdimmedsync,
+		.smgr_registersync = mdregistersync,
 	}
 };
 
@@ -583,6 +585,14 @@ smgrreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
  * on disk at return, only dumped out to the kernel.  However,
  * provisions will be made to fsync the write before the next checkpoint.
  *
+ * NB: The mechanism to ensure fsync at next checkpoint assumes that there is
+ * something that prevents a concurrent checkpoint from "racing ahead" of the
+ * write.  One way to prevent that is by holding a lock on the buffer; the
+ * buffer manager's writes are protected by that.  The bulk writer facility in
+ * bulk_write.c checks the redo pointer and and calls smgrimmedsync() if a
+ * checkpoint happened; that relies on the fact that no other backend can be
+ * concurrently modify the page.
+ *
  * skipFsync indicates that the caller will make other provisions to
  * fsync the relation, so we needn't bother.  Temporary relations also
  * do not require fsync.
@@ -700,6 +710,24 @@ smgrtruncate(SMgrRelation reln, ForkNumber *forknum, int nforks, BlockNumber *nb
 	}
 }
 
+/*
+ * smgrregistersync() -- Request a relation to be sync'd at next checkpoint
+ *
+ * This can be used after calling smgrwrite() or smgrextend() with skipFsync =
+ * true, to register the fsyncs that were skipped earlier.
+ *
+ * Note: be mindful that a checkpoint could already have happened between the
+ * smgrwrite or smgrextend calls and this!  In that case, the checkpoint
+ * already missed fsyncing this relation, and you should use smgrimmedsync
+ * instead.  Most callers should use the bulk loading facility in bulk_write.c
+ * instead, which handles all that.
+ */
+void
+smgrregistersync(SMgrRelation reln, ForkNumber forknum)
+{
+	smgrsw[reln->smgr_which].smgr_registersync(reln, forknum);
+}
+
 /*
  * smgrimmedsync() -- Force the specified relation to stable storage.
  *
@@ -722,6 +750,9 @@ smgrtruncate(SMgrRelation reln, ForkNumber *forknum, int nforks, BlockNumber *nb
  * Note that you need to do FlushRelationBuffers() first if there is
  * any possibility that there are dirty buffers for the relation;
  * otherwise the sync is not very meaningful.
+ *
+ * Most callers should use the bulk loading facility in bulk_write.c
+ * instead of calling this directly.
  */
 void
 smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
diff --git a/src/include/storage/bulk_write.h b/src/include/storage/bulk_write.h
new file mode 100644
index 00000000000..6d4532d05e1
--- /dev/null
+++ b/src/include/storage/bulk_write.h
@@ -0,0 +1,40 @@
+/*-------------------------------------------------------------------------
+ *
+ * bulk_write.h
+ *	  Efficiently and reliably populate a new relation
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/bulk_write.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BULK_WRITE_H
+#define BULK_WRITE_H
+
+#include "storage/smgr.h"
+#include "utils/rel.h"
+
+typedef struct BulkWriteState BulkWriteState;
+
+/*
+ * Temporary buffer to hold a page to until it's written out. Use
+ * smgr_bulk_get_buf() to reserve one of these.  This is a separate typedef to
+ * distinguish it from other block-sized buffers passed around in the system.
+ */
+typedef PGIOAlignedBlock *BulkWriteBuffer;
+
+/* forward declared from smgr.h */
+struct SMgrRelationData;
+
+extern BulkWriteState *smgr_bulk_start_rel(Relation rel, ForkNumber forknum);
+extern BulkWriteState *smgr_bulk_start_smgr(struct SMgrRelationData *smgr, ForkNumber forknum, bool use_wal);
+
+extern BulkWriteBuffer smgr_bulk_get_buf(BulkWriteState *bulkstate);
+extern void smgr_bulk_write(BulkWriteState *bulkstate, BlockNumber blocknum, BulkWriteBuffer buf, bool page_std);
+
+extern void smgr_bulk_finish(BulkWriteState *bulkstate);
+
+#endif							/* BULK_WRITE_H */
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 7c181e5a171..620f10abdeb 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -43,6 +43,7 @@ extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber nblocks);
 extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
+extern void mdregistersync(SMgrRelation reln, ForkNumber forknum);
 
 extern void ForgetDatabaseSyncRequests(Oid dbid);
 extern void DropRelationFiles(RelFileLocator *delrels, int ndelrels, bool isRedo);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 527cd2a0568..d6d24487763 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -110,6 +110,7 @@ extern BlockNumber smgrnblocks_cached(SMgrRelation reln, ForkNumber forknum);
 extern void smgrtruncate(SMgrRelation reln, ForkNumber *forknum,
 						 int nforks, BlockNumber *nblocks);
 extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
+extern void smgrregistersync(SMgrRelation reln, ForkNumber forknum);
 extern void AtEOXact_SMgr(void);
 extern bool ProcessBarrierSmgrRelease(void);
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 7e866e3c3d0..96ef87dab75 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -332,6 +332,8 @@ BuildAccumulator
 BuiltinScript
 BulkInsertState
 BulkInsertStateData
+BulkWriteBuffer
+BulkWriteState
 CACHESIGN
 CAC_state
 CCFastEqualFN
@@ -2017,6 +2019,7 @@ PendingFsyncEntry
 PendingRelDelete
 PendingRelSync
 PendingUnlinkEntry
+PendingWrite
 PendingWriteback
 PerLockTagEntry
 PerlInterpreter
-- 
2.39.2

#11Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Heikki Linnakangas (#10)
Re: Relation bulk write facility

Committed this. Thanks everyone!

--
Heikki Linnakangas
Neon (https://neon.tech)

#12Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Heikki Linnakangas (#11)
Re: Relation bulk write facility

On 23/02/2024 16:27, Heikki Linnakangas wrote:

Committed this. Thanks everyone!

Buildfarm animals 'sifaka' and 'longfin' are not happy, I will investigate..

--
Heikki Linnakangas
Neon (https://neon.tech)

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#12)
Re: Relation bulk write facility

Heikki Linnakangas <hlinnaka@iki.fi> writes:

Buildfarm animals 'sifaka' and 'longfin' are not happy, I will investigate..

Those are mine, let me know if you need local investigation.

regards, tom lane

#14Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Tom Lane (#13)
Re: Relation bulk write facility

On 23/02/2024 17:15, Tom Lane wrote:

Heikki Linnakangas <hlinnaka@iki.fi> writes:

Buildfarm animals 'sifaka' and 'longfin' are not happy, I will investigate..

Those are mine, let me know if you need local investigation.

Thanks, the error message was clear enough:

bulk_write.c:78:3: error: redefinition of typedef 'BulkWriteState' is a C11 feature [-Werror,-Wtypedef-redefinition]
} BulkWriteState;
^
../../../../src/include/storage/bulk_write.h:20:31: note: previous definition is here
typedef struct BulkWriteState BulkWriteState;
^
1 error generated.

Fixed now, but I'm a bit surprised other buildfarm members nor cirrus CI
caught that. I also tried to reproduce it locally by adding
-Wtypedef-redefinition, but my version of clang didn't produce any
warnings. Are there any extra compiler options on those animals or
something?

--
Heikki Linnakangas
Neon (https://neon.tech)

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#14)
Re: Relation bulk write facility

Heikki Linnakangas <hlinnaka@iki.fi> writes:

Thanks, the error message was clear enough:

bulk_write.c:78:3: error: redefinition of typedef 'BulkWriteState' is a C11 feature [-Werror,-Wtypedef-redefinition]
} BulkWriteState;

Fixed now, but I'm a bit surprised other buildfarm members nor cirrus CI
caught that. I also tried to reproduce it locally by adding
-Wtypedef-redefinition, but my version of clang didn't produce any
warnings. Are there any extra compiler options on those animals or
something?

They use Apple's standard compiler (clang 15 or so), but

'CC' => 'ccache clang -std=gnu99',

so maybe the -std has something to do with it. I installed that
(or -std=gnu90 as appropriate to branch) on most of my build
setups back when we started the C99 push.

regards, tom lane

#16Noah Misch
noah@leadboat.com
In reply to: Heikki Linnakangas (#11)
Re: Relation bulk write facility

On Fri, Feb 23, 2024 at 04:27:34PM +0200, Heikki Linnakangas wrote:

Committed this. Thanks everyone!

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&amp;dt=2024-02-24%2015%3A13%3A14 got:
TRAP: failed Assert("(uintptr_t) buffer == TYPEALIGN(PG_IO_ALIGN_SIZE, buffer)"), File: "md.c", Line: 472, PID: 43188608

with this stack trace:
#5 0x10005cf0 in ExceptionalCondition (conditionName=0x1015d790 <XLogBeginInsert+80> "`", fileName=0x0, lineNumber=16780064) at assert.c:66
#6 0x102daba8 in mdextend (reln=0x1042628c <PageSetChecksumInplace+44>, forknum=812540744, blocknum=33, buffer=0x306e6000, skipFsync=812539904) at md.c:472
#7 0x102d6760 in smgrextend (reln=0x306e6670, forknum=812540744, blocknum=33, buffer=0x306e6000, skipFsync=812539904) at smgr.c:541
#8 0x104c8dac in smgr_bulk_flush (bulkstate=0x306e6000) at bulk_write.c:245
#9 0x107baf24 in _bt_blwritepage (wstate=0x100d0a14 <datum_image_eq@AF65_7+404>, buf=0x304f13b0, blkno=807631240) at nbtsort.c:638
#10 0x107bccd8 in _bt_buildadd (wstate=0x104c9384 <smgr_bulk_start_rel+132>, state=0x304eb190, itup=0xe10, truncextra=805686672) at nbtsort.c:984
#11 0x107bc86c in _bt_sort_dedup_finish_pending (wstate=0x3b6, state=0x19, dstate=0x3) at nbtsort.c:1036
#12 0x107bc188 in _bt_load (wstate=0x10, btspool=0x0, btspool2=0x0) at nbtsort.c:1331
#13 0x107bd4ec in _bt_leafbuild (btspool=0x101589fc <ProcessInvalidationMessages+188>, btspool2=0x0) at nbtsort.c:571
#14 0x107be028 in btbuild (heap=0x304d2a00, index=0x4e1f, indexInfo=0x3) at nbtsort.c:329
#15 0x1013538c in index_build (heapRelation=0x2, indexRelation=0x10bdc518 <getopt_long+2464664>, indexInfo=0x2, isreindex=10, parallel=false) at index.c:3047
#16 0x101389e0 in index_create (heapRelation=0x1001aac0 <palloc+192>, indexRelationName=0x20 <error: Cannot access memory at address 0x20>, indexRelationId=804393376, parentIndexRelid=805686672,
parentConstraintId=268544704, relFileNumber=805309688, indexInfo=0x3009a328, indexColNames=0x30237a20, accessMethodId=403, tableSpaceId=0, collationIds=0x304d29d8, opclassIds=0x304d29f8,
opclassOptions=0x304d2a18, coloptions=0x304d2a38, reloptions=0, flags=0, constr_flags=0, allow_system_table_mods=false, is_internal=false, constraintId=0x2ff211b4) at index.c:1260
#17 0x1050342c in DefineIndex (tableId=19994, stmt=0x2ff21370, indexRelationId=0, parentIndexId=0, parentConstraintId=0, total_parts=0, is_alter_table=false, check_rights=true, check_not_in_use=true,
skip_build=false, quiet=false) at indexcmds.c:1204
#18 0x104b4474 in ProcessUtilitySlow (pstate=<error reading variable>, pstmt=0x3009a408, queryString=0x30099730 "CREATE INDEX dupindexcols_i ON dupindexcols (f1, id, f1 text_pattern_ops);",

If there are other ways I should poke at it, let me know.

#17Thomas Munro
thomas.munro@gmail.com
In reply to: Noah Misch (#16)
Re: Relation bulk write facility

On Sun, Feb 25, 2024 at 6:24 AM Noah Misch <noah@leadboat.com> wrote:

On Fri, Feb 23, 2024 at 04:27:34PM +0200, Heikki Linnakangas wrote:

Committed this. Thanks everyone!

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&amp;dt=2024-02-24%2015%3A13%3A14 got:
TRAP: failed Assert("(uintptr_t) buffer == TYPEALIGN(PG_IO_ALIGN_SIZE, buffer)"), File: "md.c", Line: 472, PID: 43188608

with this stack trace:
#5 0x10005cf0 in ExceptionalCondition (conditionName=0x1015d790 <XLogBeginInsert+80> "`", fileName=0x0, lineNumber=16780064) at assert.c:66
#6 0x102daba8 in mdextend (reln=0x1042628c <PageSetChecksumInplace+44>, forknum=812540744, blocknum=33, buffer=0x306e6000, skipFsync=812539904) at md.c:472
#7 0x102d6760 in smgrextend (reln=0x306e6670, forknum=812540744, blocknum=33, buffer=0x306e6000, skipFsync=812539904) at smgr.c:541
#8 0x104c8dac in smgr_bulk_flush (bulkstate=0x306e6000) at bulk_write.c:245

So that's:

static const PGIOAlignedBlock zero_buffer = {{0}}; /* worth BLCKSZ */

...
smgrextend(bulkstate->smgr, bulkstate->forknum,
bulkstate->pages_written++,
&zero_buffer,
true);

... where PGIOAlignedBlock is:

typedef union PGIOAlignedBlock
{
#ifdef pg_attribute_aligned
pg_attribute_aligned(PG_IO_ALIGN_SIZE)
#endif
char data[BLCKSZ];
...

We see this happen with both xlc and gcc (new enough to know how to do
this). One idea would be that the AIX *linker* is unable to align it,
as that is the common tool-chain component here (and unlike stack and
heap objects, this scope is the linker's job). There is a
pre-existing example of a zero-buffer that is at file scope like that:
pg_prewarm.c. Perhaps it doesn't get tested?

Hmm.

#18Noah Misch
noah@leadboat.com
In reply to: Thomas Munro (#17)
Re: Relation bulk write facility

On Sun, Feb 25, 2024 at 07:52:16AM +1300, Thomas Munro wrote:

On Sun, Feb 25, 2024 at 6:24 AM Noah Misch <noah@leadboat.com> wrote:

On Fri, Feb 23, 2024 at 04:27:34PM +0200, Heikki Linnakangas wrote:

Committed this. Thanks everyone!

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&amp;dt=2024-02-24%2015%3A13%3A14 got:
TRAP: failed Assert("(uintptr_t) buffer == TYPEALIGN(PG_IO_ALIGN_SIZE, buffer)"), File: "md.c", Line: 472, PID: 43188608

with this stack trace:
#5 0x10005cf0 in ExceptionalCondition (conditionName=0x1015d790 <XLogBeginInsert+80> "`", fileName=0x0, lineNumber=16780064) at assert.c:66
#6 0x102daba8 in mdextend (reln=0x1042628c <PageSetChecksumInplace+44>, forknum=812540744, blocknum=33, buffer=0x306e6000, skipFsync=812539904) at md.c:472
#7 0x102d6760 in smgrextend (reln=0x306e6670, forknum=812540744, blocknum=33, buffer=0x306e6000, skipFsync=812539904) at smgr.c:541
#8 0x104c8dac in smgr_bulk_flush (bulkstate=0x306e6000) at bulk_write.c:245

So that's:

static const PGIOAlignedBlock zero_buffer = {{0}}; /* worth BLCKSZ */

...
smgrextend(bulkstate->smgr, bulkstate->forknum,
bulkstate->pages_written++,
&zero_buffer,
true);

... where PGIOAlignedBlock is:

typedef union PGIOAlignedBlock
{
#ifdef pg_attribute_aligned
pg_attribute_aligned(PG_IO_ALIGN_SIZE)
#endif
char data[BLCKSZ];
...

We see this happen with both xlc and gcc (new enough to know how to do
this). One idea would be that the AIX *linker* is unable to align it,
as that is the common tool-chain component here (and unlike stack and
heap objects, this scope is the linker's job). There is a
pre-existing example of a zero-buffer that is at file scope like that:
pg_prewarm.c. Perhaps it doesn't get tested?

Hmm.

GCC docs do say "For some linkers, the maximum supported alignment may be very
very small.", but AIX "man LD" says "data sections are aligned on a boundary
so as to satisfy the alignment of all CSECTs in the sections". It also has -H
and -K flags to force some particular higher alignment.

On GNU/Linux x64, gcc correctly records alignment=2**12 for the associated
section (.rodata for bulk_write.o zero_buffer, .bss for pg_prewarm.o
blockbuffer). If I'm reading this right, neither AIX gcc nor xlc is marking
the section with sufficient alignment, in bulk_write.o or pg_prewarm.o:

$ /opt/cfarm/binutils-latest/bin/objdump --section-headers ~/farm/*/HEAD/pgsqlkeep.*/src/backend/storage/smgr/bulk_write.o

/home/nm/farm/gcc64/HEAD/pgsqlkeep.2024-02-24_00-03-22/src/backend/storage/smgr/bulk_write.o: file format aix5coff64-rs6000

Sections:
Idx Name Size VMA LMA File off Algn
0 .text 0000277c 0000000000000000 0000000000000000 000000f0 2**2
CONTENTS, ALLOC, LOAD, RELOC, CODE
1 .data 000000e4 000000000000277c 000000000000277c 0000286c 2**3
CONTENTS, ALLOC, LOAD, RELOC, DATA
2 .debug 0001f7ea 0000000000000000 0000000000000000 00002950 2**3
CONTENTS

/home/nm/farm/xlc32/HEAD/pgsqlkeep.2024-02-24_15-12-23/src/backend/storage/smgr/bulk_write.o: file format aixcoff-rs6000

Sections:
Idx Name Size VMA LMA File off Algn
0 .text 00000880 00000000 00000000 00000180 2**2
CONTENTS, ALLOC, LOAD, RELOC, CODE
1 .data 0000410c 00000880 00000880 00000a00 2**3
CONTENTS, ALLOC, LOAD, RELOC, DATA
2 .bss 00000000 0000498c 0000498c 00000000 2**3
ALLOC
3 .debug 00008448 00000000 00000000 00004b24 2**3
CONTENTS
4 .except 00000018 00000000 00000000 00004b0c 2**3
CONTENTS, LOAD

$ /opt/cfarm/binutils-latest/bin/objdump --section-headers ~/farm/*/HEAD/pgsqlkeep.*/contrib/pg_prewarm/pg_prewarm.o

/home/nm/farm/gcc32/HEAD/pgsqlkeep.2024-01-21_03-16-12/contrib/pg_prewarm/pg_prewarm.o: file format aixcoff-rs6000

Sections:
Idx Name Size VMA LMA File off Algn
0 .text 00000a6c 00000000 00000000 000000b4 2**2
CONTENTS, ALLOC, LOAD, RELOC, CODE
1 .data 00000044 00000a6c 00000a6c 00000b20 2**3
CONTENTS, ALLOC, LOAD, RELOC, DATA
2 .bss 00002550 00000ab0 00000ab0 00000000 2**3
ALLOC
3 .debug 0001c50e 00000000 00000000 00000b64 2**3
CONTENTS

/home/nm/farm/gcc64/HEAD/pgsqlkeep.2024-02-15_17-13-04/contrib/pg_prewarm/pg_prewarm.o: file format aix5coff64-rs6000

Sections:
Idx Name Size VMA LMA File off Algn
0 .text 00000948 0000000000000000 0000000000000000 00000138 2**2
CONTENTS, ALLOC, LOAD, RELOC, CODE
1 .data 00000078 0000000000000948 0000000000000948 00000a80 2**3
CONTENTS, ALLOC, LOAD, RELOC, DATA
2 .bss 00002640 00000000000009c0 00000000000009c0 00000000 2**3
ALLOC
3 .debug 0001d887 0000000000000000 0000000000000000 00000af8 2**3
CONTENTS

#19Thomas Munro
thomas.munro@gmail.com
In reply to: Noah Misch (#18)
Re: Relation bulk write facility

On Sun, Feb 25, 2024 at 8:50 AM Noah Misch <noah@leadboat.com> wrote:

On GNU/Linux x64, gcc correctly records alignment=2**12 for the associated
section (.rodata for bulk_write.o zero_buffer, .bss for pg_prewarm.o
blockbuffer). If I'm reading this right, neither AIX gcc nor xlc is marking
the section with sufficient alignment, in bulk_write.o or pg_prewarm.o:

Ah, that is a bit of a hazard that we should probably document.

I guess the ideas to fix this would be: use smgrzeroextend() instead
of this coding, and/or perhaps look at the coding of pg_pwrite_zeros()
(function-local static) for any other place that needs such a thing,
if it would be satisfied by function-local scope?

#20Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#19)
Re: Relation bulk write facility

On Sun, Feb 25, 2024 at 9:12 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Sun, Feb 25, 2024 at 8:50 AM Noah Misch <noah@leadboat.com> wrote:

On GNU/Linux x64, gcc correctly records alignment=2**12 for the associated
section (.rodata for bulk_write.o zero_buffer, .bss for pg_prewarm.o
blockbuffer). If I'm reading this right, neither AIX gcc nor xlc is marking
the section with sufficient alignment, in bulk_write.o or pg_prewarm.o:

Ah, that is a bit of a hazard that we should probably document.

I guess the ideas to fix this would be: use smgrzeroextend() instead
of this coding, and/or perhaps look at the coding of pg_pwrite_zeros()
(function-local static) for any other place that needs such a thing,
if it would be satisfied by function-local scope?

Erm, wait, how does that function-local static object work differently?

#21Noah Misch
noah@leadboat.com
In reply to: Thomas Munro (#20)
Re: Relation bulk write facility

On Sun, Feb 25, 2024 at 09:13:47AM +1300, Thomas Munro wrote:

On Sun, Feb 25, 2024 at 9:12 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Sun, Feb 25, 2024 at 8:50 AM Noah Misch <noah@leadboat.com> wrote:

On GNU/Linux x64, gcc correctly records alignment=2**12 for the associated
section (.rodata for bulk_write.o zero_buffer, .bss for pg_prewarm.o
blockbuffer). If I'm reading this right, neither AIX gcc nor xlc is marking
the section with sufficient alignment, in bulk_write.o or pg_prewarm.o:

Ah, that is a bit of a hazard that we should probably document.

I guess the ideas to fix this would be: use smgrzeroextend() instead
of this coding, and/or perhaps look at the coding of pg_pwrite_zeros()
(function-local static) for any other place that needs such a thing,
if it would be satisfied by function-local scope?

True. Alternatively, could arrange for "#define PG_O_DIRECT 0" on AIX, which
disables the alignment assertions (and debug_io_direct).

Erm, wait, how does that function-local static object work differently?

I don't know specifically, but I expect they're different parts of the gcc
implementation. Aligning an xcoff section may entail some xcoff-specific gcc
component. Aligning a function-local object just changes the early
instructions of the function; it's independent of the object format.

#22Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#18)
Re: Relation bulk write facility

Hi,

On 2024-02-24 11:50:24 -0800, Noah Misch wrote:

We see this happen with both xlc and gcc (new enough to know how to do
this). One idea would be that the AIX *linker* is unable to align it,
as that is the common tool-chain component here (and unlike stack and
heap objects, this scope is the linker's job). There is a
pre-existing example of a zero-buffer that is at file scope like that:
pg_prewarm.c. Perhaps it doesn't get tested?

Hmm.

GCC docs do say "For some linkers, the maximum supported alignment may be very
very small.", but AIX "man LD" says "data sections are aligned on a boundary
so as to satisfy the alignment of all CSECTs in the sections". It also has -H
and -K flags to force some particular higher alignment.

Some xlc manual [1]https://www.ibm.com/docs/en/SSGH2K_13.1.2/com.ibm.compilers.aix.doc/proguide.pdf states that

n must be a positive power of 2, or NIL. NIL can be specified as either
__attribute__((aligned())) or __attribute__((aligned)); this is the same as
specifying the maximum system alignment (16 bytes on all UNIX platforms).

Which does seems to suggest that this is a platform restriction.

Let's just drop AIX. This isn't the only alignment issue we've found and the
solution for those isn't so much a fix as forcing everyone to carefully only
look into one direction and not notice the cliffs to either side.

Greetings,

Andres Freund

[1]: https://www.ibm.com/docs/en/SSGH2K_13.1.2/com.ibm.compilers.aix.doc/proguide.pdf

#23Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Andres Freund (#22)
Re: Relation bulk write facility

On 24 February 2024 23:29:36 EET, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2024-02-24 11:50:24 -0800, Noah Misch wrote:

We see this happen with both xlc and gcc (new enough to know how to do
this). One idea would be that the AIX *linker* is unable to align it,
as that is the common tool-chain component here (and unlike stack and
heap objects, this scope is the linker's job). There is a
pre-existing example of a zero-buffer that is at file scope like that:
pg_prewarm.c. Perhaps it doesn't get tested?

Hmm.

GCC docs do say "For some linkers, the maximum supported alignment may be very
very small.", but AIX "man LD" says "data sections are aligned on a boundary
so as to satisfy the alignment of all CSECTs in the sections". It also has -H
and -K flags to force some particular higher alignment.

Some xlc manual [1] states that

n must be a positive power of 2, or NIL. NIL can be specified as either
__attribute__((aligned())) or __attribute__((aligned)); this is the same as
specifying the maximum system alignment (16 bytes on all UNIX platforms).

Which does seems to suggest that this is a platform restriction.

My reading of that paragraph is that you can set it to any powet of two, and it should work. 16 bytes is just what you get if you set it to NIL.

Let's just drop AIX. This isn't the only alignment issue we've found and the
solution for those isn't so much a fix as forcing everyone to carefully only
look into one direction and not notice the cliffs to either side.

I think the way that decision should go is that as long as someone is willing to step up and do the work keep AIX support going, we support it. To be clear, that someone is not me. Anyone willing to do it?

Regarding the issue at hand, perhaps we should define PG_IO_ALIGN_SIZE as 16 on AIX, if that's the best the linker can do on that platform.

We could also make the allocation 2*PG_IO_ALIGN_SIZE and round up the starting address ourselves to PG_IO_ALIGN_SIZE. Or allocate it in the heap.

- Heikki

#24Thomas Munro
thomas.munro@gmail.com
In reply to: Heikki Linnakangas (#23)
Re: Relation bulk write facility

On Sun, Feb 25, 2024 at 11:06 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Regarding the issue at hand, perhaps we should define PG_IO_ALIGN_SIZE as 16 on AIX, if that's the best the linker can do on that platform.

You'll probably get either an error or silently fall back to buffered
I/O, if direct I/O is enabled and you try to read/write a badly
aligned buffer. That's documented (they offer finfo() to query it,
but it's always 4KB for the same sort of reasons as it is on every
other OS).

#25Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#24)
Re: Relation bulk write facility

On Sun, Feb 25, 2024 at 11:16 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Sun, Feb 25, 2024 at 11:06 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Regarding the issue at hand, perhaps we should define PG_IO_ALIGN_SIZE as 16 on AIX, if that's the best the linker can do on that platform.

You'll probably get either an error or silently fall back to buffered
I/O, if direct I/O is enabled and you try to read/write a badly
aligned buffer. That's documented (they offer finfo() to query it,
but it's always 4KB for the same sort of reasons as it is on every
other OS).

I guess it's the latter ("to work efficiently" sounds like it isn't
going to reject the request):

https://www.ibm.com/docs/en/aix/7.3?topic=tuning-direct-io

If you make it < 4KB then all direct I/O would be affected, not just
this one place, so then you might as well just not allow direct I/O on
AIX at all, to avoid giving a false impression that it does something.
(Note that if we think the platform lacks O_DIRECT we don't make those
assertions about alignment).

FWIW I'm aware of one other thing that is wrong with our direct I/O
support on AIX: it should perhaps be using a different flag. I
created a wiki page to defer thinking about any AIX issues
until/unless at least one real, live user shows up, which hasn't
happened yet: https://wiki.postgresql.org/wiki/AIX

#26Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Thomas Munro (#25)
1 attachment(s)
Re: Relation bulk write facility

On 25/02/2024 00:37, Thomas Munro wrote:

On Sun, Feb 25, 2024 at 11:16 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Sun, Feb 25, 2024 at 11:06 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Regarding the issue at hand, perhaps we should define PG_IO_ALIGN_SIZE as 16 on AIX, if that's the best the linker can do on that platform.

You'll probably get either an error or silently fall back to buffered
I/O, if direct I/O is enabled and you try to read/write a badly
aligned buffer. That's documented (they offer finfo() to query it,
but it's always 4KB for the same sort of reasons as it is on every
other OS).

I guess it's the latter ("to work efficiently" sounds like it isn't
going to reject the request):

https://www.ibm.com/docs/en/aix/7.3?topic=tuning-direct-io

If you make it < 4KB then all direct I/O would be affected, not just
this one place, so then you might as well just not allow direct I/O on
AIX at all, to avoid giving a false impression that it does something.
(Note that if we think the platform lacks O_DIRECT we don't make those
assertions about alignment).

FWIW I'm aware of one other thing that is wrong with our direct I/O
support on AIX: it should perhaps be using a different flag. I
created a wiki page to defer thinking about any AIX issues
until/unless at least one real, live user shows up, which hasn't
happened yet: https://wiki.postgresql.org/wiki/AIX

Here's a patch that effectively disables direct I/O on AIX. I'm inclined
to commit this as a quick fix to make the buildfarm green again.

I agree with Andres though, that unless someone raises their hand and
volunteers to properly maintain the AIX support, we should drop it. The
current AIX buildfarm members are running AIX 7.1, which has been out of
support since May 2023
(https://www.ibm.com/support/pages/aix-support-lifecycle-information).
See also older thread on this [0]/messages/by-id/20220702183354.a6uhja35wta7agew@alap3.anarazel.de.

Noah, you're running the current AIX buildfarm animals. How much effort
are you interested to put into AIX support?

[0]: /messages/by-id/20220702183354.a6uhja35wta7agew@alap3.anarazel.de
/messages/by-id/20220702183354.a6uhja35wta7agew@alap3.anarazel.de

--
Heikki Linnakangas
Neon (https://neon.tech)

Attachments:

0001-Disable-O_DIRECT-on-AIX.patchtext/x-patch; charset=UTF-8; name=0001-Disable-O_DIRECT-on-AIX.patchDownload
From 3e206b952b8011c4bab97c7c87ea693832137999 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Sun, 25 Feb 2024 13:33:07 +0200
Subject: [PATCH 1/1] Disable O_DIRECT on AIX

AIX apparently doesn't support the __attribute__((aligned(a)))
attribute, when a > 16. That means that PGIOAlignedBlock is not
correctly aligned. That showed up as an assertion failure when writing
pages with the new bulk_write.c facility, because it uses "static
const" of type PGIOAlignedBlock.

pg_prewarm.c also uses a static variable of type PGIOAlignedBlock, and
it's not clear why it hasn't tripped the assertion.

There surely would be ways to force the alignment on AIX, but at least
this should make the buildfarm green again, and it doesn't seem worth
spending more effort on this right now. We might soon drop AIX support
altogether unless someone steps up to the plate to maintain it
properly and set up a more recent buildfarm animal for it.

Discussion: https://www.postgresql.org/message-id/20240224195024.af@rfd.leadboat.com
---
 src/include/c.h          | 14 ++++++++++----
 src/include/port/aix.h   |  3 +++
 src/include/storage/fd.h |  2 +-
 3 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/src/include/c.h b/src/include/c.h
index 2e3ea206e1..9a464bd592 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -186,6 +186,12 @@
 #define pg_attribute_noreturn() __attribute__((noreturn))
 #define pg_attribute_packed() __attribute__((packed))
 #define HAVE_PG_ATTRIBUTE_NORETURN 1
+
+/* Some platforms (AIX) support the aligned attribute only for small values */
+#if !defined(PG_MAX_ATTRIBUTE_ALIGN) || PG_MAX_ATTRIBUTE_ALIGN >= PG_IO_ALIGN_SIZE
+#define pg_attribute_io_aligned __attribute__((aligned(PG_IO_ALIGN_SIZE)))
+#endif
+
 #elif defined(_MSC_VER)
 /*
  * MSVC supports aligned.  noreturn is also possible but in MSVC it is
@@ -1123,8 +1129,8 @@ typedef union PGAlignedBlock
  */
 typedef union PGIOAlignedBlock
 {
-#ifdef pg_attribute_aligned
-	pg_attribute_aligned(PG_IO_ALIGN_SIZE)
+#ifdef pg_attribute_io_aligned
+	pg_attribute_io_aligned
 #endif
 	char		data[BLCKSZ];
 	double		force_align_d;
@@ -1134,8 +1140,8 @@ typedef union PGIOAlignedBlock
 /* Same, but for an XLOG_BLCKSZ-sized buffer */
 typedef union PGAlignedXLogBlock
 {
-#ifdef pg_attribute_aligned
-	pg_attribute_aligned(PG_IO_ALIGN_SIZE)
+#ifdef pg_attribute_io_aligned
+	pg_attribute_io_aligned
 #endif
 	char		data[XLOG_BLCKSZ];
 	double		force_align_d;
diff --git a/src/include/port/aix.h b/src/include/port/aix.h
index 5b1159c578..405151746d 100644
--- a/src/include/port/aix.h
+++ b/src/include/port/aix.h
@@ -12,3 +12,6 @@
 #if defined(__ILP32__) && defined(__IBMC__)
 #define PG_FORCE_DISABLE_INLINE
 #endif
+
+/* pg_attribute_aligned(n) works only up to 16 bytes on AIX. */
+#define PG_MAX_ATTRIBUTE_ALIGN	16
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 60bba5c970..579cb3142b 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -88,7 +88,7 @@ extern PGDLLIMPORT int max_safe_fds;
  * idea on a Unix).  We can only use it if the compiler will correctly align
  * PGIOAlignedBlock for us, though.
  */
-#if defined(O_DIRECT) && defined(pg_attribute_aligned)
+#if defined(O_DIRECT) && defined(pg_attribute_io_aligned)
 #define		PG_O_DIRECT O_DIRECT
 #elif defined(F_NOCACHE)
 #define		PG_O_DIRECT 0x80000000
-- 
2.39.2

#27Noah Misch
noah@leadboat.com
In reply to: Heikki Linnakangas (#26)
1 attachment(s)
Re: Relation bulk write facility

On Sun, Feb 25, 2024 at 04:34:47PM +0200, Heikki Linnakangas wrote:

I agree with Andres though, that unless someone raises their hand and
volunteers to properly maintain the AIX support, we should drop it.

There's no way forward in which AIX support stops doing net harm. Even if AIX
enthusiasts intercepted would-be buildfarm failures and fixed them before
buildfarm.postgresql.org could see them, the damage from the broader community
seeing the AIX-specific code would outweigh the benefits of AIX support. I've
now disabled the animals for v17+, though each may do one more run before
picking up the disable.

My upthread observation about xcoff section alignment was a red herring. gcc
populates symbol-level alignment, and section-level alignment is unnecessary
if symbol-level alignment is correct. The simplest workaround for $SUBJECT
AIX failure would be to remove the "const", based on the results of the
attached test program. The pg_prewarm.c var is like al4096_static in the
outputs below, hence the lack of trouble there. The bulk_write.c var is like
al4096_static_const_initialized.

==== gcc 8.3.0
al4096 4096 @ 0x11000c000 (mod 0)
al4096_initialized 4096 @ 0x110000fd0 (mod 4048 - BUG)
al4096_const 4096 @ 0x11000f000 (mod 0)
al4096_const_initialized 4096 @ 0x10000cd00 (mod 3328 - BUG)
al4096_static 4096 @ 0x110005000 (mod 0)
al4096_static_initialized 4096 @ 0x110008000 (mod 0)
al4096_static_const 4096 @ 0x100000c10 (mod 3088 - BUG)
al4096_static_const_initialized 4096 @ 0x100003c10 (mod 3088 - BUG)
==== xlc 12.01.0000.0000
al4096 4096 @ 0x110008000 (mod 0)
al4096_initialized 4096 @ 0x110004000 (mod 0)
al4096_const 4096 @ 0x11000b000 (mod 0)
al4096_const_initialized 4096 @ 0x100007000 (mod 0)
al4096_static 4096 @ 0x11000e000 (mod 0)
al4096_static_initialized 4096 @ 0x110001000 (mod 0)
al4096_static_const 4096 @ 0x110011000 (mod 0)
al4096_static_const_initialized 4096 @ 0x1000007d0 (mod 2000 - BUG)
==== ibm-clang 17.1.1.2
al4096 4096 @ 0x110001000 (mod 0)
al4096_initialized 4096 @ 0x110004000 (mod 0)
al4096_const 4096 @ 0x100001000 (mod 0)
al4096_const_initialized 4096 @ 0x100005000 (mod 0)
al4096_static 4096 @ 0x110008000 (mod 0)
al4096_static_initialized 4096 @ 0x11000b000 (mod 0)
al4096_static_const 4096 @ 0x100009000 (mod 0)
al4096_static_const_initialized 4096 @ 0x10000d000 (mod 0)

Attachments:

align.ctext/plain; charset=us-asciiDownload
#28Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noah Misch (#27)
Re: Relation bulk write facility

Noah Misch <noah@leadboat.com> writes:

On Sun, Feb 25, 2024 at 04:34:47PM +0200, Heikki Linnakangas wrote:

I agree with Andres though, that unless someone raises their hand and
volunteers to properly maintain the AIX support, we should drop it.

There's no way forward in which AIX support stops doing net harm. Even if AIX
enthusiasts intercepted would-be buildfarm failures and fixed them before
buildfarm.postgresql.org could see them, the damage from the broader community
seeing the AIX-specific code would outweigh the benefits of AIX support. I've
now disabled the animals for v17+, though each may do one more run before
picking up the disable.

So, we now need to strip the remnants of AIX support from the code and
docs? I don't see that much of it, but it's misleading to leave it
there.

(BTW, I still want to nuke the remaining snippets of HPPA support.
I don't think it does anybody any good to make it look like that's
still expected to work.)

regards, tom lane

#29Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#28)
Re: Relation bulk write facility

On Mon, Feb 26, 2024 at 1:21 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

So, we now need to strip the remnants of AIX support from the code and
docs? I don't see that much of it, but it's misleading to leave it
there.

(BTW, I still want to nuke the remaining snippets of HPPA support.
I don't think it does anybody any good to make it look like that's
still expected to work.)

+1 for removing things that don't work (or that we think probably don't work).

--
Robert Haas
EDB: http://www.enterprisedb.com

#30Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#29)
Re: Relation bulk write facility

On Mon, Feb 26, 2024 at 09:42:03AM +0530, Robert Haas wrote:

On Mon, Feb 26, 2024 at 1:21 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

So, we now need to strip the remnants of AIX support from the code and
docs? I don't see that much of it, but it's misleading to leave it
there.

(BTW, I still want to nuke the remaining snippets of HPPA support.
I don't think it does anybody any good to make it look like that's
still expected to work.)

+1 for removing things that don't work (or that we think probably don't work).

Seeing this stuff eat developer time because of the debugging of weird
issues while having a very limited impact for end-users is sad, so +1
for a cleanup of any remnants if this disappears.
--
Michael

#31Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Michael Paquier (#30)
1 attachment(s)
Re: Relation bulk write facility

On 26/02/2024 06:18, Michael Paquier wrote:

On Mon, Feb 26, 2024 at 09:42:03AM +0530, Robert Haas wrote:

On Mon, Feb 26, 2024 at 1:21 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

So, we now need to strip the remnants of AIX support from the code and
docs? I don't see that much of it, but it's misleading to leave it
there.

(BTW, I still want to nuke the remaining snippets of HPPA support.
I don't think it does anybody any good to make it look like that's
still expected to work.)

+1 for removing things that don't work (or that we think probably don't work).

Seeing this stuff eat developer time because of the debugging of weird
issues while having a very limited impact for end-users is sad, so +1
for a cleanup of any remnants if this disappears.

Here's a patch to fully remove AIX support.

One small issue that warrants some discussion (in sanity_check.sql):

--- When ALIGNOF_DOUBLE==4 (e.g. AIX), the C ABI may impose 8-byte alignment on
+-- When MAXIMUM_ALIGNOF==8 but ALIGNOF_DOUBLE==4, the C ABI may impose 8-byte alignment
-- some of the C types that correspond to TYPALIGN_DOUBLE SQL types.  To ensure
-- catalog C struct layout matches catalog tuple layout, arrange for the tuple
-- offset of each fixed-width, attalign='d' catalog column to be divisible by 8
-- unconditionally.  Keep such columns before the first NameData column of the
-- catalog, since packagers can override NAMEDATALEN to an odd number.
+-- (XXX: I'm not sure if any of the supported platforms have MAXIMUM_ALIGNOF==8 and
+-- ALIGNOF_DOUBLE==4.  Perhaps we should just require that
+-- ALIGNOF_DOUBLE==MAXIMUM_ALIGNOF)

What do y'all think of adding a check for
ALIGNOF_DOUBLE==MAXIMUM_ALIGNOF to configure.ac and meson.build? It's
not a requirement today, but I believe AIX was the only platform where
that was not true. With AIX gone, that combination won't be tested, and
we will probably break it sooner or later.

--
Heikki Linnakangas
Neon (https://neon.tech)

Attachments:

0001-Remove-AIX-support.patchtext/x-patch; charset=UTF-8; name=0001-Remove-AIX-support.patchDownload
From 59a66f507365ff9cb8c462d8206be285e3e2632d Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 28 Feb 2024 00:22:23 +0400
Subject: [PATCH 1/1] Remove AIX support

There isn't a lot of user demand for AIX support, no one has stepped
up to the plate to properly maintain it, so it's best to remove it
altogether. AIX is still supported for stable versions.

The acute issue that triggered this decision was that after commit
8af2565248, the AIX buildfarm members have been hitting this
assertion:

    TRAP: failed Assert("(uintptr_t) buffer == TYPEALIGN(PG_IO_ALIGN_SIZE, buffer)"), File: "md.c", Line: 472, PID: 2949728

Apperently the "pg_attribute_aligned(a)" attribute doesn't work on AIX
(linker?) for values larger than PG_IO_ALIGN_SIZE. That could be
worked around, but we decided to just drop the AIX support instead.

Discussion: https://www.postgresql.org/message-id/20240224172345.32%40rfd.leadboat.com
---
 Makefile                                      |   2 +
 config/c-compiler.m4                          |   2 +-
 configure                                     | 301 +-----------------
 configure.ac                                  |  34 +-
 doc/src/sgml/dfunc.sgml                       |  19 --
 doc/src/sgml/installation.sgml                | 119 +------
 doc/src/sgml/runtime.sgml                     |  23 --
 meson.build                                   |  27 +-
 src/Makefile.shlib                            |  30 --
 src/backend/Makefile                          |  20 --
 src/backend/meson.build                       |  15 -
 src/backend/port/aix/mkldexport.sh            |  61 ----
 src/backend/utils/error/elog.c                |   2 -
 src/backend/utils/misc/ps_status.c            |   4 +-
 src/bin/pg_basebackup/t/010_pg_basebackup.pl  |   3 -
 src/bin/pg_verifybackup/t/008_untar.pl        |   3 -
 src/bin/pg_verifybackup/t/010_client_untar.pl |   3 -
 src/include/c.h                               |  18 +-
 src/include/port/aix.h                        |  14 -
 src/include/port/atomics.h                    |   6 +-
 src/include/storage/s_lock.h                  |  31 +-
 src/interfaces/libpq/Makefile                 |   2 +-
 src/interfaces/libpq/meson.build              |   5 +-
 src/makefiles/Makefile.aix                    |  39 ---
 src/port/README                               |   2 +-
 src/port/strerror.c                           |   2 -
 src/template/aix                              |  25 --
 src/test/regress/Makefile                     |   5 -
 src/test/regress/expected/sanity_check.out    |   5 +-
 src/test/regress/sql/sanity_check.sql         |   5 +-
 src/tools/gen_export.pl                       |  11 +-
 src/tools/pginclude/cpluspluscheck            |   1 -
 src/tools/pginclude/headerscheck              |   1 -
 33 files changed, 52 insertions(+), 788 deletions(-)
 delete mode 100755 src/backend/port/aix/mkldexport.sh
 delete mode 100644 src/include/port/aix.h
 delete mode 100644 src/makefiles/Makefile.aix
 delete mode 100644 src/template/aix

diff --git a/Makefile b/Makefile
index 9bc1a4ec17b..8a2ec9396b6 100644
--- a/Makefile
+++ b/Makefile
@@ -13,6 +13,8 @@
 
 # AIX make defaults to building *every* target of the first rule.  Start with
 # a single-target, empty rule to make the other targets non-default.
+# (We don't support AIX anymore, but if someone tries to build on AIX anyway,
+# at least they'll get the instructions to run 'configure' first.)
 all:
 
 all check install installdirs installcheck installcheck-parallel uninstall clean distclean maintainer-clean dist distcheck world check-world install-world installcheck-world:
diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 5db02b2ab75..3268a780bb0 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -137,7 +137,7 @@ if test x"$pgac_cv__128bit_int" = xyes ; then
   AC_CACHE_CHECK([for __int128 alignment bug], [pgac_cv__128bit_int_bug],
   [AC_RUN_IFELSE([AC_LANG_PROGRAM([
 /* This must match the corresponding code in c.h: */
-#if defined(__GNUC__) || defined(__SUNPRO_C) || defined(__IBMC__)
+#if defined(__GNUC__) || defined(__SUNPRO_C)
 #define pg_attribute_aligned(a) __attribute__((aligned(a)))
 #elif defined(_MSC_VER)
 #define pg_attribute_aligned(a) __declspec(align(a))
diff --git a/configure b/configure
index 6b87e5c9a8c..4e2dba338be 100755
--- a/configure
+++ b/configure
@@ -2987,7 +2987,6 @@ else
 # --with-template not given
 
 case $host_os in
-     aix*) template=aix ;;
   cygwin*|msys*) template=cygwin ;;
   darwin*) template=darwin ;;
 dragonfly*) template=netbsd ;;
@@ -3917,10 +3916,10 @@ fi
 
 
 
-case $template in
-  aix) pgac_cc_list="gcc xlc"; pgac_cxx_list="g++ xlC";;
-    *) pgac_cc_list="gcc cc"; pgac_cxx_list="g++ c++";;
-esac
+# If you don't specify a list of compilers to test, the AC_PROG_CC and
+# AC_PROG_CXX macros test for a long list of unsupported compilers.
+pgac_cc_list="gcc cc"
+pgac_cxx_list="g++ c++"
 
 ac_ext=c
 ac_cpp='$CPP $CPPFLAGS'
@@ -6874,190 +6873,6 @@ if test x"$pgac_cv_prog_CXX_cxxflags__fno_strict_aliasing" = x"yes"; then
 fi
 
 
-elif test "$PORTNAME" = "aix"; then
-  # AIX's xlc has to have strict aliasing turned off too
-
-{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether ${CC} supports -qnoansialias, for CFLAGS" >&5
-$as_echo_n "checking whether ${CC} supports -qnoansialias, for CFLAGS... " >&6; }
-if ${pgac_cv_prog_CC_cflags__qnoansialias+:} false; then :
-  $as_echo_n "(cached) " >&6
-else
-  pgac_save_CFLAGS=$CFLAGS
-pgac_save_CC=$CC
-CC=${CC}
-CFLAGS="${CFLAGS} -qnoansialias"
-ac_save_c_werror_flag=$ac_c_werror_flag
-ac_c_werror_flag=yes
-cat confdefs.h - <<_ACEOF >conftest.$ac_ext
-/* end confdefs.h.  */
-
-int
-main ()
-{
-
-  ;
-  return 0;
-}
-_ACEOF
-if ac_fn_c_try_compile "$LINENO"; then :
-  pgac_cv_prog_CC_cflags__qnoansialias=yes
-else
-  pgac_cv_prog_CC_cflags__qnoansialias=no
-fi
-rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
-ac_c_werror_flag=$ac_save_c_werror_flag
-CFLAGS="$pgac_save_CFLAGS"
-CC="$pgac_save_CC"
-fi
-{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_prog_CC_cflags__qnoansialias" >&5
-$as_echo "$pgac_cv_prog_CC_cflags__qnoansialias" >&6; }
-if test x"$pgac_cv_prog_CC_cflags__qnoansialias" = x"yes"; then
-  CFLAGS="${CFLAGS} -qnoansialias"
-fi
-
-
-  { $as_echo "$as_me:${as_lineno-$LINENO}: checking whether ${CXX} supports -qnoansialias, for CXXFLAGS" >&5
-$as_echo_n "checking whether ${CXX} supports -qnoansialias, for CXXFLAGS... " >&6; }
-if ${pgac_cv_prog_CXX_cxxflags__qnoansialias+:} false; then :
-  $as_echo_n "(cached) " >&6
-else
-  pgac_save_CXXFLAGS=$CXXFLAGS
-pgac_save_CXX=$CXX
-CXX=${CXX}
-CXXFLAGS="${CXXFLAGS} -qnoansialias"
-ac_save_cxx_werror_flag=$ac_cxx_werror_flag
-ac_cxx_werror_flag=yes
-ac_ext=cpp
-ac_cpp='$CXXCPP $CPPFLAGS'
-ac_compile='$CXX -c $CXXFLAGS $CPPFLAGS conftest.$ac_ext >&5'
-ac_link='$CXX -o conftest$ac_exeext $CXXFLAGS $CPPFLAGS $LDFLAGS conftest.$ac_ext $LIBS >&5'
-ac_compiler_gnu=$ac_cv_cxx_compiler_gnu
-
-cat confdefs.h - <<_ACEOF >conftest.$ac_ext
-/* end confdefs.h.  */
-
-int
-main ()
-{
-
-  ;
-  return 0;
-}
-_ACEOF
-if ac_fn_cxx_try_compile "$LINENO"; then :
-  pgac_cv_prog_CXX_cxxflags__qnoansialias=yes
-else
-  pgac_cv_prog_CXX_cxxflags__qnoansialias=no
-fi
-rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
-ac_ext=c
-ac_cpp='$CPP $CPPFLAGS'
-ac_compile='$CC -c $CFLAGS $CPPFLAGS conftest.$ac_ext >&5'
-ac_link='$CC -o conftest$ac_exeext $CFLAGS $CPPFLAGS $LDFLAGS conftest.$ac_ext $LIBS >&5'
-ac_compiler_gnu=$ac_cv_c_compiler_gnu
-
-ac_cxx_werror_flag=$ac_save_cxx_werror_flag
-CXXFLAGS="$pgac_save_CXXFLAGS"
-CXX="$pgac_save_CXX"
-fi
-{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_prog_CXX_cxxflags__qnoansialias" >&5
-$as_echo "$pgac_cv_prog_CXX_cxxflags__qnoansialias" >&6; }
-if test x"$pgac_cv_prog_CXX_cxxflags__qnoansialias" = x"yes"; then
-  CXXFLAGS="${CXXFLAGS} -qnoansialias"
-fi
-
-
-
-{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether ${CC} supports -qlonglong, for CFLAGS" >&5
-$as_echo_n "checking whether ${CC} supports -qlonglong, for CFLAGS... " >&6; }
-if ${pgac_cv_prog_CC_cflags__qlonglong+:} false; then :
-  $as_echo_n "(cached) " >&6
-else
-  pgac_save_CFLAGS=$CFLAGS
-pgac_save_CC=$CC
-CC=${CC}
-CFLAGS="${CFLAGS} -qlonglong"
-ac_save_c_werror_flag=$ac_c_werror_flag
-ac_c_werror_flag=yes
-cat confdefs.h - <<_ACEOF >conftest.$ac_ext
-/* end confdefs.h.  */
-
-int
-main ()
-{
-
-  ;
-  return 0;
-}
-_ACEOF
-if ac_fn_c_try_compile "$LINENO"; then :
-  pgac_cv_prog_CC_cflags__qlonglong=yes
-else
-  pgac_cv_prog_CC_cflags__qlonglong=no
-fi
-rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
-ac_c_werror_flag=$ac_save_c_werror_flag
-CFLAGS="$pgac_save_CFLAGS"
-CC="$pgac_save_CC"
-fi
-{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_prog_CC_cflags__qlonglong" >&5
-$as_echo "$pgac_cv_prog_CC_cflags__qlonglong" >&6; }
-if test x"$pgac_cv_prog_CC_cflags__qlonglong" = x"yes"; then
-  CFLAGS="${CFLAGS} -qlonglong"
-fi
-
-
-  { $as_echo "$as_me:${as_lineno-$LINENO}: checking whether ${CXX} supports -qlonglong, for CXXFLAGS" >&5
-$as_echo_n "checking whether ${CXX} supports -qlonglong, for CXXFLAGS... " >&6; }
-if ${pgac_cv_prog_CXX_cxxflags__qlonglong+:} false; then :
-  $as_echo_n "(cached) " >&6
-else
-  pgac_save_CXXFLAGS=$CXXFLAGS
-pgac_save_CXX=$CXX
-CXX=${CXX}
-CXXFLAGS="${CXXFLAGS} -qlonglong"
-ac_save_cxx_werror_flag=$ac_cxx_werror_flag
-ac_cxx_werror_flag=yes
-ac_ext=cpp
-ac_cpp='$CXXCPP $CPPFLAGS'
-ac_compile='$CXX -c $CXXFLAGS $CPPFLAGS conftest.$ac_ext >&5'
-ac_link='$CXX -o conftest$ac_exeext $CXXFLAGS $CPPFLAGS $LDFLAGS conftest.$ac_ext $LIBS >&5'
-ac_compiler_gnu=$ac_cv_cxx_compiler_gnu
-
-cat confdefs.h - <<_ACEOF >conftest.$ac_ext
-/* end confdefs.h.  */
-
-int
-main ()
-{
-
-  ;
-  return 0;
-}
-_ACEOF
-if ac_fn_cxx_try_compile "$LINENO"; then :
-  pgac_cv_prog_CXX_cxxflags__qlonglong=yes
-else
-  pgac_cv_prog_CXX_cxxflags__qlonglong=no
-fi
-rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
-ac_ext=c
-ac_cpp='$CPP $CPPFLAGS'
-ac_compile='$CC -c $CFLAGS $CPPFLAGS conftest.$ac_ext >&5'
-ac_link='$CC -o conftest$ac_exeext $CFLAGS $CPPFLAGS $LDFLAGS conftest.$ac_ext $LIBS >&5'
-ac_compiler_gnu=$ac_cv_c_compiler_gnu
-
-ac_cxx_werror_flag=$ac_save_cxx_werror_flag
-CXXFLAGS="$pgac_save_CXXFLAGS"
-CXX="$pgac_save_CXX"
-fi
-{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_prog_CXX_cxxflags__qlonglong" >&5
-$as_echo "$pgac_cv_prog_CXX_cxxflags__qlonglong" >&6; }
-if test x"$pgac_cv_prog_CXX_cxxflags__qlonglong" = x"yes"; then
-  CXXFLAGS="${CXXFLAGS} -qlonglong"
-fi
-
-
 fi
 
 # If the compiler knows how to hide symbols, add the switch needed for that to
@@ -7212,103 +7027,6 @@ if test x"$pgac_cv_prog_CXX_cxxflags__fvisibility_inlines_hidden" = x"yes"; then
 fi
 
   have_visibility_attribute=$pgac_cv_prog_CC_cflags__fvisibility_hidden
-elif test "$PORTNAME" = "aix"; then
-  # Note that xlc accepts -fvisibility=hidden as a file.
-  { $as_echo "$as_me:${as_lineno-$LINENO}: checking whether ${CC} supports -qvisibility=hidden, for CFLAGS_SL_MODULE" >&5
-$as_echo_n "checking whether ${CC} supports -qvisibility=hidden, for CFLAGS_SL_MODULE... " >&6; }
-if ${pgac_cv_prog_CC_cflags__qvisibility_hidden+:} false; then :
-  $as_echo_n "(cached) " >&6
-else
-  pgac_save_CFLAGS=$CFLAGS
-pgac_save_CC=$CC
-CC=${CC}
-CFLAGS="${CFLAGS_SL_MODULE} -qvisibility=hidden"
-ac_save_c_werror_flag=$ac_c_werror_flag
-ac_c_werror_flag=yes
-cat confdefs.h - <<_ACEOF >conftest.$ac_ext
-/* end confdefs.h.  */
-
-int
-main ()
-{
-
-  ;
-  return 0;
-}
-_ACEOF
-if ac_fn_c_try_compile "$LINENO"; then :
-  pgac_cv_prog_CC_cflags__qvisibility_hidden=yes
-else
-  pgac_cv_prog_CC_cflags__qvisibility_hidden=no
-fi
-rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
-ac_c_werror_flag=$ac_save_c_werror_flag
-CFLAGS="$pgac_save_CFLAGS"
-CC="$pgac_save_CC"
-fi
-{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_prog_CC_cflags__qvisibility_hidden" >&5
-$as_echo "$pgac_cv_prog_CC_cflags__qvisibility_hidden" >&6; }
-if test x"$pgac_cv_prog_CC_cflags__qvisibility_hidden" = x"yes"; then
-  CFLAGS_SL_MODULE="${CFLAGS_SL_MODULE} -qvisibility=hidden"
-fi
-
-
-  { $as_echo "$as_me:${as_lineno-$LINENO}: checking whether ${CXX} supports -qvisibility=hidden, for CXXFLAGS_SL_MODULE" >&5
-$as_echo_n "checking whether ${CXX} supports -qvisibility=hidden, for CXXFLAGS_SL_MODULE... " >&6; }
-if ${pgac_cv_prog_CXX_cxxflags__qvisibility_hidden+:} false; then :
-  $as_echo_n "(cached) " >&6
-else
-  pgac_save_CXXFLAGS=$CXXFLAGS
-pgac_save_CXX=$CXX
-CXX=${CXX}
-CXXFLAGS="${CXXFLAGS_SL_MODULE} -qvisibility=hidden"
-ac_save_cxx_werror_flag=$ac_cxx_werror_flag
-ac_cxx_werror_flag=yes
-ac_ext=cpp
-ac_cpp='$CXXCPP $CPPFLAGS'
-ac_compile='$CXX -c $CXXFLAGS $CPPFLAGS conftest.$ac_ext >&5'
-ac_link='$CXX -o conftest$ac_exeext $CXXFLAGS $CPPFLAGS $LDFLAGS conftest.$ac_ext $LIBS >&5'
-ac_compiler_gnu=$ac_cv_cxx_compiler_gnu
-
-cat confdefs.h - <<_ACEOF >conftest.$ac_ext
-/* end confdefs.h.  */
-
-int
-main ()
-{
-
-  ;
-  return 0;
-}
-_ACEOF
-if ac_fn_cxx_try_compile "$LINENO"; then :
-  pgac_cv_prog_CXX_cxxflags__qvisibility_hidden=yes
-else
-  pgac_cv_prog_CXX_cxxflags__qvisibility_hidden=no
-fi
-rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
-ac_ext=c
-ac_cpp='$CPP $CPPFLAGS'
-ac_compile='$CC -c $CFLAGS $CPPFLAGS conftest.$ac_ext >&5'
-ac_link='$CC -o conftest$ac_exeext $CFLAGS $CPPFLAGS $LDFLAGS conftest.$ac_ext $LIBS >&5'
-ac_compiler_gnu=$ac_cv_c_compiler_gnu
-
-ac_cxx_werror_flag=$ac_save_cxx_werror_flag
-CXXFLAGS="$pgac_save_CXXFLAGS"
-CXX="$pgac_save_CXX"
-fi
-{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_prog_CXX_cxxflags__qvisibility_hidden" >&5
-$as_echo "$pgac_cv_prog_CXX_cxxflags__qvisibility_hidden" >&6; }
-if test x"$pgac_cv_prog_CXX_cxxflags__qvisibility_hidden" = x"yes"; then
-  CXXFLAGS_SL_MODULE="${CXXFLAGS_SL_MODULE} -qvisibility=hidden"
-fi
-
-  have_visibility_attribute=$pgac_cv_prog_CC_cflags__qvisibility_hidden
-  # Old xlc versions (<13.1) don't have support for -qvisibility. Use expfull to force
-  # all extension module symbols to be exported.
-  if test "$pgac_cv_prog_CC_cflags__qvisibility_hidden" != "yes"; then
-    CFLAGS_SL_MODULE="$CFLAGS_SL_MODULE -Wl,-b,expfull"
-  fi
 fi
 
 if test "$have_visibility_attribute" = "yes"; then
@@ -13166,8 +12884,7 @@ fi
 
 fi
 
-# Note: We can test for libldap_r only after we know PTHREAD_LIBS;
-# also, on AIX, we may need to have openssl in LIBS for this step.
+# Note: We can test for libldap_r only after we know PTHREAD_LIBS
 if test "$with_ldap" = yes ; then
   _LIBS="$LIBS"
   if test "$PORTNAME" != "win32"; then
@@ -15025,10 +14742,6 @@ fi
 # spelling it understands, because it conflicts with
 # __declspec(restrict). Therefore we define pg_restrict to the
 # appropriate definition, which presumably won't conflict.
-#
-# Allow platforms with buggy compilers to force restrict to not be
-# used by setting $FORCE_DISABLE_RESTRICT=yes in the relevant
-# template.
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for C/C++ restrict keyword" >&5
 $as_echo_n "checking for C/C++ restrict keyword... " >&6; }
 if ${ac_cv_c_restrict+:} false; then :
@@ -15075,7 +14788,7 @@ _ACEOF
  ;;
  esac
 
-if test "$ac_cv_c_restrict" = "no" -o "x$FORCE_DISABLE_RESTRICT" = "xyes"; then
+if test "$ac_cv_c_restrict" = "no"; then
   pg_restrict=""
 else
   pg_restrict="$ac_cv_c_restrict"
@@ -17391,7 +17104,7 @@ else
 /* end confdefs.h.  */
 
 /* This must match the corresponding code in c.h: */
-#if defined(__GNUC__) || defined(__SUNPRO_C) || defined(__IBMC__)
+#if defined(__GNUC__) || defined(__SUNPRO_C)
 #define pg_attribute_aligned(a) __attribute__((aligned(a)))
 #elif defined(_MSC_VER)
 #define pg_attribute_aligned(a) __declspec(align(a))
diff --git a/configure.ac b/configure.ac
index 6e64ece11da..e8851daa798 100644
--- a/configure.ac
+++ b/configure.ac
@@ -62,7 +62,6 @@ PGAC_ARG_REQ(with, template, [NAME], [override operating system template],
 # --with-template not given
 
 case $host_os in
-     aix*) template=aix ;;
   cygwin*|msys*) template=cygwin ;;
   darwin*) template=darwin ;;
 dragonfly*) template=netbsd ;;
@@ -374,10 +373,10 @@ AC_DEFINE_UNQUOTED([XLOG_BLCKSZ], ${XLOG_BLCKSZ}, [
 # variable.
 PGAC_ARG_REQ(with, CC, [CMD], [set compiler (deprecated)], [CC=$with_CC])
 
-case $template in
-  aix) pgac_cc_list="gcc xlc"; pgac_cxx_list="g++ xlC";;
-    *) pgac_cc_list="gcc cc"; pgac_cxx_list="g++ c++";;
-esac
+# If you don't specify a list of compilers to test, the AC_PROG_CC and
+# AC_PROG_CXX macros test for a long list of unsupported compilers.
+pgac_cc_list="gcc cc"
+pgac_cxx_list="g++ c++"
 
 AC_PROG_CC([$pgac_cc_list])
 AC_PROG_CC_C99()
@@ -594,12 +593,6 @@ elif test "$ICC" = yes; then
   # Make sure strict aliasing is off (though this is said to be the default)
   PGAC_PROG_CC_CFLAGS_OPT([-fno-strict-aliasing])
   PGAC_PROG_CXX_CFLAGS_OPT([-fno-strict-aliasing])
-elif test "$PORTNAME" = "aix"; then
-  # AIX's xlc has to have strict aliasing turned off too
-  PGAC_PROG_CC_CFLAGS_OPT([-qnoansialias])
-  PGAC_PROG_CXX_CFLAGS_OPT([-qnoansialias])
-  PGAC_PROG_CC_CFLAGS_OPT([-qlonglong])
-  PGAC_PROG_CXX_CFLAGS_OPT([-qlonglong])
 fi
 
 # If the compiler knows how to hide symbols, add the switch needed for that to
@@ -618,16 +611,6 @@ if test "$GCC" = yes -o "$SUN_STUDIO_CC" = yes ; then
   PGAC_PROG_VARCXX_VARFLAGS_OPT(CXX, CXXFLAGS_SL_MODULE, [-fvisibility=hidden])
   PGAC_PROG_VARCXX_VARFLAGS_OPT(CXX, CXXFLAGS_SL_MODULE, [-fvisibility-inlines-hidden])
   have_visibility_attribute=$pgac_cv_prog_CC_cflags__fvisibility_hidden
-elif test "$PORTNAME" = "aix"; then
-  # Note that xlc accepts -fvisibility=hidden as a file.
-  PGAC_PROG_CC_VAR_OPT(CFLAGS_SL_MODULE, [-qvisibility=hidden])
-  PGAC_PROG_VARCXX_VARFLAGS_OPT(CXX, CXXFLAGS_SL_MODULE, [-qvisibility=hidden])
-  have_visibility_attribute=$pgac_cv_prog_CC_cflags__qvisibility_hidden
-  # Old xlc versions (<13.1) don't have support for -qvisibility. Use expfull to force
-  # all extension module symbols to be exported.
-  if test "$pgac_cv_prog_CC_cflags__qvisibility_hidden" != "yes"; then
-    CFLAGS_SL_MODULE="$CFLAGS_SL_MODULE -Wl,-b,expfull"
-  fi
 fi
 
 if test "$have_visibility_attribute" = "yes"; then
@@ -1407,8 +1390,7 @@ if test "$with_zstd" = yes ; then
   AC_CHECK_LIB(zstd, ZSTD_compress, [], [AC_MSG_ERROR([library 'zstd' is required for ZSTD support])])
 fi
 
-# Note: We can test for libldap_r only after we know PTHREAD_LIBS;
-# also, on AIX, we may need to have openssl in LIBS for this step.
+# Note: We can test for libldap_r only after we know PTHREAD_LIBS
 if test "$with_ldap" = yes ; then
   _LIBS="$LIBS"
   if test "$PORTNAME" != "win32"; then
@@ -1666,12 +1648,8 @@ PGAC_TYPE_LOCALE_T
 # spelling it understands, because it conflicts with
 # __declspec(restrict). Therefore we define pg_restrict to the
 # appropriate definition, which presumably won't conflict.
-#
-# Allow platforms with buggy compilers to force restrict to not be
-# used by setting $FORCE_DISABLE_RESTRICT=yes in the relevant
-# template.
 AC_C_RESTRICT
-if test "$ac_cv_c_restrict" = "no" -o "x$FORCE_DISABLE_RESTRICT" = "xyes"; then
+if test "$ac_cv_c_restrict" = "no"; then
   pg_restrict=""
 else
   pg_restrict="$ac_cv_c_restrict"
diff --git a/doc/src/sgml/dfunc.sgml b/doc/src/sgml/dfunc.sgml
index 554f9fac4ce..b94aefcd0ca 100644
--- a/doc/src/sgml/dfunc.sgml
+++ b/doc/src/sgml/dfunc.sgml
@@ -202,23 +202,4 @@ gcc -G -o foo.so foo.o
   server expects to find the shared library files.
  </para>
 
-<!--
-Under AIX, object files are compiled normally but building the shared
-library requires a couple of steps.  First, create the object file:
-.nf
-cc <other flags> -c foo.c
-.fi
-You must then create a symbol \*(lqexports\*(rq file for the object
-file:
-.nf
-mkldexport foo.o `pwd` &gt; foo.exp
-.fi
-Finally, you can create the shared library:
-.nf
-ld <other flags> -H512 -T512 -o foo.so -e _nostart \e
-   -bI:.../lib/postgres.exp -bE:foo.exp foo.o \e
-   -lm -lc 2>/dev/null
-.fi
-  -->
-
 </sect2>
diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index ed5b285a5ee..d901576d98e 100644
--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -3401,7 +3401,7 @@ export MANPATH
   <para>
    <productname>PostgreSQL</productname> can be expected to work on current
    versions of these operating systems: Linux, Windows,
-   FreeBSD, OpenBSD, NetBSD, DragonFlyBSD, macOS, AIX, Solaris, and illumos.
+   FreeBSD, OpenBSD, NetBSD, DragonFlyBSD, macOS, Solaris, and illumos.
    Other Unix-like systems may also work but are not currently
    being tested.  In most cases, all CPU architectures supported by
    a given operating system will work.  Look in
@@ -3445,123 +3445,6 @@ export MANPATH
    installation issues.
   </para>
 
-  <sect2 id="installation-notes-aix">
-   <title>AIX</title>
-
-   <indexterm zone="installation-notes-aix">
-    <primary>AIX</primary>
-    <secondary>installation on</secondary>
-   </indexterm>
-
-   <para>
-    You can use GCC or the native IBM compiler <command>xlc</command>
-    to build <productname>PostgreSQL</productname>
-    on <productname>AIX</productname>.
-   </para>
-
-   <para>
-    <productname>AIX</productname> versions before 7.1 are no longer
-    tested nor supported by the <productname>PostgreSQL</productname>
-    community.
-   </para>
-
-   <sect3 id="installation-notes-aix-mem-management">
-    <title>Memory Management</title>
-    <!-- https://archives.postgresql.org/message-id/603bgqmpl9.fsf@dba2.int.libertyrms.com -->
-
-    <para>
-     AIX can be somewhat peculiar with regards to the way it does
-     memory management.  You can have a server with many multiples of
-     gigabytes of RAM free, but still get out of memory or address
-     space errors when running applications.  One example
-     is loading of extensions failing with unusual errors.
-     For example, running as the owner of the PostgreSQL installation:
-<screen>
-=# CREATE EXTENSION plperl;
-ERROR:  could not load library "/opt/dbs/pgsql/lib/plperl.so": A memory address is not in the address space for the process.
-</screen>
-    Running as a non-owner in the group possessing the PostgreSQL
-    installation:
-<screen>
-=# CREATE EXTENSION plperl;
-ERROR:  could not load library "/opt/dbs/pgsql/lib/plperl.so": Bad address
-</screen>
-     Another example is out of memory errors in the PostgreSQL server
-     logs, with every memory allocation near or greater than 256 MB
-     failing.
-    </para>
-
-    <para>
-     The overall cause of all these problems is the default bittedness
-     and memory model used by the server process.  By default, all
-     binaries built on AIX are 32-bit.  This does not depend upon
-     hardware type or kernel in use.  These 32-bit processes are
-     limited to 4 GB of memory laid out in 256 MB segments using one
-     of a few models.  The default allows for less than 256 MB in the
-     heap as it shares a single segment with the stack.
-    </para>
-
-    <para>
-     In the case of the <literal>plperl</literal> example, above,
-     check your umask and the permissions of the binaries in your
-     PostgreSQL installation.  The binaries involved in that example
-     were 32-bit and installed as mode 750 instead of 755.  Due to the
-     permissions being set in this fashion, only the owner or a member
-     of the possessing group can load the library.  Since it isn't
-     world-readable, the loader places the object into the process'
-     heap instead of the shared library segments where it would
-     otherwise be placed.
-    </para>
-
-    <para>
-     The <quote>ideal</quote> solution for this is to use a 64-bit
-     build of PostgreSQL, but that is not always practical, because
-     systems with 32-bit processors can build, but not run, 64-bit
-     binaries.
-    </para>
-
-    <para>
-     If a 32-bit binary is desired, set <symbol>LDR_CNTRL</symbol> to
-     <literal>MAXDATA=0x<replaceable>n</replaceable>0000000</literal>,
-     where 1 &lt;= n &lt;= 8, before starting the PostgreSQL server,
-     and try different values and <filename>postgresql.conf</filename>
-     settings to find a configuration that works satisfactorily.  This
-     use of <symbol>LDR_CNTRL</symbol> tells AIX that you want the
-     server to have <symbol>MAXDATA</symbol> bytes set aside for the
-     heap, allocated in 256 MB segments.  When you find a workable
-     configuration,
-     <command>ldedit</command> can be used to modify the binaries so
-     that they default to using the desired heap size.  PostgreSQL can
-     also be rebuilt, passing <literal>configure
-     LDFLAGS="-Wl,-bmaxdata:0x<replaceable>n</replaceable>0000000"</literal>
-     to achieve the same effect.
-    </para>
-
-    <para>
-     For a 64-bit build, set <envar>OBJECT_MODE</envar> to 64 and
-     pass <literal>CC="gcc -maix64"</literal>
-     and <literal>LDFLAGS="-Wl,-bbigtoc"</literal>
-     to <command>configure</command>.  (Options for
-    <command>xlc</command> might differ.)  If you omit the export of
-    <envar>OBJECT_MODE</envar>, your build may fail with linker errors.  When
-    <envar>OBJECT_MODE</envar> is set, it tells AIX's build utilities
-    such as <command>ar</command>, <command>as</command>, and <command>ld</command> what
-    type of objects to default to handling.
-    </para>
-
-    <para>
-     By default, overcommit of paging space can happen.  While we have
-     not seen this occur, AIX will kill processes when it runs out of
-     memory and the overcommit is accessed.  The closest to this that
-     we have seen is fork failing because the system decided that
-     there was not enough memory for another process.  Like many other
-     parts of AIX, the paging space allocation method and
-     out-of-memory kill is configurable on a system- or process-wide
-     basis if this becomes a problem.
-    </para>
-   </sect3>
-  </sect2>
-
   <sect2 id="installation-notes-cygwin">
    <title>Cygwin</title>
 
diff --git a/doc/src/sgml/runtime.sgml b/doc/src/sgml/runtime.sgml
index 64753d9c014..6047b8171d4 100644
--- a/doc/src/sgml/runtime.sgml
+++ b/doc/src/sgml/runtime.sgml
@@ -891,29 +891,6 @@ psql: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed: No such
 
 
     <variablelist>
-     <varlistentry>
-      <term><systemitem class="osname">AIX</systemitem>
-      <indexterm><primary>AIX</primary><secondary>IPC configuration</secondary></indexterm>
-      </term>
-      <listitem>
-       <para>
-        It should not be necessary to do
-        any special configuration for such parameters as
-        <varname>SHMMAX</varname>, as it appears this is configured to
-        allow all memory to be used as shared memory.  That is the
-        sort of configuration commonly used for other databases such
-        as <application>DB/2</application>.</para>
-
-       <para> It might, however, be necessary to modify the global
-       <command>ulimit</command> information in
-       <filename>/etc/security/limits</filename>, as the default hard
-       limits for file sizes (<varname>fsize</varname>) and numbers of
-       files (<varname>nofiles</varname>) might be too low.
-       </para>
-      </listitem>
-     </varlistentry>
-
-
      <varlistentry>
       <term><systemitem class="osname">FreeBSD</systemitem>
       <indexterm><primary>FreeBSD</primary><secondary>IPC configuration</secondary></indexterm>
diff --git a/meson.build b/meson.build
index 8ed51b6aae8..c637a26e275 100644
--- a/meson.build
+++ b/meson.build
@@ -196,26 +196,7 @@ endif
 # that purpose.
 portname = host_system
 
-if host_system == 'aix'
-  library_path_var = 'LIBPATH'
-
-  export_file_format = 'aix'
-  export_fmt = '-Wl,-bE:@0@'
-  mod_link_args_fmt = ['-Wl,-bI:@0@']
-  mod_link_with_dir = 'libdir'
-  mod_link_with_name = '@0@.imp'
-
-  # M:SRE sets a flag indicating that an object is a shared library. Seems to
-  # work in some circumstances without, but required in others.
-  ldflags_sl += '-Wl,-bM:SRE'
-  ldflags_be += '-Wl,-brtllib'
-
-  # Native memset() is faster, tested on:
-  # - AIX 5.1 and 5.2, XLC 6.0 (IBM's cc)
-  # - AIX 5.3 ML3, gcc 4.0.1
-  memset_loop_limit = 0
-
-elif host_system == 'cygwin'
+if host_system == 'cygwin'
   sema_kind = 'unnamed_posix'
   cppflags += '-D_GNU_SOURCE'
   dlsuffix = '.dll'
@@ -1571,7 +1552,7 @@ if cc.links('''
   if not meson.is_cross_build()
     r = cc.run('''
     /* This must match the corresponding code in c.h: */
-    #if defined(__GNUC__) || defined(__SUNPRO_C) || defined(__IBMC__)
+    #if defined(__GNUC__) || defined(__SUNPRO_C)
     #define pg_attribute_aligned(a) __attribute__((aligned(a)))
     #elif defined(_MSC_VER)
     #define pg_attribute_aligned(a) __declspec(align(a))
@@ -2371,10 +2352,6 @@ endif
 # conflict.
 #
 # We assume C99 support, so we don't need to make this conditional.
-#
-# XXX: Historically we allowed platforms to disable restrict in template
-# files, but that was only added for AIX when building with XLC, which we
-# don't support yet.
 cdata.set('pg_restrict', '__restrict')
 
 
diff --git a/src/Makefile.shlib b/src/Makefile.shlib
index 8ca51ca03f7..986c71b5d20 100644
--- a/src/Makefile.shlib
+++ b/src/Makefile.shlib
@@ -106,20 +106,6 @@ ifdef SO_MAJOR_VERSION
 override CPPFLAGS += -DSO_MAJOR_VERSION=$(SO_MAJOR_VERSION)
 endif
 
-ifeq ($(PORTNAME), aix)
-  LINK.shared		= $(COMPILER)
-  ifdef SO_MAJOR_VERSION
-    shlib		= lib$(NAME)$(DLSUFFIX).$(SO_MAJOR_VERSION)
-  endif
-  haslibarule   = yes
-  # $(exports_file) is also usable as an import file
-  exports_file		= lib$(NAME).exp
-  BUILD.exports		= ( echo '\#! $(shlib)'; $(AWK) '/^[^\#]/ {printf "%s\n",$$1}' $< ) > $@
-  ifneq (,$(SHLIB_EXPORTS))
-    LINK.shared		+= -Wl,-bE:$(exports_file)
-  endif
-endif
-
 ifeq ($(PORTNAME), darwin)
   ifdef soname
     # linkable library
@@ -268,14 +254,6 @@ $(stlib): $(OBJS) | $(SHLIB_PREREQS)
 	touch $@
 endif #haslibarule
 
-# AIX wraps shared libraries inside a static library, can be used both
-# for static and shared linking
-ifeq ($(PORTNAME), aix)
-$(stlib): $(shlib)
-	rm -f $(stlib)
-	$(AR) $(AROPT) $(stlib) $(shlib)
-endif # aix
-
 ifeq (,$(filter cygwin win32,$(PORTNAME)))
 
 # Normal case
@@ -289,11 +267,8 @@ ifneq ($(shlib), $(shlib_major))
 endif
 # Make sure we have a link to a name without any version numbers
 ifneq ($(shlib), $(shlib_bare))
-# except on AIX, where that's not a thing
-ifneq ($(PORTNAME), aix)
 	rm -f $(shlib_bare)
 	$(LN_S) $(shlib) $(shlib_bare)
-endif # aix
 endif # shlib_bare
 endif # shlib_major
 
@@ -401,10 +376,6 @@ install-lib-static: $(stlib) installdirs-lib
 
 install-lib-shared: $(shlib) installdirs-lib
 ifdef soname
-# we don't install $(shlib) on AIX
-# (see http://archives.postgresql.org/message-id/52EF20B2E3209443BC37736D00C3C1380A6E79FE@EXADV1.host.magwien.gv.at)
-ifneq ($(PORTNAME), aix)
-	$(INSTALL_SHLIB) $< '$(DESTDIR)$(libdir)/$(shlib)'
 ifneq ($(PORTNAME), cygwin)
 ifneq ($(PORTNAME), win32)
 ifneq ($(shlib), $(shlib_major))
@@ -419,7 +390,6 @@ ifneq ($(shlib), $(shlib_bare))
 endif
 endif # not win32
 endif # not cygwin
-endif # not aix
 ifneq (,$(findstring $(PORTNAME),win32 cygwin))
 	$(INSTALL_SHLIB) $< '$(DESTDIR)$(bindir)/$(shlib)'
 endif
diff --git a/src/backend/Makefile b/src/backend/Makefile
index 7d2ea7d54a6..d66e2a4b9fa 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -62,14 +62,12 @@ all: submake-libpgport submake-catalog-headers submake-utils-headers postgres $(
 
 ifneq ($(PORTNAME), cygwin)
 ifneq ($(PORTNAME), win32)
-ifneq ($(PORTNAME), aix)
 
 postgres: $(OBJS)
 	$(CC) $(CFLAGS) $(call expand_subsys,$^) $(LDFLAGS) $(LIBS) -o $@
 
 endif
 endif
-endif
 
 ifeq ($(PORTNAME), cygwin)
 
@@ -96,24 +94,6 @@ libpostgres.a: postgres
 
 endif # win32
 
-ifeq ($(PORTNAME), aix)
-
-postgres: $(POSTGRES_IMP)
-	$(CC) $(CFLAGS) $(call expand_subsys,$(OBJS)) $(LDFLAGS) -Wl,-bE:$(top_builddir)/src/backend/$(POSTGRES_IMP) $(LIBS) -Wl,-brtllib -o $@
-
-# Linking to a single .o with -r is a lot faster than building a .a or passing
-# all objects to MKLDEXPORT.
-#
-# It looks alluring to use $(CC) -r instead of ld -r, but that doesn't
-# trivially work with gcc, due to gcc specific static libraries linked in with
-# -r.
-$(POSTGRES_IMP): $(OBJS)
-	ld -r -o SUBSYS.o $(call expand_subsys,$^)
-	$(MKLDEXPORT) SUBSYS.o . > $@
-	@rm -f SUBSYS.o
-
-endif # aix
-
 $(top_builddir)/src/port/libpgport_srv.a: | submake-libpgport
 
 
diff --git a/src/backend/meson.build b/src/backend/meson.build
index 8767aaba678..436c04af080 100644
--- a/src/backend/meson.build
+++ b/src/backend/meson.build
@@ -91,21 +91,6 @@ if cc.get_id() == 'msvc'
   # be restricted to b_pch=true.
   backend_link_with += postgres_lib
 
-elif host_system == 'aix'
-  # The '.' argument leads mkldexport.sh to emit "#! .", which refers to the
-  # main executable, allowing extension libraries to resolve their undefined
-  # symbols to symbols in the postgres binary.
-  postgres_imp = custom_target('postgres.imp',
-    command: [files('port/aix/mkldexport.sh'), '@INPUT@', '.'],
-    input: postgres_lib,
-    output: 'postgres.imp',
-    capture: true,
-    install: true,
-    install_dir: dir_lib,
-    build_by_default: false,
-  )
-  backend_link_args += '-Wl,-bE:@0@'.format(postgres_imp.full_path())
-  backend_link_depends += postgres_imp
 endif
 
 backend_input = []
diff --git a/src/backend/port/aix/mkldexport.sh b/src/backend/port/aix/mkldexport.sh
deleted file mode 100755
index adf3793e868..00000000000
--- a/src/backend/port/aix/mkldexport.sh
+++ /dev/null
@@ -1,61 +0,0 @@
-#!/bin/sh
-#
-# mkldexport
-#	create an AIX exports file from an object file
-#
-# src/backend/port/aix/mkldexport.sh
-#
-# Usage:
-#	mkldexport objectfile [location]
-# where
-#	objectfile is the current location of the object file.
-#	location is the eventual (installed) location of the
-#		object file (if different from the current
-#		working directory).
-#
-# [This file comes from the Postgres 4.2 distribution. - ay 7/95]
-#
-# Header: /usr/local/devel/postgres/src/tools/mkldexport/RCS/mkldexport.sh,v 1.2 1994/03/13 04:59:12 aoki Exp
-#
-
-# setting this to nm -B might be better
-# ... due to changes in AIX 4.x ...
-# ... let us search in different directories - Gerhard Reithofer
-if [ -x /usr/ucb/nm ]
-then NM=/usr/ucb/nm
-elif [ -x /usr/bin/nm ]
-then NM=/usr/bin/nm
-elif [ -x /usr/ccs/bin/nm ]
-then NM=/usr/ccs/bin/nm
-elif [ -x /usr/usg/bin/nm ]
-then NM=/usr/usg/bin/nm
-else echo "Fatal error: cannot find `nm' ... please check your installation."
-     exit 1
-fi
-
-CMDNAME=`basename $0`
-if [ -z "$1" ]; then
-	echo "Usage: $CMDNAME object [location]"
-	exit 1
-fi
-OBJNAME=`basename $1`
-if [ "`basename $OBJNAME`" != "`basename $OBJNAME .o`" ]; then
-	OBJNAME=`basename $OBJNAME .o`.so
-fi
-if [ -z "$2" ]; then
-	echo '#!'
-else
-	if [ "$2" = "." ]; then
-		# for the base executable (AIX 4.2 and up)
-		echo '#! .'
-	else
-		echo '#!' $2
-	fi
-fi
-$NM -BCg $1 | \
-	egrep ' [TDB] ' | \
-	sed -e 's/.* //' | \
-	egrep -v '\$' | \
-	sed -e 's/^[.]//' | \
-	sort | \
-	uniq
diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
index bba00a0087f..c9719f358b6 100644
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
@@ -911,9 +911,7 @@ errcode_for_file_access(void)
 			/* Wrong object type or state */
 		case ENOTDIR:			/* Not a directory */
 		case EISDIR:			/* Is a directory */
-#if defined(ENOTEMPTY) && (ENOTEMPTY != EEXIST) /* same code on AIX */
 		case ENOTEMPTY:			/* Directory not empty */
-#endif
 			edata->sqlerrcode = ERRCODE_WRONG_OBJECT_TYPE;
 			break;
 
diff --git a/src/backend/utils/misc/ps_status.c b/src/backend/utils/misc/ps_status.c
index 8f77f4b563a..ddb45a6bce8 100644
--- a/src/backend/utils/misc/ps_status.c
+++ b/src/backend/utils/misc/ps_status.c
@@ -52,7 +52,7 @@ bool		update_process_title = DEFAULT_UPDATE_PROCESS_TITLE;
 #define PS_USE_SETPROCTITLE_FAST
 #elif defined(HAVE_SETPROCTITLE)
 #define PS_USE_SETPROCTITLE
-#elif defined(__linux__) || defined(_AIX) || defined(__sun) || defined(__darwin__)
+#elif defined(__linux__) || defined(__sun) || defined(__darwin__)
 #define PS_USE_CLOBBER_ARGV
 #elif defined(WIN32)
 #define PS_USE_WIN32
@@ -62,7 +62,7 @@ bool		update_process_title = DEFAULT_UPDATE_PROCESS_TITLE;
 
 
 /* Different systems want the buffer padded differently */
-#if defined(_AIX) || defined(__linux__) || defined(__darwin__)
+#if defined(__linux__) || defined(__darwin__)
 #define PS_PADDING '\0'
 #else
 #define PS_PADDING ' '
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index 86cc01a640b..fc6b00224f6 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -400,9 +400,6 @@ is(scalar(@tblspc_tars), 1, 'one tablespace tar was created');
 SKIP:
 {
 	my $tar = $ENV{TAR};
-	# don't check for a working tar here, to accommodate various odd
-	# cases such as AIX. If tar doesn't work the init_from_backup below
-	# will fail.
 	skip "no tar program available", 1
 	  if (!defined $tar || $tar eq '');
 
diff --git a/src/bin/pg_verifybackup/t/008_untar.pl b/src/bin/pg_verifybackup/t/008_untar.pl
index 30d9f3f7f0f..b4235541ab0 100644
--- a/src/bin/pg_verifybackup/t/008_untar.pl
+++ b/src/bin/pg_verifybackup/t/008_untar.pl
@@ -103,9 +103,6 @@ for my $tc (@test_configuration)
 	  SKIP:
 		{
 			my $tar = $ENV{TAR};
-			# don't check for a working tar here, to accommodate various odd
-			# cases such as AIX. If tar doesn't work the init_from_backup below
-			# will fail.
 			skip "no tar program available", 1
 			  if (!defined $tar || $tar eq '');
 
diff --git a/src/bin/pg_verifybackup/t/010_client_untar.pl b/src/bin/pg_verifybackup/t/010_client_untar.pl
index 45010d79ac8..e9c2bcd7aad 100644
--- a/src/bin/pg_verifybackup/t/010_client_untar.pl
+++ b/src/bin/pg_verifybackup/t/010_client_untar.pl
@@ -133,9 +133,6 @@ for my $tc (@test_configuration)
 	  SKIP:
 		{
 			my $tar = $ENV{TAR};
-			# don't check for a working tar here, to accommodate various odd
-			# cases such as AIX. If tar doesn't work the init_from_backup below
-			# will fail.
 			skip "no tar program available", 1
 			  if (!defined $tar || $tar eq '');
 
diff --git a/src/include/c.h b/src/include/c.h
index 2e3ea206e1c..cc53f154041 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -105,8 +105,6 @@
  * GCC: https://gcc.gnu.org/onlinedocs/gcc/Type-Attributes.html
  * Clang: https://clang.llvm.org/docs/AttributeReference.html
  * Sunpro: https://docs.oracle.com/cd/E18659_01/html/821-1384/gjzke.html
- * XLC: https://www.ibm.com/support/knowledgecenter/SSGH2K_13.1.2/com.ibm.xlc131.aix.doc/language_ref/function_attributes.html
- * XLC: https://www.ibm.com/support/knowledgecenter/SSGH2K_13.1.2/com.ibm.xlc131.aix.doc/language_ref/type_attrib.html
  */
 
 /*
@@ -171,8 +169,8 @@
 #define PG_USED_FOR_ASSERTS_ONLY pg_attribute_unused()
 #endif
 
-/* GCC and XLC support format attributes */
-#if defined(__GNUC__) || defined(__IBMC__)
+/* GCC support format attributes */
+#if defined(__GNUC__)
 #define pg_attribute_format_arg(a) __attribute__((format_arg(a)))
 #define pg_attribute_printf(f,a) __attribute__((format(PG_PRINTF_ATTRIBUTE, f, a)))
 #else
@@ -180,8 +178,8 @@
 #define pg_attribute_printf(f,a)
 #endif
 
-/* GCC, Sunpro and XLC support aligned, packed and noreturn */
-#if defined(__GNUC__) || defined(__SUNPRO_C) || defined(__IBMC__)
+/* GCC and Sunpro support aligned, packed and noreturn */
+#if defined(__GNUC__) || defined(__SUNPRO_C)
 #define pg_attribute_aligned(a) __attribute__((aligned(a)))
 #define pg_attribute_noreturn() __attribute__((noreturn))
 #define pg_attribute_packed() __attribute__((packed))
@@ -212,8 +210,8 @@
  * choose not to.  But, if possible, don't force inlining in unoptimized
  * debug builds.
  */
-#if (defined(__GNUC__) && __GNUC__ > 3 && defined(__OPTIMIZE__)) || defined(__SUNPRO_C) || defined(__IBMC__)
-/* GCC > 3, Sunpro and XLC support always_inline via __attribute__ */
+#if (defined(__GNUC__) && __GNUC__ > 3 && defined(__OPTIMIZE__)) || defined(__SUNPRO_C)
+/* GCC > 3 and Sunpro support always_inline via __attribute__ */
 #define pg_attribute_always_inline __attribute__((always_inline)) inline
 #elif defined(_MSC_VER)
 /* MSVC has a special keyword for this */
@@ -229,8 +227,8 @@
  * for proper cost attribution.  Note that unlike the pg_attribute_XXX macros
  * above, this should be placed before the function's return type and name.
  */
-/* GCC, Sunpro and XLC support noinline via __attribute__ */
-#if (defined(__GNUC__) && __GNUC__ > 2) || defined(__SUNPRO_C) || defined(__IBMC__)
+/* GCC and Sunpro support noinline via __attribute__ */
+#if (defined(__GNUC__) && __GNUC__ > 2) || defined(__SUNPRO_C)
 #define pg_noinline __attribute__((noinline))
 /* msvc via declspec */
 #elif defined(_MSC_VER)
diff --git a/src/include/port/aix.h b/src/include/port/aix.h
deleted file mode 100644
index 5b1159c5785..00000000000
--- a/src/include/port/aix.h
+++ /dev/null
@@ -1,14 +0,0 @@
-/*
- * src/include/port/aix.h
- */
-#define CLASS_CONFLICT
-#define DISABLE_XOPEN_NLS
-
-/*
- * "IBM XL C/C++ for AIX, V12.1" miscompiles, for 32-bit, some inline
- * expansions of ginCompareItemPointers() "long long" arithmetic.  To take
- * advantage of inlining, build a 64-bit PostgreSQL.
- */
-#if defined(__ILP32__) && defined(__IBMC__)
-#define PG_FORCE_DISABLE_INLINE
-#endif
diff --git a/src/include/port/atomics.h b/src/include/port/atomics.h
index bf151037f72..504349080d4 100644
--- a/src/include/port/atomics.h
+++ b/src/include/port/atomics.h
@@ -84,11 +84,9 @@
  * using compiler intrinsics are a good idea.
  */
 /*
- * gcc or compatible, including clang and icc.  Exclude xlc.  The ppc64le "IBM
- * XL C/C++ for Linux, V13.1.2" emulates gcc, but __sync_lock_test_and_set()
- * of one-byte types elicits SIGSEGV.  That bug was gone by V13.1.5 (2016-12).
+ * gcc or compatible, including clang and icc.
  */
-#if (defined(__GNUC__) || defined(__INTEL_COMPILER)) && !(defined(__IBMC__) || defined(__IBMCPP__))
+#if defined(__GNUC__) || defined(__INTEL_COMPILER)
 #include "port/atomics/generic-gcc.h"
 #elif defined(_MSC_VER)
 #include "port/atomics/generic-msvc.h"
diff --git a/src/include/storage/s_lock.h b/src/include/storage/s_lock.h
index 69582f4ae71..29ac6cdcd92 100644
--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -414,12 +414,6 @@ typedef unsigned int slock_t;
  * an isync is a sufficient synchronization barrier after a lwarx/stwcx loop.
  * But if the spinlock is in ordinary memory, we can use lwsync instead for
  * better performance.
- *
- * Ordinarily, we'd code the branches here using GNU-style local symbols, that
- * is "1f" referencing "1:" and so on.  But some people run gcc on AIX with
- * IBM's assembler as backend, and IBM's assembler doesn't do local symbols.
- * So hand-code the branch offsets; fortunately, all PPC instructions are
- * exactly 4 bytes each, so it's not too hard to count.
  */
 static __inline__ int
 tas(volatile slock_t *lock)
@@ -430,15 +424,17 @@ tas(volatile slock_t *lock)
 	__asm__ __volatile__(
 "	lwarx   %0,0,%3,1	\n"
 "	cmpwi   %0,0		\n"
-"	bne     $+16		\n"		/* branch to li %1,1 */
+"	bne     1f			\n"
 "	addi    %0,%0,1		\n"
 "	stwcx.  %0,0,%3		\n"
-"	beq     $+12		\n"		/* branch to lwsync */
+"	beq     2f			\n"
+"1: \n"
 "	li      %1,1		\n"
-"	b       $+12		\n"		/* branch to end of asm sequence */
+"	b       3f			\n"
+"2: \n"
 "	lwsync				\n"
 "	li      %1,0		\n"
-
+"3: \n"
 :	"=&b"(_t), "=r"(_res), "+m"(*lock)
 :	"r"(lock)
 :	"memory", "cc");
@@ -666,21 +662,6 @@ tas(volatile slock_t *lock)
 
 #if !defined(HAS_TEST_AND_SET)	/* We didn't trigger above, let's try here */
 
-#if defined(_AIX)	/* AIX */
-/*
- * AIX (POWER)
- */
-#define HAS_TEST_AND_SET
-
-#include <sys/atomic_op.h>
-
-typedef int slock_t;
-
-#define TAS(lock)			_check_lock((slock_t *) (lock), 0, 1)
-#define S_UNLOCK(lock)		_clear_lock((slock_t *) (lock), 0)
-#endif	 /* _AIX */
-
-
 /* These are in sunstudio_(sparc|x86).s */
 
 #if defined(__SUNPRO_C) && (defined(__i386) || defined(__x86_64__) || defined(__sparc__) || defined(__sparc))
diff --git a/src/interfaces/libpq/Makefile b/src/interfaces/libpq/Makefile
index 083ca6f4cce..fe2af575c5d 100644
--- a/src/interfaces/libpq/Makefile
+++ b/src/interfaces/libpq/Makefile
@@ -114,7 +114,7 @@ backend_src = $(top_srcdir)/src/backend
 # coding rule.
 libpq-refs-stamp: $(shlib)
 ifneq ($(enable_coverage), yes)
-ifeq (,$(filter aix solaris,$(PORTNAME)))
+ifeq (,$(filter solaris,$(PORTNAME)))
 	@if nm -A -u $< 2>/dev/null | grep -v -e __cxa_atexit -e __tsan_func_exit | grep exit; then \
 		echo 'libpq must not be calling any function which invokes exit'; exit 1; \
 	fi
diff --git a/src/interfaces/libpq/meson.build b/src/interfaces/libpq/meson.build
index a47b6f425dd..be6fadaea23 100644
--- a/src/interfaces/libpq/meson.build
+++ b/src/interfaces/libpq/meson.build
@@ -54,9 +54,8 @@ libpq_c_args = ['-DSO_MAJOR_VERSION=5']
 #    libpq_st, and {pgport,common}_shlib for libpq_sh
 #
 # We could try to avoid building the source files twice, but it probably adds
-# more complexity than its worth (AIX doesn't support link_whole yet, reusing
-# object files requires also linking to the library on windows or breaks
-# precompiled headers).
+# more complexity than its worth (reusing object files requires also linking
+# to the library on windows or breaks precompiled headers).
 libpq_st = static_library('libpq',
   libpq_sources,
   include_directories: [libpq_inc],
diff --git a/src/makefiles/Makefile.aix b/src/makefiles/Makefile.aix
deleted file mode 100644
index dd16a7a0378..00000000000
--- a/src/makefiles/Makefile.aix
+++ /dev/null
@@ -1,39 +0,0 @@
-# MAKE_EXPORTS is required for svr4 loaders that want a file of
-# symbol names to tell them what to export/import.
-MAKE_EXPORTS= true
-
-# -blibpath must contain ALL directories where we should look for libraries
-libpath := $(shell echo $(subst -L,:,$(filter -L/%,$(LDFLAGS))) | sed -e's/ //g'):/usr/lib:/lib
-
-# when building with gcc, need to make sure that libgcc can be found
-ifeq ($(GCC), yes)
-libpath := $(libpath):$(dir $(shell gcc -print-libgcc-file-name))
-endif
-
-rpath = -Wl,-blibpath:'$(rpathdir)$(libpath)'
-
-LDFLAGS_SL += -Wl,-bnoentry -Wl,-H512 -Wl,-bM:SRE
-
-# gcc needs to know it's building a shared lib, otherwise it'll not emit
-# correct code / link to the right support libraries
-ifeq ($(GCC), yes)
-LDFLAGS_SL += -shared
-endif
-
-# env var name to use in place of LD_LIBRARY_PATH
-ld_library_path_var = LIBPATH
-
-
-POSTGRES_IMP= postgres.imp
-
-ifdef PGXS
-BE_DLLLIBS= -Wl,-bI:$(pkglibdir)/$(POSTGRES_IMP)
-else
-BE_DLLLIBS= -Wl,-bI:$(top_builddir)/src/backend/$(POSTGRES_IMP)
-endif
-
-MKLDEXPORT_DIR=src/backend/port/aix
-MKLDEXPORT=$(top_srcdir)/$(MKLDEXPORT_DIR)/mkldexport.sh
-
-%$(DLSUFFIX): %.o
-	$(CC) $(CFLAGS) $*.o $(LDFLAGS) $(LDFLAGS_SL) -o $@ $(BE_DLLLIBS)
diff --git a/src/port/README b/src/port/README
index 97f18a62338..ed5c54a72fa 100644
--- a/src/port/README
+++ b/src/port/README
@@ -28,5 +28,5 @@ applications.
 from libpgport are linked first.  This avoids having applications
 dependent on symbols that are _used_ by libpq, but not intended to be
 exported by libpq.  libpq's libpgport usage changes over time, so such a
-dependency is a problem.  Windows, Linux, AIX, and macOS use an export
+dependency is a problem.  Windows, Linux, and macOS use an export
 list to control the symbols exported by libpq.
diff --git a/src/port/strerror.c b/src/port/strerror.c
index 1070a49802e..4918ba821c1 100644
--- a/src/port/strerror.c
+++ b/src/port/strerror.c
@@ -214,10 +214,8 @@ get_errno_symbol(int errnum)
 			return "ENOTCONN";
 		case ENOTDIR:
 			return "ENOTDIR";
-#if defined(ENOTEMPTY) && (ENOTEMPTY != EEXIST) /* same code on AIX */
 		case ENOTEMPTY:
 			return "ENOTEMPTY";
-#endif
 		case ENOTSOCK:
 			return "ENOTSOCK";
 #ifdef ENOTSUP
diff --git a/src/template/aix b/src/template/aix
deleted file mode 100644
index 47fa8990a7c..00000000000
--- a/src/template/aix
+++ /dev/null
@@ -1,25 +0,0 @@
-# src/template/aix
-
-# Set default options if using xlc.  This formerly included -qsrcmsg, but that
-# option elicits internal compiler errors from xlc v16.1.0.  Note: configure
-# will add -qnoansialias if the compiler accepts it, even if user specifies a
-# non-default CFLAGS setting.
-if test "$GCC" != yes ; then
-  case $host_os in
-    *)
-      CFLAGS="-O2 -qmaxmem=16384"
-      ;;
-  esac
-
-  # Due to a compiler bug, see 20171013023536.GA492146@rfd.leadboat.com for details,
-  # force restrict not to be used when compiling with xlc.
-  FORCE_DISABLE_RESTRICT=yes
-fi
-
-# Extra CFLAGS for code that will go into a shared library
-CFLAGS_SL=""
-
-# Native memset() is faster, tested on:
-# 	AIX 5.1 and 5.2, XLC 6.0 (IBM's cc)
-# 	AIX 5.3 ML3, gcc 4.0.1
-MEMSET_LOOP_LIMIT=0
diff --git a/src/test/regress/Makefile b/src/test/regress/Makefile
index 7c665ff892d..6409a485e84 100644
--- a/src/test/regress/Makefile
+++ b/src/test/regress/Makefile
@@ -7,11 +7,6 @@
 # GNU make uses a make file named "GNUmakefile" in preference to "Makefile"
 # if it exists.  Postgres is shipped with a "GNUmakefile".
 
-
-# AIX make defaults to building *every* target of the first rule.  Start with
-# a single-target, empty rule to make the other targets non-default.
-all:
-
 all install clean check installcheck:
 	@echo "You must use GNU make to use Postgres.  It may be installed"
 	@echo "on your system with the name 'gmake'."
diff --git a/src/test/regress/expected/sanity_check.out b/src/test/regress/expected/sanity_check.out
index c5c675b7508..82972b70e6f 100644
--- a/src/test/regress/expected/sanity_check.out
+++ b/src/test/regress/expected/sanity_check.out
@@ -26,12 +26,15 @@ SELECT relname, relkind
 (0 rows)
 
 --
--- When ALIGNOF_DOUBLE==4 (e.g. AIX), the C ABI may impose 8-byte alignment on
+-- When MAXIMUM_ALIGNOF==8 but ALIGNOF_DOUBLE==4, the C ABI may impose 8-byte alignment
 -- some of the C types that correspond to TYPALIGN_DOUBLE SQL types.  To ensure
 -- catalog C struct layout matches catalog tuple layout, arrange for the tuple
 -- offset of each fixed-width, attalign='d' catalog column to be divisible by 8
 -- unconditionally.  Keep such columns before the first NameData column of the
 -- catalog, since packagers can override NAMEDATALEN to an odd number.
+-- (XXX: I'm not sure if any of the supported platforms have MAXIMUM_ALIGNOF==8 and
+-- ALIGNOF_DOUBLE==4.  Perhaps we should just require that
+-- ALIGNOF_DOUBLE==MAXIMUM_ALIGNOF)
 --
 WITH check_columns AS (
  SELECT relname, attname,
diff --git a/src/test/regress/sql/sanity_check.sql b/src/test/regress/sql/sanity_check.sql
index 7f338d191c6..2e9d5ebef3f 100644
--- a/src/test/regress/sql/sanity_check.sql
+++ b/src/test/regress/sql/sanity_check.sql
@@ -21,12 +21,15 @@ SELECT relname, relkind
        AND relfilenode <> 0;
 
 --
--- When ALIGNOF_DOUBLE==4 (e.g. AIX), the C ABI may impose 8-byte alignment on
+-- When MAXIMUM_ALIGNOF==8 but ALIGNOF_DOUBLE==4, the C ABI may impose 8-byte alignment
 -- some of the C types that correspond to TYPALIGN_DOUBLE SQL types.  To ensure
 -- catalog C struct layout matches catalog tuple layout, arrange for the tuple
 -- offset of each fixed-width, attalign='d' catalog column to be divisible by 8
 -- unconditionally.  Keep such columns before the first NameData column of the
 -- catalog, since packagers can override NAMEDATALEN to an odd number.
+-- (XXX: I'm not sure if any of the supported platforms have MAXIMUM_ALIGNOF==8 and
+-- ALIGNOF_DOUBLE==4.  Perhaps we should just require that
+-- ALIGNOF_DOUBLE==MAXIMUM_ALIGNOF)
 --
 WITH check_columns AS (
  SELECT relname, attname,
diff --git a/src/tools/gen_export.pl b/src/tools/gen_export.pl
index 888c8a197a9..d9fdaaaf6d0 100644
--- a/src/tools/gen_export.pl
+++ b/src/tools/gen_export.pl
@@ -16,12 +16,11 @@ GetOptions(
 	'input:s' => \$input,
 	'output:s' => \$output) or die "wrong arguments";
 
-if (not(   $format eq 'aix'
-		or $format eq 'darwin'
+if (not(   $format eq 'darwin'
 		or $format eq 'gnu'
 		or $format eq 'win'))
 {
-	die "$0: $format is not yet handled (only aix, darwin, gnu, win are)\n";
+	die "$0: $format is not yet handled (only darwin, gnu, win are)\n";
 }
 
 open(my $input_handle, '<', $input)
@@ -56,11 +55,7 @@ while (<$input_handle>)
 	}
 	elsif (/^(\S+)\s+(\S+)/)
 	{
-		if ($format eq 'aix')
-		{
-			print $output_handle "$1\n";
-		}
-		elsif ($format eq 'darwin')
+		if ($format eq 'darwin')
 		{
 			print $output_handle "_$1\n";
 		}
diff --git a/src/tools/pginclude/cpluspluscheck b/src/tools/pginclude/cpluspluscheck
index 7edfc44b49a..a46ff52cc1b 100755
--- a/src/tools/pginclude/cpluspluscheck
+++ b/src/tools/pginclude/cpluspluscheck
@@ -61,7 +61,6 @@ do
 
 	# These files are platform-specific, and c.h will include the
 	# one that's relevant for our current platform anyway.
-	test "$f" = src/include/port/aix.h && continue
 	test "$f" = src/include/port/cygwin.h && continue
 	test "$f" = src/include/port/darwin.h && continue
 	test "$f" = src/include/port/freebsd.h && continue
diff --git a/src/tools/pginclude/headerscheck b/src/tools/pginclude/headerscheck
index 84b892b5c51..0e2d7f537ef 100755
--- a/src/tools/pginclude/headerscheck
+++ b/src/tools/pginclude/headerscheck
@@ -57,7 +57,6 @@ do
 
 	# These files are platform-specific, and c.h will include the
 	# one that's relevant for our current platform anyway.
-	test "$f" = src/include/port/aix.h && continue
 	test "$f" = src/include/port/cygwin.h && continue
 	test "$f" = src/include/port/darwin.h && continue
 	test "$f" = src/include/port/freebsd.h && continue
-- 
2.39.2

#32Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#31)
Re: Relation bulk write facility

Heikki Linnakangas <hlinnaka@iki.fi> writes:

What do y'all think of adding a check for
ALIGNOF_DOUBLE==MAXIMUM_ALIGNOF to configure.ac and meson.build? It's
not a requirement today, but I believe AIX was the only platform where
that was not true. With AIX gone, that combination won't be tested, and
we will probably break it sooner or later.

+1, and then probably revert the whole test addition of 79b716cfb7a.

I did a quick scrape of the buildfarm, and identified these as the
only animals reporting ALIGNOF_DOUBLE less than 8:

$ grep 'alignment of double' alignments | grep -v ' 8$'
hornet | 2024-02-22 16:26:16 | checking alignment of double... 4
lapwing | 2024-02-27 12:40:15 | checking alignment of double... (cached) 4
mandrill | 2024-02-19 01:03:47 | checking alignment of double... 4
sungazer | 2024-02-21 00:22:48 | checking alignment of double... 4
tern | 2024-02-22 13:25:12 | checking alignment of double... 4

With AIX out of the picture, lapwing will be the only remaining
animal testing MAXALIGN less than 8. That seems like a single
point of failure ... should we spin up another couple 32-bit
animals? I had supposed that my faithful old PPC animal mamba
was helping to check this, but I see that under NetBSD it's
joined the ALIGNOF_DOUBLE==8 crowd.

regards, tom lane

#33Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#32)
Re: Relation bulk write facility

Hi,

On 2024-02-27 15:45:45 -0500, Tom Lane wrote:

Heikki Linnakangas <hlinnaka@iki.fi> writes:
With AIX out of the picture, lapwing will be the only remaining
animal testing MAXALIGN less than 8. That seems like a single
point of failure ... should we spin up another couple 32-bit
animals? I had supposed that my faithful old PPC animal mamba
was helping to check this, but I see that under NetBSD it's
joined the ALIGNOF_DOUBLE==8 crowd.

I can set up a i386 animal, albeit on an amd64 kernel. But I don't think the
latter matters.

Greetings,

Andres Freund

#34Thomas Munro
thomas.munro@gmail.com
In reply to: Heikki Linnakangas (#31)
Re: Relation bulk write facility

On Wed, Feb 28, 2024 at 9:24 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Here's a patch to fully remove AIX support.

--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -3401,7 +3401,7 @@ export MANPATH
   <para>
    <productname>PostgreSQL</productname> can be expected to work on current
    versions of these operating systems: Linux, Windows,
-   FreeBSD, OpenBSD, NetBSD, DragonFlyBSD, macOS, AIX, Solaris, and illumos.
+   FreeBSD, OpenBSD, NetBSD, DragonFlyBSD, macOS, Solaris, and illumos.

There is also a little roll-of-honour of operating systems we used to
support, just a couple of paragraphs down, where AIX should appear.

#35Noah Misch
noah@leadboat.com
In reply to: Heikki Linnakangas (#31)
Re: Relation bulk write facility

On Wed, Feb 28, 2024 at 12:24:01AM +0400, Heikki Linnakangas wrote:

Here's a patch to fully remove AIX support.

Subject: [PATCH 1/1] Remove AIX support

There isn't a lot of user demand for AIX support, no one has stepped
up to the plate to properly maintain it, so it's best to remove it

Regardless of how someone were to step up to maintain it, we'd be telling them
such contributions have negative value and must stop. We're expelling AIX due
to low demand, compiler bugs, its ABI, and its shlib symbol export needs.

altogether. AIX is still supported for stable versions.

The acute issue that triggered this decision was that after commit
8af2565248, the AIX buildfarm members have been hitting this
assertion:

TRAP: failed Assert("(uintptr_t) buffer == TYPEALIGN(PG_IO_ALIGN_SIZE, buffer)"), File: "md.c", Line: 472, PID: 2949728

Apperently the "pg_attribute_aligned(a)" attribute doesn't work on AIX
(linker?) for values larger than PG_IO_ALIGN_SIZE.

No; see /messages/by-id/20240225194322.a5@rfd.leadboat.com

#36Andres Freund
andres@anarazel.de
In reply to: Heikki Linnakangas (#31)
Re: Relation bulk write facility

Hi,

On 2024-02-28 00:24:01 +0400, Heikki Linnakangas wrote:

Here's a patch to fully remove AIX support.

Thomas mentioned to me that cfbot failed with this applied:
https://cirrus-ci.com/task/6348635416297472
https://api.cirrus-ci.com/v1/artifact/task/6348635416297472/log/tmp_install/log/initdb-template.log

initdb: error while loading shared libraries: libpq.so.5: cannot open shared object file: No such file or directory

While I couldn't reproduce the failure, I did notice that locally with the
patch applied, system libpq ended up getting used. Which isn't pre-installed
in the CI environment, explaining the failure.

The problem is due to this hunk:

@@ -401,10 +376,6 @@ install-lib-static: $(stlib) installdirs-lib

install-lib-shared: $(shlib) installdirs-lib
ifdef soname
-# we don't install $(shlib) on AIX
-# (see http://archives.postgresql.org/message-id/52EF20B2E3209443BC37736D00C3C1380A6E79FE@EXADV1.host.magwien.gv.at)
-ifneq ($(PORTNAME), aix)
- $(INSTALL_SHLIB) $< '$(DESTDIR)$(libdir)/$(shlib)'
ifneq ($(PORTNAME), cygwin)
ifneq ($(PORTNAME), win32)
ifneq ($(shlib), $(shlib_major))

So the versioned name didn't end up getting installed anymore, leading to
broken symlinks in the install directory.

diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index 86cc01a640b..fc6b00224f6 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -400,9 +400,6 @@ is(scalar(@tblspc_tars), 1, 'one tablespace tar was created');
SKIP:
{
my $tar = $ENV{TAR};
-	# don't check for a working tar here, to accommodate various odd
-	# cases such as AIX. If tar doesn't work the init_from_backup below
-	# will fail.
skip "no tar program available", 1
if (!defined $tar || $tar eq '');

Maybe better to not remove the whole comment, just the reference to AIX?

diff --git a/src/test/regress/sql/sanity_check.sql b/src/test/regress/sql/sanity_check.sql
index 7f338d191c6..2e9d5ebef3f 100644
--- a/src/test/regress/sql/sanity_check.sql
+++ b/src/test/regress/sql/sanity_check.sql
@@ -21,12 +21,15 @@ SELECT relname, relkind
AND relfilenode <> 0;
--
--- When ALIGNOF_DOUBLE==4 (e.g. AIX), the C ABI may impose 8-byte alignment on
+-- When MAXIMUM_ALIGNOF==8 but ALIGNOF_DOUBLE==4, the C ABI may impose 8-byte alignment
-- some of the C types that correspond to TYPALIGN_DOUBLE SQL types.  To ensure
-- catalog C struct layout matches catalog tuple layout, arrange for the tuple
-- offset of each fixed-width, attalign='d' catalog column to be divisible by 8
-- unconditionally.  Keep such columns before the first NameData column of the
-- catalog, since packagers can override NAMEDATALEN to an odd number.
+-- (XXX: I'm not sure if any of the supported platforms have MAXIMUM_ALIGNOF==8 and
+-- ALIGNOF_DOUBLE==4.  Perhaps we should just require that
+-- ALIGNOF_DOUBLE==MAXIMUM_ALIGNOF)
--
WITH check_columns AS (
SELECT relname, attname,

I agree, this should be an error, and we should then remove the test.

Greetings,

Andres Freund

#37Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#33)
Re: Relation bulk write facility

Hi,

On 2024-02-27 12:59:14 -0800, Andres Freund wrote:

On 2024-02-27 15:45:45 -0500, Tom Lane wrote:

Heikki Linnakangas <hlinnaka@iki.fi> writes:
With AIX out of the picture, lapwing will be the only remaining
animal testing MAXALIGN less than 8. That seems like a single
point of failure ... should we spin up another couple 32-bit
animals? I had supposed that my faithful old PPC animal mamba
was helping to check this, but I see that under NetBSD it's
joined the ALIGNOF_DOUBLE==8 crowd.

I can set up a i386 animal, albeit on an amd64 kernel. But I don't think the
latter matters.

That animal is now running, named "adder". Due to a typo there are still
spurious errors on the older branches, but I've triggered those to be re-run.

Currently adder builds with autconf on older branches and with meson on newer
ones. Is it worth setting up two animals so we cover both ac and meson with 32
bit on 16/HEAD?

There's something odd about how we fail when not specifying the correct PERL
at configure time:
/home/bf/bf-build/adder/REL_13_STABLE/pgsql.build/../pgsql/src/pl/plperl/Util.c: loadable library and perl binaries are mismatched (got first handshake key 0x93c0080, needed 0x9580080)

Not sure what gets linked against what wrongly. But I'm also not sure it's
worth the energy to investigate.

Greetings,

Andres Freund

#38Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Andres Freund (#36)
Re: Relation bulk write facility

Committed, after fixing the various little things you pointed out:

On 28/02/2024 00:30, Thomas Munro wrote:

--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -3401,7 +3401,7 @@ export MANPATH
<para>
<productname>PostgreSQL</productname> can be expected to work on current
versions of these operating systems: Linux, Windows,
-   FreeBSD, OpenBSD, NetBSD, DragonFlyBSD, macOS, AIX, Solaris, and illumos.
+   FreeBSD, OpenBSD, NetBSD, DragonFlyBSD, macOS, Solaris, and illumos.

There is also a little roll-of-honour of operating systems we used to
support, just a couple of paragraphs down, where AIX should appear.

Added.

On 28/02/2024 05:52, Noah Misch wrote:

Regardless of how someone were to step up to maintain it, we'd be telling them
such contributions have negative value and must stop. We're expelling AIX due
to low demand, compiler bugs, its ABI, and its shlib symbol export needs.

Reworded.

Apperently the "pg_attribute_aligned(a)" attribute doesn't work on AIX
(linker?) for values larger than PG_IO_ALIGN_SIZE.

No; see /messages/by-id/20240225194322.a5@rfd.leadboat.com

Ok, reworded.

On 28/02/2024 11:26, Andres Freund wrote:

On 2024-02-28 00:24:01 +0400, Heikki Linnakangas wrote:
The problem is due to this hunk:

@@ -401,10 +376,6 @@ install-lib-static: $(stlib) installdirs-lib

install-lib-shared: $(shlib) installdirs-lib
ifdef soname
-# we don't install $(shlib) on AIX
-# (see http://archives.postgresql.org/message-id/52EF20B2E3209443BC37736D00C3C1380A6E79FE@EXADV1.host.magwien.gv.at)
-ifneq ($(PORTNAME), aix)
- $(INSTALL_SHLIB) $< '$(DESTDIR)$(libdir)/$(shlib)'
ifneq ($(PORTNAME), cygwin)
ifneq ($(PORTNAME), win32)
ifneq ($(shlib), $(shlib_major))

So the versioned name didn't end up getting installed anymore, leading to
broken symlinks in the install directory.

Fixed, thanks!

diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index 86cc01a640b..fc6b00224f6 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -400,9 +400,6 @@ is(scalar(@tblspc_tars), 1, 'one tablespace tar was created');
SKIP:
{
my $tar = $ENV{TAR};
-	# don't check for a working tar here, to accommodate various odd
-	# cases such as AIX. If tar doesn't work the init_from_backup below
-	# will fail.
skip "no tar program available", 1
if (!defined $tar || $tar eq '');

Maybe better to not remove the whole comment, just the reference to AIX?

Ok, done

diff --git a/src/test/regress/sql/sanity_check.sql b/src/test/regress/sql/sanity_check.sql
index 7f338d191c6..2e9d5ebef3f 100644
--- a/src/test/regress/sql/sanity_check.sql
+++ b/src/test/regress/sql/sanity_check.sql
@@ -21,12 +21,15 @@ SELECT relname, relkind
AND relfilenode <> 0;
--
--- When ALIGNOF_DOUBLE==4 (e.g. AIX), the C ABI may impose 8-byte alignment on
+-- When MAXIMUM_ALIGNOF==8 but ALIGNOF_DOUBLE==4, the C ABI may impose 8-byte alignment
-- some of the C types that correspond to TYPALIGN_DOUBLE SQL types.  To ensure
-- catalog C struct layout matches catalog tuple layout, arrange for the tuple
-- offset of each fixed-width, attalign='d' catalog column to be divisible by 8
-- unconditionally.  Keep such columns before the first NameData column of the
-- catalog, since packagers can override NAMEDATALEN to an odd number.
+-- (XXX: I'm not sure if any of the supported platforms have MAXIMUM_ALIGNOF==8 and
+-- ALIGNOF_DOUBLE==4.  Perhaps we should just require that
+-- ALIGNOF_DOUBLE==MAXIMUM_ALIGNOF)
--
WITH check_columns AS (
SELECT relname, attname,

I agree, this should be an error, and we should then remove the test.

Done.

--
Heikki Linnakangas
Neon (https://neon.tech)

#39Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#38)
Re: Relation bulk write facility

Heikki Linnakangas <hlinnaka@iki.fi> writes:

On 28/02/2024 00:30, Thomas Munro wrote:

I agree, this should be an error, and we should then remove the test.

Done.

The commit that added that test added a support function
"get_columns_length" which is now unused. Should we get rid of that
as well?

regards, tom lane

#40Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Tom Lane (#39)
Re: Relation bulk write facility

On 28/02/2024 18:04, Tom Lane wrote:

Heikki Linnakangas <hlinnaka@iki.fi> writes:

On 28/02/2024 00:30, Thomas Munro wrote:

I agree, this should be an error, and we should then remove the test.

Done.

The commit that added that test added a support function
"get_columns_length" which is now unused. Should we get rid of that
as well?

I see you just removed it; thanks!

--
Heikki Linnakangas
Neon (https://neon.tech)

#41Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#40)
Re: Relation bulk write facility

Heikki Linnakangas <hlinnaka@iki.fi> writes:

On 28/02/2024 18:04, Tom Lane wrote:

The commit that added that test added a support function
"get_columns_length" which is now unused. Should we get rid of that
as well?

I see you just removed it; thanks!

In the no-good-deed-goes-unpunished department: crake is reporting
that this broke our cross-version upgrade tests. I'll go fix that.

regards, tom lane

#42Michael Banck
mbanck@gmx.net
In reply to: Andres Freund (#22)
Remove AIX Support (was: Re: Relation bulk write facility)

Hi,

On Sat, Feb 24, 2024 at 01:29:36PM -0800, Andres Freund wrote:

Let's just drop AIX. This isn't the only alignment issue we've found and the
solution for those isn't so much a fix as forcing everyone to carefully only
look into one direction and not notice the cliffs to either side.

While I am not against dropping AIX (and certainly won't step up to
maintain it just for fun), I don't think burying this inside some
"Relation bulk write facility" thread is helpful; I have changed the
thread title as a first step.

The commit message says there is not a lot of user demand and that might
be right, but contrary to other fringe OSes that got removed like HPPA
or Irix, I believe Postgres on AIX is still used in production and if
so, probably in a mission-critical manner at some old-school
institutions (in fact, one of our customers does just that) and not as a
thought-experiment. It is probably well-known among Postgres hackers
that AIX support is problematic/a burden, but the current users might
not be aware of this.

Not sure what to do about this (especially now that this has been
committed), maybe there should have been be a public deprecation notice
first for v17... On the other hand, that might not work if important
features like direct-IO would have to be bumped from v17 just because of
AIX.

I posted about this on Twitter and Mastodon to see whether anybody
complains and did not get a lot of feedback.

In any case, users will have a couple of years to migrate as usual if
they upgrade to v16.

Michael

#43Daniel Gustafsson
daniel@yesql.se
In reply to: Michael Banck (#42)
Re: Remove AIX Support (was: Re: Relation bulk write facility)

On 29 Feb 2024, at 09:13, Michael Banck <mbanck@gmx.net> wrote:

In any case, users will have a couple of years to migrate as usual if
they upgrade to v16.

As you say, there are many years left of AIX being supported so there is plenty
of runway for planning a migration.

--
Daniel Gustafsson

#44Andres Freund
andres@anarazel.de
In reply to: Michael Banck (#42)
Re: Remove AIX Support (was: Re: Relation bulk write facility)

Hi,

On 2024-02-29 09:13:04 +0100, Michael Banck wrote:

The commit message says there is not a lot of user demand and that might
be right, but contrary to other fringe OSes that got removed like HPPA
or Irix, I believe Postgres on AIX is still used in production and if
so, probably in a mission-critical manner at some old-school
institutions (in fact, one of our customers does just that) and not as a
thought-experiment. It is probably well-known among Postgres hackers
that AIX support is problematic/a burden, but the current users might
not be aware of this.

Then these users should have paid somebody to actually do maintenance work on
the AIX support,o it doesn't regularly stand in the way of implementing
various things.

Greetings,

Andres Freund

#45Michael Banck
mbanck@gmx.net
In reply to: Andres Freund (#44)
Re: Remove AIX Support (was: Re: Relation bulk write facility)

Hi,

On Thu, Feb 29, 2024 at 12:57:31AM -0800, Andres Freund wrote:

On 2024-02-29 09:13:04 +0100, Michael Banck wrote:

The commit message says there is not a lot of user demand and that might
be right, but contrary to other fringe OSes that got removed like HPPA
or Irix, I believe Postgres on AIX is still used in production and if
so, probably in a mission-critical manner at some old-school
institutions (in fact, one of our customers does just that) and not as a
thought-experiment. It is probably well-known among Postgres hackers
that AIX support is problematic/a burden, but the current users might
not be aware of this.

Then these users should have paid somebody to actually do maintenance work on
the AIX support,o it doesn't regularly stand in the way of implementing
various things.

Right, absolutely.

But: did we ever tell them to do that? I don't think it's reasonable for
them to expect to follow -hackers and jump in when somebody grumbles
about AIX being a burden somewhere deep down a thread...

Michael

#46Andres Freund
andres@anarazel.de
In reply to: Michael Banck (#45)
Re: Remove AIX Support (was: Re: Relation bulk write facility)

Hi,

On 2024-02-29 10:24:24 +0100, Michael Banck wrote:

On Thu, Feb 29, 2024 at 12:57:31AM -0800, Andres Freund wrote:

On 2024-02-29 09:13:04 +0100, Michael Banck wrote:

The commit message says there is not a lot of user demand and that might
be right, but contrary to other fringe OSes that got removed like HPPA
or Irix, I believe Postgres on AIX is still used in production and if
so, probably in a mission-critical manner at some old-school
institutions (in fact, one of our customers does just that) and not as a
thought-experiment. It is probably well-known among Postgres hackers
that AIX support is problematic/a burden, but the current users might
not be aware of this.

Then these users should have paid somebody to actually do maintenance work on
the AIX support,o it doesn't regularly stand in the way of implementing
various things.

Right, absolutely.

But: did we ever tell them to do that? I don't think it's reasonable for
them to expect to follow -hackers and jump in when somebody grumbles
about AIX being a burden somewhere deep down a thread...

Well, the thing is that it's commonly going to be deep down some threads that
portability problems cause pain. This is far from the only time. Just a few
threads:

/messages/by-id/CA+TgmoauCAv+p4Z57PqgVgNxsApxKs3Yh9mDLdUDB8fep-s=1w@mail.gmail.com
/messages/by-id/CA+hUKGK=DOC+hE-62FKfZy=Ybt5uLkrg3zCZD-jFykM-iPn8yw@mail.gmail.com
/messages/by-id/20230124165814.2njc7gnvubn2amh6@awork3.anarazel.de
/messages/by-id/2385119.1696354473@sss.pgh.pa.us
/messages/by-id/20221005200710.luvw5evhwf6clig6@awork3.anarazel.de
/messages/by-id/20220820204401.vrf5kejih6jofvqb@awork3.anarazel.de
/messages/by-id/E1oWpzF-002EG4-AG@gemulon.postgresql.org

This is far from all.

The only platform rivalling AIX on the pain-caused metric is windows. And at
least that can be tested via CI (or locally). We've been relying on the gcc
buildfarm to be able to maintain AIX at all, and that's not a resource that
scales to many users.

Greetings,

Andres Freund

#47Daniel Gustafsson
daniel@yesql.se
In reply to: Michael Banck (#45)
Re: Remove AIX Support (was: Re: Relation bulk write facility)

On 29 Feb 2024, at 10:24, Michael Banck <mbanck@gmx.net> wrote:
On Thu, Feb 29, 2024 at 12:57:31AM -0800, Andres Freund wrote:

Then these users should have paid somebody to actually do maintenance work on
the AIX support,o it doesn't regularly stand in the way of implementing
various things.

Right, absolutely.

But: did we ever tell them to do that? I don't think it's reasonable for
them to expect to follow -hackers and jump in when somebody grumbles
about AIX being a burden somewhere deep down a thread...

Having spent a fair bit of time within open source projects that companies rely
on, my experience is that those companies who need to hear such news have zero
interaction with the project and most of time don't even know the project
community exist. That conversely also means that the project don't know they
exist. If their consultants and suppliers, who have a higher probability of
knowing this, hasn't told them then it's highly unlikely that anything we say
will get across.

--
Daniel Gustafsson

#48Phil Florent
philflorent@hotmail.com
In reply to: Andres Freund (#46)
RE: Remove AIX Support (was: Re: Relation bulk write facility)

Hi,
Historically many public hospitals I work for had IBM Power hardware.
The SMT8 (8 threads/cores) capabilities of Power CPU are useful to lower Oracle licence & support cost. We migrate to PostgreSQL and it runs very well on Power, especially since the (relatively) recent parallel executions features of the RDBMS match very well the CPU capabilities.
We chose to run PostgreSQL on Debian/Power (Little Endian) since ppc64le is an official Debian port. No AIX then. Only problem is that we still need to access Oracle databases and it can be useful to read directly with oracle_fdw but this tool needs an instant client and it's not open source of course. Oracle provides a binary but they don't provide patches for Debian/Power Little Endian (strange situation...) Just to say that of course we chose Linux for PostgreSQL but sometimes things are not so easy... We could have chosen AIX and we still have a ???? about interoperability.
Best regards,
Phil
________________________________
De : Andres Freund <andres@anarazel.de>
Envoyé : jeudi 29 février 2024 10:35
À : Michael Banck <mbanck@gmx.net>
Cc : Noah Misch <noah@leadboat.com>; Thomas Munro <thomas.munro@gmail.com>; Heikki Linnakangas <hlinnaka@iki.fi>; Peter Smith <smithpb2250@gmail.com>; Robert Haas <robertmhaas@gmail.com>; vignesh C <vignesh21@gmail.com>; pgsql-hackers <pgsql-hackers@postgresql.org>; Melanie Plageman <melanieplageman@gmail.com>
Objet : Re: Remove AIX Support (was: Re: Relation bulk write facility)

Hi,

On 2024-02-29 10:24:24 +0100, Michael Banck wrote:

On Thu, Feb 29, 2024 at 12:57:31AM -0800, Andres Freund wrote:

On 2024-02-29 09:13:04 +0100, Michael Banck wrote:

The commit message says there is not a lot of user demand and that might
be right, but contrary to other fringe OSes that got removed like HPPA
or Irix, I believe Postgres on AIX is still used in production and if
so, probably in a mission-critical manner at some old-school
institutions (in fact, one of our customers does just that) and not as a
thought-experiment. It is probably well-known among Postgres hackers
that AIX support is problematic/a burden, but the current users might
not be aware of this.

Then these users should have paid somebody to actually do maintenance work on
the AIX support,o it doesn't regularly stand in the way of implementing
various things.

Right, absolutely.

But: did we ever tell them to do that? I don't think it's reasonable for
them to expect to follow -hackers and jump in when somebody grumbles
about AIX being a burden somewhere deep down a thread...

Well, the thing is that it's commonly going to be deep down some threads that
portability problems cause pain. This is far from the only time. Just a few
threads:

/messages/by-id/CA+TgmoauCAv+p4Z57PqgVgNxsApxKs3Yh9mDLdUDB8fep-s=1w@mail.gmail.com
/messages/by-id/CA+hUKGK=DOC+hE-62FKfZy=Ybt5uLkrg3zCZD-jFykM-iPn8yw@mail.gmail.com
/messages/by-id/20230124165814.2njc7gnvubn2amh6@awork3.anarazel.de
/messages/by-id/2385119.1696354473@sss.pgh.pa.us
/messages/by-id/20221005200710.luvw5evhwf6clig6@awork3.anarazel.de
/messages/by-id/20220820204401.vrf5kejih6jofvqb@awork3.anarazel.de
/messages/by-id/E1oWpzF-002EG4-AG@gemulon.postgresql.org

This is far from all.

The only platform rivalling AIX on the pain-caused metric is windows. And at
least that can be tested via CI (or locally). We've been relying on the gcc
buildfarm to be able to maintain AIX at all, and that's not a resource that
scales to many users.

Greetings,

Andres Freund

#49Noah Misch
noah@leadboat.com
In reply to: Heikki Linnakangas (#11)
Re: Relation bulk write facility

On Fri, Feb 23, 2024 at 04:27:34PM +0200, Heikki Linnakangas wrote:

Committed this. Thanks everyone!

Commit 8af2565 wrote:

--- /dev/null
+++ b/src/backend/storage/smgr/bulk_write.c
+/*
+ * Finish bulk write operation.
+ *
+ * This WAL-logs and flushes any remaining pending writes to disk, and fsyncs
+ * the relation if needed.
+ */
+void
+smgr_bulk_finish(BulkWriteState *bulkstate)
+{
+	/* WAL-log and flush any remaining pages */
+	smgr_bulk_flush(bulkstate);
+
+	/*
+	 * When we wrote out the pages, we passed skipFsync=true to avoid the
+	 * overhead of registering all the writes with the checkpointer.  Register
+	 * the whole relation now.
+	 *
+	 * There is one hole in that idea: If a checkpoint occurred while we were
+	 * writing the pages, it already missed fsyncing the pages we had written
+	 * before the checkpoint started.  A crash later on would replay the WAL
+	 * starting from the checkpoint, therefore it wouldn't replay our earlier
+	 * WAL records.  So if a checkpoint started after the bulk write, fsync
+	 * the files now.
+	 */
+	if (!SmgrIsTemp(bulkstate->smgr))
+	{

Shouldn't this be "if (bulkstate->use_wal)"? The GetRedoRecPtr()-based
decision is irrelevant to the !wal case. Either we don't need fsync at all
(TEMP or UNLOGGED) or smgrDoPendingSyncs() will do it (wal_level=minimal). I
don't see any functional problem, but this likely arranges for an unnecessary
sync when a checkpoint starts between mdcreate() and here. (The mdcreate()
sync may also be unnecessary, but that's longstanding.)

+		/*
+		 * Prevent a checkpoint from starting between the GetRedoRecPtr() and
+		 * smgrregistersync() calls.
+		 */
+		Assert((MyProc->delayChkptFlags & DELAY_CHKPT_START) == 0);
+		MyProc->delayChkptFlags |= DELAY_CHKPT_START;
+
+		if (bulkstate->start_RedoRecPtr != GetRedoRecPtr())
+		{
+			/*
+			 * A checkpoint occurred and it didn't know about our writes, so
+			 * fsync() the relation ourselves.
+			 */
+			MyProc->delayChkptFlags &= ~DELAY_CHKPT_START;
+			smgrimmedsync(bulkstate->smgr, bulkstate->forknum);
+			elog(DEBUG1, "flushed relation because a checkpoint occurred concurrently");
+		}
+		else
+		{
+			smgrregistersync(bulkstate->smgr, bulkstate->forknum);
+			MyProc->delayChkptFlags &= ~DELAY_CHKPT_START;
+		}
+	}
+}

This is an elegant optimization.

#50Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Noah Misch (#49)
Re: Relation bulk write facility

Thanks for poking at this!

On 01/07/2024 23:52, Noah Misch wrote:

Commit 8af2565 wrote:

--- /dev/null
+++ b/src/backend/storage/smgr/bulk_write.c
+/*
+ * Finish bulk write operation.
+ *
+ * This WAL-logs and flushes any remaining pending writes to disk, and fsyncs
+ * the relation if needed.
+ */
+void
+smgr_bulk_finish(BulkWriteState *bulkstate)
+{
+	/* WAL-log and flush any remaining pages */
+	smgr_bulk_flush(bulkstate);
+
+	/*
+	 * When we wrote out the pages, we passed skipFsync=true to avoid the
+	 * overhead of registering all the writes with the checkpointer.  Register
+	 * the whole relation now.
+	 *
+	 * There is one hole in that idea: If a checkpoint occurred while we were
+	 * writing the pages, it already missed fsyncing the pages we had written
+	 * before the checkpoint started.  A crash later on would replay the WAL
+	 * starting from the checkpoint, therefore it wouldn't replay our earlier
+	 * WAL records.  So if a checkpoint started after the bulk write, fsync
+	 * the files now.
+	 */
+	if (!SmgrIsTemp(bulkstate->smgr))
+	{

Shouldn't this be "if (bulkstate->use_wal)"? The GetRedoRecPtr()-based
decision is irrelevant to the !wal case. Either we don't need fsync at all
(TEMP or UNLOGGED) or smgrDoPendingSyncs() will do it (wal_level=minimal).

The point of GetRedoRecPtr() is to detect if a checkpoint has started
concurrently. It works for that purpose whether or not the bulk load is
WAL-logged. It is not compared with the LSNs of WAL records written by
the bulk load.

Unlogged tables do need to be fsync'd. The scenario is:

1. Bulk load an unlogged table.
2. Shut down Postgres cleanly
3. Pull power plug from server, and restart.

We talked about this earlier in the "Unlogged relation copy is not
fsync'd" thread [1]/messages/by-id/65e94fc8-ce1d-dd02-3be3-fda0fe8f2965@iki.fi. I had already forgotten about that; that bug
actually still exists in back branches, and we should fix it..

[1]: /messages/by-id/65e94fc8-ce1d-dd02-3be3-fda0fe8f2965@iki.fi
/messages/by-id/65e94fc8-ce1d-dd02-3be3-fda0fe8f2965@iki.fi

I don't see any functional problem, but this likely arranges for an
unnecessary sync when a checkpoint starts between mdcreate() and
here. (The mdcreate() sync may also be unnecessary, but that's
longstanding.)

Hmm, yes we might do two fsyncs() with wal_level=minimal, unnecessarily.
It seems hard to eliminate the redundancy. smgr_bulk_finish() could skip
the fsync, if it knew that smgrDoPendingSyncs() will do it later.
However, smgrDoPendingSyncs() might also decide to WAL-log the relation
instead of fsyncing it, and in that case we do still need the fsync.

Fortunately, fsync() on a file that's already flushed to disk is pretty
cheap.

--
Heikki Linnakangas
Neon (https://neon.tech)

#51Noah Misch
noah@leadboat.com
In reply to: Heikki Linnakangas (#50)
Re: Relation bulk write facility

On Tue, Jul 02, 2024 at 12:53:05AM +0300, Heikki Linnakangas wrote:

On 01/07/2024 23:52, Noah Misch wrote:

Commit 8af2565 wrote:

--- /dev/null
+++ b/src/backend/storage/smgr/bulk_write.c
+/*
+ * Finish bulk write operation.
+ *
+ * This WAL-logs and flushes any remaining pending writes to disk, and fsyncs
+ * the relation if needed.
+ */
+void
+smgr_bulk_finish(BulkWriteState *bulkstate)
+{
+	/* WAL-log and flush any remaining pages */
+	smgr_bulk_flush(bulkstate);
+
+	/*
+	 * When we wrote out the pages, we passed skipFsync=true to avoid the
+	 * overhead of registering all the writes with the checkpointer.  Register
+	 * the whole relation now.
+	 *
+	 * There is one hole in that idea: If a checkpoint occurred while we were
+	 * writing the pages, it already missed fsyncing the pages we had written
+	 * before the checkpoint started.  A crash later on would replay the WAL
+	 * starting from the checkpoint, therefore it wouldn't replay our earlier
+	 * WAL records.  So if a checkpoint started after the bulk write, fsync
+	 * the files now.
+	 */
+	if (!SmgrIsTemp(bulkstate->smgr))
+	{

Shouldn't this be "if (bulkstate->use_wal)"? The GetRedoRecPtr()-based
decision is irrelevant to the !wal case. Either we don't need fsync at all
(TEMP or UNLOGGED) or smgrDoPendingSyncs() will do it (wal_level=minimal).

The point of GetRedoRecPtr() is to detect if a checkpoint has started
concurrently. It works for that purpose whether or not the bulk load is
WAL-logged. It is not compared with the LSNs of WAL records written by the
bulk load.

I think the significance of start_RedoRecPtr is it preceding all records
needed to recreate the bulk write. If start_RedoRecPtr==GetRedoRecPtr() and
we crash after commit, we're indifferent to whether the rel gets synced at a
checkpoint before that crash or rebuilt from WAL after that crash. If
start_RedoRecPtr!=GetRedoRecPtr(), some WAL of the bulk write is already
deleted, so only smgrimmedsync() suffices. Overall, while it is not compared
with LSNs in WAL records, it's significant only to the extent that such a WAL
record exists. What am I missing?

Unlogged tables do need to be fsync'd. The scenario is:

1. Bulk load an unlogged table.
2. Shut down Postgres cleanly
3. Pull power plug from server, and restart.

We talked about this earlier in the "Unlogged relation copy is not fsync'd"
thread [1]. I had already forgotten about that; that bug actually still
exists in back branches, and we should fix it..

[1] /messages/by-id/65e94fc8-ce1d-dd02-3be3-fda0fe8f2965@iki.fi

Ah, that's right. I agree this code suffices for unlogged. As a further
optimization, it would be valid to ignore GetRedoRecPtr() for unlogged and
always call smgrregistersync(). (For any rel, smgrimmedsync() improves on
smgrregistersync() only if we fail to reach the shutdown checkpoint. Without
a shutdown checkpoint, unlogged rels get reset anyway.)

I don't see any functional problem, but this likely arranges for an
unnecessary sync when a checkpoint starts between mdcreate() and
here. (The mdcreate() sync may also be unnecessary, but that's
longstanding.)

Hmm, yes we might do two fsyncs() with wal_level=minimal, unnecessarily. It
seems hard to eliminate the redundancy. smgr_bulk_finish() could skip the
fsync, if it knew that smgrDoPendingSyncs() will do it later. However,
smgrDoPendingSyncs() might also decide to WAL-log the relation instead of
fsyncing it, and in that case we do still need the fsync.

We do not need the fsync in the "WAL-log the relation instead" case; see
/messages/by-id/20230921062210.GA110358@rfd.leadboat.com

So maybe like this:

if (use_wal) /* includes init forks */
current logic;
else if (unlogged)
smgrregistersync;
/* else temp || (permanent && wal_level=minimal): nothing to do */

Fortunately, fsync() on a file that's already flushed to disk is pretty
cheap.

Yep. I'm more concerned about future readers wondering why the function is
using LSNs to decide what to do about data that doesn't appear in WAL. A
comment could be another way to fix that, though.

#52Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Noah Misch (#51)
1 attachment(s)
Re: Relation bulk write facility

On 02/07/2024 02:24, Noah Misch wrote:

On Tue, Jul 02, 2024 at 12:53:05AM +0300, Heikki Linnakangas wrote:

On 01/07/2024 23:52, Noah Misch wrote:

Commit 8af2565 wrote:

--- /dev/null
+++ b/src/backend/storage/smgr/bulk_write.c
+/*
+ * Finish bulk write operation.
+ *
+ * This WAL-logs and flushes any remaining pending writes to disk, and fsyncs
+ * the relation if needed.
+ */
+void
+smgr_bulk_finish(BulkWriteState *bulkstate)
+{
+	/* WAL-log and flush any remaining pages */
+	smgr_bulk_flush(bulkstate);
+
+	/*
+	 * When we wrote out the pages, we passed skipFsync=true to avoid the
+	 * overhead of registering all the writes with the checkpointer.  Register
+	 * the whole relation now.
+	 *
+	 * There is one hole in that idea: If a checkpoint occurred while we were
+	 * writing the pages, it already missed fsyncing the pages we had written
+	 * before the checkpoint started.  A crash later on would replay the WAL
+	 * starting from the checkpoint, therefore it wouldn't replay our earlier
+	 * WAL records.  So if a checkpoint started after the bulk write, fsync
+	 * the files now.
+	 */
+	if (!SmgrIsTemp(bulkstate->smgr))
+	{

Shouldn't this be "if (bulkstate->use_wal)"? The GetRedoRecPtr()-based
decision is irrelevant to the !wal case. Either we don't need fsync at all
(TEMP or UNLOGGED) or smgrDoPendingSyncs() will do it (wal_level=minimal).

The point of GetRedoRecPtr() is to detect if a checkpoint has started
concurrently. It works for that purpose whether or not the bulk load is
WAL-logged. It is not compared with the LSNs of WAL records written by the
bulk load.

I think the significance of start_RedoRecPtr is it preceding all records
needed to recreate the bulk write. If start_RedoRecPtr==GetRedoRecPtr() and
we crash after commit, we're indifferent to whether the rel gets synced at a
checkpoint before that crash or rebuilt from WAL after that crash. If
start_RedoRecPtr!=GetRedoRecPtr(), some WAL of the bulk write is already
deleted, so only smgrimmedsync() suffices. Overall, while it is not compared
with LSNs in WAL records, it's significant only to the extent that such a WAL
record exists. What am I missing?

You're right. You pointed out below that we don't need to register or
immediately fsync the relation if it was not WAL-logged, I missed that.

In the alternative universe that we did need to fsync() even in !use_wal
case, the point of the start_RedoRecPtr==GetRedoRecPtr() was to detect
whether the last checkpoint "missed" fsyncing the files that we wrote.
But the point is moot now.

Unlogged tables do need to be fsync'd. The scenario is:

1. Bulk load an unlogged table.
2. Shut down Postgres cleanly
3. Pull power plug from server, and restart.

We talked about this earlier in the "Unlogged relation copy is not fsync'd"
thread [1]. I had already forgotten about that; that bug actually still
exists in back branches, and we should fix it..

[1] /messages/by-id/65e94fc8-ce1d-dd02-3be3-fda0fe8f2965@iki.fi

Ah, that's right. I agree this code suffices for unlogged. As a further
optimization, it would be valid to ignore GetRedoRecPtr() for unlogged and
always call smgrregistersync(). (For any rel, smgrimmedsync() improves on
smgrregistersync() only if we fail to reach the shutdown checkpoint. Without
a shutdown checkpoint, unlogged rels get reset anyway.)

I don't see any functional problem, but this likely arranges for an
unnecessary sync when a checkpoint starts between mdcreate() and
here. (The mdcreate() sync may also be unnecessary, but that's
longstanding.)

Hmm, yes we might do two fsyncs() with wal_level=minimal, unnecessarily. It
seems hard to eliminate the redundancy. smgr_bulk_finish() could skip the
fsync, if it knew that smgrDoPendingSyncs() will do it later. However,
smgrDoPendingSyncs() might also decide to WAL-log the relation instead of
fsyncing it, and in that case we do still need the fsync.

We do not need the fsync in the "WAL-log the relation instead" case; see
/messages/by-id/20230921062210.GA110358@rfd.leadboat.com

Ah, true, I missed that log_newpage_range() loads the pages to the
buffer cache and dirties them. That kinds of sucks actually, I wish it
didn't need to dirty the buffers.

So maybe like this:

if (use_wal) /* includes init forks */
current logic;
else if (unlogged)
smgrregistersync;
/* else temp || (permanent && wal_level=minimal): nothing to do */

Makes sense, except that we cannot distinguish between unlogged
relations and permanent relations with !use_wal here.

It would be nice to have relpersistence flag in SMgrRelation. I remember
wanting to have that before, although I don't remember what the context
was exactly.

Fortunately, fsync() on a file that's already flushed to disk is pretty
cheap.

Yep. I'm more concerned about future readers wondering why the function is
using LSNs to decide what to do about data that doesn't appear in WAL. A
comment could be another way to fix that, though.

Agreed, this is all very subtle, and deserves a good comment. What do
you think of the attached?

--
Heikki Linnakangas
Neon (https://neon.tech)

Attachments:

0001-Relax-fsyncing-at-end-of-bulk-load-that-was-not-WAL-.patchtext/x-patch; charset=UTF-8; name=0001-Relax-fsyncing-at-end-of-bulk-load-that-was-not-WAL-.patchDownload
From 6a7a2f34b2134b055c629789aa18a4ad0c4b50a9 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Tue, 2 Jul 2024 14:30:29 +0300
Subject: [PATCH 1/1] Relax fsyncing at end of bulk load that was not
 WAL-logged

And improve the comments.
---
 src/backend/storage/smgr/bulk_write.c | 71 ++++++++++++++++++++++-----
 1 file changed, 60 insertions(+), 11 deletions(-)

diff --git a/src/backend/storage/smgr/bulk_write.c b/src/backend/storage/smgr/bulk_write.c
index 4a10ece4c39..f66d718c7be 100644
--- a/src/backend/storage/smgr/bulk_write.c
+++ b/src/backend/storage/smgr/bulk_write.c
@@ -132,19 +132,68 @@ smgr_bulk_finish(BulkWriteState *bulkstate)
 	smgr_bulk_flush(bulkstate);
 
 	/*
-	 * When we wrote out the pages, we passed skipFsync=true to avoid the
-	 * overhead of registering all the writes with the checkpointer.  Register
-	 * the whole relation now.
-	 *
-	 * There is one hole in that idea: If a checkpoint occurred while we were
-	 * writing the pages, it already missed fsyncing the pages we had written
-	 * before the checkpoint started.  A crash later on would replay the WAL
-	 * starting from the checkpoint, therefore it wouldn't replay our earlier
-	 * WAL records.  So if a checkpoint started after the bulk write, fsync
-	 * the files now.
+	 * Fsync the relation, or ask the checkpoint to register it, if necessary.
 	 */
-	if (!SmgrIsTemp(bulkstate->smgr))
+	if (SmgrIsTemp(bulkstate->smgr))
 	{
+		/* Temporary relations don't need to be fsync'd, ever */
+	}
+	else if (!bulkstate->use_wal)
+	{
+		/*
+		 * This is either an unlogged relation, or a permanent relation but we
+		 * skipped WAL-logging because wal_level=minimal:
+		 *
+		 * A) Unlogged relation
+		 *
+		 *    Unlogged relations will go away on crash, but they need to be
+		 *    fsync'd on a clean shutdown. It's sufficient to call
+		 *    smgrregistersync(), that ensures that the checkpointer will
+		 *    flush it at the shutdown checkpoint. (It will flush it on the
+		 *    next online checkpoint too, which is not strictly necessary.)
+		 *
+		 *    Note that the init-fork of an unlogged relation is not
+		 *    considered unlogged for our purposes. It's treated like a
+		 *    regular permanent relation. The callers will pass use_wal=true
+		 *    for the init fork.
+		 *
+		 * B) Permanent relation, WAL-logging skipped because wal_level=minimal
+		 *
+		 *    This is a new relation, and we didn't WAL-log the pages as we
+		 *    wrote, but they need to be fsync'd before commit.
+		 *
+		 *    We don't need to do that here, however. The fsync() is done at
+		 *    commit, by smgrDoPendingSyncs() (*).
+		 *
+		 *    (*) smgrDoPendingSyncs() might decide to WAL-log the whole
+		 *    relation at commit instead of fsyncing it, if the relation was
+		 *    very small, but it's smgrDoPendingSyncs() responsibility in any
+		 *    case.
+		 *
+		 * We cannot distinguish the two here, so conservatively assume it's
+		 * an unlogged relation. A permanent relation with wal_level=minimal
+		 * would require no actions, see above.
+		 */
+		smgrregistersync(bulkstate->smgr, bulkstate->forknum);
+	}
+	else
+	{
+		/*
+		 * Permanent relation, WAL-logged normally.
+		 *
+		 * We already WAL-logged all the pages, so they will be replayed from
+		 * WAL on crash. However, when we wrote out the pages, we passed
+		 * skipFsync=true to avoid the overhead of registering all the writes
+		 * with the checkpointer.  Register the whole relation now.
+		 *
+		 * There is one hole in that idea: If a checkpoint occurred while we
+		 * were writing the pages, it already missed fsyncing the pages we had
+		 * written before the checkpoint started.  A crash later on would
+		 * replay the WAL starting from the checkpoint, therefore it wouldn't
+		 * replay our earlier WAL records.  So if a checkpoint started after
+		 * the bulk write, fsync the files now.
+		 */
+
 		/*
 		 * Prevent a checkpoint from starting between the GetRedoRecPtr() and
 		 * smgrregistersync() calls.
-- 
2.39.2

#53Noah Misch
noah@leadboat.com
In reply to: Heikki Linnakangas (#52)
Re: Relation bulk write facility

On Tue, Jul 02, 2024 at 02:42:50PM +0300, Heikki Linnakangas wrote:

On 02/07/2024 02:24, Noah Misch wrote:

On Tue, Jul 02, 2024 at 12:53:05AM +0300, Heikki Linnakangas wrote:

log_newpage_range() loads the pages to the buffer
cache and dirties them. That kinds of sucks actually, I wish it didn't need
to dirty the buffers.

Agreed.

Fortunately, fsync() on a file that's already flushed to disk is pretty
cheap.

Yep. I'm more concerned about future readers wondering why the function is
using LSNs to decide what to do about data that doesn't appear in WAL. A
comment could be another way to fix that, though.

Agreed, this is all very subtle, and deserves a good comment. What do you
think of the attached?

Looks good. Thanks. pgindent doesn't preserve all your indentation, but it
doesn't make things objectionable, either.

#54Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Noah Misch (#53)
Re: Relation bulk write facility

On 03/07/2024 06:41, Noah Misch wrote:

On Tue, Jul 02, 2024 at 02:42:50PM +0300, Heikki Linnakangas wrote:

On 02/07/2024 02:24, Noah Misch wrote:

On Tue, Jul 02, 2024 at 12:53:05AM +0300, Heikki Linnakangas wrote:

Fortunately, fsync() on a file that's already flushed to disk is pretty
cheap.

Yep. I'm more concerned about future readers wondering why the function is
using LSNs to decide what to do about data that doesn't appear in WAL. A
comment could be another way to fix that, though.

Agreed, this is all very subtle, and deserves a good comment. What do you
think of the attached?

Looks good. Thanks. pgindent doesn't preserve all your indentation, but it
doesn't make things objectionable, either.

Committed, thanks!

(Sorry for the delay, I had forgotten about this already and found it
only now sedimented at the bottom of my inbox)

--
Heikki Linnakangas
Neon (https://neon.tech)